Zero-Shot Selection of Pretrained Models

Deep learning (DL) has celebrated many successes, but it’s still a challenge to find the right model for a given dataset — especially with a limited budget. Adapting DL models to new problems can be computationally intensive and requires comprehensive training data. On tabular data, AutoML solutions like Auto-SkLearn and AutoGluon work very well. However, there is no equivalent that allows the selection of pretrained deep models in vision or natural language processing, along with the right finetuning hyperparameters. We combine AutoML, meta-learning, and pretrained models to offer a solution named ZAP (Zero-Shot AutoML with Pretained Models) to automatically select the best DL pipeline (pretrained model and its finetuning hyperparameters) based on dataset characteristics. In this post, we introduce our ZAP method and outline our experiments and results.

Presenting: ZAP

ZAP (Zero-Shot AutoML with Pretrained Models) is a domain-independent meta-learning approach that learns a zero-shot surrogate model for the problem describe above. I.e., at test time, for a new dataset D, ZAP selects the proper DL pipeline, including the pre-trained model and finetuning hyperparameters, using only trivial meta-features, such as image resolution or the number of classes. This process is visualized in the following figure:

Two stage process of ZAP-HPO

In regular use, zero-shot refers to generalizing on a novel category of samples. In the context of our method, we use the term zero-shot to express that we do not perform any exploratory evaluation of the pipeline but must select a suitable one without evaluating any alternatives on a holdout dataset.

In summary, our key contributions are:

  1. A framework for choosing a pretrained model and finetuning hyperparameters to finetune a new dataset by meta-learning a domain-independent zero-shot surrogate model using trivial meta-features.
  2. The release of a large DL image classification meta-dataset that is 1000 times larger than existing ones. This contains the performance of 525 DL pipelines, consisting of the pretrained model, finetuning hyperparameters, and training data metadata across various datasets and augmentations [Github Link].
  3. Our approach outperforms other contenders in the AutoDL challenge evaluation.

To solve the problem of selecting the most cost-effective deep learning pipeline for a given dataset, we begin with a set of pretrained models and their associated finetuning hyperparameters, as well as a collection of datasets and their corresponding meta-features. We also have a cost matrix specifying the cost of training a dataset on each pipeline. The goal of Zero-Shot AutoML with pretrained models is to find a mapping that minimizes the cost across all datasets.
We propose two approaches to finding this mapping.

ZAP-AS with AutoFolio

Our first approach for solving ZAP treated the selection of the deep learning (DL) pipeline as an algorithm selection (AS) problem. Algorithm selection is the classic problem due to Rice (dating back to 1976!) of identifying the best algorithm from a pool of candidate algorithms based on dataset meta-features. A strong method from the literature for solving AS is Autofolio, which offers diverse approaches, including regression, classification, clustering, and cost-sensitive classification, to select the best candidate pipeline for a given dataset.


Our second approach also accounts for the fact that the different DL pipelines we consider are related to each other. We use this to reframe the DL pipeline selection problem as one of hyperparameter optimization, where the choice of pretrained model is a categorical value and the parameters for finetuning are continuous  variables.

We use a surrogate function to estimate the test cost of finetuning a pretrained model on a dataset using the choice of pretrained model, finetuning hyperparameters, and dataset meta-features. This is optimized with a pairwise ranking strategy using a ranking loss. The loss function increases the gap between the scores of high and poor-performing pipelines. This is visualized in the following figure:

Visual representation of the ranking objective used in testing

Creation of the Candidate DL pipelines (Meta dataset)

Cost matrix as a heatmap, where color indicates the ALC score and ligher is better.

ZAP’s strength derives from its metadata dataset, which comprises annotated meta-features and candidate DL pipelines. Trivial meta-features, such as the number of samples, channels, classes, and resolution, are used to characterize datasets that cover a wide range, improving the ZAP system’s generalization. We selected 35 diverse datasets from the image domain and augmented them 15 times each to create 525 unique datasets.

To choose candidate Deep Learning pipelines, we expanded code of the winning baselines of the AutoCV competition by adding multiple degrees of freedom to the process.

We use BOHB to optimize the DL pipeline for each dataset in the metadata set, with a 26-dimensional hyperparameter search space. To create the meta-dataset, we evaluated 525 candidate DL pipeline instantiations on 525 datasets, resulting in 275,625 pipeline-dataset pairs as a cost matrix. The ZAP-HPO method has an offline phase of creating the metadata set and a subsequent online phase where a two-layer MLP surrogate model is trained using the metadata set. During the test phase, the surrogate model takes in the meta-features of the test dataset and outputs a pipeline configuration. We use this configuration to train a deep-learning model on the test dataset.

About the AutoDL Challenge

One way in which we evaluate our ZAP methods is with respect to how well they do in an existing challenge setting. The AutoDL challenges are a series of machine learning competitions focusing on applying Automated Machine Learning (AutoML) to various data modalities, including image, video, text, speech, and tabular data. Due to training time limitations, participants are evaluated using an anytime learning metric and must use efficient techniques. The competition starts with a 20-minute initialization, followed by 20-minutes of training and testing with the test data to provide predictions. The primary evaluation metric is the Area under the Learning Curve (ALC), and the model must generate predictions as early as possible to maximize the ALC since probing is costly.

Evaluation and Experiments

We compared the performance of ZAP-AS and ZAP-HPO against previous AutoDL challenge winners, including Baselines (DeepWisdom, DeepBlueAI, PASA-NJU), Random Selection (taken from 525 pipelines), Single-best (best pipeline averaged across the datasets), and the Oracle (best pipeline per dataset). We evaluated two benchmarks, the AutoDL benchmark and our ZAP benchmark based on the ZAP meta dataset. For the AutoDL benchmark, we meta-trained the candidates on the entire meta-dataset and uploaded them to the official platform for evaluation. We reported results from the performance of the five undisclosed final datasets from different domains.

For the ZAP Benchmark, we used the Leave-One-Core-Dataset-Out protocol, meta-training the methods on 34 of the 35 datasets, testing on the held-out dataset, and repeating 35 times while averaging the results. We reported the evaluation results on each benchmark averaged over ten repetitions.


On the AutoDL benchmark, ZAP-HPO outperformed the winner baselines, single-best, and random baselines. On the other hand, when generalizing to out-of-distribution data, the more conservative ZAP-AS method performed slightly better than ZAP-HPO.

ZAP-HPO approach vs. prior AutoDL winners on the AutoDL benchmark.

Disclaimer: Unlike the winner baselines, we did not use the challenge feedback datasets to optimize the base model and zero-shot model hyperparameters for the final submission but used (only) the ones from the ZAP benchmark. Due to a difference in distributions between these benchmarks (ZAP vs. AutoDL), it could not be taken for granted that our method generalizes to the AutoDL competition datasets. Still, it performed best on the test server.

Ranking ZAP-HPO on the ZAP benchmark

On the ZAP benchmark, the algorithm-selection-based ZAP-AS outperformed the competition winners, but the geometry-aware ranking-based model, ZAP-HPO, performed even better.

Even when using a different metric, normalized Area under the curve (Normalized AUC), ZAP-HPO was better than the AutoDL winners, with performance close to the (hypothetical best) Oracle performance.

ALC scores of the ZAP approach vs. AutoDL winner baselines


In conclusion, ZAP expands the scope of AutoML to address the challenge of finetuning pre-trained deep learning models on new image classification datasets. By releasing our metadata dataset, which includes evaluations of 275,000 pipelines assessed on 35 popular image datasets, we have established a foundation for further research in this area, and we hope that other researchers will find it useful. We are working on expanding our framework to include additional models and datasets, and we are also exploring how this approach can extend to other domains. We hope that our work will pave the way for more effective and efficient finetuning of deep learning models, ultimately leading to improved performance across various applications.

You can read all the details in our full ICML 2022 paper.