Meta Omnium: A Benchmark for General-Purpose Learning-to-Learn

Meta-learning and other approaches to few-shot learning are widely studied for image recognition, and are increasingly applied to other vision tasks such as pose estimation and dense prediction. This naturally raises the question of whether there is any fewshot metalearning algorithm capable of generalizing across these diverse task types? To support the community in answering this question, we introduce Meta Omnium, a dataset-of-datasets spanning multiple vision tasks including recognition, keypoint localization, semantic segmentation and regression. We experiment with popular fewshot metalearning baselines and analyze their ability to generalize across tasks and to transfer knowledge between them. Meta Omnium enables metalearning researchers to evaluate model generalization to a much wider array of tasks than previously possible, and provides a single framework for evaluating meta-learners across a wide suite of vision applications in a consistent manner. Code and dataset are available at https://github.com/edi-meta-learning/meta-omnium.


Introduction
Meta-learning is a long-standing research area that aims to replicate the human ability to learn from a few examples by learning-to-learn from a large number of learning problems [61].This area has become increasingly important recently, as a paradigm with the potential to break the data bottleneck of traditional supervised learning [26,70].While the largest body of work is applied to image recognition, few-shot learning algorithms have now been studied in most corners of computer vision, from semantic segmentation [37] to pose estimation [49] and beyond.Nevertheless, most of these applications of few-shot learning are advancing independently, with increasingly divergent application-specific methods and benchmarks.This makes it hard to evaluate whether few-shot meta-learners can solve diverse vision tasks.Importantly it also discourages the development of meta-learners with the ability to learn-to-learn across tasks, transferring knowledge from, e.g., keypoint localization to segmentation -a capability that would be highly valuable for vision systems if achieved.The overall trend in computer vision [20,52] and AI [5,55] more generally is towards more general-purpose models and algorithms that support many tasks and ideally leverage synergies across them.However, it has not yet been possible to explore this trend in meta-learning, due to the lack of few-shot benchmarks spanning multiple tasks.State-of-the-art benchmarks [63,65]  handful of visual domains.There is no few-shot benchmark that poses the more substantial challenge [57,77] of generalizing across different tasks.We remark that the term task is used differently in few-shot meta-learning literature [16,26,70] (to mean different image recognition problems, such as cat vs dog or car vs bus) and the multi-task literature [20,57,74,77] (to mean different kinds of image understanding problems, such as classification vs segmentation).
In this paper, we will use the term task in the multi-task literature sense, and the term episode to refer to tasks in the meta-learning literature sense, corresponding to a support and query set.We introduce Meta Omnium, a dataset-of-datasets spanning multiple vision tasks including recognition, semantic segmentation, keypoint localization/pose estimation, and regression as illustrated in Figure 1.Specifically, Meta Omnium provides the following important contributions: (1) Existing benchmarks only test the ability of meta-learners to learn-to-learn within tasks such as classification [63,65], or dense prediction [37].Meta Omnium uniquely tests the ability of meta-learners to learn across multiple task types.
(2) Meta Omnium covers multiple visual domains (from natural to medical and industrial images).(3) Meta Omnium provides the ability to thoroughly evaluate both indistribution and out-of-distribution generalisation.(4) Meta Omnium has a clear hyper-parameter tuning (HPO) and model selection protocol, to facilitate future fair comparison across current and future meta-learning algorithms.(5), Unlike popular predecessors, [63], and despite the diversity of tasks, Meta Omnium has been carefully designed to be of moderate computational cost, making it accessible for research in modestly-resourced universities as well as large institutions.Table 1 compares Meta Omnium to other relevant meta-learning datasets.
We expect Meta Omnium to advance the field by encouraging the development of meta-learning algorithms capable of knowledge transfer across different tasks -as well as across learning episodes within individual tasks as is popularly studied today [16,70].In this regard, it provides the next step of the level of a currently topical challenge of dealing with heterogeneity in meta-learning [1,35,63,67].While existing benchmarks have tested multi-domain het-erogeneity (e.g., recognition of written characters and plants within a single network) [63,65] and shown it to be challenging, Meta Omnium tests multi-task learning (e.g., character recognition vs plant segmentation).This is substantially more ambitious when considered from the perspective of common representation learning.For example, a representation tuned for recognition might benefit from rotation invariance, while one tuned for segmentation might benefit from rotation equivariance [11,15,71].Thus, in contrast to conventional within-task meta-learning benchmarks that have been criticized as relying more on common representation learning than learning-to-learn [53,62], Meta Omnium better tests the ability of learning-to-learn since the constituent tasks require more diverse representations.

Meta-learning Benchmarks
The classic datasets in few-shot meta-learning for computer vision are Omniglot [30] and miniImageNet [66].Later work criticized these for having insufficient task (episode) diversity and tieredImageNet [56] used the class hierarchy of ImageNet to enforce more diversity between meta-train and meta-test episodes.The main contemporary benchmarks are CD-FSL [22], which challenges few-shot learners to generalize to new visual domains; and Meta-Dataset [63] and Meta-Album [65], which go further in requiring few-shot learners to learn from a mixture of visual domains.Such multi-domain heterogeneous metalearning turns out to be challenging.A related benchmark to Meta-Dataset is VTAB [80], which similarly provides multiple domains for evaluating data-efficient visual recognition, but their focus is on evaluating representation transfer from large-scale pre-training rather than learning-to-learn and meta-learning.VTAB+MD [13] compare representation transfer and meta-learning approaches on the Meta-Dataset tasks.However, none of these benchmarks address multi-task meta-learning as considered here (Figure 1).
Outside of recognition, task-specific few-shot benchmarks have been proposed in vision problems of semantic segmentation [37], regression [19], pose/keypoint estimation [73], etc.These are mostly slightly behind the com-plexity of the recognition benchmarks with regards to being single-domain, with the exception of [73].With regard to multi-task meta-learning as considered here, the only existing benchmark is meta-world [78], which is specific to robotics and reinforcement learning rather than vision.
We also mention taskonomy [79] as a popular dataset that has been used for multi-task learning.However, it is not widely used for few-shot meta-learning.This is because, although taskonomy has many tasks, unlike the main metalearning benchmarks [37,63], there are not enough visual concepts within each task to provide a large number of concepts for meta-training and a disjoint set of concepts to evaluate few-shot learning for meta-validation and meta-testing.

Heterogeneity in Meta-Learning
There are several sophisticated methods in the literature that highlighted the challenge of the task to address heterogeneity in meta-learning.These have gone under various names such as multi-modal meta-learning [1,38,67] -in the sense of multi-modal probability distribution (over tasks/episodes).However, with the exception of [38], these have mostly not been shown to scale to the main multimodal benchmarks such as Meta-Dataset [63].A more common approach to achieving high performance across multiple heterogeneous domains such as those in Meta-Dataset is to train an ensemble of feature extractors across available training domains, and fuse them during metatesting [14,36,36].However, this obviously incurs a substantial additional cost of maintaining a model ensemble.In our evaluation, we focus on the simpler meta-learners that have been shown to work in challenging multi-domain learning scenarios [63,65], while leaving sophisticated algorithmic and ensemble-based approaches for future researchers to evaluate on the benchmark.

Motivation and Guiding Principles
We first explain the motivating goals and guiding principles behind the design of Meta Omnium.The goal is to build a benchmark for multi-task meta-learning that will: (i) Encourage the community to develop meta-learners that are flexible enough to deal with greater task heterogeneity than before, and thus are more likely to be useful in practice with less curated episode distributions.This was identified as a major challenge in the discussion arising in several recent meta-learning and computer vision workshops and challenges1 .(ii) Ultimately progress on this benchmark should provide practical improvements in data-efficient learning for computer vision through the development of methods that can better transfer across different task types.
In developing this benchmark, we established a few principles that we used to guide design choices.These include: (i) The benchmark should be lightweight in terms of storage and computing, making it accessible to a broad range of researchers and not only large corporations.(ii) The benchmark should cover multiple tasks with heterogeneous output spaces (as opposed to all classification, all regression, or all dense prediction), as well as multiple visual domains.In these regards, Meta Omnium is compared to alternatives in Table 1.(iii) The initial baselines should have only minimal task-specific decoders.This is in contrast to the state of the art within various sub-disciplines of FSL such as segmentation [25,42], keypoint [39,72], and classification [2,75] where specially designed decoders are often used.This is to evaluate and encourage future research on learning-tolearn across tasks, rather than primarily benchmarking how well we can manually engineer prior knowledge of optimal task-specific decoders.While we are not opposed to future competitors on this benchmark developing task-specific decoders, these should be evaluated separately against the minimal-decoder competitors.(iv) The benchmark should provide distinct datasets for in-distribution (ID) training and out-of-distribution (OOD) evaluation, to evaluate the robustness of the distribution-shift.This is already provided by [63,65] for classification, and we extend such an ID and OOD dataset ensemble to multiple tasks.Figure 2 illustrates our dataset and task-split.(v) The benchmark should provide a clear hyper-parameter tuning protocol.With a number of recent studies showing that hyper-parameter tuning can dominate other effects of interest in computer vision [21,32,45], this is important for a future-proof metalearning benchmark.This is also related to the first cost point (i) above: Only for a benchmark with a modest cost can most institutions realistically expect to conduct hyperparameter tuning.We provide the hyper-parameter tuning protocol.(vi) Finally, following the debate in [35,53,62] as to the value of meta-learning vs conventional transfer learning, the dataset should support both episodic meta-learning and conventional transfer learning approaches.

Data Splits and Tasks
For each main task (classification, segmentation, keypoint localization), we split the datasets into seen datasets available for meta-training, and unseen datasets that are completely held out for out-of-domain meta-validation and meta-testing.Similarly to [63,65], for the seen datasets, we construct category-wise splits into meta-train/val/test.While for the unseen datasets, there is no category-wise split as episodes from all categories from the whole dataset will be used for validation and testing respectively.The Figure 2. Schematic of benchmark and dataset splits.For each task, there are multiple datasets, which are divided into seen (solid border) and unseen (dash border) datasets.The seen (ID) datasets are divided class-wise into meta-train/meta-val/meta-test splits.The unseen datasets are held out for out-of-distribution (OD) evaluation.Meta-training is conducted on the ID-meta-train split of the seen datasets (blue).Models are validated on ID validation class splits, or OOD validation datasets (green).Results are reported on the ID test class splits and OOD test datasets (orange).We also hold out an entire task family (regression) for evaluating novel task generalisation.overall split organization is illustrated in Figure 2. We additionally have a completely held-out task: regression.Datasets from this task are not used during meta-training.
Our multi-task setup enables us to define and compare two training protocols: Single-task meta-learning which evaluates how well meta-learning performs when trained and tested within a particular task family (within each plate in Figure 2); and Multi-task meta-learning which evaluates how well meta-learning performs when trained across all available task families (across all plates in Figure 2).
With this organization we can separately evaluate: Within-distribution generalization (ID): How well do meta-learners generalize to novel test concepts within the seen datasets?; and Out-distribution generalization (OOD): How well do meta-learners generalize to novel concepts in unseen datasets?
We provide two sources of validation data: ID and OOD, and our models are selected based on the combined performance across both.OOD validation is not supposed by the most popular Meta-Dataset benchmark [63] as despite its larger size it does not provide OOD validation datasets.

Datasets and Metrics
Given the considerations in Section 3.1, our benchmark consists of three main tasks (classification, segmentation, keypoints/pose) and one held-out task (regression).Classification For classification we take the 10 datasets from the initial public release of Meta-Album [65].These images are all 128 × 128 and contain 19-706 classes per dataset, with 40 images per class.Three of these datasets are reserved for out-of-distribution meta-validation, and four for out-of-distribution meta-test.Segmentation For segmentation, we take FSS1000 [37] for in-distribution (10,000 images, 1,000 classes), and combine it with VizWiz [64] for OOD meta-validation (862 images, 22 classes), and modified Pascal5i [58] (7,242 images, 6 classes) and the very distinct medical imaging dataset PH2 [41] (200 images, 3 classes) for meta-testing.The segmentation images originally were of diverse sizes.We resize them all to 224 × 224 for Meta Omnium.Note that VizWiz and Pascal datasets originally contain more classes and images.We exclude the classes that overlap with that in the FSS1000 dataset for few-shot learning, and thus there are no classes overlapping among all the datasets.Keypoints For keypoints/pose, we take animal-pose [10] for in-distribution, synthetic animal-pose [44] for OOD meta-validation, and MPII human-pose [4] for OOD metatesting.All images are resized to 128 × 128.MPII includes about 40k people in over 25k images with annotated body keypoints.Animal Pose includes 5 animal categories for 6K instances in over 4k images.Each animal is cropped from the original image.We keep cats and dogs for training, horses and sheep for in-domain validation, and cow for indomain testing.Synthetic animal pose generates synthetic images using animal CAD models rendered from various viewpoints and lightings on a random background.We keep only the horse and tiger categories in our final datasets.Regression For evaluating regression as a held-out task, we use four datasets corresponding to the test splits of [19]: ShapeNet1D, ShapeNet2D, Distractor and Pascal1D [76].All images are resized to 128 × 128.ShapeNet1D aims to predict azimuth angles.It contains 30 categories in total and we keep the 3 categories from the test set.ShapeNet2D further includes 2D rotation with azimuth angles and elevation.The test set of ShapeNet2D contains 300 categories in total with 30 images per category.Distractor aims to predict the position of a target object in the presence of a distractor.It contains 12 categories in total and the test set has 2 categories.Each category contains 1000 objects with 36 images for each.Pascal1D aims to predict azimuth angle.The whole Pascal1D contains 65 objects from 10 categories.The test set contains 15 objects with 100 images for each object.The supplementary material provides full details of all datasets and splits.

Training API
For episodic learning, we proceed by (i) sampling a task, (ii) sampling a dataset, (iii) sampling an episode.Under our main protocol we consider variable 1 to 5-shot evaluation, but also evaluate separate 1 and 5 shot settings (train-ing is always done with a variable number of shots -support examples).For classification tasks, we follow [65] in generating 5-way episodes.For segmentation tasks, we follow [37] in considering each episode to be a binary foreground/background classification problem for a novel class and generate 2-way episodes.For keypoint, we form episodes by randomly selecting a class (e.g., animal category) and then randomly selecting a subset of 5 keypoints to localize for each episode.For regression tasks, we generate variable 5 to 25-shot episodes because it is a common practice to use more shots for regression tasks [19].
For non-episodic/transfer learning, we provide access to the meta-train portions of the seen datasets in conventional mini-batches for conventional single task and multi-task supervised learning.Evaluation Metrics For classification tasks, we use standard top-1 accuracy; for segmentation tasks, we use standard mean intersection-over-union (mIOU) that averages over IoU values of all object classes; for keypoint prediction, we report the Percentage of Correct Keypoints (PCK).In detail, a detected joint is considered correct if the distance between the predicted and the true joint is within a certain threshold.In our experiments, the threshold is 0.01 for normalized value, which stands for about 12.8 pixel of input image resolution.For regression tasks, we follow [19] and use the same metrics.

Architecture and Baseline Competitors
As discussed in Section 3.1, we aim to establish baselines that can be adapted to tasks with heterogeneous outputs, with minimal reliance on task-specific decoders.We follow [63,65] in using a ResNet-18 CNN [23] as a feature extractor architecture.For recognition tasks, we perform multi-class classification immediately after ResNet's GAP.For regression tasks, we perform linear regression directly after the ResNet's GAP.For keypoint tasks, we consider them to be a regression problem from the feature map to the keypoint location.For segmentation tasks, we use a simplified PSPNet-like [81] strategy.We concatenate the extracted feature maps from ResNet's feature pyramid, with upsampling where appropriate, to generate a feature map of size w × h, and then do pixel-wise classification with 1 × 1 convolutional layer to obtain the final segmentation map.All tasks thus have only one learnable weight as a minimal classifier/decoder after the common ResNet feature encoder.Based on this common encoder and minimal decoder architecture, we describe our meta-learning baselines.Prototypical Network [59] is a classic meta-learner that exploits nearest-centroid metric learning for few-shot classification.ProtoNets were adapted to segmentation tasks in PANet [68], by performing pixel-level feature matching between support prototypes and query pixels.We use the same principle together with the PSPNet-like features described earlier.To generalize ProtoNets to regression tasks such as keypoint prediction, we must relax the prototype assumption, and use them as simple Gaussian kernel-regression models [7].Specifically, we generate a feature embedding for each support example, and then for query examples, we calculate the negative exponential distance to the support examples, and use this inverse distance-weighted sum of support set labels as the prediction.Thus for regression tasks with a support set S = {(x i , y i )} and query example x q , ProtoRegression predicts enabling us to learn deep feature f θ in the usual episodic meta-learning way.We use cross-entropy loss for classification and segmentation, and MAE loss for regression tasks.DDRR Deep differentiable ridge-regression has been considered for few-shot recognition [6], tracking [82], and other tasks.It is related to ProtoNet in that the feature is not adapted after the meta-train stage, but different in that the decoder/classifier layer is learned by differentiable ridgeregression rather than nearest centroid or kernel regression.An elegant property of DDRR methods is that they naturally address regression tasks, although they have been repurposed for classification [6] by conducting MSE-loss regression to a target 1-hot vector.Thus they are a natural choice for our benchmark.For application to segmentation, we apply DDRR in a 1 × 1 convolution-like way, to perform pixel classification for the output mask with a DDRR classifier at each pixel.Further, we calibrate the prediction for binary cross entropy loss with a learnable scale and bias following [6].DDRR uses MAE loss for only regression tasks and use MSE loss for other tasks.MAML The seminal few-shot meta-learner MAML [16] aims to learn an initial condition for per-episode gradientdescent.MAML is straightforward to adapt to different types of tasks.Based on each episode's support set, a new output layer is learned, and the feature extractor is updated, both by a few steps of gradient descent.Similarly to Meta-Dataset [63], we do not learn an initialization for the output layer, since it can change size between episodes drawn from multiple tasks.To alleviate this challenge, we also follow Meta-Dataset in evaluating Proto-MAML -a variant that initializes the MAML output layer based on the linear classifier/regressor suggested by nearest-centroid prior to gradient descent.Going beyond this, to adapt Proto-MAML to regression tasks, we also initialize the output layer based on the ridge-regression solution to the support set.Meta-Curvature Meta-Curvature [48] is an enhancement of MAML that learns a pre-conditioning matrix to improve inner-loop adaptation, as well as an initial condition as in standard MAML.Meta-Curvature outperforms MAML in simpler single-task few-shot benchmarks.
Transfer Learning We also consider standard supervised learning on the meta-train tasks for transfer to the target tasks, a strategy reported to be competitive with metalearning [62].For adaptation, we explore both linear readout [62,69] and fine-tuning [65].Besides learning a new output layer from scratch, we also consider a fine-tuning version that initializes the classifier weights using class prototypes (recognition/segmentation) or ridge regression weights (keypoints/regression), inspired by Proto-MAML.Train-from-Scratch (TFS) We lastly consider training each episode from scratch using only the support set [65].

Hyperparameter Optimization
As part of our benchmark, we perform hyperparameter optimization (HPO) to ensure we select appropriate hyperparameters for the diverse tasks that we consider.Multiobjective HPO under restricted resources is challenging, so we devise the following HPO protocol: estimate the performance of each candidate configuration on a lower fidelity (lower number of iterations) and then identify the configuration that works the best across all validation datasets considered (combination of in-domain and out-domain datasets, across various task families).Note that fast multi-fidelity methods such as Hyperband [34], ASHA [33] or PASHA [8] are not applicable out of the box in our multi-objective setup, so we decided to train each candidate configuration using fixed 5,000 training iterations.Since different tasks/datasets are of different difficulties (and use different metrics), we normalize the score of each configuration for each validation dataset by the best score for that dataset across all candidate configurations.We then select the configuration with the best average normalized score.
Note that due to resource constraints we are only able to sample a relatively smaller number of candidates (30), so we utilize a sample efficient state-of-the-art Multi-Objective TPE method [47], available from the Optuna library [3].We perform HPO for multi-task and single-tasks setups separately, so single task classification, segmentation and keypoint estimation have their own set of hyperparameters; and the multi-task case has its own set.The hyperparameters include the meta-learning rate and optimizer, momentum, and various method-specific hyperparameters (full details are in the supplementary).Once the hyperparameters are chosen, we perform standard training of the model for the full number of iterations.

Experiments
In this section, we aim to use our benchmark to answer the following questions: (1) Which meta-learner performs best on average across a heterogenous range of tasks?Existing benchmarks have evaluated meta-learners for one task at a time, we now use our common evaluation platform to find out if any meta-learner can provide general-purpose learning to learn across different task types, or whether each task type prefers a different learner.Similarly, we can ask which meta-learner is most robust to out-of-distribution tasks?(2) Having defined the first multi-task meta-learning benchmark, and generalizations of seminal meta-learners to different kinds of output spaces, we ask which meta-learner performs best for multi-task meta-learning?More generally, is there a trend in gradient-based vs metric-based meta-learner success?(3) Does multi-task meta-learning improve or worsen performance compared to single-task?
The former obviously provides more meta-training data, which should be advantageous, but the increased heterogeneity across meta-training episodes in the multi-task case also makes it harder to learn [63,67].( 4) How does metalearning perform compared to simple transfer learning, or learning from scratch?

Experimental Settings
We train each meta-learner for 30,000 meta-training iterations, with meta-validation after every 2,500 iterations (used for checkpoint selection).For evaluation during metatesting we use 600 tasks for each corresponding dataset, and for meta-validation we use 1200 tasks together.We use random seeds to ensure that the same tasks are used across all methods that we compare.For transfer learning approaches (fine-tuning, training from scratch, etc.), we use 20 update steps during evaluation.We only retain the meta-learned shared feature extractor across tasks, and for each new evaluation task, we randomly initialize the output layer so that we can support any number of classes as well as novel task families during meta-testing (in line with [65]).

Results
The main experimental results are shown in Table 2, where rows correspond to different few-shot learners, and columns report the average performance of test episodes, aggregated across multiple datasets in each task family, and broken down by "seen" datasets (ID) and "unseen" datasets (OOD).The table also reports the average rank of each meta-learner across each dataset, both overall and broken down by ID and OOD datasets.More specifically, for each setting (e.g.cls.ID) we calculate the rank of each method (separately for STL and MTL), and then we average those ranks across cls., seg.and keypoints.From the results, we can draw the following main conclusions: (1) ProtoNet is the most versatile meta-learner, as shown by its highest average rank in the single-task scenario.This validates our novel Kernel Regression extension of ProtoNet for tackling regression-type keypoint localization tasks.Somewhat surprisingly, ProtoNet is also the most robust to out-of-distribution episodes (OOD) which is different from the conclusion of [63] and others who suggested that gradient-based adaptation is crucial to adapt to OOD data.However, it is also in line with the results of [65] and the strong performance of prototypes more broadly [9].
(2) Coming to multi-task meta-learning the situation is similar in that ProtoNet dominates the other competitors, but now sharing the first place with Proto-MAML.
(3) To compare single-task and multi-task meta-learning (top and bottom blocks of Table 2) more easily, Figure 3 shows the difference in meta-testing episode performance after STL and MTL meta-training for each method.Overall STL outperforms the MTL condition, showing that the difficulty of learning from heterogeneous tasks [57,77] outweighs the benefit of the extra available multi-task data.(4) Finally, comparing meta-learning methods with simple transfer learning methods as discussed in [13,62], the best meta-learners are clearly better than transfer learning for both single and multi-task scenarios.
We also note that Proto-MAML is better than MAML in the multi-task case, likely due to the importance of a good output-layer initialization in the case of heterogeneous episodes, as per [63].Meta-Curvature outperforms MAML in single-task in-domain scenarios, in line with previous results [48], but it did not achieve stronger performance outof-domain or in the multi-task case.Finally, while DDRR is perhaps the most elegant baseline in terms of most naturally spanning all task types, its overall performance is middling.

Additional Analysis
How well can multi-task meta-learners generalize to completely new held-out tasks?We take the multi-task meta-learners (trained on classification, segmentation, keypoints) and evaluate them on four regression benchmarks inspired by [19]: ShapeNet1D, ShapeNet2D, Distractor and Pascal1D.Because the metrics differ across datasets, we analyse the rankings and summarise the results in Table 3.We see the basic TFS performs the worst, with MAML, Pro-toNets, DDRR being the best.However, in several cases the results were not better than predicting the mean (full results in the appendix), showing that learning-to-learn of completely new task families is an open challenge.How much does external pre-training help?Our focus is on assessing the efficacy of meta-learning rather than representation transfer, but we also aim to support researchers investigating the impact of representation learning on external data prior to meta-learning [13,27,80].We therefore  specify evaluation conditions where external data outside our defined in-distribution meta-training set is allowed.We take two high-performing approaches in the multitask scenario, Proto-MAML and ProtoNet, and we investigate to what extent external pre-training helps.We use the standard ImageNet1k pre-trained ResNet18, prior to conducting our meta-learning pipeline as usual.We use the same hyperparameters as selected earlier for these models to ensure consistent evaluation, and ensure that the differences in performance are not due to a better selection of hyperparameters.The results in Table 4 show that pretraining is not necessarily helpful in the considered multi-task setting, in contrast to purely recognition-focused evaluations [13,27,80], which were unambiguously positive about representation transfer from external data.

Analysis of Runtimes
We analyze the times that the different meta-learning approaches spend on meta-training, meta-validation and meta-testing in the multitask learning case of our benchmark.The results in Table 5 show that all experiments are relatively lightweight, despite the ambitious goal of our benchmark to learn a meta-learner that can generalize across various task families.Most notably we observe that ProtoNets are the fastest approach, alongside being the best-performing one.Note that fine-tuning and training from scratch are expensive during the test time as they use backpropagation with a larger number of steps.multi-task meta-learning per-se.These including: studying the multi-task optimisation [77] in meta-learning, studying HPO for meta-learning, developing validation strategies in meta-learning (using ID vs OD val sets [32]), and studying the benefit of task-specific decoders and external data.

Conclusion
We have introduced Meta Omnium, the first multitask few-shot meta-learning benchmark for computer vision.The benchmark is challenging in multiple highly topical ways including requiring learning on heterogeneous task distributions, evaluating generalization to outof-distribution datasets, and uniquely challenging metalearners to learn-to-learn and transfer knowledge across tasks with heterogeneous output spaces.Meta Omnium is nevertheless lightweight enough to be of broad interest and use for driving future research, and even to support future research in hyper-parameter optimization for meta-learning.

A. Full Dataset Details
We describe the full details of our multi-task metadataset in Table 6 and provide further high-level details in this section.
• Segmentation datasets: We first split FSS1000 [37] dataset into in-domain train, validation, and test sets, i.e.FSS1000-Trn, FSS1000-Val, FSS1000-Test.We use Vizwiz [64] dataset for out-of-domain validation, and a modified version of Pascal 5i [58] and PH2 [41] datasets for out-of-domain testing.We exclude the object classes from the out-of-domain datasets that overlap with FSS1000 to ensure the classes during validation and testing are never seen during training.
• Keypoint estimation datasets: We use three keypoint datasets in the paper, including animal pose [10], synthetic animal pose [44] and human pose [4].A single animal/human image is cropped from the original picture according to absolute maximum and minimum keypoint coordinates.The boundary is extended with 5 more pixels to avoid losing important information at object edges.Different keypoint datasets would have various target keypoints, so we cannot have a trivial solution like classification with a N -way K-shot setting, which stands for sampling K samples from N categories.Instead, we sample each keypoint task from one object category with only a fixed number of keypoints.In detail, we randomly select 5 keypoints per task, and train and fit the model to predict only 5 keypoints.This method leads to a general metalearning keypoint prediction model that learns to predict corresponding keypoints from the limited support labels, which makes it possible for an arbitrary number of keypoint prediction tasks when conducted on more complex keypoint datasets.
• Regression datasets: We use regression datasets only for out-of-task (OOT) meta-test evaluation, so they are not used during meta-training.More specifically we use ShapeNet1D [19], ShapeNet2D [19], Distractor [19] and Pascal1D datasets [76].Because regression problems typically require larger number of examples for adaptation, we use 5-times as many support examples compared to the other cases (e.g.instead of 5-shot we have 25-shot case).For our analysis experiments we consider the equivalent of variable 1-to-5shot setting: variable 5-to-25-shot setting.

B. Additional Analysis
How do gradient-based meta-learners adapt their layers?A recent debate in few-shot meta-learning has been around whether gradient-based meta-learners really learn to adapt, or simply reuse features without adaptation.[53] claimed that feature reuse was the dominant effect after measuring the representational change pre-and postadaptation and finding that representational change was primarily in the output layer.We analyze this using Canonical Correlation Analysis (CCA) [43,54] for Meta Omnium, reporting the representational change of multi-task MAML by layer for each task family during meta-testing.From the results in Figure 4, we observe that: (1) The degree of representational change varies substantially with tasks, (2) Similar to [53], there is greater representational change at the later layers, especially the final output layer.However, significant amount of adaptation is done also in the earlier layers, which we attribute to the greater diversity of tasks and visual domains in Meta Omnium compared to the simple recognition episodes in miniImageNet studied by [53].
Our HPO is reasonably fast, and it generally takes between a few hours up to two days in the slowest cases (using a single NVIDIA 1080 Ti GPU with 12GB memory and using 4 CPUs).As a result, it is feasible to run the HPO even with modest resources when designing new approaches for our multi-task scenario.We provide the found hyperparameters within the released code.

C.2. Experimental Settings
Many of our experimental settings follow Meta-Album [65]), whose code-base we have also used as the starting point.All approaches use one task in a meta-batch.We use 5 inner-loop steps during meta-training and 10 innerloop steps during evaluation for MAML, Proto-MAML and Meta-Curvature.We use gradient-clipping of 5. DDRR uses an adjustment layer, the scale of which is initialized to 5.0 (with the adjust base set to 1.0).Proto-FineTuning, FineTuning, Linear-Readout and TFS use 20 fine-tuning steps during evaluation.The training minibatch size for these approaches is 16, while the testing minibatch size is 4. We use standard ImageNet normalization for segmentation tasks, but we do not use normalization in the other cases, following earlier work [65].
We train each model for 30,000 iterations and evaluate the model on validation data after every 2,500 tasks, including at the beginning and the end (used for early stoppingmodel selection).We use 5-way tasks during both training and evaluation.The number of shots is between 1 and 5 during meta-training, and we consider 3 setups for evaluation: variable 1-to-5-shot (primary), 1-shot and 5-shot (presented in the appendix).The query size is 5 examples per category and this has been selected to be consistent across the different datasets.Validation uses 600 tasks for each of indomain and out-domain evaluation.Testing uses 600 tasks per dataset to provide more rigorous evaluation.
During evaluation, we randomly initialize the top layer weights (classifier) to enable any-way predictions, in line with previous literature [65].We do this for the approaches that perform fine-tuning (e.g.MAML or Fine-Tuning baseline).Note that in approaches such as Proto-MAML the top layer is initialized using weights derived from the prototypes or ridge regression solution.

D. Detailed Per-Dataset Results
We include detailed per-dataset results (various-shot evaluation), first showing the single-task learning results for classification, segmentation and keypoint estimation, followed by multi-task learning results.In each case, we separately report the results for in-domain and out-of-domain evaluation.We also include detailed results for our out-oftask evaluation using regression datasets.Summary 1-shot and 5-shot results are included for the single and multi-task settings.

Figure 1 .
Figure 1.Illustration of the diverse visual domains and task types in Meta Omnium.Meta-learners are required to generalize across multiple task types, multiple datasets, and held-out datasets.

Figure 3 .
Figure3.Analysis of the differences in scores between single-task (STL) and multi-task (MTL) learning for different methods.

Figure 4 .
Figure 4. Analysis of the layer adaptation by MAML in Meta Omnium.

Table 1 .
Feature comparison between Meta Omnium and other few-shot meta-learning benchmarks.Meta Omnium uniquely combines a rich set of tasks and visual domains with a lightweight size for accessible use.
for visual fewshot learning are restricted to image recognition across a arXiv:2305.07625v1[cs.CV] 12 May 2023

Table 2
. Main Results.Results are presented as averages across the datasets within each task type and separately for in-distribution (ID) and out-of-distribution (OOD) datasets.Classification, segmentation, and keypoint results are reported in accuracy (%), mIOU (%), and PCK (%) respectively.The upper and lower groups correspond to multi-task and single-task meta-training prior to evaluation on the same set of meta-testing episodes.Upper and lower sub row groups correspond to meta-learners and non-meta learners respectively.See the appendix for a full breakdown over individual datasets.

Table 3 .
Average ranking of the different methods across four outof-task regression datasets.

Table 4 .
Analysis of the impact of external data pretraining for selected meta-learners in multi-task learning condition.The results show that ImageNet pretraining does not necessarily help improve performance.Cls., Segm., Keyp.represent classification, segmentation and keypoint respectively.
Discussion and Future Work In future Meta Omnium can be used in a variety of ways beyond benchmarking

Table 5 .
Analysis of times needed by different algorithms in the multitask setting (using one NVIDIA 1080 Ti GPU and 4 CPUs).

Table 6 .
Details of all task families included in Meta Omnium.

Table 7 .
This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) grant number EP/S000631/1; the MOD University Defence Research Collaboration (UDRC) in Signal Processing; EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh; the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics, and Samsung AI Center, Cambridge.This project was supported by the Royal Academy of Engineering under the Research Fellowship programme.This work was also supported by Key Project Plan of Blockchain in Ministry of Education of the People's Republic of China (Grant No. 2020KJ010802), Innovation and Transformation Fund of Peking University Third Hospital (Grant No. BYSYZHKC2021115) and China Scholarship Council.Yongshuo Zong is supported by the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics.For the purpose of open access, the author has applied a creative commons attribution (CC BY) licence to any author accepted manuscript version arising.In-domain single-task classification results.Mean test accuracy (%) and 95% confidence interval across test tasks.± 1.07 52.00 ± 1.13 45.11 ± 1.06 86.55 ± 0.98 ProtoNet 59.68 ± 1.15 51.17 ± 1.04 43.65 ± 1.06 83.02 ± 1.03 DDRR 60.28 ± 1.19 48.70 ± 1.04 42.83 ± 1.04 83.17 ± 1.10 Proto-FineTuning 51.50 ± 1.27 41.92 ± 1.06 39.54 ± 1.03 69.77 ± 1.39 FineTuning 46.83 ± 1.13 41.37 ± 0.96 36.03 ± 0.90 68.39 ± 1.23 Linear-Readout 52.68 ± 1.02 46.24 ± 1.00 41.39 ± 0.94 73.45 ± 1.13 TFS 43.87 ± 1.03 36.36 ± 0.90 35.07 ± 0.91 52.54 ± 1.33

Table 8 .
Out-of-domain single-task classification results.Mean test accuracy (%) and 95% confidence interval across test tasks.

Table 10 .
Out-of-domain single-task segmentation results.Mean test mIoU (%) and 95% confidence interval across test tasks.Larger mIoU is better.

Table 11 .
In-domain single-task keypoint estimation results.Mean test PCK (%) and 95% confidence interval across test tasks.Larger PCK is better.

Table 12 .
Out-of-domain single-task keypoint estimation results.Mean test PCK (%) and 95% confidence interval across test tasks.Larger PCK is better.

Table 15 .
Evaluation of multi-task models on out-of-task regression datasets, using variable 5-to-25-shot episodes.Lower value is better.

Table 16 .
5-way 1-shot results, reporting the same metrics as in our primary table with variable-shot results.

Table 17 .
5-way 5-shot results, reporting the same metrics as in our primary table with variable-shot results.