Oftentimes, machine studying (ML) mannequin builders start their design utilizing a generic spine mannequin that’s skilled at scale and with capabilities transferable to a variety of downstream duties. In pure language processing, numerous in style spine fashions, together with BERT, T5, GPT-3 (typically additionally known as “basis fashions”), are pre-trained on web-scale information and have demonstrated generic multi-tasking capabilities by means of zero-shot, few-shot or switch studying. In contrast with coaching over-specialized particular person fashions, pre-training spine fashions for numerous downstream duties can amortize the coaching prices, permitting one to beat useful resource limitations when constructing massive scale fashions.
In pc imaginative and prescient, pioneering work has proven the effectiveness of single-encoder fashions pre-trained for picture classification to seize generic visible representations which can be efficient for different downstream duties. Extra just lately, contrastive dual-encoder (CLIP, ALIGN, Florence) and generative encoder-decoder (SimVLM) approaches skilled utilizing web-scale noisy image-text pairs have been explored. Twin-encoder fashions exhibit exceptional zero-shot picture classification capabilities however are much less efficient for joint vision-language understanding. However, encoder-decoder strategies are good at picture captioning and visible query answering however can not carry out retrieval-style duties.
In “CoCa: Contrastive Captioners are Picture-Textual content Basis Fashions”, we current a unified imaginative and prescient spine mannequin referred to as Contrastive Captioner (CoCa). Our mannequin is a novel encoder-decoder strategy that concurrently produces aligned unimodal picture and textual content embeddings and joint multimodal representations, making it versatile sufficient to be instantly relevant for all sorts of downstream duties. Particularly, CoCa achieves state-of-the-art outcomes on a sequence of imaginative and prescient and vision-language duties spanning imaginative and prescient recognition, cross-modal alignment, and multimodal understanding. Moreover, it learns extremely generic representations in order that it could possibly carry out as properly or higher than absolutely fine-tuned fashions with zero-shot studying or frozen encoders.
|Overview of Contrastive Captioners (CoCa) in comparison with single-encoder, dual-encoder and encoder-decoder fashions.|
We suggest CoCa, a unified coaching framework that mixes contrastive loss and captioning loss on a single coaching information stream consisting of picture annotations and noisy image-text pairs, successfully merging single-encoder, dual-encoder and encoder-decoder paradigms.
To this finish, we current a novel encoder-decoder structure the place the encoder is a imaginative and prescient transformer (ViT), and the textual content decoder transformer is decoupled into two components, a unimodal textual content decoder and a multimodal textual content decoder. We skip cross-attention in unimodal decoder layers to encode text-only representations for contrastive loss, and cascade multimodal decoder layers with cross-attention to picture encoder outputs to be taught multimodal image-text representations for captioning loss. This design maximizes the mannequin’s flexibility and universality in accommodating a large spectrum of duties, and on the identical time, it may be effectively skilled with a single ahead and backward propagation for each coaching targets, leading to minimal computational overhead. Thus, the mannequin might be skilled end-to-end from scratch with coaching prices corresponding to a naïve encoder-decoder mannequin.
|Illustration of ahead propagation utilized by CoCa for each contrastive and captioning losses.|
The CoCa mannequin might be instantly fine-tuned on many duties with minimal adaptation. By doing so, our mannequin achieves a sequence of state-of-the-art outcomes on in style imaginative and prescient and multimodal benchmarks, together with (1) visible recognition: ImageNet, Kinetics-400/600/700, and MiT; (2) cross-modal alignment: MS-COCO, Flickr30K, and MSR-VTT; and (3) multimodal understanding: VQA, SNLI-VE, NLVR2, and NoCaps.
|Comparability of CoCa with different image-text spine fashions (with out task-specific customization) and a number of state-of-the-art task-specialized fashions.|
It’s noteworthy that CoCa attains these outcomes as a single mannequin tailored for all duties whereas typically lighter than prior top-performing specialised fashions. For instance, CoCa obtains 91.0% ImageNet top-1 accuracy whereas utilizing lower than half the parameters of prior state-of-the-art fashions. As well as, CoCa additionally obtains sturdy generative functionality of high-quality picture captions.
|Picture classification scaling efficiency evaluating fine-tuned ImageNet top-1 accuracy versus mannequin measurement.|
|Textual content captions generated by CoCa with NoCaps photographs as enter.|
Apart from attaining glorious efficiency with fine-tuning, CoCa additionally outperforms earlier state-of-the-art fashions on zero-shot studying duties, together with picture classification,and cross-modal retrieval. CoCa obtains 86.3% zero-shot accuracy on ImageNet whereas additionally robustly outperforming prior fashions on difficult variant benchmarks, reminiscent of ImageNet-A, ImageNet-R, ImageNet-V2, and ImageNet-Sketch. As proven within the determine beneath, CoCa obtains higher zero-shot accuracy with smaller mannequin sizes in comparison with prior strategies.
|Picture classification scaling efficiency evaluating zero-shot ImageNet top-1 accuracy versus mannequin measurement.|
Frozen Encoder Illustration
One notably thrilling remark is that CoCa achieves outcomes corresponding to the very best fine-tuned fashions utilizing solely a frozen visible encoder, wherein options extracted after mannequin coaching are used to coach a classifier, relatively than the extra computationally intensive effort of fine-tuning a mannequin. On ImageNet, a frozen CoCa encoder with a discovered classification head obtains 90.6% top-1 accuracy, which is healthier than the absolutely fine-tuned efficiency of present spine fashions (90.1%). We additionally discover this setup to work extraordinarily properly for video recognition. We feed sampled video frames into the CoCa frozen picture encoder individually, and fuse output options by attentional pooling earlier than making use of a discovered classifier. This easy strategy utilizing a CoCa frozen picture encoder achieves video motion recognition top-1 accuracy of 88.0% on Kinetics-400 dataset and demonstrates that CoCa learns a extremely generic visible illustration with the mixed coaching targets.
|Comparability of Frozen CoCa visible encoder with (a number of) best-performing fine-tuned fashions.|
We current Contrastive Captioner (CoCa), a novel pre-training paradigm for image-text spine fashions. This easy technique is extensively relevant to many kinds of imaginative and prescient and vision-language downstream duties, and obtains state-of-the-art efficiency with minimal and even no task-specific diversifications.
We wish to thank our co-authors Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu who’ve been concerned in all elements of the undertaking. We additionally wish to thank Yi-Ting Chen, Kaifeng Chen, Ye Xia, Zhen Li, Chao Jia, Yinfei Yang, Zhengdong Zhang, Wei Han, Yuan Cao, Tao Zhu, Futang Peng, Soham Ghosh, Zihang Dai, Xin Li, Anelia Angelova, Jason Baldridge, Izhak Shafran, Shengyang Dai, Abhijit Ogale, Zhifeng Chen, Claire Cui, Paul Natsev, Tom Duerig for useful discussions, Andrew Dai for assist with contrastive fashions, Christopher Fifty and Bowen Zhang for assist with video fashions, Yuanzhong Xu for assist with mannequin scaling, Lucas Beyer for assist with information preparation, Andy Zeng for assist with MSR-VTT analysis, Hieu Pham and Simon Kornblith for assist with zero-shot evaluations, Erica Moreira and Victor Gomes for assist with useful resource coordination, Liangliang Cao for proofreading, Tom Small for creating the animations used on this blogpost, and others within the Google Mind workforce for assist all through this undertaking.