Learning Transferable Visual Models From Natural Language Supervision

作者： Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever (2021)

arXiv： 2103.00020

领域

多模态

TLDR（中文）

用 4 亿对图文做对比学习，得到通用视觉 encoder。CLIP embedding 至今是几乎所有多模态系统（DALL·E、Stable Diffusion、LLaVA）的视觉前端。

TLDR (English)

Uses 400M image-text pairs for contrastive learning to obtain universal vision encoder. CLIP embeddings remain the vision frontend for almost all multimodal systems (DALL·E, Stable Diffusion, LLaVA) today.

出现在这些文章里

RAG 与检索增强：让模型有外部记忆
RAG and Retrieval Augmentation: Giving Models External Memory

同被引用

这些论文与本文出现在同一篇文章中

Learning Transferable Visual Models From Natural Language Supervision

领域

TLDR（中文）

TLDR (English)

出现在这些文章里

同被引用

相关论文