radford2021-clip
arXiv: 2103.00020
TLDR(中文)
用 4 亿对图文做对比学习,得到通用视觉 encoder。CLIP embedding 至今是几乎所有多模态系统(DALL·E、Stable Diffusion、LLaVA)的视觉前端。
TLDR (English)
Uses 400M image-text pairs for contrastive learning to obtain universal vision encoder. CLIP embeddings remain the vision frontend for almost all multimodal systems (DALL·E, Stable Diffusion, LLaVA) today.