跳转到内容

radford2021-clip

arXiv: 2103.00020

TLDR(中文)

用 4 亿对图文做对比学习,得到通用视觉 encoder。CLIP embedding 至今是几乎所有多模态系统(DALL·E、Stable Diffusion、LLaVA)的视觉前端。

TLDR (English)

Uses 400M image-text pairs for contrastive learning to obtain universal vision encoder. CLIP embeddings remain the vision frontend for almost all multimodal systems (DALL·E, Stable Diffusion, LLaVA) today.