Learning Transferable Visual Models From Natural Language Supervision
arXiv: 2103.00020
Domains
TLDR (English)
The original CLIP paper, proposing learning transferable visual representations from natural language supervision. By training a contrastive model on 400 million image-text pairs, CLIP achieves zero-shot image classification and demonstrates strong cross-task transferability, pioneering a new paradigm for vision-language alignment.
TLDR(中文)
CLIP 的原始论文,提出使用自然语言监督来学习可迁移的视觉表示。通过在 4 亿对图像-文本数据上训练对比学习模型,CLIP 实现了零样本图像分类,并展示了强大的跨任务迁移能力,开创了视觉-语言对齐的新范式。
Related Papers
Other papers in the same domain