跳转到内容

Learning Transferable Visual Models From Natural Language Supervision

作者: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever (2021)

arXiv: 2103.00020

领域

多模态

TLDR(中文)

CLIP 的原始论文,提出使用自然语言监督来学习可迁移的视觉表示。通过在 4 亿对图像-文本数据上训练对比学习模型,CLIP 实现了零样本图像分类,并展示了强大的跨任务迁移能力,开创了视觉-语言对齐的新范式。

TLDR (English)

The original CLIP paper, proposing learning transferable visual representations from natural language supervision. By training a contrastive model on 400 million image-text pairs, CLIP achieves zero-shot image classification and demonstrates strong cross-task transferability, pioneering a new paradigm for vision-language alignment.

相关论文

同一领域的其他论文