Visual Instruction Tuning

作者： Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee (2023)

领域

多模态

TLDR（中文）

CLIP 视觉 encoder + LLaMA + GPT-4 合成的多模态指令数据，用极少算力做出第一个开源 GPT-4V 风格模型。开源多模态生态（LLaVA-1.5/1.6、Qwen-VL、InternVL）的范式起点。

TLDR (English)

CLIP vision encoder + LLaMA + GPT-4 synthesized multimodal instruction data creates first open-source GPT-4V style model with minimal compute. Starting point for open-source multimodal ecosystem (LLaVA-1.5/1.6, Qwen-VL, InternVL).

出现在这些文章里

RAG 与检索增强：让模型有外部记忆
RAG and Retrieval Augmentation: Giving Models External Memory

同被引用

这些论文与本文出现在同一篇文章中

Visual Instruction Tuning

领域

TLDR（中文）

TLDR (English)

出现在这些文章里

同被引用

相关论文