frantar2022-gptq
arXiv: 2210.17323
TLDR(中文)
第一次实现"在单卡上 4-bit 量化 175B 模型而几乎不掉精度"。把 LLM 推理硬件门槛从 8xA100 拉到一张消费级显卡,普及"开源大模型本地跑"。
TLDR (English)
First to achieve "4-bit quantization of 175B model on single GPU with almost no accuracy loss". Lowered LLM inference hardware barrier from 8xA100 to single consumer GPU, popularizing "run open-source LLMs locally".