Do Emergent Abilities Exist in Quantized Large Language Models

type

status

date

slug

summary

Background & Motivation

Large Language Models (LLMs) have superior performance while require significant computational resources for deployment and use.

LLMs have the prominent features of emergent abilities. And there exists a strong dependency between emergent abilities and parameter scale.

Quantization methods have been widely applied to reduce the memory footprint of LLMs as well as increasing the inference rate.

🤕

But low-bit quantization methods often lead to performance degradation. It is important to understand (1) how quantization impacts the capacity of LLMs? and (2) what is the lowest bit precision for quantization to achieve decent performance on a specific task? And previous studies focused on overall performance, lacking a deep investigation into LLM’s abilities on complex tasks.

So this paper made the following contribution 🤗:

examine whether the emergent abilities exist in quantized LLMs

and if so, what level of performance it can achieve?

how to improve the performance of low-bit models

study which components (or substructures) are more sensitive to quantization
performance compensation through model fine-tuning

Related work

LLMs

先介绍一下语言模型 language modeling (LM)：

🗣️

LM aims to model the generative likelihood of word sequences, so as to predict the probabilities of future (or missing) tokens.

语言模型从发展至今可以分为四个阶段：

📏统计语言模型 (Statistical Language Models, SLM): 这种模型基于概率统计理论来预测词序列的可能性。它使用词的概率分布以及词之间的条件概率来进行预测。统计语言模型的一个著名例子是n-gram模型，它预测下一个词的概率，基于前面的n-1个词。这种模型的一个主要限制是它无法有效地处理较长的依赖关系和稀有词汇。

🧠神经语言模型 (Neural Language Models, NLM): 这种模型使用神经网络（如循环神经网络（RNN）或长短期记忆网络（LSTM））来预测词序列的可能性。神经语言模型能够处理较长的依赖关系，并且可以学习词的连续表示，这使得它们能够更好地处理稀有词汇和复杂的语义关系。

♾️预训练语言模型 (Pretrained Language Models, PLM): 这种模型让神经网络首先在大量的无标签数据上进行预训练，然后在特定任务上进行微调 (sets the “pre-training and fine-tuning” learning paradigm)。预训练过程通常涉及无监督的学习任务，如语言模型训练或自我监督学习。这种模型的一个主要优点是它可以借鉴在预训练阶段学习到的丰富的语言知识，以提高在特定任务上的性能。著名的预训练语言模型包括BERT、GPT和RoBERTa等。

💥大型语言模型 (Large Language Models, LLM): 这种模型通常涉及大量的参数和数据，以便捕捉更复杂的语言模式和结构。大型语言模型通常需要大量的计算资源来训练，但是它们通常能够在各种语言任务上实现最先进的性能。OpenAI的GPT-3就是一个著名的大型语言模型的例子。

Typically, large language models (LLMs) refer to Transformer language models that contain hundreds of billions (or more) of parameters, which are trained on massive text data, such as GPT-3, PaLM, Galactica, and LLaMA. [1]

Emergent ablities

An ability is emergent if it is not present in smaller models but is present in larger models. [2]

也就是说，这些能力直到参数等规模达到一定值后才会突然涌现出来，包括 CoT、In-context learning(ICL)、Instruction following(IF)。

In-context learning(ICL)

Incontext learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration without requiring additional training or gradient update. Essentially, it estimates the likelihood of the potential answer conditioned on the demonstration by using a well-trained language model. [3]

Chain-of-Thought (CoT)

A chain of thought is a series of intermediate natural language reasoning steps that lead to the final output. [4]

对于小型语言模型来说，通常很难解决涉及多个推理步骤的复杂任务，例如，数学推理题等。相反，通过 CoT 策略，LLMs 可以解决这类任务。CoT 策略的核心在于在上下文示例中涵盖中间推理步骤来推导最终答案，当然后面也出现了 zero-shot CoT。CoT策略的引入开辟了新的路径，允许模型通过多步骤的推理来一步步解决问题。

一些研究表明这种 CoT 能力是通过在大量编程语言的数据集上训练得到的。因为代码的本质就是一系列的推理和决策过程，因此模型通过学习这种结构化的推理和决策，将这种能力转化为解决更加复杂的问题。

Instruction following(IF)

Instruction Following (IF) refers to the superior ability that a LLM follows human instructions and completes the target task as needed.

IF 不同于以上两种能力，上述两种能力更倾向于 prompt 策略，但是 IF 的能力一般是通过在 instruction-following data 的数据集上进行 instruction tuning 微调得到的，并且在实际的输入过程中，一般并不需要具体的上下文等示例来进行阐述遵循 instruction.

Quantization

The essential idea of quantization is to map floating-point numbers into low-bit integers (e.g., BF16 to INT8), so as to reduce the total model bits.

在神经网络模型中，我们通常需要对两种数据进行量化处理，分别是 weights (model parameters) and activations (hidden activations)。

Weights are applied to the inputs as they travel along synapses to reach the neuron. The neuron then applies an activation function to the “sum of weighted inputs” from each incoming synapse and passes the result(acctivations values) on to all the neurons in the next layer.

two quantization techniques [5]

⭐ absolute maximum (absmax) quantization

通过 absmax 量化，原始数字除以张量的绝对最大值，并乘以缩放因子 (e.g. 127)，以将输入映射到范围 [-127, 127]。为了得到原始 FP16 值，将 INT8 数字除以量化因子，但是必然会造成一些精度损失。

例如，假设我们的绝对最大值为 3.2。权重 0.1 将被量化为 round(0.1 × 127/3.2) = 4。如果我们想对其进行反量化，我们将得到 4 × 3.2/127 = 0.1008，这意味着误差为 0.008。

⭐ zero-point quantization

zero-point quantization 可以适用于输入的不对称分布，例如 ReLU 函数的输出（仅正值）。输入值首先通过缩放因子缩放，然后该分布通过 zero-point 进行平移，将其映射到目标范围内(注意与absmax相比多了一个值)。首先,我们计算比例因子和零点的值：

如果量化为 8-bit，那么 b=8。根据量化因子以及 zero-point 我们就可以得出 quant 和 dequant：

举个例子：最大值为 3.2，最小值为 -3.0。我们可以计算出比例为 255/(3.2 + 3.0) = 41.13，零点 -round(41.13 × -3.0) - 128 = 123 -128 = -5，因此我们之前的权重 0.1 将被量化为 round( 41.13 × 0.1 -5) = -1。这与之前使用 absmax 获得的值（4 与 -1）有很大不同。

📌

在应用量化之后，我们有时需要恢复原始值以进行后续的计算或分析即 dequantize

What is absolute maximum (absmax) quantization and zero-point quantization and what is the difference between the two (answered by GPT-4)

Quantization is a process used to map a large set of values (like floating-point numbers) to a smaller set (like integers). It's often used in the context of deep learning to reduce the memory footprint and computational requirements of models. Absolute Maximum (AbsMax) Quantization and Zero-Point Quantization are two methods used for this purpose.

Absolute Maximum (AbsMax) Quantization

AbsMax Quantization is a method where the absolute maximum value in the data is used to scale the entire range of values. This scaling factor is then used to map the floating-point numbers to a fixed range of integers.

The process typically involves:

Finding the absolute maximum value in the data.

Dividing all the values by this absolute maximum value, so that the data is now in the range of [-1, 1].

Multiplying these values by the maximum integer value that can be represented in the target bit-width (e.g., 127 for int8).

Rounding to the nearest integer.

This method ensures that the entire range of integers is used, maximizing the precision for the given bit-width. However, it might not be the best choice if the data has outliers, as they can disproportionately affect the scaling factor.

Zero-Point Quantization

Zero-Point Quantization is a more flexible method that allows for asymmetric quantization. Unlike AbsMax, which always centers the quantized range around zero, Zero-Point Quantization can shift the range to better fit the data.

The process typically involves:

Finding the minimum and maximum values in the data.

Calculating the scale and zero-point that map the floating-point range to the integer range.

Using the scale and zero-point to map the floating-point numbers to integers.

The zero-point is an integer value that represents 0 in the floating-point range, and the scale is a factor that represents the step between consecutive integers. This method can provide a better approximation of the data distribution, especially if it's not symmetric around zero.

Difference Between the Two

Symmetry: AbsMax ensures that the quantized range is symmetric around zero, while Zero-Point Quantization allows for an asymmetric range.

Precision: AbsMax might lose precision if there are outliers in the data, as they can affect the scaling factor. Zero-Point Quantization can often provide a better approximation of the data distribution.

Complexity: AbsMax is generally simpler to implement, as it only requires finding the absolute maximum value and scaling accordingly. Zero-Point Quantization requires calculating both a scale and a zero-point, making it slightly more complex.

In summary, AbsMax Quantization is a simpler method that works well if the data is symmetric around zero, while Zero-Point Quantization provides more flexibility and can often represent the data more accurately. The choice between the two will depend on the specific requirements of the application.

Post-Training Quantization (PTQ)

In PTQ, the model is quantized after it has been trained. PTQ is a neural network compression technique that converts a full-precision model into a quantized model using lower-precision data types with little degradation in model accuracy.

PTQ 可以分为大致四个方法：

Mixed-precision decomposition: LLM.int8()

It separates the feature dimensions with outliers and the rest dimensions in matrix multiplication. Then, the calculations for the two parts are performed with 16bit floating numbers and 8-bit integers, respectively, so as to recover these outliers in a high precision.

Fine-grained quantization

Balancing the quantization difficulty

Layerwise quantization：希望找到一个量化后的权重，使得量化后的权重与原来的权重在计算输出的结果上尽可能的相近，差距最小

本文使用的是GPTQ(Generative Pretrained Transformer Quantization)，一种压缩 GPT-3 等大型语言模型权重的 Layer-wise quantization 的方法。原理来自于另一个量化方法OBQ，GPTQ在一些方面进行了改进。(由于看不太懂，涉及到数学太多 🥵，脑子不够用，先贴个讲解。对于更详细的 PTQ 方法介绍，可以查看这篇 paper)

👉

GPTQ improves the original optimal brain quantization (OBQ) method by fixing the quantization order of weights for all rows. Further, with specially designed methods (i.e., lazy batch-updates and Cholesky reformulation), GPTQ is feasible to quantize very large models (e.g., 175B OPT) in 3 or 4 bit precision.

Do Emergent Abilities Exist in Quantized LLMs?

Experimental setup

Test	Datasets	Method	Metrics
In-Context Learning Test	MMLU & BBH	few- and zero-shot	acc
Chain-of-Thought Reasoning Test	GSM8K(Grade School Math 8K)	few-shot ( formatted as <input, CoT, output> )	acc
Instruction Following Test	Vicuna dataset	AutoEval (GPT3.5)	ㅤ
Language Modeling Test	WikiText	ㅤ	PPL

Model size	7B	13B	30B	65B
Quantization level	16-bit(non-quantized)	8-bit	4-bit	2-bit

Results and Analysis

🌟 the three kinds of emergent abilities seem to be seldom affected with 4-bit quantization

可以看到。7B、13B、30B 的模型经过 4-bit 和 8-bit 的量化相对于未量化的模型，在三个涌现能力的测试结果上并未产生较大的性能下降，甚至超过原有的未量化的模型。

对于经过 2-bit 量化后的模型，模型的三种能力显著下降，甚至为 0。

🌟 4-bit precision exhibits a favorable trade-off in terms of both total bits and performance

💡

Total (model) bits: multiply the total number of parameters with the actual number of representation bits.

可以看到，对于每一个图像，如果绘制一条垂直 x 轴的竖线，可以发现最高的相交点是 4-bit 对应的曲线。也就是说，在相同 bit 占用量的情况下，经过 4-bit 量化后的模型表现最佳。表明其在 performance 和 memory 之间做了最好的平衡。

🌟 Low-bit quantization performance benefits from the demonstrations in ICL tests

可以看到经过 2-bit 量化的 model，使用 zero-shot 几乎不能表现出 ICL 能力，但是通过 few-shot，ICL 能力得到显著的提升。但是这种参数规模扩大所带来的提升并未表现在 7B 模型上，表明 ICL 能力需要参数规模以及精度达到一定的水平。

🌟 The scaling effect depends on specific tasks, and increasing the model scale benefits the CoT task the most

可以看到在 CoT 图像中，随着 bits 数量的增加，acc 值对应的增长速度最快(除了 2-bit 丧失了能力)，在 MMLU 对应的 ICL 能力测试中，随着参数规模的扩大，对应的表现也有较为显著的提升。

🌟 For CoT tests, extreme 2-bit quantization requires a large model scale

对于 CoT，不管参数规模的大小，经过 2-bit 的量化，CoT 的能力都丧失。

可以看到这三个参数规模的模型在 2-bit 的量化后，7B 的模型直接开始胡言乱语，但是我感觉既然在这里胡言乱语，说明基本的 language modeling 能力也没多少了，感觉其他的 ICL 表现有点疑惑，难道不应该也是胡言乱语吗？13B 的模型并不能理解 CoT。30B 的模型虽然最终答案未正确，但是输出了 “思维链”。只有当 30B 的模型经过 4-bit 的量化输出了正确的 CoT 以及答案。

说明，即使在极限的 2-bit 量化下，CoT 能力可以随着参数规模的扩大，但是应该也不会有很好的效果，只能说模型的增大使模型理解了推理格式，但是对于推理分解、计算等能力并未获得（65B量化后也只有 0.8）。所以精度很重要，来更准确的表示信息。

How to Enhance the Performance of Low-bit Models?

low-bit 模型的实用性以及更具前景性，但是相对于原来未量化的模型，其 performance 还是相对下降甚至在某些任务上丧失能力。所以能不能考虑从以下两个方面来考虑增强量化后的模型的表现：

Quantization Sensitivity Analysis：在量化的过程中，模型架构的某些部分可能对量化即低精度是非常敏感的，比如这部分可能在模型某种表现起着重要的作用，经过量化转化为低精度就会损失信息，从而可能导致模型性能发生较大幅度的下降。所以尝试找出这些 “敏感的” 部分，然后让其精度保持不变。

Fine-tuning Compensation Analysis：在量化后，既然量化的模型损失了一些信息能力下降，那是否可以通过 fine-tuning 让模型的表现提高?

Quantization Sensitivity Analysis

Different model components (or feature dimensions) might exhibit varied sensitivity to quantization, i.e., different levels of performance degradation.

Component & Substructure quantization analysis

🤗 First, 让我们先来熟悉一下 concept

因为目前的绝大多数 LLM 都是基于 Transformer 架构的，所以这里主要分析了 Transformer 中的两个 Component：attention layers(ATT) and feed-forward networks (FFN) 对于模型性能下降的影响。

📖

Definition：¬ATT， ¬FNN 代表对应的 ATT 或 FNN component 保持原精度(FP16)不变，然后剩余的 component 都量化为 low bit

然而，目前的针对量化性能的研究都集中在 component 以及 outlier feature dimension 上，但是作者发现substructures of a component 也是一个重要的影响部分。比如，outlier dimensions 主要出现在 FFN 的 down projection 中。

在 Transformer 模型中，"down projection"一词是指减小特定特征空间的维数。Transformer中的前馈神经网络（FFN）组件包括两个线性变换，中间夹有ReLU激活函数（或其他非线性激活）。

Transformer模型中 FFN 组件的下投影是指第二个线性变换中发生的降低维数，这有助于捕获复杂的模式，同时保持模型各部分的维数一致性。

输入和隐藏层：FFN接收大小为的输入，并将其转换为大小为的隐藏层（通常 > ，其中是输入的维数，是隐藏层的维数。

下投影：激活函数之后，可能有第二个线性变换将大小为的隐藏层重新投影回大小为的尺寸，与原始输入维数相匹配。这被称为下投影，因为它减小了特征空间的维数。

为什么使用下投影？：通过最初扩展维数（上投影）然后减少它（下投影），网络可以学习数据中更复杂的关系。隐藏层中的更高维数允许网络创建更复杂的映射，随后的下投影确保了FFN的输出与后续层或最终输出的预期大小相匹配。

在下面的实验中，主要关注了以下 substructure:

FFN 中的 down projection

在 attention component 中选择使用 GPTQ 得到的 layer-wise 量化错误最大的 projection

对于 7B 的 model，选择了 query projections and key projections
对于 13B 的 model，选择了 key projections and output projections

https://github.com/hkproj/transformer-from-scratch-notes

🤗🤗 Next, 让我们开始分析实验结果

先说一下 all 坐标的意义，文章中也没有说清，all 对应的是所有权重都量化。然后整个图像展示的就是随着某些 component 保持精度不变所对应的 acc 以及相比全部量化的提升。折线对应的就是，在相应的量化方法下，对应整个 model 所包含的 total-bits。

FFN component 对于 2-bit quantization 性能提升具有重要意义：Keeping FFN in FP16 improves LLaMA-7B-2bit’s performance from 0.038 to 0.225 and LLaMA-13B-2bit’s performance from 0.148 to 0.286.

结果显示，与保留整个 FFN 组件（¬FFN ）相比，通过保留组件中的关键子结构，表现出了更好的性能，并且，即使同时保留了 FFN 以及 ATT 中的子结构权重，对应的 total bits 要少于只保留整个 FFN 的。

💡

These observations show the significance of exploring fine-grained quantization strategy in extreme 2-bit quantization.

Outlier quantization analysis

⭐ First, what is Outlier(离群值)🤔

We define outliers according to the following criteria: the magnitude of the feature is at least , affects at least of layers, and affects at least of the sequence dimensions. More formally, given a transformer with layers and hidden state , where is the sequence dimension and the feature dimension, we define a feature to be a particular dimension in any of the hidden states . We track dimensions , (Outlier dimension), which have at least one value with a magnitude of and we only collect statistics if these outliers occur in the same feature dimension in at least of transformer layers and appear in at least of all sequence dimensions across all hidden states .

It has been found that large outliers would occur for Transformer language models having a size of 6.7B or above [7].

Outlier feature(emergent feature) dimensions refer to a small number of dimensions in the feature activation values of Transformer-based language models that have a significant impact on the model's attention mechanism and predictive performance! (maybe从实际的影响上来定义)

We identify certain outlier dimensions in Transformer layer outputs and show that they play a crucial role in both language modeling and downstream task performance. Disabling the weights for these output dimensions drastically degrades performance. [8]

https://www.youtube.com/watch?v=mii-xFaPCrA

Quantizing large magnitude feature dimensions (called outliers) can ruin quantization precision, especially when the outliers emerge in all Transformer layers.

⭐⭐ 为什么量化这些离群值会破坏量化精度？

从前面的两种量化技术可以看到，在将高精度值量化的过程中，都会招找出当前数值范围内的 max 以及 min 值，如果大多数数值都处于较小的范围之内，由于离群值的存在，其他值可能在转化后降低很多精度。比如一组值，最大为9999，大多数值都在 5 以内，将其转化为 INT8，所以量化因子就为 255/9999，将原数值诚乘以量化因子之后，甚至可能丢失所有信息变为0。

但是如果单纯去掉离群值或者让其变为最大值 255 等，对于离群值又会损失很多信息。就如上述所说，因为在 LLMs 中，有些离群值出现在特定的维度上，并且这些值对于模型的表现有至关重要的作用，如果简单的进行量化，一定会造成大幅的性能下降。所以也就有了 Mixed-precision decomposition 等方法分解量化。

⭐⭐⭐ 实验结果分析

📖

top-n dimensions：sort the outlier dimensions based on the number of layers they affect all non-outlier dimensions：只量化 non-outlier dimensions +top-1 outlier dimension：只量化 top1 离群值以及所有的 non-outlier dimensions +top-3 outlier dimension：只量化 top3 以内的离群值以及所有的 non-outlier dimensions 由于 Activations 的量化较为复杂，所以量化指标为 8-bit

这里有一个指标 “affect” 很疑惑，指的是什么意思呢？根据 LLM.int8() 这篇论文，可以看到 “affect” 指的就是这个 outlier dimension 存在多少 layer 的 activation 中，可以从下面这段话中体会出来大致的意思：

At around 6.7B parameters, a phase shift occurs, and all transformer layers and 75% of all sequence dimensions are affected by extreme magnitude features. These outliers are highly systematic: at the 6.7B scale, 150,000 outliers occur per sequence, but they are concentrated in only 6 feature dimensions across the entire transformer.

top outliers 对于模型的能力下降有显著影响，特别是CoT 和 PPL 能力。

在量化 top outliers 维度时，LLaMA13B模型相比7B模型遇到了更严重的性能下降，可能意味着更大的模型对离群值更为敏感。

另一个重要发现是，outliers 似乎出现在组件的特殊 substructures 上。例如，在LLaMA-7B模型中，outliers 主要出现在 FFN 组件的 down projections 中。

Fine-tuning Compensation Analysis

Experimental Setup

设置	模型大小	量化精度	任务
Pre-Quantization Fine-tuning	7B, 13B	2-bit, 4-bit	MMLU(5-shot), GSM8K, AutoEval
Post-Quantization Fine-tuning	7B, 13B, 65B	2-bit, 4-bit	MMLU

⏮️ 后量化微调

为了解决量化后模型性能下降的问题，作者深入探索了后量化微调的好处。在这里，作者创建了一个专门工具，允许使用单个A100 80G进行2位精度的LLaMA-65B模型微调，并以MMLU（5-shot）衡量优于微调前的16位LLaMA-13B模型。同时作者为了克服直接优化 quantized weights 的挑战，引入并修改了原始的了LoRA 方法，实现了显著的内存消耗减少。

使用GPTQ量化后的 quantized weights 替换原始 LoRA 方法中的 pre-trained weights

通过使用 LoRA，对LLaMA-65B模型的微调只消耗了17.8 GiB。

Results and Analysis

📐 Pre-Quantization Fine-tuning

实验涉及在FP16模型上进行FFT和LoRA的参数高效微调，然后使用GPTQ进行量化。

与 base 模型比较：

FFT方法在MMLU、GSM8K和AutoEval上取得了显著改进。

在进行4位量化时，从 FFT 获得的性能提升基本保持不变，几乎没有在 MMLU 和 AutoEval 任务上的性能下降。

但是，在进行2-bit量化时，FFT的增益显著减少，尤其是在GSM8K的情况下（例如LLaMA-7B为2.6，LLaMA-13B为2.0）。

预量化微调的局限性：结果表明，Pre-Quantization Fine-tuning 不能很有效补偿 low-bits(e.g. 2-bits) 模型在复杂任务上的性能下降。

LoRA与FFT的差距：

在大多数情况下，LoRA可显著提升基础模型的性能，在4位量化下，微调的性能优势依然存在，但在2位量化下不一定总是如此。

LoRA与FFT之间仍存在显著差距（例如，在GSM8K上为25.8与38.0）。

在4位量化下，LoRA在GSM8K上的微调显著下降，这表明在复杂推断任务上进行量化时，可能更适合考虑使用完整参数微调的模型。

📐📐 Post-Quantization Fine-tuning

首先量化后微调的一个优点是，可以对 65B 等更大规模的参数模型也能进行高效率的微调，在单张 A100 卡上也是没得问题的。

与基础模型的结果比较表明，LoRAq的增强效果在 2-bits 下尤为明显（例如，65B 在 0-shot 设置下为42.0与9.0在 5-shot 设置下为44.4与22.6）。

在 total bits 较少的情况下，65B模型的 2-bits 超过了非微调的FP16精度13B模型，即在 0-shot 设置下的42.0与41.4。

🔥这些发现表明，即使在2位量化之后，也可以通过微调有效增强大型模型结局复杂任务的能力

Conclusion

large models (fine-tuned or not) can well retain emergent abilities with 4-bit weight quantization, but experience substantial degradation at 2-bit precision.

LLMs can be enhanced by effectively preserving more crucial components, feature dimensions and substructures for low-bit quantization.

finetuning can alleviate the performance degradation from low-bit quantization, showing the great potential to enhance the capacity of quantized LLMs.

Surprise，something interesting created by ChatGPT 🤖

Imagine you have a very detailed painting, full of colors and fine strokes. This painting is like a computer model, and it performs certain tasks for us, like recognizing images or translating languages.

Low-bit Post-Training Quantization (PTQ):

Painting with Limited Colors: Think of the low-bit post-training quantization like trying to recreate that detailed painting, but using only a few colors. We still want it to look great, but we're limited in our tools.

Why Do This?: By reducing the colors (or bits), we can make the model simpler, and potentially faster, but we might lose some details (performance).

Exploring Strategies for Higher Performance:

Finding the Best Way to Paint with Fewer Colors: We're looking for the best ways to keep as much of the original detail as possible, even with fewer colors. We're experimenting to find the right techniques to do this.

Analysis Experiments:

Trying Different Brushes and Techniques: We're not just guessing how to do this. We're conducting specific experiments to figure out the best approach.

Quantization Sensitivity of Fine-Grained Model Structures:

Examining How Different Parts of the Painting React to Fewer Colors: Some parts of the painting might not look as good with fewer colors, while others might not change much. We're studying these different areas to understand where we need to be extra careful.

Effects of Performance Compensation via Model Fine-Tuning:

Adjusting and Retouching the Painting: Even after we've made the painting with fewer colors, we might go back and fine-tune some areas, adding a bit more detail where needed to make it look better. This is like adjusting the model to compensate for the loss in details and improve its performance.

In summary, this text is about finding the best ways to make a complex computer model simpler without losing too much of its ability to perform its tasks. It's like finding the right techniques to recreate a detailed painting with fewer colors without losing its beauty. By carefully examining how different parts of the model react to these changes and making careful adjustments, we can aim to achieve the best possible performance with these limitations.

Resource

QuantizedEmpirical

RUCAIBox • Updated Feb 15, 2025

GPTQ

4-bit Quantization with GPTQ

LLaMa GPTQ 4-Bit Quantization. Billions of Parameters Made Smaller and Smarter. How Does it Work?

AutoGPTQ

AutoGPTQ • Updated Mar 24, 2025

others

how to calculate the model size