- Yandex Research,
IST Austria, NeuralMagic, and KAUST develop and open-source two large
language model (LLM) compression methods, AQLM and PV-Tuning, reducing
model size by up to 8 times while retaining 95% response quality.
- New methods
reduce equipment costs by up to 8 times, significantly lowering the
barrier to entry for AI deployment.
- Compressed models
like Llama 2 13B can run on 1 GPU instead of 4.
- The AQLM
compression method has been showcased at the ICML conference, highlighting
significant advancements in LLM technology.
Bangalore, Karnataka, India 29 July 2024 : The Yandex Research team, in collaboration with
researchers from IST Austria, NeuralMagic, and KAUST, have developed two
innovative compression methods for large language models: Additive Quantization
for Language Models (AQLM) and PV-Tuning. When combined, these methods allow
for a reduction in model size by up to 8 times while preserving response
quality by 95%. The methods aim to optimize resources and enhance efficiency in
running large language models. The research article detailing this approach has been featured at the International
Conference on Machine Learning (ICML), currently underway in Vienna, Austria.
Key features of AQLM and PV-Tuning
AQLM leverages additive quantization, traditionally used for information
retrieval, for LLM compression. The resulting method preserves and even
improves model accuracy under extreme compression, making it possible to deploy
LLMs on everyday devices like home computers and smartphones. This results in a
significant reduction in memory consumption.
PV-Tuning addresses errors that may arise during the model compression
process. When combined, AQLM and PV-Tuning deliver optimal results — compact
models capable of providing high-quality responses even on limited computing
resources.
Method evaluation and recognition
The effectiveness of the methods was rigorously assessed using popular
open source models such as LLama 2, Llama 3, Mistral, and others. Researchers
compressed these large language models and evaluated answer quality against
English-language benchmarks — WikiText2 and C4 — maintaining an
impressive 95% answer quality as the models were compressed
by 8 times.
Who can benefit from AQLM and PV-Tuning
The new methods offer substantial resource savings for companies involved
in developing and deploying proprietary language models and open-source LLMs.
For instance, the Llama 2 model with 13 billion parameters, post-compression,
can now run on just 1 GPU instead of 4, reducing hardware costs by up
to 8 times. This means that startups, individual researchers, and LLM
enthusiasts can run advanced LLMs such as Llama on their everyday computers.
Exploring new LLM applications
AQLM and PV-Tuning make it possible to deploy models offline on devices
with limited computing resources, enabling new use cases for smartphones, smart
speakers, and more. With advanced LLMs integrated into them, users can use text
and image generation, voice assistance, personalized recommendations, and even
real-time language translation without needing an active internet connection.
Moreover, models compressed using the methods can operate up to 4
times faster, as they require fewer computations.
Implementation and access
Developers and researchers worldwide can already use AQLM and PV-Tuning,
which are available on GitHub. Demo materials provided by the authors
offer guidance for effectively training compressed LLMs for various
applications. Additionally, developers can download popular open-source models that have already been
compressed using the methods.
ICML highlight
A scientific article by Yandex Research on the AQLM compression method
has been featured at ICML, one of the world's most prestigious machine learning
conferences. Co-authored with researchers from IST Austria and experts from AI startup
Neural Magic, this work signifies a significant advancement in LLM compression
technology.
Yandex is a global technology company that builds intelligent products
and services powered by machine learning. The company aims to help consumers
and businesses better navigate the online and offline world. Since 1997, Yandex
has been delivering world-class, locally relevant search and information
services and has also developed market-leading on-demand transportation
services, navigation products, and other mobile applications for millions of
consumers across the globe.
For reference [additional details for media
& journalists]
Deploying large language models (LLMs) on consumer hardware is challenging
due to the inherent trade-off between model size and computational efficiency.
Compression methods, such as quantization, have offered partial solutions, but
often compromise model performance.
To address this challenge, researchers from Yandex Research, IST
Austria, KAUST, and NeuralMagic developed two compression methods — Additive
Quantization for Language Models (AQLM) and PV-Tuning. AQLM reduces the bit
count per model parameter to 2–3 bits while preserving or even enhancing model
accuracy, particularly in extreme compression scenarios. PV-Tuning is a
representation-agnostic framework that generalizes and improves upon existing
fine-tuning strategies.
AQLM's key innovations include learned additive quantization of weight
matrices, which adapts to input variability and joint optimization of codebook
parameters across layer blocks. This dual strategy enables AQLM to outperform
other compression techniques, setting new benchmarks in the field.
AQLM's practicality is demonstrated by its implementations on GPU and
CPU architectures, making it suitable for real-world applications. Comparative
analysis shows that AQLM can achieve extreme compression without compromising
model performance, as evidenced by its superior results in metrics like model
perplexity and accuracy in zero-shot tasks.
PV-Tuning provides convergence guarantees in restricted cases, and has
been shown to outperform previous methods when used for 1-2 bit vector
quantization on highly-performant models such as Llama and Mistral. By
leveraging PV-Tuning, the researchers achieved the first Pareto-optimal
quantization for Llama 2 models at 2 bits per parameter.
The effectiveness of the methods was rigorously assessed using popular
open-source models such as LLama 2, Mistral, and Mixtral. Researchers
compressed these large language models and evaluated answer quality against
English-language benchmarks — WikiText2 and C4 — maintaining an impressive 95%
answer quality as the models were compressed by 8 times.
Model |
Number of parameters |
Relative quality of answers after compression |
LLama 2 |
7 billion |
88% |
LLama 2 |
13 billion |
97% |
LLama 2 |
70 billion |
99% |
LLama 3 |
8 billion |
92% |
LLama 3 |
70 billion |
93% |
Mistral |
8 billion |
96% |
On average, for all models in the test |
95% |
* The closer the average accuracy of answers
in tests is to the original model, the better the new methods are at preserving
the quality of answers. The figures above show the combined results of the two
methods, which compress the models by, on average, 8 times.