
Understanding LLM Size, Weights, Parameters, Quantization, KV Cache & Inference Memory
How much RAM do you need to run a 30 billion parameter model? Why are there multiple versions of the same model at different file sizes? What does "8-bit quantization" actually mean, and how does it affect performance and/or precision? If you're running language models locally or planning to, understanding the relationship between parameters, weights, quantization, and memory is essential.
If you've ever been confused about why models are certain sizes, what quantization means, or how much memory you actually need for inference, then Reddit user Daniel_H212's reply to my unrelated reddit question is a concise but extremely well written piece about everything from weights and parameters to KV cache and memory calculation. I'm copy/pasting his reply below. Thanks a lot, Daniel_H212!
- The downloaded model file contains primarily weights, which are the numbers associated with the parameters that you probably know of as in parameter count.
- E.g. GLM-4.7-Flash has about 30B (billion) parameters and 3B activated parameters (activated parameters is, basically, how many of the parameters are actually used to generate each token).
- It also contains a few other things such as config files telling the program you're using for inference how to do math using those weights, but those are much smaller files. Loading a model into memory requires, at bare minimum, enough free memory to load all the weights and any memory that the program itself needs to take up.
- Each weight takes up a certain amount of bits. Most models are originally FP16 or FP32, meaning each weight is a 16 or 32 bit floating point integer. Given that a gigabyte has a billion bytes and each byte has 8 bits, a model with n billion parameters at FP16 would be roughly 2n gigabytes, and FP32 would be roughly 4n gigabytes.
- Models can be quantized from their original precision, meaning a significant portion of their weights can be stored less precisely, e.g. 0.1746583 can be stored as 0.175, without affecting performance too much. This lets systems without as much working memory to still run larger models. For example, a quantized to 8 bpw (bits per weight) with n billion parameters would take up about n gigabytes, and 4 bpw would be about n/2 gigabytes. Note that bpw is an average measure, for optimal quality modern quantization methods aiming to quantize to n bpw would preserve some more important weights to higher precision than n bits while some less important weights would be cut down to fewer than n bits to target a n bpw average. mlx is a quantization format intended for and compatible with apple silicon, and mlx 5 bit is an mlx quant at roughly 5 bpw precision. Other quantization formats include GGUF, AWQ, etc.
- However, the program also needs to store information about the text you input and the text it generates, which combined forms the model context, usually measured in tokens, since models process and generate whole tokens at a time, not necessarily individual letters. Note that each model architecture has different token vocabularies, e.g. the word "token" may be a single token for some models, while for others they don't have the whole word in their vocabulary, so they deconstruct it as "to" and "ken" instead.
- However, models can't do math with the tokens directly, so instead, during inference, the tokens in a model's context gets converted into a format that is mathematically compatible with the model weights, and this processed format is known as KV cache, which is the part actually taking up a significant amount of memory. Then math is performed by applying the weights to this KV cache, which generates the next entry in the cache, which in turn gets translated back to a token and outputted to you.