HN Offline: Standardizing next-generation narrow precision data formats for AI

Standardizing next-generation narrow precision data formats for AI

opcode84 | 103 points | 23mon ago | www.opencompute.org

opcode84|23mon ago

Earlier this year, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Technologies, Inc. formed the Microscaling Formats (MX) Alliance with the goal of creating and standardizing next-generation 6- and 4-bit data types for AI training and inferencing. The key enabling technology that enables sub 8-bit formats to work, referred to as microscaling, builds on a foundation of years of design space exploration and research. MX enhances the robustness and ease-of-use of existing 8-bit formats such as FP8 and INT8, thus lowering the barrier for broader adoption of single digit bit training and inference.

Spec: https://www.opencompute.org/documents/ocp-microscaling-forma...

Whitepaper: https://arxiv.org/abs/2310.10537

Code: https://github.com/microsoft/microxcaling

ljosifov|23mon ago

Thanks - interesting. I wish

> Integer data types use a 2’s complement encoding, but the maximum negative representation (−2) may be left unused to maintain symmetry between the maximum positive and negative representations and avoid introducing a negative bias.

... the maximum negative representation was used for a NAN. IDK why and how it is that we all agree that NAN-s are useful for floats (and they are super useful), but very few think the same for integers??

duskwuff|23mon ago

> IDK why and how it is that we all agree that NAN-s are useful for floats (and they are super useful), but very few think the same for integers??

Because making an integer bit pattern act as a NaN would require specific semantics (e.g. NaN + X = NaN; NaN != NaN) which are difficult to implement efficiently in hardware. These properties would also potentially rule out some arithmetic optimizations which are currently possible.

ljosifov|23mon ago

Yep - agreed that there will be a price to be paid. Do you have any inkling of how costly in h/w would it be? Maybe compared to the similar in floats? For floats it's been decided that paying the price was worth it, it seems.

duskwuff|23mon ago

> Do you have any inkling of how costly in h/w would it be?

For something like addition, a ripple-carry adder is ~5 gates per bit. To check for NaN on input, you'd need a wide AND/OR on each input (~N log N gates per bit per input, so ~4 gates per bit for a 64-bit adder), a multiplexer on the output (~3 gates per bit), and a bunch of fan-in/out for the "is this NaN" signal. That'd likely more than double the size of the cell.

Subtraction makes that even more awkward. With 2's complement, an adder can also perform subtraction by inverting the second operand and carrying in a 1. This trick stops working if one of your bit patterns is a special value, so you either have to add even more logic to specify that NaN is inverted, or duplicate the whole mess for subtraction.

You'd also have to add a bunch of completely new hardware to distinguish between e.g. "X is bitwise equal to Y" and "X is numerically equal to Y" in equality tests, because NaN != NaN. It's hard to speculate how expensive that would be, but it certainly wouldn't be trivial.

> For floats it's been decided that paying the price was worth it, it seems.

That's a bit of an oversimplification. I'd say that it's more that:

1) Floating-point arithmetic is already fairly complex; even if you didn't handle the special values it'd still require a lot more logic than integer math.

2) Handling special values like infinities and NaN was simply part of the "spec" for floating-point math. It wouldn't have been considered fit for purpose without those features.

cyrillite|23mon ago

For us less technical folks (in this field), what’s the big take away here / why does this matter / why should we be excited?

buildbot|23mon ago

Typically, you need to use some tricks for pre-training in lower precision (finetuning seems to work at low precision), with FP16 you need loss scaling for example. With MX, you can train in 6 bits of precision without any tricks, and hit the same loss as FP32.

andy99|23mon ago

On most hardware, handwritten math is required for all the nonstandard formats, e.g. for quantized int-8 https://github.com/karpathy/llama2.c/blob/master/runq.c#L317

Integer quantization doesn't typically just round, it has scaling and other factors in blocks so it's not just a question of manipulating int8's. And the FP16/FP8 are not supported by most processors so need their own custom routines as well. It would be great if you could just write code that operates with intrinsics on the quanitzed types.

imjonse|23mon ago

Future hardware implementations for these <8bit data types will result in much larger (number of parameters) models fitting in the same memory. Unless they are standardized, each vendor and software framework will have their own slightly different approach.

superkuh|23mon ago

Actual article as text for those like myself getting completely blocked by the broken captcha wall. I must have hit the "prove you're not a robot" check-box 20 times.

https://web.archive.org/web/20231018183224/https://www.openc...

sanxiyn|23mon ago

This reminds me a lot of Nervana Systems' Flexpoint.

Dylan16807|23mon ago

Huh, for FP4 just E2M1 with no E3M0? I've seen a paper in the past that went so heavy on exponent it was skipping every other power of two, so I would have thought the demand was there.

Oddly they do have E8M0.

pclmulqdq|23mon ago

E3M0 was the format I was most excited to see here, but I guess not. E8M0 makes sense because of the relationship to E8M23 (float32) and E8M7 (bfloat16). Nvidia has their own E8M12 format that uses the exponent logic of float32 and the mantissa logic of float16, allowing you to multiply 2x more numbers at a time in E8M12 as E8M23 without adding more hardware or resorting to a narrower exponent.

buildbot|23mon ago

Copy my comment here too - Point of clarification - there is no E8M0 direct datatype (unless I misunderstand something!) E8M0 is only used for the scaling of exponents in the block - there is 8 bits of scale per block.

pclmulqdq|23mon ago

I think you're right. In general, storage and operating formats seem to be decoupling for AI/ML.

Nvidia's E8M12 is also a format specifically for operators - they expect you to store FP32 when you operate in E8M12. Storage is almost always in power-of-2 sizes.

buildbot|23mon ago

I would hope so ;)

buildbot|23mon ago

Point of clarification - there is no E8M0 direct datatype (unless I misunderstand something!) E8M0 is only used for the scaling of exponents in the block - there is 8 bits of scale per block.