Byte Latent Transformer: Patches Scale Better Than Tokens
zxexz | 378 points | 7mon ago | ai.meta.com
dang|7mon ago
The paper: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/470135129_...
PaulHoule|7mon ago
The summer that BERT came out I was working at a startup that was using character-based CNN models for classification. We were thinking a lot about alternate representations, other members of the team were keen on word vectors but I wasn't, particularly because it seemed the documents were were working on frequently had out-of-dictionary words, because those words were important, and because discarding them would lead to failure.
(We were working on "foundation models" too, so it's not just being out-of-dictionary in the final model that's a problem but being out-of-dictionary in the foundation model which is more expensive to train.)
We were doing OK with character based models for classification but people believed that storing the "dictionary" inside the neural net was not a good use of the neural net so there was a lot of enthusiasm for tokens.
Meanwhile I felt so sure that schemes like Word2Vec were doomed that I had left an earlier project using RNNs where the goal was text understanding with a foundation model made by training an RNN to write fake abstracts for case reports from PubMed.
When byte-pair encoding was introduced I remember telling people in a meeting that it was the first tokenization scheme we'd looked at that I could endorse.
I have to admit though that I wish we could work at the character label.
binarymax|7mon ago
I was really excited for CANINE [1] but it never really went anywhere. Tokens are a hack. They work for the most part, but it’s clear when they don’t.
yndoendo|7mon ago
Do you mean that all produced output must be a chain or words found in a dictionary?
The real-world for humans has them creating and using non-dictionary words to communicate daily. A good example is "notify", defined in the dictionary. "notifier", which is not and is used to describe "a means to notify someone". The code to send an email notification is an "email notifier", then there is text message, voice call, call center call back notifiers ....
All industries and organizations have jargon, custom defined words not found in a dictionary and use non distinctive acronyms.
How would a ML output be useful if it cannot handle real world commutation and only lab based sanitization of in-dictionary only responses?
entilzha|7mon ago
(Author here)
If I understand your question right, this is one of the reasons BPE is nice and the parent liked it. For any character sequence, provided the characters are in the alphabet used to create the BPE vocab, there are no unknown words/sequences. One downside of some previous tokenization methods is you could have unknown/UNK tokens, EG dictionary based methods.
In our paper with bytes, we also avoid the UNK issue, since we can have an embedding for every possible byte, since it’s not that many (and for sequences of bytes we use hash embedding, although we did test n-gram lookups for the top K frequent byte n-grams in the training data).
cs702|7mon ago
Nice work. Thank you for commenting on HN!
Did you guys try using an RNN or some other kind of DNN to encode the patches?
entilzha|7mon ago
I don't believe so, or at least if someone tried it didn't work well enough that I remember :). Some of the motivation for the architecture changes in encoding patches stemmed from finding FLOP efficient ways to express relationships between byte sequences. E.G., having a long context window makes sense when dealing with tokens, but you don't need as long as an attention window if you're attending byte sequences to make patch representations, since the patch representations will implicitly be part of a longer context window in terms of number of patches.
cs702|7mon ago
Thanks for the quick reply!
Interesting. I would have thought one of those "minimum viable" RNNs (like https://arxiv.org/abs/2410.01201) would have been ideal for this. I might tinker a bit with this :-)
phh|7mon ago
That's the OP's point. At the time, the community was split between word-level, which has the shortcomings you're describing, and byte-level which is uselessly compute intensive. BPE was the first reasonable in-between. BLT improves on BPE by having the the compression learnable rather than precomputed