**Comparison of BPE implementation** # Goal I [rewrote](https://github.com/gwenzek/fastBPE) fastBPE from C++ to [Zig](https://ziglang.org/). FastBPE is a [C++ implementation](https://github.com/glample/fastBPE) of the Bytes Pair Encoding algorithm introduced by [Sennrich] et al. The main goal was to actually learn Zig on a small performance-oriented project. I'm also interested in the performance impact of changing language. C++ claims to be as fast as C while Zig claims to be faster than C thanks to better defaults. [#Sennrich]: Sennrich, Rico, Barry Haddow, and Alexandra Birch. [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/abs/1508.07909) (2015) # What's BPE In NLP (Natural Language Processing), one of the first thing we do when feeding text to another algorithm is to split it into "words". The problem with "words" is that the notion is very specific to each languages, poorly defined for some languages, and on top of that you generally ends up with a very large vocabulary. This makes writting the "tokenizer" relatively complex, with a lot of language specific rules, and ends up being relatively slow. And it assumes you know in which language your text is in, which is not always so easy. The large vocabulary is also a problem for neural network trained to predict the next word in a sentence, because you need to compute a probability for each word in your vocabulary. And you always end up finding words that weren't part of your vocabulary, or that you saw so little time that your model doesn't handle them correctly. BPE aims to solve this issue by splitting the text in frequent chuncks of characters. So "I'm eating some apples." might become "I'm eat ing some appl es.". It doesn't aim at being linguistically correct, but rather at being a good compromise between speed, correctness, and coverage. I'm personally not a big fan of BPE, because it uses frequences rather than mutual information, but it's implementation is relatively straightforward, so I chose it for this exercise. If this topic interest, you look into [sentence piece](https://github.com/google/sentencepiece). Outline of the algorithm: * At train time, use some basic tokenization (on spaces, newlines, ...) * Compute the frequency of all n-grams * At inference time, reuse the same basic tokenization * Treat each character of the word as a token * Merge recursively the tokens, prioritizing those with the highest score: `e a t i n g -> ea t i n g -> ea t i ng -> ea t ing -> eat ing` * The algorithm is mostly linear but quadratic in the length of your words. # Zig implementation The first implementation was very much a translation of the C++ implementation, as I wanted to keep things comparable. The main difference is memory allocation. In Zig memory implementation is more explicit than in C++ and a bit more cumbersome. I decided to make the Zig version prealloacate memory and to not allocate during processing. The C++ implementation also do a lot of small string concatenations which was replaced by array slicing in Zig. I think this is fair because in Zig you have to think about memory allocation and preallocating makes everything easier. The main downside is that it can't handle "words" which are more than 4096 bytes long (but that shouldn't happen to often), and arbitrarely skip them in 4096 bytes chunks. I think when facing such pathological case I prefer my computer to default to an approximated solution rather than churning CPU at it. # Benchmark I evaluated the "apply" part of the algorithm starting from learned codes. The corpus used as input is 800k words long and is extracted from the French Wikipedia. All implementations produce the exact same files. For C++ there are two implementations: * `apply` takes a whole file, load it memory as one bytes array and treat it with 4 threads in parallel. * `apply_stream` read from stdin and process lines one by one in a single thread. I'm more interested by the second version because they are different use case where I won't have the file on disk: * The text is stored in compressed in a .gz file * The text on the disk is not tokenized and is piped through a tokenizer before BPE. * I don't want to preprocess 100Gb of text before starting working on the next step. The Zig implementation has only the stream, single core implemented yet. So unless mentioned implementation refers to the stream mode. `C++ (batch)` refers to the implementation with in memory file and 4 threads. | Implementation | Time | Word / s | |:---------------|------:|---------:| | C++ (batch) | 0.47s | 1702k/s | | C++ | 2.92s | 274k/s | | Zig | 0.97s | 825k/s | # Python wrapper Another thing I wanted to evaluate with this project is to compare the different way of writing C extension for Python. There are 3 main alternatives: Cython, pybind11, cffi. Here is a table comparing them summarizing from Stephan [Behnel] blog | Tool | API language | Static | Fast | |:--------:|:-------------|:--------|:-----| | Cython | Python-like | static | fast | | pybind11 | C++ | static | fast | | cffi | Python | dynamic | slow | | ctypes | Python | dynamic | slow | [#Behnel]: [Cython, pybind11, cffi – which tool should you choose?](http://blog.behnel.de/posts/cython-pybind11-cffi-which-tool-to-choose.html) - Stefan Behnel ## Python Wrapper Benchmark I was curious about the performance impact of the chosen python wrapper and the general overhead. The python code is pretty minimal and called Python functions are mostly direct C calls, with some string encoding. So I was expecting the Python impact to be negligible. ```py import fastBPE, sys, time def apply(file: str, codes: str) -> None: start = time.time() f = sys.stdin if file == "-" else open(file, mode="r") bpe = fastBPE.fastBPE(codes, "") for i, line in enumerate(f): s = bpe.apply([line[:-1]])[0] print(s) delay = time.time() - start print(f"Computed BPE on {i} sentences in {delay:.2f}s, using cython wrapper around cpp implementation", file=sys.stderr) ``` | Implementation | Time | Word / s | Overhead | |:---------------|------:|---------:|:---------| | C++ | 2.92s | 274/s | 1× (ref) | | C++ - Cython | 2.79s | 287/s | 0.95× | | Zig | 1.06s | 755/s | 1× (ref) | | Zig - Ctypes | 1.07s | 748/s | 1.01× | I'm not sure why the Cython implementation is faster than C++, The exact number varies across runs but I found the small speed up to be constistent. Maybe it's due to the Python IO being faster than `stdio.h` ? I don't know ^^ ## Flame graph I used [py-spy](https://www.benfrederickson.com/profiling-native-python-extensions-with-py-spy/) to benchmark the two python wrappers (click to enlarge). ![C++ - Cython](./test/cpp_cython.svg) The Cython flame graph is a mess, hard to see what's going around. I think the -O3 optimization level is inlining so much stuff that the call stack is probably mostly wrong. In particular the string concatenation isn't calling `process_bpe`. So I'm not sure how much conclusion we can draw from this one. Most of the time seems to be inside `unordered_map::find` which is logic. We can see we are also spending some 10% of our time into `vec::push_back` which is probably due to the vector often resizing. ![Zig - Ctypes](./test/zig_ctypes.svg) That's more readable. Apparently we spend 25% in `HashMap.ensureCapacityExact` which is called when the `HashMap` need to grow. This happens only when loading the bytes paires and their score from the disk. I should probably be more aggressive here with the first allocation, and I could also save in the file the number of pairs. I have to be careful if I want the zig code to read old models trained with the C++ code. I also spend 14% of the time copying back the results from the zig code to python. Currently the python code passes a buffer to python and copies the result to a new `bytes` object. I think it'd be better if Zig could directly return a Python bytes object, but I'm not sure how to do that. We also have `PyCFuncPtrCall` that cost 6%. This is the cost of calling the Zig function from Python. It can be mitigated by calling the a Zig function that reads the full file, but I think it's interesting to benchmark it this way, because one of the point of the Python wrapper is to grant more flexibility to the caller. # Parting word Overall it was fun to learn more about Zig and the onboarding was ok. I found myself reading more of the source code of the [standard library](https://github.com/ziglang/zig/blob/master/lib/std/hash_map.zig), because sometimes the [documentation](https://ziglang.org/documentation/0.6.0/std/#std) is still a bit lacking. But reading Zig code was also a good way of learning it :-) I'm new to Zig, so probably my code isn't the best, but you can [check it out](https://github.com/gwenzek/fastBPE/blob/master/fastBPE/applyBPE.zig). Thanks for reading ! Cheers, Guillaume