
Figure 1
The Colibri Core architecture; light green squares represent data models, yellowish rounded squares represent processes that manipulate data.
Algorithm 1
Informed Iterative Counting for n-grams. Take m to be the maximum n-gram order we intend to extract, t to be the minimum occurrence threshold, and M to be the pattern model in memory, with unigrams already counted in the more trivial fashion.
| for n ∈ 2..m do |
| for line ∈ corpus do |
| for ngram ∈ extractNGrams (line, n) do |
| nm1gram1, nm1gram2 ← extractNGrams(ngram, n – 1) |
| if M (nm1gram1) ≥ t & M (nm1gram2) ≥ t then |
| M (ngram) ← M (ngram) + 1 |
| end if |
| end for |
| end for |
| M ← pruneNGrams(M, n, t) |
| end for |
| return M |
Algorithm 2
Informed Counting for skipgrams. Take l to be the maximum skipgram order we intend to extract, t to be the minimum occurrence threshold, and M to be the pattern model in memory, with ngrams already counted.
| for n ∈ 3..l do |
| for ngram ∈ getNGrams(M, n, t) do |
| for skipgram ∈ possibleConfigurations(ngram) do |
| docount ∈ True |
| for part ∈ parts(skipgram) do |
| if M (part) < t then |
| docount ∈ False Break |
| end if |
| end for |
| if docount then |
| M(skipgram) ∈ M(skipgram) + 1 |
| end if |
| end for |
| end for |
| M ∈ pruneSkipgrams(M, n, t) |
| end for |
| return M |
Table 1
Colibri Core performance benchmarks on the IWSLT data set. Performed by the colibri-benchmarks tool and the benchmarks.py script for the Python baselines. All non-Colibri Core references are marked with an asterisk (*). Memory usage is measured as the difference in resident memory after training and before training. Peak memory usage is measured absolutely as reported by the OS and included in parentheses. All experiments were performed on a Linux system with an Intel Xeon CPU (E5-2630L v3) at 1.80GHz and 256GB RAM. Parameter t refers to the occurrence threshold, l to the maximum pattern length.
| Experiment | CPU time | Memory |
|---|---|---|
| IWSLT data, without thresholding (t = 1, l = 8) | ||
| * Naive Python implementation | 20.0 s | 1404 MB |
| * Python NLTK implementation | 18.7 s | 1413 MB |
| * Naive C++ implementation | 10.5 s | 1398 MB |
| Unindexed Pattern Model (from file) | 29.7 s | 775 MB (787 MB) |
| Unindexed Pattern Pointer Model (preloaded corpus) | 19.8 s | 615 MB (627 MB) |
| IWSLT data, with thresholding (t = 2, l = 8) | ||
| * Naive Python implementation | 28.8 s | 1485 MB |
| * Python with iterative counting | 70.9 s | 171 MB |
| Unindexed Pattern Model (from file) | 14.6 s | 64 MB (76 MB) |
| Unindexed Pattern Model (preloaded corpus) | 11.3 s | 64 MB (76 MB) |
| Unindexed Pattern Pointer Model (preloaded corpus) | 9.0 s | 50 MB (62 MB) |
| Indexed Pattern Model (preloaded corpus) | 13.6 s | 148 MB (160 MB) |
| Indexed Pattern Pointer Model (preloaded corpus) | 10.1 s | 133 MB (146 MB) |
| Unindexed Pattern Model (ordered map) | 30.5 s | 84 MB (96 MB) |
| Unindexed Pattern Model (preloaded corpus), with skipgrams | 62.4 s | 105 MB (118 MB) |
| Unindexed Pattern Pointer Model (preloaded), skipgrams | 50.0 s | 89 MB (102 MB) |
| Indexed Pattern Model (preloaded corpus), with skipgrams | 24.2 s | 165 MB (178 MB) |
| IWSLT data, LM comparison, trigram model (t = 1, l = 3) | ||
| Unindexed Pattern Model (from file) | 5.9 s | 152 MB (164 MB) |
| * SRILM ngram-count | 1.5 s | 80 MB |
Table 2
Benchmarks on two large data sets. Same setup as Table 1. Parameter y represents the occurrence threshold specifically for skipgrams.
| Experiment | CPU time | Memory |
|---|---|---|
| JRC-Acquis data, with thresholding (t = 2, l = 8) | ||
| *Naive Python implementatio | 8m 4s | 11407 MB |
| Unindexed Pattern Model (from file) | 4m 18s | 1704 MB (1803 MB) |
| Unindexed Pattern Model (preloaded corpus) | 3m 58s | 1700 MB (1800 MB) |
| Unindexed Pattern Pointer Model (preloaded corpus) | 3m 1s | 1344 MB (1445 MB) |
| Indexed Pattern Model (preloaded corpus) | 4m 36s | 4117 MB (4217 MB) |
| Unindexed Pattern Model (preloaded corpus), skipgrams | 61m 1s | 30420 MB (30520 MB) |
| Indexed Pattern Model (prel.), skipgrams, skiptypes = 12 | 80m 11s | 29746 MB (29846 MB) |
| JRC-Acquis data, LM comparison, trigram model (t = 1, l = 3) | ||
| Unindexed Pattern Model (from file) | 42s | 716 MB (847 MB) |
| *SRILM ngram-count | 12s | 348 MB |
| Google Billion Words corpus, with heavy thresholding (t = 10, l = 4, y = 20) | ||
| Unindexed Pattern Pointer Model (preloaded corpus) | 83m 14s | 12279 MB (15317 MB) |
| Google Billion Words corpus, trigram model (t = 1, l = 3) | ||
| Unindexed Pattern Model (from file) | 25m 16s | 16568 MB (19788 MB) |
