Efficient n-gram, Skipgram and Flexgram Modelling with Colibri Core

Maarten van Gompel; Antal van den Bosch

doi:10.5334/jors.105

Efficient n-gram, Skipgram and Flexgram Modelling with Colibri Core

Journal of Open Research Software

Volume 4 (2016): Issue 1

By: Maarten van Gompel and Antal van den Bosch

Open Access

|Aug 2016

Figures & Tables

The Colibri Core architecture; light green squares represent data models, yellowish rounded squares represent processes that manipulate data.

Algorithm 1

Informed Iterative Counting for n-grams. Take m to be the maximum n-gram order we intend to extract, t to be the minimum occurrence threshold, and M to be the pattern model in memory, with unigrams already counted in the more trivial fashion.

for n ∈ 2..m do

for line ∈ corpus do

for ngram ∈ extractNGrams (line, n) do

nm1gram₁, nm1gram₂ ← extractNGrams(ngram, n – 1)

if M (nm1gram₁) ≥ t & M (nm1gram₂) ≥ t then

M (ngram) ← M (ngram) + 1

end if

end for

M ← pruneNGrams(M, n, t)

end for

return M

Algorithm 2

Informed Counting for skipgrams. Take l to be the maximum skipgram order we intend to extract, t to be the minimum occurrence threshold, and M to be the pattern model in memory, with ngrams already counted.

for n ∈ 3..l do

for ngram ∈ getNGrams(M, n, t) do

for skipgram ∈ possibleConfigurations(ngram) do

docount ∈ True

for part ∈ parts(skipgram) do

if M (part) < t then

docount ∈ False Break

end if

end for

if docount then

M(skipgram) ∈ M(skipgram) + 1

end if

end for

M ∈ pruneSkipgrams(M, n, t)

end for

return M

Table 1

Colibri Core performance benchmarks on the IWSLT data set. Performed by the colibri-benchmarks tool and the benchmarks.py script for the Python baselines. All non-Colibri Core references are marked with an asterisk (*). Memory usage is measured as the difference in resident memory after training and before training. Peak memory usage is measured absolutely as reported by the OS and included in parentheses. All experiments were performed on a Linux system with an Intel Xeon CPU (E5-2630L v3) at 1.80GHz and 256GB RAM. Parameter t refers to the occurrence threshold, l to the maximum pattern length.

Experiment	CPU time	Memory
IWSLT data, without thresholding (t = 1, l = 8)
* Naive Python implementation	20.0 s	1404 MB
* Python NLTK implementation	18.7 s	1413 MB
* Naive C++ implementation	10.5 s	1398 MB
Unindexed Pattern Model (from file)	29.7 s	775 MB (787 MB)
Unindexed Pattern Pointer Model (preloaded corpus)	19.8 s	615 MB (627 MB)
IWSLT data, with thresholding (t = 2, l = 8)
* Naive Python implementation	28.8 s	1485 MB
* Python with iterative counting	70.9 s	171 MB
Unindexed Pattern Model (from file)	14.6 s	64 MB (76 MB)
Unindexed Pattern Model (preloaded corpus)	11.3 s	64 MB (76 MB)
Unindexed Pattern Pointer Model (preloaded corpus)	9.0 s	50 MB (62 MB)
Indexed Pattern Model (preloaded corpus)	13.6 s	148 MB (160 MB)
Indexed Pattern Pointer Model (preloaded corpus)	10.1 s	133 MB (146 MB)
Unindexed Pattern Model (ordered map)	30.5 s	84 MB (96 MB)
Unindexed Pattern Model (preloaded corpus), with skipgrams	62.4 s	105 MB (118 MB)
Unindexed Pattern Pointer Model (preloaded), skipgrams	50.0 s	89 MB (102 MB)
Indexed Pattern Model (preloaded corpus), with skipgrams	24.2 s	165 MB (178 MB)
IWSLT data, LM comparison, trigram model (t = 1, l = 3)
Unindexed Pattern Model (from file)	5.9 s	152 MB (164 MB)
* SRILM ngram-count	1.5 s	80 MB

Table 2

Benchmarks on two large data sets. Same setup as Table 1. Parameter y represents the occurrence threshold specifically for skipgrams.

Experiment	CPU time	Memory
JRC-Acquis data, with thresholding (t = 2, l = 8)
*Naive Python implementatio	8m 4s	11407 MB
Unindexed Pattern Model (from file)	4m 18s	1704 MB (1803 MB)
Unindexed Pattern Model (preloaded corpus)	3m 58s	1700 MB (1800 MB)
Unindexed Pattern Pointer Model (preloaded corpus)	3m 1s	1344 MB (1445 MB)
Indexed Pattern Model (preloaded corpus)	4m 36s	4117 MB (4217 MB)
Unindexed Pattern Model (preloaded corpus), skipgrams	61m 1s	30420 MB (30520 MB)
Indexed Pattern Model (prel.), skipgrams, skiptypes = 12	80m 11s	29746 MB (29846 MB)
JRC-Acquis data, LM comparison, trigram model (t = 1, l = 3)
Unindexed Pattern Model (from file)	42s	716 MB (847 MB)
*SRILM ngram-count	12s	348 MB
Google Billion Words corpus, with heavy thresholding (t = 10, l = 4, y = 20)
Unindexed Pattern Pointer Model (preloaded corpus)	83m 14s	12279 MB (15317 MB)
Google Billion Words corpus, trigram model (t = 1, l = 3)
Unindexed Pattern Model (from file)	25m 16s	16568 MB (19788 MB)