Skip to main content
Have a personal or library account? Click to login
Efficient n-gram, Skipgram and Flexgram Modelling with Colibri Core Cover

Efficient n-gram, Skipgram and Flexgram Modelling with Colibri Core

Open Access
|Aug 2016

Figures & Tables

Figure 1

The Colibri Core architecture; light green squares represent data models, yellowish rounded squares represent processes that manipulate data.

Algorithm 1

Informed Iterative Counting for n-grams. Take m to be the maximum n-gram order we intend to extract, t to be the minimum occurrence threshold, and M to be the pattern model in memory, with unigrams already counted in the more trivial fashion.

for n ∈ 2..m do
     for linecorpus do
          for ngramextractNGrams (line, n) do
               nm1gram1, nm1gram2extractNGrams(ngram, n – 1)
               if M (nm1gram1) ≥ t & M (nm1gram2) ≥ t then
                  M (ngram) ← M (ngram) + 1
              end if
            end for
        end for
       MpruneNGrams(M, n, t)
end for
return M
Algorithm 2

Informed Counting for skipgrams. Take l to be the maximum skipgram order we intend to extract, t to be the minimum occurrence threshold, and M to be the pattern model in memory, with ngrams already counted.

for n ∈ 3..l do
     for ngramgetNGrams(M, n, t) do
          for skipgrampossibleConfigurations(ngram) do
               docountTrue
               for partparts(skipgram) do
                   if M (part) < t then
                      docountFalse Break
                  end if
              end for
              if docount then
                  M(skipgram) ∈ M(skipgram) + 1
              end if
          end for
     end for
     MpruneSkipgrams(M, n, t)
end for
return M
Table 1

Colibri Core performance benchmarks on the IWSLT data set. Performed by the colibri-benchmarks tool and the benchmarks.py script for the Python baselines. All non-Colibri Core references are marked with an asterisk (*). Memory usage is measured as the difference in resident memory after training and before training. Peak memory usage is measured absolutely as reported by the OS and included in parentheses. All experiments were performed on a Linux system with an Intel Xeon CPU (E5-2630L v3) at 1.80GHz and 256GB RAM. Parameter t refers to the occurrence threshold, l to the maximum pattern length.

ExperimentCPU timeMemory
IWSLT data, without thresholding (t = 1, l = 8)
* Naive Python implementation20.0 s1404 MB
* Python NLTK implementation18.7 s1413 MB
* Naive C++ implementation10.5 s1398 MB
Unindexed Pattern Model (from file)29.7 s775 MB (787 MB)
Unindexed Pattern Pointer Model (preloaded corpus)19.8 s615 MB (627 MB)
IWSLT data, with thresholding (t = 2, l = 8)
* Naive Python implementation28.8 s1485 MB
* Python with iterative counting70.9 s171 MB
Unindexed Pattern Model (from file)14.6 s64 MB (76 MB)
Unindexed Pattern Model (preloaded corpus)11.3 s64 MB (76 MB)
Unindexed Pattern Pointer Model (preloaded corpus)9.0 s50 MB (62 MB)
Indexed Pattern Model (preloaded corpus)13.6 s148 MB (160 MB)
Indexed Pattern Pointer Model (preloaded corpus)10.1 s133 MB (146 MB)
Unindexed Pattern Model (ordered map)30.5 s84 MB (96 MB)
Unindexed Pattern Model (preloaded corpus), with skipgrams62.4 s105 MB (118 MB)
Unindexed Pattern Pointer Model (preloaded), skipgrams50.0 s89 MB (102 MB)
Indexed Pattern Model (preloaded corpus), with skipgrams24.2 s165 MB (178 MB)
IWSLT data, LM comparison, trigram model (t = 1, l = 3)
Unindexed Pattern Model (from file)5.9 s152 MB (164 MB)
* SRILM ngram-count1.5 s80 MB
Table 2

Benchmarks on two large data sets. Same setup as Table 1. Parameter y represents the occurrence threshold specifically for skipgrams.

ExperimentCPU timeMemory
JRC-Acquis data, with thresholding (t = 2, l = 8)
*Naive Python implementatio8m 4s11407 MB
Unindexed Pattern Model (from file)4m 18s1704 MB (1803 MB)
Unindexed Pattern Model (preloaded corpus)3m 58s1700 MB (1800 MB)
Unindexed Pattern Pointer Model (preloaded corpus)3m 1s1344 MB (1445 MB)
Indexed Pattern Model (preloaded corpus)4m 36s4117 MB (4217 MB)
Unindexed Pattern Model (preloaded corpus), skipgrams61m 1s30420 MB (30520 MB)
Indexed Pattern Model (prel.), skipgrams, skiptypes = 1280m 11s29746 MB (29846 MB)
JRC-Acquis data, LM comparison, trigram model (t = 1, l = 3)
Unindexed Pattern Model (from file)42s716 MB (847 MB)
*SRILM ngram-count12s348 MB
Google Billion Words corpus, with heavy thresholding (t = 10, l = 4, y = 20)
Unindexed Pattern Pointer Model (preloaded corpus)83m 14s12279 MB (15317 MB)
Google Billion Words corpus, trigram model (t = 1, l = 3)
Unindexed Pattern Model (from file)25m 16s16568 MB (19788 MB)
DOI: https://doi.org/10.5334/jors.105 | Journal eISSN: 2049-9647
Language: English
Submitted on: Nov 9, 2015
Accepted on: Jul 1, 2016
Published on: Aug 2, 2016
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2016 Maarten van Gompel, Antal van den Bosch, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.