proycon / colibri-core Goto Github PK

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.

Home Page: https://proycon.github.io/colibri-core

License: GNU General Public License v3.0

Python 2.60% Shell 2.52% C++ 69.76% TeX 0.19% Jupyter Notebook 10.09% Makefile 0.18% M4 0.85% Dockerfile 0.07% Cython 13.74%

c-plus-plus python nlp ngrams skipgram ngram corpus linguistics library text-processing

colibri-core's Introduction

Colibri Core

by Maarten van Gompel, [email protected], Radboud University Nijmegen

Licensed under GPLv3 (See http://www.gnu.org/licenses/gpl-3.0.html)

Colibri Core is software to quickly and efficiently count and extract patterns from large corpus data, to extract various statistics on the extracted patterns, and to compute relations between the extracted patterns. The employed notion of pattern or construction encompasses the following categories:

n-gram -- n consecutive words
skipgram -- An abstract pattern of predetermined length with one or multiple gaps (of specific size).
flexgram -- An abstract pattern with one or more gaps of variable-size.

N-gram extraction may seem fairly trivial at first, with a few lines in your favourite scripting language, you can move a simple sliding window of size n over your corpus and store the results in some kind of hashmap. This trivial approach however makes an unnecessarily high demand on memory resources, this often becomes prohibitive if unleashed on large corpora. Colibri Core tries to minimise these space requirements in several ways:

Compressed binary representation -- Each word type is assigned a numeric class, which is encoded in a compact binary format in which highly frequent classes take less space than less frequent classes. Colibri core always uses this representation rather than a full string representation, both on disk and in memory.
Informed iterative counting -- Counting is performed more intelligently by iteratively processing the corpus in several passes and quickly discarding patterns that won't reach the desired occurrence threshold.

Skipgram and flexgram extraction are computationally more demanding but have been implemented with similar optimisations. Skipgrams are computed by abstracting over n-grams, and flexgrams in turn are computed either by abstracting over skipgrams, or directly from n-grams on the basis of co-occurrence information (mutual pointwise information).

At the heart of the sofware is the notion of pattern models. The core tool, to be used from the command-line, is colibri-patternmodeller which enables you to build pattern models, generate statistical reports, query for specific patterns and relations, and manipulate models.

A pattern model is simply a collection of extracted patterns (any of the three categories) and their counts from a specific corpus. Pattern models come in two varieties:

Unindexed Pattern Model -- The simplest form, which simply stores the patterns and their count.
Indexed Pattern Model -- The more informed form, which retains all indices to the original corpus, at the cost of more memory/diskspace.

The Indexed Pattern Model is much more powerful, and allows more statistics and relations to be inferred.

The generation of pattern models is optionally parametrised by a minimum occurrence threshold, a maximum pattern length, and a lower-boundary on the different types that may instantiate a skipgram (i.e. possible fillings of the gaps).

Technical Details

Colibri Core is available as a collection of standalone command-line tools, as a C++ library, and as a Python library.

Please consult the full documentation at https://proycon.github.io/colibri-core

Installation

Python binding

For the Colibri Core Python library, just install using:

pip install colibricore

We strongly recommend you use a Virtual Environment for this. Do note that this is only available for unix-like systems, Windows is not supported.

Installation from packages

For the command-line tools, check if your distribution has a package available. There are packages for Alpine Linux (apk add colibri-core) and for macOS with homebrew (brew tap fbkarsdorp/homebrew-lamachine && brew install colibri-core). Note that these do not contain the Python binding!

Otherwise you will need to either use the container image or to build and install from source.

Installation from source

If no packages are available, you will need to compile from source or use the container build (e.g. Docker) as explained later on.

In order to do so, you need a sane build environment, install the necessary dependencies for your distribution:

For Debian/Ubuntu::

$ sudo apt-get install make gcc g++ pkg-config autoconf-archive libtool autotools-dev libbz2-dev zlib1g-dev libtar-dev python3 python3-dev cython3

For RedHat-based systems (run as root)::

# yum install pkgconfig libtool autoconf automake autoconf-archive make gcc gcc-c++ libtar libtar-devel python3 python3-devel zlib zlib-devel python3-pip bzip2 bzip2-devel cython3

For macOS with homebrew:

$ brew install autoconf automake libtool autoconf-archive python3 pkg-config

Then clone this repository and install as follows:

$ bash bootstrap
$ ./configure
$ make
$ sudo make install

Container usage

The Colibri Core command-line tools are also available as an OCI/Docker container.

A pre-made container image can be obtained from Docker Hub as follows:

docker pull proycon/colibri-core

You can also build a container image yourself as follows, make sure you are in the root of this repository:

docker build -t proycon/colibri-core .

This builds the latest stable release, if you want to use the latest development version from the git repository instead, do:

docker build -t proycon/colibri-core --build-arg VERSION=development .

Run the frog container interactively as follows, it will dump you into a shell where the various command line tools are available:

docker run -t -i proycon/colibri-core

Add the -v /path/to/your/data:/data parameter if you want to mount your data volume into the container at /data.

Demo

Publication

This software is extensively described in the following peer-reviewed publication:

van Gompel, M and van den Bosch, A (2016)
Efficient n-gram, Skipgram and Flexgram Modelling with Colibri Core.
*Journal of Open Research Software*
4: e30, DOI: http://dx.doi.org/10.5334/jors.105

Access the publication here and please cite it if you make use of Colibri Core in your work.

colibri-core's People

Contributors

Stargazers

Watchers

Forkers

naiaden muranava xuanhan863 zyh329 imclab pandasasa solertis manrock007 praveenmunagapati prasastoadi robertomalatesta olegbaskov l748198943 ameerhamza111 stjordanis anandksrao prxus computational-linguistics-research moonhiden

colibri-core's Issues

lower-order ngrams not pruned when training with skipgrams, minlengh > 1 and t > 1

colibri-patternmodeller -c input.colibri.cls -f input.colibri.dat -o input.colibri.patternmodel -t 2 -l 4 -m 4 -u -s -P results in uni/bi/trigrams as well..

Class encoding fails if input only contains one line without new line?

Discovered by @fkunneman; output file was only 2-bytes (the initial null byte and version marker).

Input text was just: prachtig apparaat en droogt goed kreukelvrij fijn de verlichting binnenin voelt heel robuust en ziet er ook erg leuk uit

Also verify this doesn't imply we lose the last sentence on larger encodings (can't imagine it does as the tests probably cover this, but better check).

Python Tutorial needs an update for v2

Not everything is working as intended yet, fix

pip failed building wheel for colibricore Mac OSX 10.11.2

I brew installed the dependencies, but get "colibricore_wrapper.cpp:258:10: fatal error: 'unordered_map' file not found" error below after trying to pip install colibricore.

coco)~/colibri-core - [master●] » pip install colibricore
Collecting colibricore
  Using cached colibricore-2.1.2.tar.gz
Requirement already satisfied (use --upgrade to upgrade): Cython>=0.23 in /Users/me/anaconda/envs/coco/lib/python3.4/site-packages (from colibricore)
Building wheels for collected packages: colibricore
  Running setup.py bdist_wheel for colibricore
  Complete output from command /Users/me/anaconda/envs/coco/bin/python3 -c "import setuptools;__file__='/private/var/folders/6n/__f45xnx36q9r_fy3jg68tz8tn99rh/T/pip-build-hat3uagr/colibricore/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /var/folders/6n/__f45xnx36q9r_fy3jg68tz8tn99rh/T/tmps16svvf_pip-wheel-:
  running bdist_wheel
  running build
  running build_ext
  cythoning colibricore_wrapper.pyx to colibricore_wrapper.cpp
  building 'colibricore' extension
  creating build
  creating build/temp.macosx-10.5-x86_64-3.4
  gcc -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/me/anaconda/envs/coco/include -arch x86_64 -I/usr/local/include/colibri-core -I/usr/include/colibri-core -I/usr/include/libxml2 -I/Users/me/anaconda/envs/coco/include/python3.4m -c colibricore_wrapper.cpp -o build/temp.macosx-10.5-x86_64-3.4/colibricore_wrapper.o --std=c++0x
  colibricore_wrapper.cpp:258:10: fatal error: 'unordered_map' file not found
  #include <unordered_map>
           ^
  1 error generated.
  (Writing /private/var/folders/6n/__f45xnx36q9r_fy3jg68tz8tn99rh/T/pip-build-hat3uagr/colibricore/colibricore_wrapper.pyx)
  /Users/me/anaconda/envs/coco/lib/python3.4/distutils/extension.py:132: UserWarning: Unknown Extension options: 'pyrex_gdb'
    warnings.warn(msg)
  warning: colibricore_wrapper.pyx:1003:12: Unreachable code
  warning: colibricore_wrapper.pyx:1247:8: Unreachable code
  warning: colibricore_wrapper.pyx:2050:8: Unreachable code
  warning: colibricore_wrapper.pyx:2951:8: Unreachable code
  warning: colibricore_wrapper.pyx:3425:8: Unreachable code
  error: command 'gcc' failed with exit status 1

  ----------------------------------------
Failed building wheel for colibricore
Failed to build colibricore
Installing collected packages: colibricore

(coco)~/colibri-core - [master●] » brew --config
HOMEBREW_VERSION: 0.9.5
ORIGIN: https://github.com/Homebrew/homebrew
HEAD: 2ae9b385ff174db4e1ac713f47a88c0e7034c516
Last commit: 15 minutes ago
HOMEBREW_PREFIX: /usr/local
HOMEBREW_REPOSITORY: /usr/local
HOMEBREW_CELLAR: /usr/local/Cellar
HOMEBREW_BOTTLE_DOMAIN: https://homebrew.bintray.com
CPU: 8-core 64-bit haswell
OS X: 10.11.2-x86_64
Xcode: 7.2
CLT: 7.2.0.0.1.1447826929
Clang: 7.0 build 700
X11: N/A
System Ruby: 2.0.0-p645
Perl: /usr/bin/perl
Python: /Users/me/anaconda/envs/coco/bin/python => /Users/me/anaconda/envs/coco/bin/python3.4
Ruby: /usr/bin/ruby => /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/bin/ruby
Java: 1.8.0_66

PatternModel.getreverseindex() is too slow with skipgrams

Provide vocabulary file

I would like to have a feature which allows me to limit the classes to a certain vocabulary. If you want to reproduce experiments by others, often you are given a vocabulary as well. Right now there is not a trivial way to limit the words to a certain vocabulary, without sacrificing efficiency in the encoding.

What I want is to give a vocabulary as parameter, and that the class file is limited to the words found in the vocabulary. The other words are mapped to OOV.

Process comments of reviewers of the Colibri Core paper

(just a meta issue for my own reference)

[Queries] Ability to create a model and cls from multiple input files

Hi,

To begin with, Thank you.. For the amazing work you've done so far..
I have a few questions regarding my usage of colibric-core in my project

What I am trying to build is a model that learns recurring patterns from a set of input text files. These are log files of a collection of software components.

Each line in my log file is converted to a unique hash representing that line, and the input to the training is a single line whose words are the hashes, word count is equal to the line count of the actual log file. This is done to generate patterns across lines and not words.

The model is then used to analyse whether patterns in a given test file matches against the training data, to detect any anomalies or unknown patterns. I am using your library for it's ability of creating variable length ngrams, skipgrams and flexgrams.
The questions that I have are as follows -

How do I create a unified model and class file, that contains patterns learnt from multiple input files
Do I save the class file and model after every instance of model trained from an input file, or can I train from multiple input files and then finally call .save/ ,write
Is there a way to perform this training on multiple cores, while saving the information to a single model? Multithreading?
Alternatively is it possible to create temporary multiple models through a batch operation and then somehow merge them together to a single model file and .cls file?
Also, I see random crashes some times while parsing a file. Re-running the training on the same file again sometimes results in a crash at the same point, and sometimes doesn't, which is weird. I'll try to get the backtraces for those crashes whenever i reproduce the issue again..

I am willing to contribute any changes done in regards to the above requirements if you could just guide me. I have also attached the relevant code that shows my usage of the library.

train_program.py.zip

Python: catch C++ exception when training indexed model without reverseindex

Non-functioning constraints in .getrightneighbours(), .getcooc() etc.

I wanted to get only n-grams of a specific size following some other n-gram. However, I experienced that the output did not adhere to the given constraints, at least not as I expected. I've consulted the documentation to figure out if I simply misunderstood something; if so, please enlighten me. :-)

Here is a working example (most of it taken from the tutorial notebook) which shows it:

import colibricore
from urllib.request import urlopen


TMPDIR = '/tmp/'
corpusfile_plato_plaintext = TMPDIR + "republic.txt"
classfile_plato = TMPDIR + "republic.colibri.cls"
corpusfile_plato = TMPDIR + "republic.colibri.dat"

f = urlopen('http://lst.science.ru.nl/~proycon/republic.txt')
with open(corpusfile_plato_plaintext,'wb') as of:
    of.write(f.read())
print("Downloaded to " + corpusfile_plato_plaintext)

# make encoder, encode corpus and make decoder
classencoder = colibricore.ClassEncoder(classfile_plato)
classencoder.build(corpusfile_plato_plaintext)
classencoder.save(classfile_plato)
classencoder.encodefile(corpusfile_plato_plaintext, corpusfile_plato)
classdecoder = colibricore.ClassDecoder(classfile_plato)

# set options and train model
options = colibricore.PatternModelOptions(mintokens=2, maxlength=8,
                                          doskipgrams=True)
corpus_plato = colibricore.IndexedCorpus(corpusfile_plato)
model = colibricore.IndexedPatternModel(reverseindex=corpus_plato)
model.train(corpusfile_plato, options)

# make ngram and get its neighbours under different constraints
ngram = classencoder.buildpattern("the law")
no_constraint = {(pattern, count)
                 for pattern, count in model.getrightneighbours(ngram, 1)}

only_bigrams = {(pattern, count)
                 for pattern, count in model.getrightneighbours(ngram, 1, size=2)}

# we'd expect nothing besides bigrams, but ...
for pattern, count in only_bigrams:
    if not pattern.isskipgram() and len(pattern) != 2:
        print('Found a non-bigram where I should not!: ',
              pattern.tostring(classdecoder))
        break

only_ngrams = {(pattern, count)
               for pattern, count in model.getrightneighbours(
        ngram, 1, category=colibricore.Category.NGRAM
    )}
# we'd expect no skipgrams, but ...
for pattern, count in only_ngrams:
    if pattern.isskipgram():
        print('Found a skipgram where I should not!',
              pattern.tostring(classdecoder))
        break

Output:

Found a non-bigram where I should not!:  ; at the same time
Found a skipgram where I should not! ; {*} their

Similar things happen for cooc methods and left neighbours.

Missing data in indexed model on large data set; yields much lower counts than unindexed model on the same data with the same parameters!

As reported by Pavel Vondřička, something fishy is going on in the computation of an indexed model on a large dataset (8.5GB compressed):

Indexed:

$ colibri-patternmodeller -l 1 -t 1 -f gigacorpus.colibri.dat                                                        
Loading corpus data...
Training model on  gigacorpus.colibri.dat
Training patternmodel, occurrence threshold: 1
Counting *all* n-grams (occurrence threshold=1)
 Found 2562104 ngrams... computing total word types prior to pruning...2562104...pruned 0...total kept: 2562104
Sorting all indices...

Unindexed (these are the correct):

$ colibri-patternmodeller -u -l 1 -t 1 -f gigacorpus.colibri.dat
Training unindexed model on  gigacorpus.colibri.dat
Training patternmodel, occurrence threshold: 1
Counting *all* n-grams (occurrence threshold=1)
 Found 11459477 ngrams... computing total word types prior to pruning...11459477...pruned 0...total kept: 11459477

The encoded corpus file has been verified to be fine (i.e. it decodes properly):

yes, I tried decoding the corpus back and it had a different size, but there was the whole contents - it seems that just some (white)spaces got lost, which is understandable. Anyway, the corpus wasn’t clipped.

I did some tests and the problem does NOT reproduce on a small text (counts are equal there as expected), which also explains why it isn't caught by our automated tests. So the cause is not yet clear and further debugging is needed.

Wrong threshold in model.filter

Hello!
In this command options = colibricore.PatternModelOptions(mintokens=50, maxlength=6, doskipgrams=True) I set mintokens=50. But then I tried to extract skipgrams with a command self.model.filter(0, colibricore.Category.SKIPGRAM)
Results look like threshold was 100 (I don't see any skipgram with occurence less than 100). Is it a bug or do I something wrong?

Can't compile on CentOS 6.6

I'm getting the following errors (gcc 4.4.7):

# pip3 install colibricore
[...]
    Bootstrapping colibri-core
    Autoconf archive found in /usr/share/aclocal/, good
    configure.ac:36: warning: AC_LANG_CONFTEST: no AC_LANG_SOURCE call detected in body
    ../../lib/autoconf/lang.m4:193: AC_LANG_CONFTEST is expanded from...
    ../../lib/autoconf/general.m4:2661: _AC_LINK_IFELSE is expanded from...
    ../../lib/autoconf/general.m4:2678: AC_LINK_IFELSE is expanded from...
    /usr/share/aclocal/libtool.m4:1022: _LT_SYS_MODULE_PATH_AIX is expanded from...
    /usr/share/aclocal/libtool.m4:4161: _LT_LINKER_SHLIBS is expanded from...
    /usr/share/aclocal/libtool.m4:5236: _LT_LANG_C_CONFIG is expanded from...
    /usr/share/aclocal/libtool.m4:138: _LT_SETUP is expanded from...
    /usr/share/aclocal/libtool.m4:67: LT_INIT is expanded from...
    configure.ac:36: the top level
[...]
    libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I.. -I../include -Wall -O3 -g -O2 -std=gnu++0x -MT pattern.lo -MD -MP -MF .deps/pattern.Tpo -c pattern.cpp  -fPIC -DPIC -o .libs/pattern.o
    In file included from ../include/patternstore.h:19,
                     from pattern.cpp:2:
    ../include/datatypes.h: In member function `std::string IndexReference::tostring() const':
    ../include/datatypes.h:73: error: call of overloaded `to_string(uint32_t)' is ambiguous
    /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int)
    /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note:                 std::string std::to_string(long long unsigned int)
    /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note:                 std::string std::to_string(long double)
    ../include/datatypes.h:73: error: call of overloaded `to_string(unsigned int)' is ambiguous
    /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2604: note: candidates are: std::string std::to_string(long long int)
    /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2610: note:                 std::string std::to_string(long long unsigned int)
    /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:2616: note:                 std::string std::to_string(long double)
    ../include/datatypes.h: In member function `void IndexedData::shrink_to_fit()':
    ../include/datatypes.h:151: error: `class std::vector<IndexReference, std::allocator<IndexReference> >' has no member named `shrink_to_fit'
    In file included from pattern.cpp:2:
    ../include/patternstore.h: In member function `void PatternSet<ReadWriteSizeType>::reserve(size_t)':
    ../include/patternstore.h:704: error: `class t_patternset' has no member named `reserve'
    pattern.cpp: In member function `const bool PatternPointer::unknown() const':
    pattern.cpp:408: warning: comparison between signed and unsigned integer expressions
    pattern.cpp: In constructor `Pattern::Pattern(std::istream*, bool, unsigned char, const unsigned char*, bool)':
    pattern.cpp:528: warning: comparison between signed and unsigned integer expressions
    make[2]: *** [pattern.lo] Error 1
    make[2]: Leaving directory `/home/avcrane1/src/colibri-core/tmp/pip-build-1al7wmwo/colibricore/src'
    make[1]: *** [all-recursive] Error 1
    make[1]: Leaving directory `/home/avcrane1/src/colibri-core/tmp/pip-build-1al7wmwo/colibricore'
    make: *** [all] Error 2
    Make of colibri-core failed

how to expose colibri-ngrams from Python API?

How can I expose colibri-ngrams and the other complimentary commands through the python API?

Pattern.ngrams() performance too slow for very large patterns, can be sped up

Caused by slices needing to count from 0 every time, slows down later ngrams.
Occurs mainly when setting new ignorenewlines option to True upon class encoding.

Implement ability to filter on (n)PMI for getleftneighbours(), getleftcooc(), etc..

(requested by Gabor Toth)

Load corpora with mmap

Would it be possible to load copora with mmap? This would make it possible to work with corpora larger than the available RAM, and is much more efficient if only a small part of a file is going to be used anyway.

Refactor Pattern.instanceof to make use of v2 changes and add flexgram support

(Used by new filterset mechanism)

Check in advance for reverse index when training skipgrams with IndexedPatternModel

Discrepancy between totaloccurrencesingroup and patterns in getreverseindex

I'm training a 4-gram skipgram model with

MINTOKENS = MINTOKENS_SKIPGRAMS = 2
MINTOKENS_UNIGRAMS = 3
MINLENGTH = 3
MAXLENGTH = 4
DOREVERSEINDEX = true
DOSKIPGRAMS_EXHAUSTIVE = true

with these numbers reported for the pattern model:

                                 PATTERNS         TOKENS       COVERAGE          TYPES
Total:                                  -     1537297768              -        2425337
Uncovered:                              -              0         0.0000        1718067
Covered:                        273998512     1537297768         1.0000         707270

       CATEGORY      N (SIZE)        PATTERNS          TYPES    OCCURRENCES
            all            all      273998512         707270     3593418773
            all              2       16652489         707269      712369750
            all              3       75300479         582876     1205518415
            all              4      182045544         495923     1675530608
         n-gram            all      136902720         707269     1658562277
         n-gram              2       16652489         707269      712369750
         n-gram              3       52408087         580582      571966995
         n-gram              4       67842144         495586      374225532
       skipgram            all      137095792         553853     1934856496
       skipgram              3       22892392         553853      633551420
       skipgram              4      114203400         495923     1301305076

trainPatternModel.totaloccurrencesingroup(0,4) reports there are 1675530608
patterns of length 4, whereas I get between 1904680000-1904700000 patterns (exact number is not reported by my code) with

for(IndexedCorpus::iterator iter = indexedCorpus->begin(); iter != indexedCorpus->end(); ++iter)
        {
            for(PatternPointer patternp : trainPatternModel.getreverseindex(iter.index(), 0, 0, 4))
            { ...

This is a difference of 13.7%.

So what is the right way to get the number of patterns, after pruning and thresholding, indifferent of the pattern type?

reverse() is broken, doesn't work for higher classes (>128)

Implement begin-of-sentence and end-of-sentence markers in pattern model training

Complicated by the fact that everything is now a patternpointer. Looking for the best approach to tackle this..

Clean warnings in v2

tokens/coverage results not split out per n category?

Running colibri-coverage the following results were obtained:

REPORT
----------------------------------
                                 PATTERNS         TOKENS       COVERAGE          TYPES
Total:                                  -         359381              -          22682
Uncovered:                              -         292470         0.8138          19710
Covered:                            10505          66911         0.1862           2972

       CATEGORY      N (SIZE)        PATTERNS         TOKENS       COVERAGE          TYPES    OCCURRENCES
            all            all          10505          66911         0.1862           2972          94966
            all              1           2972          66911         0.1862           2972          66911
            all              2           5691          66911         0.1862           1377          24638
            all              3           1651          66911         0.1862            548           3165
            all              4            176          66911         0.1862            177            231
            all              5             13          66911         0.1862             35             17
            all              6              2          66911         0.1862              9              4
         n-gram            all          10505          66911         0.1862           2972          94966
         n-gram              1           2972          66911         0.1862           2972          66911
         n-gram              2           5691          66911         0.1862           1377          24638
         n-gram              3           1651          66911         0.1862            548           3165
         n-gram              4            176          66911         0.1862            177            231
         n-gram              5             13          66911         0.1862             35             17
         n-gram              6              2          66911         0.1862              9              4

Legend:
 - PATTERNS    : The number of distinct patterns within the group
 - TOKENS      : The number of tokens that is covered by the patterns in the group.
 - COVERAGE    : The number of tokens covered, as a fraction of the total in the corpus
 - TYPES       : The number of unique *word/unigram* types in this group
 - OCCURRENCES : The total number of occurrences of the patterns in this group

Tokens and coverage are the same for all n categories, this looks wrong.

Unable to load large corpora into memory because PatternPointer length can't exceed 2^32 bytes (32 bit size descriptor)

Whilst fine in most situations, this doesn't work for IndexedCorpus which loads an entire corpus into one PatternPointer. This prevents loading very large corpora (continuation of #41):

Loading corpus data...
Loaded 307725534 sentences; corpussize (bytes) = 9157735203
ERROR: Pattern too long for pattern pointer [9157735203 bytes,explicit]
terminate called after throwing an instance of 'InternalError'
  what():  Colibri internal error

Simply setting the size descriptor to a 64 bit integer would waste too much memory in most other situation so isn't an option either. I think we need a more flexible solution through templating.

Problems compiling with anaconda

I had two minor issues while building from source:

First the installation aborted with the following error:

libtool: Version mismatch error.  This is libtool 2.4.6, but the
libtool: definition of this LT_INIT comes from libtool 2.4.6.42-b88ce.
libtool: You should recreate aclocal.m4 with macros from libtool 2.4.6
libtool: and run autoconf again.
make[2]: *** [Makefile:798: SpookyV2.lo] Error 63
make[2]: Leaving directory '/home/marco/PycharmProjects/colibri-core/src'
make[1]: *** [Makefile:466: all-recursive] Error 1
make[1]: Leaving directory '/home/marco/PycharmProjects/colibri-core'
make: *** [Makefile:375: all] Error 2
Make of colibri-core failed

I solved this error as suggested by recreating aclocal.m4 using autoreconf --force --install

Afterwards the compiling aborted again with the following error:

/home/marco/anaconda3/envs/MedInf/compiler_compat/ld: build/temp.linux-x86_64-3.7/colibricore_wrapper.o: unable to initialize decompress status for section .debug_info
build/temp.linux-x86_64-3.7/colibricore_wrapper.o: file not recognized: file format not recognized
collect2: error: ld returned 1 exit status

I solved this problem with a strange workaround by giving condas ld another name so the system wide ld was used.

I'm not sure if you are in the position to solve those problems, but I leave this here, so maybe I save others some time.

computestats() is too slow, delaying saving of indexed model significantly

getskipcontent() broken in v2

Investigate improved scalability using use of out-of-memory datastructures

The following library could be pluggable into our current framework:

STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks.: http://stxxl.sourceforge.net/

Error with Tibetan Unicode

I'm working on a Tibetan language corpus and I get the following error message with the patternmodeller:

Loading pattern model legya.colibri.dat as model...
File is not a colibri model file (or a very old one)
terminate called after throwing an instance of 'InternalError'
what(): Colibri internal error

The command was:

colibri-patternmodeller -i legya.colibri.dat -t 10 -l 20 -T 3 -o legya.colibri.indexedpatternmodel

classdecode spat out the Unicode without complaining, idem for the script colibri-ngrams...

Here's the file:
legya.txt

Support initial and final skips in skipgram training? [vote if desired]

Currently, Colibri Core only extracts skipgrams in which the skip is not at an initial or final position, but in the middle. For example, patterns like x {*} and {*} x are never extracted, only x {*} y. This is to keep complexity down, to find patterns left and right of x you are better of using the neigbour relations using an indexed patternmodel.

Vote for this feature here if you do want support for initial and final skips.

$ colibri-patternmodeller -c input.colibri.cls -f input.colibri.dat -o input.colibri.patternmodel -t 1 -l 4 -m 4 -u -P | cut -f1 > ngrams
$ colibri-patternmodeller -c input.colibri.cls -f input.colibri.dat -o input.colibri.patternmodel -t 1 -l 4 -m 4 -u -s  -P | cut -f1 > ngramsskipgrams 
$ cat ngramsskipgrams | grep -v "{*}" > ngramsskipgrams.filtered
$ wc -l ngrams*
    3001 ngrams
    11339 ngramsskipgrams
    5070 ngramsskipgrams.filtered

Example upon inspection of data:

existing good ngram: 10 December 2007 imposing
additional bad ngram: 10 December Other imposing

iterating over model.reverseindex() crashes in Python

Regression error since v2