Comments (3)
confirmed, reproduces on a smaller corpus, cause still unknown
from colibri-core.
Original issue:
$ colibri-patternmodeller -f europarl250000-en-train.colibri.dat -t 2 -W 3 -m 3 -l 4 -s -o model -u && test.py
...
totaloccurrencesingroup: 18501817
(helper structure has 14657 unigrams mapping to 42319 skipgrams total)
Iterations: 21284687
Simplifying, no -W no -m, issue still occurs:
$ colibri-patternmodeller -f europarl250000-en-train.colibri.dat -t 2 -l 4 -s -o modelnowm -u && test.py modelnowm
...
totaloccurrencesingroup: 31341064
(helper structure has 15057 unigrams mapping to 43274 skipgrams total)
Iterations: 34122432.
No skipgrams, issue does not occur:
$ colibri-patternmodeller -f europarl250000-en-train.colibri.dat -t 2 -W 3 -m 3 -l 4 -o modelnos -u && test.py
...
totaloccurrencesingroup: 6214396
Iterations: 6214396
from colibri-core.
Ok, as I suspected this is not a bug, but it is confusing nevertheless. The reason you get more occurrences when you loop over the corpus explicitly and use getreverseindex()
is that Colibri Core will actively match the ngram under consideration against all skipgrams in the model. If the ngram is an instantiation of a skipgram, the skipgram will be returned, it doesn't even require the n-gram to be in the model (after all, this should work also for models that contain no ngrams at all). This ngram may therefore well have been below any threshold during computation and not considered a candidate for the skipgram at the time, which is why you end up with a higher occurrence count.
from colibri-core.
Related Issues (20)
- No flexgram support in IndexedPatternModel.getsubchildren() / getsubparents() yet
- Pattern.ngrams() performance too slow for very large patterns, can be sped up
- Can't compile on CentOS 6.6 HOT 2
- Load corpora with mmap HOT 1
- Process comments of reviewers of the Colibri Core paper HOT 3
- buildpattern() does not raise an exception when unknown tokens are presented in the input and allowunknown=false (default)!
- tokens/coverage results not split out per n category? HOT 1
- Investigate improved scalability using use of out-of-memory datastructures HOT 1
- Implement ability to filter on (n)PMI for getleftneighbours(), getleftcooc(), etc..
- Class encoding fails if input only contains one line without new line?
- [Queries] Ability to create a model and cls from multiple input files
- Error with Tibetan Unicode HOT 2
- how to expose colibri-ngrams from Python API? HOT 4
- Wrong threshold in model.filter HOT 3
- Missing data in indexed model on large data set; yields much lower counts than unindexed model on the same data with the same parameters! HOT 4
- Unable to load large corpora into memory because PatternPointer length can't exceed 2^32 bytes (32 bit size descriptor) HOT 5
- Problems compiling with anaconda HOT 1
- Non-functioning constraints in .getrightneighbours(), .getcooc() etc. HOT 2
- Package for Alpine Linux HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from colibri-core.