alexeyg / pysais Goto Github PK
View Code? Open in Web Editor NEWA Python C module wrapper around the SA-IS suffix array construction algorithm by Yuta Mori.
License: MIT License
A Python C module wrapper around the SA-IS suffix array construction algorithm by Yuta Mori.
License: MIT License
I'm testing the package on random data:
>>> sequence = '$'
>>> sa = pysais.sais(sequence)
>>> lcp, lcp_lm, lcp_mr = pysais.lcp(sequence, sa)
Segmentation fault (core dumped)
(I observed segmentation faults also on normal data sometimes if I omitted the terminal symbol. But I understand that you strictly require it.)
I'm using python 3.6.6
Calling pysais.lcp(sequence, sa)
is crashing silently- it seems to crash python entirely. Whether running a script that calls lcp or in interactive mode, it ends up closing python all together. The following code caused it on my end
import pysais
sequence = "aaabbbcccdddaaacccbbbddd"
sa = pysais.sais(sequence)
o = pysais.lcp(sequence, sa)
print(o)
I'm trying why the output of LCP is so complicated - and I found a bug in bisect:
>>> sequence = 'abc'
>>> sa = pysais.sais(sequence)
>>> lcp, lcp_lm, lcp_mr = pysais.lcp(sequence, sa)
>>> pysais.bisect(sequence, 'c', sa, lcp_lm, lcp_mr)
(2, True) # OK
>>> pysais.bisect(sequence, 'b', sa, lcp_lm, lcp_mr)
(1, True) # OK
>>> pysais.bisect(sequence, 'a', sa, lcp_lm, lcp_mr)
(1, True) # BUG
It would be nice if you can release this under a clear license. It will make it easier for people to use it to build a larger piece of software. It could encourage contributions, too.
Would you like to choose the MIT license (as Yuta Mori did for sais.c and sais.h)?
lcp_int returned an array with no useful data (constant zeros).
I had a look into the code and found a suspicious line:
Line 290 in bc27f42
This looks obviously wrong. Correcting it to
T = pyvector_to_Carrayptrs(T_np);
gives me better results.
Now I am wondering, am I the first one to even use this function? Might there be other glitches hidden noone ever found due to lack of testing? Is this project still alive? It would be a pity of not, because I think it is really cool!
Best regards and thanks, Andreas
Python: python 3.5
pysais: master
from sklearn.datasets import fetch_20newsgroups
import pysais
s = '$'.join(fetch_20newsgroups().data)
sa = pysais.sais(s)
print(len(s))
print(len(sa))
expected output:
22065807
22065807
actual output:
22065807
22065930
This issue to let you know that I've re-used your strategy for wrapping the C code from Yuta Mori, applied to an enhanced version found here: https://github.com/kurpicz/sais-lite-lcp . This enhancement enables computing SA and LCP arrays simultaneously, and I have yet to find a segfault. According to wikipedia it was also the most efficient algorithm as of 2012.
My fork it there: https://github.com/fcharras/pyfischer
I've only ported the functions that meet my usecases, but the other functions could be ported too.
See https://docs.python.org/3.1/distutils/packageindex.html and http://stackoverflow.com/questions/9411494/how-do-i-create-a-pip-installable-project for the procedure.
While we are at it, I also suggest changing the name (as in setup.py) to "pysais". pip install pysais
is easier to type and remember than pip install Py-SAIS
(capitalized + hyphen + allcaps).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.