artfl-project / text-pair Goto Github PK
View Code? Open in Web Editor NEWHigh-performance text aligner for large collections of texts
License: GNU General Public License v3.0
High-performance text aligner for large collections of texts
License: GNU General Public License v3.0
include cython in setup_requires
Proposed logic:
We add minimum_matching_ngrams to max_gap once we reach multiples of minimum_matching_ngrams. For instance, if minimum_matching_ngrams is 4 and max_gap is 15:
if matching_ngrams == 4, max_gap += 4
if matching_ngrams == 8, max_gap += 4
if matching_ngrams == 12, max_gap += 4
....
until we reach our window size (actually just before), at which point we freeze max_gap. In the above example, matching_ngrams == 4, max_gap == 27, which would be the max value if window_size is 30.
Hello, I have been using text-pair to generate text reuse for some EEBO-TCP files.
With some files I got the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.10/dist-packages/philologic/loadtime/Loader.py", line 532, in parse_file
parser.parse(input_file)
File "/usr/local/lib/python3.10/dist-packages/philologic/loadtime/PlainTextParser.py", line 118, in parse
print(f"Long word in {input.name}: {word}", file=sys.stderr)
AttributeError: 'builtin_function_or_method' object has no attribute 'name'
I believe this is an error in the line
print(f"Long word in {input.name}: {word}", file=sys.stderr)
in the script PlainTextParser.py.
input.name
should be input_file.name
.
When I edit the PlainTextParser in the Docker instance, it runs successfully and seems to fix the problem.
Just a quick thought -- having a minimal example of metadata.json
could help folks who aren't in a Philo4 environment load in their texts...
This causes results to be unsorted when two works have multiple matching passages (and the metadata sorting is essentially a tie). We want to break the sorting tie with source_start_byte.
This mostly visible on very large alignments (eg EEBO+ECCO) where even batching does not reduce memory pressure.
See here: https://github.com/kislyuk/argcomplete
We have the position of alignments in percentages in documents (start/end position). Why not build a visualization that would show the location of those alignments within a single document?
Currently using a with statement to implicitly close connections. This does not work in a connection pool. The connection should be opened without a with statement.
Hello !
I'm trying to run the command textalign
right after successfully running the install.sh
script, but the software rewards me with this error message :
$ textalign
Traceback (most recent call last):
File "/usr/local/bin/textalign", line 12, in <module>
from textalign import TEIParser, Ngrams, create_web_app, web_loader, parse_config
File "/usr/local/lib/python3.5/dist-packages/textalign/__init__.py", line 2, in <module>
from .generate_ngrams import Ngrams
File "/usr/local/lib/python3.5/dist-packages/textalign/generate_ngrams.py", line 19, in <module>
from text_preprocessing import PreProcessor, modernize
ImportError: No module named 'text_preprocessing'
It seems that the module text_preprocessing
has not been included in the textalign
python package during my installation.
Do you know how to fix this error ?
Thank you for your time and have a great day !
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.