Giter Club home page Giter Club logo

text-pair's People

Contributors

clovis avatar river974 avatar valerie-hanoka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text-pair's Issues

Increase max gap between matching ngrams as match grows

Proposed logic:
We add minimum_matching_ngrams to max_gap once we reach multiples of minimum_matching_ngrams. For instance, if minimum_matching_ngrams is 4 and max_gap is 15:
if matching_ngrams == 4, max_gap += 4
if matching_ngrams == 8, max_gap += 4
if matching_ngrams == 12, max_gap += 4
....
until we reach our window size (actually just before), at which point we freeze max_gap. In the above example, matching_ngrams == 4, max_gap == 27, which would be the max value if window_size is 30.

Philologic PlainTextParser.py causing error with longer words

Hello, I have been using text-pair to generate text reuse for some EEBO-TCP files.

With some files I got the following error:


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.10/dist-packages/philologic/loadtime/Loader.py", line 532, in parse_file
    parser.parse(input_file)
  File "/usr/local/lib/python3.10/dist-packages/philologic/loadtime/PlainTextParser.py", line 118, in parse
    print(f"Long word in {input.name}: {word}", file=sys.stderr)
AttributeError: 'builtin_function_or_method' object has no attribute 'name'

I believe this is an error in the line

print(f"Long word in {input.name}: {word}", file=sys.stderr)

in the script PlainTextParser.py.

input.name should be input_file.name.

When I edit the PlainTextParser in the Docker instance, it runs successfully and seems to fix the problem.

Document metadata.json format

Just a quick thought -- having a minimal example of metadata.json could help folks who aren't in a Philo4 environment load in their texts...

Memory leak in compareNgrams

This mostly visible on very large alignments (eg EEBO+ECCO) where even batching does not reduce memory pressure.

Visualize reuses from a single document

We have the position of alignments in percentages in documents (start/end position). Why not build a visualization that would show the location of those alignments within a single document?

Closer connection properly

Currently using a with statement to implicitly close connections. This does not work in a connection pool. The connection should be opened without a with statement.

No module named 'text_preprocessing'

Hello !
I'm trying to run the command textalign right after successfully running the install.sh script, but the software rewards me with this error message :

$ textalign 
Traceback (most recent call last):
  File "/usr/local/bin/textalign", line 12, in <module>
    from textalign import TEIParser, Ngrams, create_web_app, web_loader, parse_config
  File "/usr/local/lib/python3.5/dist-packages/textalign/__init__.py", line 2, in <module>
    from .generate_ngrams import Ngrams
  File "/usr/local/lib/python3.5/dist-packages/textalign/generate_ngrams.py", line 19, in <module>
    from text_preprocessing import PreProcessor, modernize
ImportError: No module named 'text_preprocessing'

It seems that the module text_preprocessing has not been included in the textalign python package during my installation.
Do you know how to fix this error ?
Thank you for your time and have a great day !

XML_parser

  • La fonction marche bien sous linux mais le path des metadata a un problème sous windows (pourtant le module PATH est utilisé).
  • Le fichier metadata contient un attribut ("option":"xmltruc(blabla)") que ne contenait pas ceux de philo4, mais ce n'est pas très grave
  • Il me semble aussi que sous Philo4 le premier fichier texte était 1, ici c'est 0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.