yaledhlab / intertext Goto Github PK

View Code? Open in Web Editor NEW

111.0 111.0 10.0 3.18 MB

Detect and visualize text reuse

Home Page: https://lab-apps.s3-us-west-2.amazonaws.com/intertext/redesign/index.html

Python 99.76% Shell 0.24%

data-visualization minhash text-mining web-app

intertext's Issues

Correct XML Parsing

The final string extraction method used to store substrings in Mongo doesn't parse XML tags, but it should

Add log scale boolean to scatterplot view

for the x axis, as the "sum" distributions tend to be of the "runaway" variety

I believe the experience of ARTFL was that there was a utility in having one or more texts (the Bible, lists of commonplace expressions) declared as ignorable, to prevent uninteresting hits from cluttering up the results. Possibly one could declare such status in the metadatajson:

"ignorethistext":true,

Add brush listeners to scatterplot

Create scatterplot logic

UI/UX Updates

Clarify UI of the sort dropdown:
- Add "Sort by" in the same type style as the word "Similarity"
- Place it to the left of the sort dropdown, which you can move over to make space
Clarifying text for previous use / later use:
- In these mocks, I suggest moving down the results text to its own line in order to have some space for more descriptive language with regards to the results and what users are seeing: Previous use - "There are 15064 passages of preceding works (left) that share textual similarity with the writing of Thomas Gray (right)." || Later use - "There are 15064 passages of subsequent works (right) that share textual similarity with the writing of Thomas Gray (left)."

Add a sample job submission file for SLURM

To help offer those on Yale campus supercompute opportunities

Add read action placeholder

Update documentation

given api rewrite

Implement data processing logic

Add visualize action

Allow users to disregard same-author matches

Declare all PropTypes

Provide hooks for fetching user-provided author images from public urls

destination should be /assets/images/authors/shakespeare.jpg

Error "File is not a zip file"

When trying to perform "intertext --infiles "[...]/*.txt" a "BadZipFile" Error occurs. Looks like something package-side going wrong, not downloading the right files?

Terminal output:

(intertext2) [...] % intertext --infiles "analysis/data/*.txt"       
                        
[...]/opt/anaconda3/envs/intertext2/lib/python3.7/site-packages/vectorizedMinHash/bias_coef.npy
 * fetching client version 0.0.1a
Traceback (most recent call last):
  File "[...]/opt/anaconda3/envs/intertext2/bin/intertext", line 8, in <module>
    sys.exit(parse())
  File "[...]/opt/anaconda3/envs/intertext2/lib/python3.7/site-packages/intertext/intertext.py", line 142, in parse
    download_client(**config)
  File "[...]/opt/anaconda3/envs/intertext2/lib/python3.7/site-packages/intertext/intertext.py", line 164, in download_client
    with zipfile.ZipFile(zip_location, 'r') as z:
  File "[...]/opt/anaconda3/envs/intertext2/lib/python3.7/zipfile.py", line 1258, in __init__
    self._RealGetContents()
  File "/[...]/opt/anaconda3/envs/intertext2/lib/python3.7/zipfile.py", line 1325, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Help is much appreciated!

User Guide Layout

Hello:

Attached you will find the comp of the User Guide page. Once the interface updates from the other issue are implemented, I will need to update the screenshots for this page.

Let me know if you want to meet to talk through the comp.

All best, Monica

Add feature to reduce comparisons to one-against-all

E.g. User has a query document and a reference corpus, and doesn't care to find the all-against-all results, but just the one-against-all results

Add compare action

Citation

Thanks for developing this software! Any suggestion for how to cite this repo (Bibtex-style)?

Consider nonconsumptive for corpus ingest/management

Hi guys, hope you don't mind if I spam you with a product placement.

I've been working on a complete rewrite of the tokenization and corpus management parts of bookworm into a standalone package called nonconsumptive. https://github.com/bmschmidt/nonconsumptive.

The most important part is a tokenization rewrite that probably doesn't matter for this project. But it also has a reasonably decent corpus ingest function. I thought of it for this project looking at the ingest files @pleonard212 put up for the Yale speeches. The goal there is to pull any of a variety of input formats (mallet-like, ndjson, keyed filenames) into a compressed, mem-mappable single file suitable for all sorts of stuff. I'm using it for scatterplots and bookworms, but I think it would also suit your use case here.

Can't remember how much I've waxed to @duhaime already about the parquet/feather ecosystem, but it's a hugely useful paradigm.

In the immediate term, the prime benefit for you would be that you'd be able to support ingest from a CSV file as well as the JSON format you currently have defined. Great, I guess. But on your side and mine, the real benefit would be that corpora built for this would be no-cost transportable to other systems, and that I could easily drop some intertext hooks into any previously built bookworms. Since the parquet/feather formats are quite portable, for the scale you're working it at it might be possible to bundle it into some kind of upload service.

Let me know if you want to talk more.

yaledhlab / intertext Goto Github PK

intertext's Issues

Recommend Projects

Recommend Topics

Recommend Org