yaledhlab / intertext Goto Github PK
View Code? Open in Web Editor NEWDetect and visualize text reuse
Home Page: https://lab-apps.s3-us-west-2.amazonaws.com/intertext/redesign/index.html
Detect and visualize text reuse
Home Page: https://lab-apps.s3-us-west-2.amazonaws.com/intertext/redesign/index.html
The final string extraction method used to store substrings in Mongo doesn't parse XML tags, but it should
for the x axis, as the "sum" distributions tend to be of the "runaway" variety
I believe the experience of ARTFL was that there was a utility in having one or more texts (the Bible, lists of commonplace expressions) declared as ignorable, to prevent uninteresting hits from cluttering up the results. Possibly one could declare such status in the metadatajson:
"ignorethistext":true,
Clarify UI of the sort dropdown:
- Add "Sort by" in the same type style as the word "Similarity"
- Place it to the left of the sort dropdown, which you can move over to make space
Clarifying text for previous use / later use:
- In these mocks, I suggest moving down the results text to its own line in order to have some space for more descriptive language with regards to the results and what users are seeing: Previous use - "There are 15064 passages of preceding works (left) that share textual similarity with the writing of Thomas Gray (right)." || Later use - "There are 15064 passages of subsequent works (right) that share textual similarity with the writing of Thomas Gray (left)."
To help offer those on Yale campus supercompute opportunities
given api rewrite
destination should be /assets/images/authors/shakespeare.jpg
When trying to perform "intertext --infiles "[...]/*.txt" a "BadZipFile" Error occurs. Looks like something package-side going wrong, not downloading the right files?
Terminal output:
(intertext2) [...] % intertext --infiles "analysis/data/*.txt"
[...]/opt/anaconda3/envs/intertext2/lib/python3.7/site-packages/vectorizedMinHash/bias_coef.npy
* fetching client version 0.0.1a
Traceback (most recent call last):
File "[...]/opt/anaconda3/envs/intertext2/bin/intertext", line 8, in <module>
sys.exit(parse())
File "[...]/opt/anaconda3/envs/intertext2/lib/python3.7/site-packages/intertext/intertext.py", line 142, in parse
download_client(**config)
File "[...]/opt/anaconda3/envs/intertext2/lib/python3.7/site-packages/intertext/intertext.py", line 164, in download_client
with zipfile.ZipFile(zip_location, 'r') as z:
File "[...]/opt/anaconda3/envs/intertext2/lib/python3.7/zipfile.py", line 1258, in __init__
self._RealGetContents()
File "/[...]/opt/anaconda3/envs/intertext2/lib/python3.7/zipfile.py", line 1325, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Help is much appreciated!
E.g. User has a query document and a reference corpus, and doesn't care to find the all-against-all results, but just the one-against-all results
Thanks for developing this software! Any suggestion for how to cite this repo (Bibtex-style)?
Hi guys, hope you don't mind if I spam you with a product placement.
I've been working on a complete rewrite of the tokenization and corpus management parts of bookworm into a standalone package called nonconsumptive
. https://github.com/bmschmidt/nonconsumptive.
The most important part is a tokenization rewrite that probably doesn't matter for this project. But it also has a reasonably decent corpus ingest function. I thought of it for this project looking at the ingest files @pleonard212 put up for the Yale speeches. The goal there is to pull any of a variety of input formats (mallet-like, ndjson, keyed filenames) into a compressed, mem-mappable single file suitable for all sorts of stuff. I'm using it for scatterplots and bookworms, but I think it would also suit your use case here.
Can't remember how much I've waxed to @duhaime already about the parquet/feather ecosystem, but it's a hugely useful paradigm.
In the immediate term, the prime benefit for you would be that you'd be able to support ingest from a CSV file as well as the JSON format you currently have defined. Great, I guess. But on your side and mine, the real benefit would be that corpora built for this would be no-cost transportable to other systems, and that I could easily drop some intertext hooks into any previously built bookworms. Since the parquet/feather formats are quite portable, for the scale you're working it at it might be possible to bundle it into some kind of upload service.
Let me know if you want to talk more.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.