Giter Club home page Giter Club logo

intertext's Introduction

Intertext

Detect and visualize text reuse within collections of plain text or XML documents.

Intertext uses machine learning and interactive visualizations to identify and display intertextual patterns in text collections. The text processing is based on minhashing vectorized strings and the web viewer is based on interactive React components. [Demo]

App preview

Installation

To install Intertext, run the steps below:

# optional: install Anaconda and set up conda virtual environment
conda create --name intertext python=3.7
conda activate intertext

# install the package
pip uninstall intertext -y
pip install https://github.com/yaledhlab/intertext/archive/master.zip

Usage

# search for intertextuality in some documents
intertext --infiles "sample_data/texts/*.txt"

# serve output
python -m http.server 8000

Then open a web browser to http://localhost:8000/output and you'll see any intertextualities the engine discovered!

CUDA Acceleration

To enable Cuda acceleration, we recommend using the following steps when installing the module:

# set up conda virtual environment
conda create --name intertext python=3.7
conda activate intertext

# set up cuda and cupy
conda install cudatoolkit
conda install -c conda-forge cupy

# install the package
pip uninstall intertext -y
pip install https://github.com/yaledhlab/intertext/archive/master.zip

Providing Metadata

To indicate the author and title of matching texts, one should pass the flag to a metadata file to the intertext command, e.g.

intertext --infiles "sample_data/texts/*.txt" --metadata "sample_data/metadata.json"

Metadata files should be JSON files with the following format:

{
  "a.xml": {
    "author": "Author A",
    "title": "Title A",
    "year": 1751,
    "url": "https://google.com?text=a.xml"
  },
  "b.xml": {
    "author": "Author B",
    "title": "Title B",
    "year": 1753,
    "url": "https://google.com?text=b.xml"
  }
}

Deeplinking

If your text documents can be read on another website, you can add a url attribute to each of your files within your metadata JSON file (see example above).

If your documents are XML files and you would like to deeplink to specific pages within a reading environment, you can use the --xml_page_tag flag to designate the tag within which page breaks are identified. Additionally, you should include $PAGE_ID in the url attribute for the given file within your metadata file, e.g.

{
  "a.xml": {
    "author": "Author A",
    "title": "Title A",
    "year": 1751,
    "url": "https://google.com?text=a.xml&page=$PAGE_ID"
  },
  "b.xml": {
    "author": "Author B",
    "title": "Title B",
    "year": 1753,
    "url": "https://google.com?text=b.xml&page=$PAGE_ID"
  }
}

If your page ids are specified within an attribute in the --xml_page_tag tag, you can specify the relevant attribute using the --xml_page_attr flag.

intertext's People

Contributors

dependabot[bot] avatar duhaime avatar pleonard212 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

intertext's Issues

UI/UX Updates

  • Clarify UI of the sort dropdown:
    - Add "Sort by" in the same type style as the word "Similarity"
    - Place it to the left of the sort dropdown, which you can move over to make space

  • Clarifying text for previous use / later use:
    - In these mocks, I suggest moving down the results text to its own line in order to have some space for more descriptive language with regards to the results and what users are seeing: Previous use - "There are 15064 passages of preceding works (left) that share textual similarity with the writing of Thomas Gray (right)." || Later use - "There are 15064 passages of subsequent works (right) that share textual similarity with the writing of Thomas Gray (left)."

Screen Shot 2021-02-18 at 5 04 05 PM

Consider nonconsumptive for corpus ingest/management

Hi guys, hope you don't mind if I spam you with a product placement.

I've been working on a complete rewrite of the tokenization and corpus management parts of bookworm into a standalone package called nonconsumptive. https://github.com/bmschmidt/nonconsumptive.

The most important part is a tokenization rewrite that probably doesn't matter for this project. But it also has a reasonably decent corpus ingest function. I thought of it for this project looking at the ingest files @pleonard212 put up for the Yale speeches. The goal there is to pull any of a variety of input formats (mallet-like, ndjson, keyed filenames) into a compressed, mem-mappable single file suitable for all sorts of stuff. I'm using it for scatterplots and bookworms, but I think it would also suit your use case here.

Can't remember how much I've waxed to @duhaime already about the parquet/feather ecosystem, but it's a hugely useful paradigm.

In the immediate term, the prime benefit for you would be that you'd be able to support ingest from a CSV file as well as the JSON format you currently have defined. Great, I guess. But on your side and mine, the real benefit would be that corpora built for this would be no-cost transportable to other systems, and that I could easily drop some intertext hooks into any previously built bookworms. Since the parquet/feather formats are quite portable, for the scale you're working it at it might be possible to bundle it into some kind of upload service.

Let me know if you want to talk more.

Citation

Thanks for developing this software! Any suggestion for how to cite this repo (Bibtex-style)?

Correct XML Parsing

The final string extraction method used to store substrings in Mongo doesn't parse XML tags, but it should

User Guide Layout

Hello:

Attached you will find the comp of the User Guide page. Once the interface updates from the other issue are implemented, I will need to update the screenshots for this page.

Let me know if you want to meet to talk through the comp.

All best, Monica

DataViz-UserGuide_update1

Error "File is not a zip file"

When trying to perform "intertext --infiles "[...]/*.txt" a "BadZipFile" Error occurs. Looks like something package-side going wrong, not downloading the right files?

Terminal output:

(intertext2) [...] % intertext --infiles "analysis/data/*.txt"       
                        
[...]/opt/anaconda3/envs/intertext2/lib/python3.7/site-packages/vectorizedMinHash/bias_coef.npy
 * fetching client version 0.0.1a
Traceback (most recent call last):
  File "[...]/opt/anaconda3/envs/intertext2/bin/intertext", line 8, in <module>
    sys.exit(parse())
  File "[...]/opt/anaconda3/envs/intertext2/lib/python3.7/site-packages/intertext/intertext.py", line 142, in parse
    download_client(**config)
  File "[...]/opt/anaconda3/envs/intertext2/lib/python3.7/site-packages/intertext/intertext.py", line 164, in download_client
    with zipfile.ZipFile(zip_location, 'r') as z:
  File "[...]/opt/anaconda3/envs/intertext2/lib/python3.7/zipfile.py", line 1258, in __init__
    self._RealGetContents()
  File "/[...]/opt/anaconda3/envs/intertext2/lib/python3.7/zipfile.py", line 1325, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Help is much appreciated!

Blacklist/stoplist support

I believe the experience of ARTFL was that there was a utility in having one or more texts (the Bible, lists of commonplace expressions) declared as ignorable, to prevent uninteresting hits from cluttering up the results. Possibly one could declare such status in the metadatajson:

"ignorethistext":true,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.