Giter Club home page Giter Club logo

text-pair's Introduction

TextPAIR (Pairwise Alignment for Intertextual Relations)

TextPAIR is a scalable and high-performance sequence aligner for humanities text analysis designed to identify "similar passages" in large collections of texts. These may include direct quotations, plagiarism and other forms of borrowings, commonplace expressions and the like.

The original text-pair package was written in Perl and had two versions: PhiloLine and Pair. The new TextPAIR is meant to replace both. You can find the old package here: https://code.google.com/archive/p/text-pair/

Built with support from the Mellon Foundation and the Fondation de la Maison des Sciences de l'Homme.

alt text

Installation

Note that TextPair will only run on 64 bit Linux and MacOS. Windows will NOT be supported.

Dependencies

  • Python 3.6 and up
  • Node and NPM
  • PostgreSQL: you will need to create a dedicated database and create a user with read/write permissions on that database. You will also need to create the pg_trgm extension on that database by running the following command in the PostgreSQL shell: CREATE EXTENSION pg_trgm; run as a superuser.
  • A running instance of Apache

Install script

  • Run install.sh script. This should install all needed components
  • Make sure you include /etc/text-pair/apache_wsgi.conf in your main Apache configuration file to enable searching
  • Edit /etc/text-pair/global_settings.ini to provide your PostgreSQL user, database, and password.

Quick start

Before running any alignment, make sure you edit your copy of config.ini. See below for details

The sequence aligner is executed via the textpair command.

textpair takes the following command-line arguments:

  • --config: path to the configuration file where preprocessing, matching, and web application settings are set
  • --source_files: path to source files
  • --source_metadata: path to source metadata. Only define if not using a PhiloLogic database.
  • --target_files: path to target files. Only define if not using a PhiloLogic database.
  • --target_metadata: path to target metadata
  • --is_philo_db: Define if files are from a PhiloLogic database. If set to True metadata will be fetched using the PhiloLogic metadata index. Set to False by default.
  • --output_path: path to results
  • --debug: turn on debugging
  • --workers: Set number of workers/threads to use for parsing, ngram generation, and alignment.
  • --load_web_app: Define whether to load results into a database viewable via a web application. Set to True by default.

Example:

textpair --source_files=/path/to/source/files --target_files=/path/to/target/files --config=config.ini --workers=6 --output_path=/path/to/output

Configuring the alignment

When running an alignment, you need to provide a configuration file to the textpair command. You can find a generic copy of the file in /var/lib/text-pair/config/config.ini. You should copy this file to the directory from which you are starting the alignment. Then you can start editing this file. Note that all parameters have comments explaining their role. While most values are reasonable defaults and don't require any edits, you do have to provide a value for table_name in the Web Application section at the bottom of the file.

Alignments using PhiloLogic databases

This currently uses the dev version of PhiloLogic5 to read PhiloLogic4 databases. So you'll need a version of PhiloLogic5 installed. A future version will have that functionnality baked in.

To leverage a PhiloLogic database, use the --is_philo_db flag, and point to the data/words_and_philo_ids directory of the PhiloLogic DB used. For instance, if the source DB is in /var/www/html/philologic/source_db and the target DB is in /var/www/html/philologic/target_db, run the following:

textpair --is_philo_db --source_files=/var/www/html/philologic/source_db/data/words_and_philo_ids/ --target_files=/var/www/html/philologic/target_db/data/words_and_philo_ids/ --workers=8 --config=config.ini

Note that the --is_philo_db flag assumes both source and target DBs are PhiloLogic databases.

Run comparison between preprocessed files manually

It's possible run a comparison between documents without having to regenerate ngrams. In this case you need to use the --only_align argument with the textpair command. Source files (and target files if doing a cross DB alignment) need to point to the location of generated ngrams. You will also need to point to the metadata.json file which should be found in the metadata directory found in the parent directory of your ngrams.

  • --source_files: path to source ngrams generated by textpair
  • --target_files: path to target ngrams generated by textpair. If this option is not defined, the comparison will be done between source files.
  • --source_metadata: path to source metadata, a required parameter
  • --target_metadata: path to target metadata, a required parameter if target files are defined.

Example: assuming source files are in ./source and target files in ./target:

textpair --only_align --source_files=source/ngrams --source_metadata=source/metadata/metadata.json --target_files=target/ngrams --target_metadata=target/metadata/metadata.json --workers=10 --output_path=results/

Configuring the Web Application

The textpair script automatically generates a Web Application, and does so by relying on the defaults configured in the appConfig.json file which is copied to the directory where the Web Application lives, typically /var/www/html/text-pair/database_name.

In this file, there are a number of fields that can be configured:

  • webServer: should not be changed as only Apache is supported for the foreseeable future.
  • appPath: this should match the WSGI configuration in /etc/text-pair/apache_wsgi.conf. Should not be changed without knowing how to work with mod_wsgi.
  • databaseName: Defines the name of the PostgreSQL database where the data lives.
  • databaseLabel: Name of the Web Application
  • sourceDB and targetDB both define contextual links using PhiloLogic. philoDB defines whether the contextual link should appear in results and link defines the URL of the PhiloLogic database.
  • sourceLabel and targetLabel are the names of source DB and target DB. This field supports HTML tags.
  • metadataTypes: defines the value type of field. Either TEXT or INTEGER.
  • sourceCitation and targetCitation define the bibliography citation in results. field defines the metadata field to use, and style is for CSS styling (using key/value for CSS rules)
  • metadataFields defines the fields available for searching in the search form for source and target. label is the name used in the form and value is the actual name of the metadata field as stored in the SQL database.
  • facetFields works the same way as metadataFields but for defining which fields are available in the faceted browser section.
  • timeSeriesIntervals defines the time intervals available for the time series functionnality.

Once you've edited these fields to your liking, you can regenerate your database by running the npm run build command from the directory where the appConfig.json file is located.

text-pair's People

Contributors

clovis avatar river974 avatar valerie-hanoka avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.