Giter Club home page Giter Club logo

git-lit's Introduction

Git-Lit

Scripts to create git repositories for ALTO XML texts, like those from the British Library's scanned documents. These scripts produce the GitHub repositories that can be seen on Git-Lit.

Project Summary

This project aims to make the British Library's corpus of scanned and OCRed ALTO XML texts better available to digital humanists, by transforming the texts into useful file formats and publishing them to the Web as corpus repositories. This is intended to have a threefold effect. First, it will make public the heretofore obscure textual holdings of the British Library (with their permission, of course). Second, it will transform their verbose XML data into archival TEI XML and plain text formats that are easier to read and computationally analyze. Third, it will make this data available to text analysts, editors, and other interested parties by creating version-controlled git repositories for each text and programmatically posting them to GitHub. This will allow for crowdsourced proofreading and collaborative improvement of the texts, as well as archival storage of every subsequent revision of the text.

Project Status

The main script, corresponding to Phase I, currently works with all the test texts. Some minor issues exist, as can be found in the issue tracker.

Project Planning

This project will be divided into these phases:

Phase I. A script will be written to parse each text's JSON metadata and use this to create a GitHub repository title, description, and README.md file for the text. That script will then interface with the GitHub API to create a new repository, set these properties, and push a newly initialized git module for each text, which, apart from the readme file, hasn't yet been altered from its original state. At this point the texts will already be public, and will already be useful to text analysts.

Phase II. Indices will be created for these texts in the form of submodule pointers. Parent repositories will be created for certain categories of texts, containing only pointers to subrepository remotes and their commit hashes. These category-based parent repositories might include "17th Century Novels," "18th Century Correspondence," or simply "Poetry," but the categories are not mutually exclusive by necessity. This will allow a literary scholar interested in a particular category to instantly assemble a corpus by git cloneing the parent repository and checking out its submodules with git submodule update --init --recursive. An early sketch of this idea is outlined in my blog post, A Proposal for a Corpus Sharing Protocol.

Phase III. A script will be written to transform the text into a more useful format, by ingesting the verbose ALTO XML and outputting Markdown editions of each text. Markdown was chosen as a plain-text file format, as it is one of the more human-readable formats, and one with the least amount of markup syntax, making it a reasonable format for computational analysis. GitHub also features an in-browser Markdown editor, which would allow any interested party to submit an edit to a text without leaving the browser. These markdown editions will be programmically committed and pushed to each repository.

Phase IV. Another transformation script will be written to ingest the ALTO XML and output TEI Simple XML. TEI Simple was chosen as an archival markup format, as it is a standardized subset of TEI XML, and eliminates many of the semantic ambiguities of the TEI superset. Many XSLT stylesheets and other tools have already been written for TEI XML, and it is the most feature-rich of textual markup languages.

Technical Details

The main script is main.ipynb, an IPython Notebook. It requires some variables set in secrets.py, which is not included here for security reasons. It assumes that all documents live in /data relative to main.ipynb. It runs on Python 3, and requires the libraries lxml, sh, pandas, github3, logging, jinja2, and possibly others. It has only been tested on Linux, but could possibly work on other platforms.

git-lit's People

Contributors

jonathanreeve avatar prpole avatar sheesh avatar tfmorris avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.