Giter Club home page Giter Club logo

toolbox's Introduction

toolbox

ARRAS should not be thought of as a black box into which one inserts a text along with a set of commands and out of which one receives a completed analysis. A better analogy is a toolbox containing a set of tools, each designed for a particular task. The ARRAS design always presumes a human inquirer at the center. This ARRAS amplifies, rather than replaces, specific perceptual and cognitive functions. (John B. Smith, "A New Environment for Literary Analysis", Perspectives in Computing 4.2/3, 1984)

What is the toolbox?

The toolbox is a collection of small work-in-progress scripts and code snippets for text processing produced by CLiGS.

Note that all functions are designed for Python 3 and are experimental in nature and quality. Each folder contains one or several Python scripts and some sample texts for testing. Currently, we are transitioning towards the toolbox as a module (see below).

Experimental feature: toolbox as module

This allows using the scripts as a repo-based module. The basic idea is that you clone the toolbox repository from GitHub and add the path to the folder containing the toolbox to your Python sys.path (using the script "activate_toolbox.py" which is included here). Then, you can import modules and submodules from the toolbox in your custom text processing scripts anywhere on your computer and use the functions provided in the toolbox. You may want to create your own branch of the toolbox to customize the functions as necessary.

Requirements

  • pandas
  • numpy
  • requests
  • lxml
  • ...

Module structure

In order to use the module efficiently, you need to know which submodules are included and which functions are included in each submodule. The following is intended as a quick overview, please see the submodules themselves for details.

  • extract.py
    • read_tei5
    • read_tei4
    • get_metadata
    • get_metadataP4
  • crawl.py
    • crawl_tc
    • convert_encoding
  • annotate
    • annotate_fw.py
      • use_freeling
      • use_wordnet
      • annotate_fw
    • fw2txm.py
      • fw2txm
    • prepare_tei.py
      • prepare_anno
      • postpare_anno
      • prepare
    • use_heideltime.py
      • apply_ht
    • workflow_teifw.py
  • check_quality
    • spellchecking.py
      • check_collection
      • correct_words
    • validate_tei.py
      • validate_tei
    • elements_used.py
  • extract
    • tei2pdf.py
      • convert2pdf
    • tei2pdf.xsl

To get more information about a submodule, especially what each function does and which parameters they take, just use the usual help command in Python, for example:

help(extract)

or

help(extract.read_tei5)

Example

If you want to read text from a TEI P5 file, you could use the following import statement and function call in your script:

from toolbox import extract

extract.read_tei5("/folder/with/tei/files/", "/folder/for/text/files", "bodytext")            

toolbox's People

Contributors

christofs avatar hennyu avatar morethanbooks avatar stefaniepopp avatar

Watchers

Spinel Jean Denis avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.