Giter Club home page Giter Club logo

orfeo-importer's Introduction

orfeo-importer

This program imports texts with linguistic annotations and generates outputs based on selected features of the annotated text. It reads files in CoNLL 2007, Macaon and TEI formats, generally merging information from several files (e.g. dependency trees from CoNLL or Macaon, metadata from TEI, time alignment information from TEI or Macaon). It then produces output in three formats: relAnnis 3.2 for importing into ANNIS; HTML as stand-alone pages for each sample; and index values for Apache Solr for text search, best suited for use with the associated Solr-based web search interface.

This program was created within the project ANR ORFEO. (The project is unrelated to a number of similarly named projects such as the Orfeo ToolBox library.)

Dependencies

Metadata is handled by a orfeo-metadata, a Ruby gem in a separate repository, which should be installed first before running this importer. The gem contains a default metadata model, but new ones can be defined using a simple column-based text file. See the metadata repository for details. Note: The metadata definitions used by the importer must match those used by the text search interface for the latter to function at all.

The directory data/files includes Javascript components by other authors:

Configuration files

Default values can be defined for all options. A file can be created for each corpus to define extra information to be displayed.

Default settings

Default values for settings can be defined in a YAML file named settings.yaml in the directory where the importer is run. These values can still be overridden on the command line.

It is particularly advisable to store values that seldom change, like the base URLs of ANNIS and the sample pages, in the YAML file, so that they need not be specified every time the script is invoked.

Corpus information files

Contained in the directory data/corpora, corpus information files must be named corresponding to the directories where input files appear, e.g. example.txt for information about a corpus read in from a directory named example. The content is read line by line, each line having different semantics. Currently there are four lines:

  1. Name of the corpus, formatted for readability (e.g. "C-Oral-Rom" rather than "coralrom").
  2. URL to the homepage of the corpus or project
  3. Filename of a logo to be displayed for the corpus
  4. An abstract describing the corpus

Unused lines may be left empty, but must not be omitted to maintain line numbering.

It is not obligatory to define corpus information for every corpus, but in the absence of an information file, the corpus information panel in the sample page will be virtually empty.

License

GPL v3; see file LICENSE for full text of the license.

orfeo-importer's People

Contributors

clement-plancq avatar larilampen avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.