Giter Club home page Giter Club logo

libfnl's Introduction

libfnl

Introduction

libfnl is an API and CLI facilitating data and text mining by providing a collection of easy-to-use tools. The library is designed to work with Python 3 (only). It is specifically tuned towards mining biomedical/scientific texts, but can be used in other contexts if need be, too. It is a complementary piece in the gnamed gene name repository daemon and the medic PubMed mirroring tool collection. In addtion, an (orphan) couchpy repository could provide a document storage facility.

The library contains the following packages:

fnl.nlp

tools to linguistically analyze text (tokenization, PoS tagging, phrase chunking, entity detection); modules to segment sentences (based on NLTK), and map text (strings) to entries in dictionaries this includes a Python wrapper for the GENIA Tagger, a Python wrapper for the NER Suite, and a handler for the GENIA corpus; furthermore, via NLTK 's wrapper for MegaM, a Maximum Entropy classifier is available, too;

fnl.stat

a module to evaluate inter-rater Kappa scores and a module to develop text classifiers based on Scikit-Learn

fnl.text

wrappers to work with text data (strings, tokens, segments, annotations, etc.)

fnl.utils

additional utilities and tools (currently, just for handling JSON)

scripts

the CLI scripts to manage data/text, representing the main value provided by this collection

The script directory provides the following command-line interfaces:

  • fnlclassi generate a classifier for [NER-tagged] text using Scikit-Learn.
  • fnlcorpus store corpora in JSON format in a CouchDB.
  • fnldgrep "grep" for tokens using a dictionary.
  • fnldictag tag semantic tokens from a dictionary in linguistically annotated text.
  • fnlgpcounter count gene/protein symbols in MEDLINE.
  • fnlkappa calculate inter-rater agreement scores.
  • fnlsegment segment text into sentences using NLTK (PunktSentenceTokenizer).
  • fnlsegtrain train a nltk.punkt.PunktSentenceTokenizer.
  • fnltok a fast, pure-Python, Unicode-aware string tokenizer.

Warning

This project is under "continuous development", better take your own snapshot.

Requirements

  • Python 3.2+
  • Numpy, SciPy, and Scikit-Learn 0.14+ (for fnlclassi)
  • NLTK 3.0+ (for the sentence segmenting tools fnlseg*)
  • DAWG (for fnlgpcounter; see Installation below)

Optional projects that work together with this project:

  • GENIA Tagger (optional, latest version)
  • NER Suite (optional, latest version, in turn requires CRF Suite)
  • MegaM - a MaxEnt classifier for NLTK with a (fast) L-BFGS optimizer
  • gnamed for creating gene/protein name repositories
  • medic for mirroring and handling PubMed citations
  • txtfnnl natural language processing tools based on Apache OpenNLP and UIMA

Installation

Into a Python 3 virtual environment:

pip install virtualenv # if virtualenv is not yet installed
git clone git://github.com/fnl/libfnl.git libfnl
virtualenv libfnl
cd libfnl
. bin/activate
pip install argparse # for python3 < 3.2
pip install numpy # because installing scipy fails if numpy isn't installed already
pip install -e . # installs all other dependencies

# if you prefer to install all other dependencies manually
# and/or prefer to use setup.py instead of pip:
# python setup.py install
pip install sqlalchemy
pip install sklearn
pip install matplotlib
pip install nltk --pre # to get 3.0

# if you want to install the test environment:
pip install pytest

# special steps to install DAWG
git clone [email protected]:fnl/DAWG.git
cd DAWG
python setup.py install
cd ..

License

All parts of this library are licensed under the GNU Affero GPL v3

See the attached LICENSE.txt file.

© 2006-2014 Florian Leitner. All rights reserved.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.