Giter Club home page Giter Club logo

text-pipeline's Introduction

Text Pipeline Package


Overview

This package contains several modules/classes meant to provide a smooth process to clean text documents for preparation in NLP models, embedding generation, and other processes. Its goal is to provide a flexible way to pipeline and clean text while also allowing quick iteration and the ability to add your own functionality. The source code can be found here.

Separate modules are provided to execute individual pipeline steps. Each module requires a parameter called `name` to specify which underlying library to use, and a set of sometimes optional parameters to customize functionality. Modules are instantiated as objects and passed to the Pipeline constructor in the order you wish the steps to execute.

Quick Start

Here are several code examples to get you started. As a note, the tokenizer module has the following input and output:

  • Input: List of strings representing a list of 'documents'
  • Output: List of list of strings representing a list of documents split into words or tokens

The other modules have the following input and output:

  • Input: List of list of strings representing a list of documents split into words or tokens
  • Output: List of list of strings representing a list of documents split into words or tokens

Please check the documentation below for the specific pipeline module to find out what names and parameters are supported.

Example #1

# Initialize corpus as a list of strings, where each string is a document
import text_pipeline as tp
docs = get_my_corpus()

t = tp.Tokenizer('spacy')
s = tp.Stemmer('nltk', stemmer='snowball')
pipeline = tp.Pipeline(t, s)
cleaned_docs = pipeline.apply(docs)

Example #2

# Initialize corpus as a list of strings, where each string is a document
import text_pipeline as tp
docs = get_my_corpus()

params = {
    'remove_stops': True,
    'remove_nums': True,
    }
t = tp.Tokenizer('spacy')
f = tp.TokenFilter('spacy', params)
s = tp.Stemmer('spacy')
f_2 = tp.TokenFilter('frequency', 5)
pipeline = tp.Pipeline(t, f, s, f_2)
cleaned_docs = pipeline.apply(docs)

Example #3

# Initialize corpus as a list of strings, where each string is a document
import text_pipeline as tp
docs = get_my_corpus()

special_cases = [("don't", 
                [ {ORTH: "do"}, {ORTH: "n't", LEMMA: "not"}]
                )]
t = tp.Tokenizer('spacy') 
f = tp.TokenFilter(
        'spacy',
        {'add_special_case': special_cases}
        )
s = tp.Stemmer('nltk', stemmer='porter')
pipeline = tp.Pipeline(t, f, s)
cleaned_docs = pipeline.apply(docs)

Installation

You can install the module via pip by running pip install text_pipeline .

To install the various data needed for the underlying modules, we recommend you run the following commands inside your virtual environment. If you need help setting up a virtual environment please check out The Hitchhiker's Guide to Python, we personally recommend virtualenv.

$ python -m spacy download en
$ python3
> import nltk
> nltk.download('punkt')
> nltk.download('wordnet')
> nltk.download('words')
> nltk.download('stopwords')

Usage

Pipeline.py

    class  text_pipeline.Pipeline(*args) 

Parameters:

  • args: Instantiated objects to be applied to text

    This is a variable length argument of objects to apply to the text. The objects must be listed in order that you wish to apply them. The onus is on the user to ensure the inputs and outputs of each class match.

Attributes:

None

Methods:

  • apply(docs): Applies the pipeline to the text.

    Parameters:

    • docs: A list of strings, where each string represents a document.

    Notes:

    Return will be a list of list of strings where strings are individual tokens or words.


Tokenizer.py

 class  text_pipeline.Tokenizer(name)

Parameters:

  • name: string

    The name of the tokenizer you wish to use.

Attributes:

None

Methods:

  • apply: Runs the tokenizer as specified by parameter name
  • spacy: This helper method is executed by apply. It should not be accessed from outside the class.
  • nltk: This helper method is executed by apply. It should not be accessed from outside the class.

Supported parameters for each name:

  • spacy
    • None
  • nltk
    • None

TokenFilter.py

 class  text_pipeline.TokenFilter(
                    name, 
			    to_lower=True, 
                            lemmatize=False, 
                            remove_stops=False, 
                            remove_nums=False,
                            remove_oov=False, 
                            add_special_case=None,
                            remove_url=True, 
                            remove_email=True, 
                            remove_punct=True)

Parameters:

  • name: string

    The name of the tokenizer you wish to use.

  • to_lower: boolean, optional, default True

    When true this converts all text to lowercase.

  • keep_alpha: boolean, optional, default False

    If true, remove all tokens that are not alpha characters.

  • keep_alpha_nums: boolean, optional, default True

    If true, remove all tokens that are not alpha characters or digits.

  • remove_stops: boolean, optional, default False

    If true, remove stop words according to chosen tokenizer's stop word list.

  • remove_nums: boolean, optional, default False

    If true, remove tokens that look like numbers.

  • remove_oov: boolean, optional, default False

    If true, remove out of vocab words according to chosen tokenizer's vocabulary.

  • add_special_case: list[tuple(string, list[dict])], optional default None

    Support for special cases in spacy. See example at beginning of Readme or Spacy documentation here for more details.

  • remove_url: boolean, optional, default True

    If true, remove tokens that look like urls.

  • remove_email: boolean, optional, default True

    If true, remove tokens that look like emails.

  • remove_punct: boolean, optional, default False

    If true, remove punctuation.

  • threshold: int, optional, default None

    Removes words with frequency count below threshold. Bound is exclusive, i.e. remove if < threshold.

Attributes:

None

Methods:

  • apply: Runs the tokenizer as specified by parameter name
  • spacy: This helper method is executed by apply. It should not be accessed from outside the class.
  • nltk: This helper method is executed by apply. It should not be accessed from outside the class.
  • frequency: This helper method is executed by apply. It should not be accessed from outside the class.

Supported parameters for each name:

spacy

  • to_lower
  • keep_alpha
  • keep_alpha_num
  • remove_stops
  • remove_num
  • add_special_case
  • remove_url
  • remove_email
  • remove_punct

nltk

  • to_lower
  • remove_stops
  • remove_oov

frequency

  • threshold

Stemmer.py

 class  text_pipeline.Stemmer(name, 
			stemmer=None, 
		        lemmatizer=None)

Parameters:

  • name: string

    The name of the tokenizer you wish to use.

  • stemmer: str, optional, default None

    When 'porter', PorterStemmer is used. When 'snowball', SnowballStemmer is used.

  • lemmatizer: str, optional, default None

    When 'wordnet', WordNetLemmatizer is used.

Attributes:

None

Methods:

  • apply: Runs the tokenizer as specified by parameter name
  • spacy: This helper method is executed by apply. It should not be accessed from outside the class.
  • nltk: This helper method is executed by apply. It should not be accessed from outside the class.

Supported parameters for each name:

spacy

  • None
  • Defaults to lemmatizer.

nltk

  • stemmer
  • lemmatizer

Example Usage

stemmer_1 = Stemmer('spacy')

# or
stemmer_2 = Stemmer('nltk', stemmer='snowball')
 
# or
stemmer_3 = Stemmer('nltk', stemmer='porter')

# or
stemmer_4 = Stemmer('nltk', lemmatizer='wordnet')

Testing

If you wish to test the package after adding functionality or making modifictations, you can run the pre-existing unit tests yourself. From the terminal type:

$ cd text_pipeline_repository
$ ls 
 text_pipeline/    tests/    ...
$ python -m unittest tests.unit_tests

Contributing

Thanks to Sanjana Kapoor for her help in writing this package, and thanks to Blaize Berry and Rachel Brynsvold for their insight and ideas.

Contributions are welcome and encouraged. Email ([email protected]) or pull request are both good ways to contribute.

text-pipeline's People

Contributors

abaybektursun avatar jsigee87 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.