Giter Club home page Giter Club logo

trivor-nlp's Introduction

trivor-nlp CircleCI

trivor-nlp leverages the use of NPL (Natural Language Processing) to detect sentences, tokens as well as the meaning of each token in the given sentence. After processing all sentences, several generators will produce valuable information that can be easily consumed.

Prerequisites

  • Java 8+

Usage

1. Add dependency

<dependency>
  <groupId>org.kalnee</groupId>
  <artifactId>trivor-nlp</artifactId>
  <version>0.0.1-alpha.2</version>
</dependency>

2. Create a Processor

trivor-nlp provides two processors:

  • TranscriptProcessor: general-purpose processor, the content must be accessed either via URI or String.
  • SubtitleProcessor: subtitle-only processor, the content must be accessed via URI.

Accepted URI schemas: file://, jar:// and s3:// (Make sure to have the AWS Credentials in place.)

Create a TranscriptProcessor from URI or String

// from URI
TranscriptProcessor tp = new TranscriptProcessor.Builder(uri).build();
// from String
TranscriptProcessor tp = new TranscriptProcessor.Builder("This is a sentence.").build();

Customize

Filters and mappers

For each line in the provided content, custom filters and mappers can be used to clean up the text before running the NLP models. Both fields are optional.

TranscriptProcessor tp = new TranscriptProcessor.Builder(uri)
        .withFilters(singletonList(line -> !line.contains("Name")))
        .withMappers(singletonList(line -> line.replaceAll(TRANSCRIPT_REGEX, EMPTY)))
        .build();
Settings

The following values can be overwritten by adding the Config class when building a Processor:

  • Vocabulary probability: Double (default: 0.9) e.g. it'll only be accepted verbs with a probability >= 90%
  • Chunk probability: Double (default: 0.5)
  • Run Sentiment Analysis: Boolean (default: true)
TranscriptProcessor tp = new TranscriptProcessor.Builder(uri)
        .withConfig(new Config.Builder().vocabularyProb(.98).chunkProb(.98).sentimentAnalysis(false).build())
        .build();

Create a SubtitleProcessor from URI

final SubtitleProcessor sp = new SubtitleProcessor.Builder(uri).withDuration(43).build();

All the necessary filters and mappers have already been provided for a Subtitle.

3. Result

After successfully building a processor, the NLP results can be accessed as follows:

processor.getSentences()

This method return the list of sentences. Each sentence is composed by the identified tokens, tags and chunks:

{
            "sentence" : "My name's Forrest.",
            "tokens" : [ 
                {
                    "token" : "My",
                    "tag" : "PRP$",
                    "lemma" : "my",
                    "prob" : 0.976362822572366
                }, 
                {
                    "token" : "name",
                    "tag" : "NN",
                    "lemma" : "name",
                    "prob" : 0.98267246788283
                }, 
                {
                    "token" : "'s",
                    "tag" : "POS",
                    "lemma" : "'s",
                    "prob" : 0.933313435543914
                }, 
                {
                    "token" : "Forrest",
                    "tag" : "NNP",
                    "lemma" : "forrest",
                    "prob" : 0.908174572293974
                }, 
                {
                    "token" : ".",
                    "tag" : ".",
                    "lemma" : ".",
                    "prob" : 0.982098322024085
                }
            ],
            "chunks" : [ 
                {
                    "tokens" : [ 
                        "My", 
                        "name"
                    ],
                    "tags" : [ 
                        "PRP$", 
                        "NN"
                    ]
                }, 
                {
                    "tokens" : [ 
                        "'s", 
                        "Forrest"
                    ],
                    "tags" : [ 
                        "POS", 
                        "NNP"
                    ]
                }
            ]
        }

processor.getResult()

This method return a Result object with many different insights such as:

  • Rate of Speech (only for Subtitles)
  • Frequency Rate
  • Frequent Sentences
  • Frequent Chunks
  • Vocabulary
  • Phrasal Verbs
  • Sentiment Analysis

The full documentation can be accessed here.

License

MIT (c) Kalnee. See LICENSE for details.

trivor-nlp's People

Contributors

danielfc avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

trivor-nlp's Issues

Add TranscriptProcessor

TranscriptProcessor would be an alternative to SubtitleProcessor. It must be intended to process generic data from URI or String. Also, make it possible to add custom filters and mappers to clean up the content.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.