Giter Club home page Giter Club logo

jargon's Introduction

Jargon

Jargon is a text pipeline, focused on recognizing variations on canonical and synonymous terms.

For example, jargon lemmatizes react, React.js, React JS and REACTJS to a canonical reactjs.

Install

Binaries are available on the Releases page.

If you have Homebrew:

brew install clipperhouse/tap/jargon

If you have a Go installation:

go install github.com/clipperhouse/jargon/cmd/jargon

To display usage, simply type:

jargon

Example:

curl -s https://en.wikipedia.org/wiki/Computer_programming | jargon -html -stack -lemmas -lines

CLI usage and details...

In your code

See GoDoc. Example:

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
	"github.com/clipperhouse/jargon/filters/stackoverflow"
)
 
text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
stream := jargon.TokenizeString(text).Filter(stackoverflow.Tags)

// Loop while Scan() returns true. Scan() will return false on error or end of tokens.
for stream.Scan() {
	token := stream.Token()
	// Do stuff with token
	fmt.Print(token)
}

if err := stream.Err(); err != nil {
	// Because the source is I/O, errors are possible
	log.Fatal(err)
}

// As an iterator, a token stream is 'forward-only'; once you consume a token, you can't go back.

// See also the convenience methods String, ToSlice, WriteTo

Token filters

Canonical terms (lemmas) are looked up in token filters. Several are available:

Stack Overflow technology tags

  • Ruby on Rails → ruby-on-rails
  • ObjC → objective-c

Contractions

  • Couldn’t → Could not

ASCII fold

  • café → cafe

Stem

  • Manager|management|manages → manag

To implement your own, see the Filter type.

Performance

jargon is designed to work in constant memory, regardless of input size. It buffers input and streams tokens.

Execution time is designed to O(n) on input size. It is I/O-bound. In your code, you control I/O and performance implications by the Reader you pass to Tokenize.

Tokenizer

Jargon includes a tokenizer based partially on Unicode text segmentation. It’s good for many common cases.

It preserves all tokens verbatim, including whitespace and punctuation, so the original text can be reconstructed with fidelity (“round tripped”).

Background

When dealing with technical terms in text – say, a job listing or a resume – it’s easy to use different words for the same thing. This is acute for things like “react” where it’s not obvious what the canonical term is. Is it React or reactjs or react.js?

This presents a problem when searching for such terms. We know the above terms are synonymous but databases don’t.

A further problem is that some n-grams should be understood as a single term. We know that “Objective C” represents one technology, but databases naively see two words.

What’s it for?

  • Recognition of domain terms in text
  • NLP for unstructured data, when we wish to ensure consistency of vocabulary, for statistical analysis.
  • Search applications, where searches for “Ruby on Rails” are understood as an entity, instead of three unrelated words, or to ensure that “React” and “reactjs” and “react.js” and handled synonmously.

jargon's People

Contributors

clipperhouse avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

carloszimm

jargon's Issues

Detect HTML for command line

Two situations where jargon should use the HTML tokenizer (instead of plain text):

  • Fetching via the -u flag, where Content-Type starts with text/html
  • Reading a file via the -f flag, and the file extension is .html or .htm

Numbers dictionary

Add ability to canonicalize numbers with are expressed with words. Keep it simple for common cases.

"three hundred million" => "300000000"

Add a flag to save output to a file

Building upon #6

On some platforms it's a pain to redirect output correctly (for example, .NET's Process isn't super friendly for that), and without significant overhead. The alternative is having a shell run the actual command, which can also be irritating to do.

Presumably something like -o <outputfile>, to go along with -f, would solve this.

Consider bufio for the tokenizer

As it stands now, the lexer loads the full string into memory. This is fine and simple enough for small strings.

If someone were to throw a large file at it, we'd want it to come through buffered, presumably using bufio.

Such a change might fit into the current lexer in a few strategic places, replacing the byte-reading bits with ReadByte. Or, deprecate the existing one and rewrite using bufio.Scanner.

(On the other side, out of scope for this issue, we might emit tokens in a buffered way, using channels perhaps.)

Implement codegen dictionary using Stack Exchange

Implement the first dictionary of tags & synonyms using the Stack Exchange API.

  • Fetch & deserialize JSON
  • Write file
  • Figure out how to do this in a separate package without a circular dependency

Pipe from Stdin instead of fetching the bytes

Currently, the jargon command line takes its input by specifying the source via flags.

  -f string
    	A file path to lemmatize
  -s string
    	A (quoted) string to lemmatize
  -u string
    	A URL to fetch and lemmatize

It occurs to me that jargon would play better simply by accepting Stdin.

There are already fine tools for reading files (cat) and fetching URLs (curl). jargon should just accept bytes piped from other tools.

Files

cat file.txt | jargon

replaces

jargon -f file.txt

URLs

curl https://example.com | jargon

replaces

jargon -u https://example.com

Strings

echo "I luv Rails" | jargon

replaces

jargon -s "I luv Rails"

What dictionaries should be added?

Currently jargon uses a Dictionary based on Stack Exchange tags & synonyms: https://github.com/clipperhouse/jargon/tree/master/stackexchange

What other types of data would you like to see a Dictionary for? Recall, a Dictionary serves two purposes:

  • To ‘correct’ (lemmatize, canonicalize) concepts that might expressed different ways
  • To identify named entities, even if they don’t need to be ‘corrected’

One idea that comes to mind is cities, or perhaps geographies in general. There is a data source here: https://www.geonames.org/export/

tokenizer.go is presupposing older uax29

type tokenizer struct {
        sc *bufio.Scanner
}

Needs to be updated to

import (	"github.com/clipperhouse/uax29/iterators" )

type tokenizer struct {
        sc *iterators.Scanner
}

to work with more recent releases of uax29

Contractions dictionary

A suggestion from @kevin-montrose: a Dictionary to canonicalize contractions, e.g.

shouldn’t → should not

The list of English contractions appears short. This would suggest simply encoding that list in a map[string]string, keyed by the contraction.

contractions[“shouldn't"] = "should not"
contractions["he’s"] = "he is”
// etc

(With tolerance for smart apostrophes, and perhaps missing apostrophes.)

Alternatively, we could try something rule-based, e.g.:

s/(.+)nt/$1 not/

My instinct is that this is likely overkill and prone to undesirable edge cases, compared to the first idea.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.