Giter Club home page Giter Club logo

string-punctuation-tokenizer's Introduction

npm npm

string-punctuation-tokenizer

Small library that provides functions to tokenize a string into an array of words with or without punctuation

Setup

npm install string-punctuation-tokenizer

Usage

var stringTokenizer = require('string-punctuation-tokenizer');

or ES6

import {tokenize} from 'string-punctuation-tokenizer';

Tokenize with punctuation

import {tokenize} from './src/tokenizers'; // use the import from above instead of this
let words = tokenize({text: 'Hello world, my name is Manny!', includePunctuation: true});
// words = ["Hello", "world", ",", "my", "name", "is", "Manny", "!"]

Tokenize without punctuation

import {tokenize} from './src/tokenizers'; // use the import from above instead of this
let words = tokenize({text: 'Hello world, my name is Manny!'});
// words = ["Hello", "world", "my", "name", "is", "Manny"]

Documentation

See detailed documentation and live WYSIWYG playground here: https://string-punctuation-tokenizer.netlify.app/#/Tokenize

string-punctuation-tokenizer's People

Contributors

ancienttexts-net avatar birchamp avatar bspidel avatar da1nerd avatar dependabot-support avatar klappy avatar mandolyte avatar mannycolon avatar photonomad0 avatar richmahn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

cdolek

string-punctuation-tokenizer's Issues

unexpected word split when encountering "3-inch"

The tokenizer correctly keeps words separated by a single dash (like “goat-hair” Exodus 26:7 NLT 2007), however we discovered that it splits up words when the dash is surrounded by a number and a word (like “3-inch” Exodus 25:25 NLT 2007). Is there a way for us to keep that as a single word?

Add more info to verbose mode

in verbose mode could you include the token position and character position in the sentence as well?
e.g. given "Hello World", "Hello" is at token position 1 (assuming starting at 0), and character position 6. This shouldn't impact performance and would help to eliminate some processing in the wordmap lexer.

"½" character removed from word split

The tokenizer doesn’t count fraction characters (i’m guessing it considers it punctuation because I choose not to include punctuation). For example, “7½ feet wide” is split up into 3 words: “7", “feet”, “wide”. Is there a way for us to keep the “½” character together with the “7" like this "7½"? See Exodus 27:1 NLT 2007.

Update Change Log

This repo is promoted in the Open Components paper. One thing that will be required of OCs is better documentation. The change log seems to have not been updated for three years.

Make example more interesting

The example 'Hello world, my name is Manny!' in the ReadMe leaves me with a few questions.

Could it possibly be extended to 'Hello world, my name is “bŏt#5”, and I don’t know Jess’ phone-number!'

Just curious. Thanks.

Add a license

Would it be possible to add a LICENSE file to the project so we could decide if we can use the package? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.