Giter Club home page Giter Club logo

parse-japanese's Introduction

parse-japanese js-standard-style

A Japanese language parser producing NLCST nodes.

  • For semantics of nodes, see NLCST;

Installation

npm:

npm install parse-japanese

Usage

var inspect = require('unist-util-inspect')

var ParseJapanese = require('parse-japanese')
var japanese = new ParseJapanese()

var text = '1 これは前段です。これは中段(2文の場合は後段。)です。これは後段です。\n'

japanese.parse(text, (cst) => {
  console.log(inspect(cst))
})
/**
* RootNode[1]
* └─ ParagraphNode[4]
*    ├─ SentenceNode[3]
*    │  ├─ WordNode[1]
*    │  │  └─ TextNode: '1'
*    │  ├─ WhiteSpaceNode: ' '
*    │  └─ WordNode[5]
*    │     ├─ TextNode: 'これ'
*    │     ├─ TextNode: 'は'
*    │     ├─ TextNode: '前段'
*    │     ├─ TextNode: 'です'
*    │     └─ PunctuationNode: '。'
*    ├─ SentenceNode[1]
*    │  └─ WordNode[14]
*    │     ├─ TextNode: 'これ'
*    │     ├─ TextNode: 'は'
*    │     ├─ TextNode: '中段'
*    │     ├─ PunctuationNode: '('
*    │     ├─ TextNode: '2'
*    │     ├─ TextNode: '文'
*    │     ├─ TextNode: 'の'
*    │     ├─ TextNode: '場合'
*    │     ├─ TextNode: 'は'
*    │     ├─ TextNode: '後段'
*    │     ├─ PunctuationNode: '。'
*    │     ├─ PunctuationNode: ')'
*    │     ├─ TextNode: 'です'
*    │     └─ PunctuationNode: '。'
*    ├─ SentenceNode[1]
*    │  └─ WordNode[5]
*    │     ├─ TextNode: 'これ'
*    │     ├─ TextNode: 'は'
*    │     ├─ TextNode: '後段'
*    │     ├─ TextNode: 'です'
*    │     └─ PunctuationNode: '。'
*    └─ WhiteSpaceNode: '
* '
*/


japanese = new ParseJapanese({pos: true})
text = 'すもももももももものうち。'

japanese.parse(text, (cst) => {
  console.log(inspect(cst))
})

/**
* RootNode[1]
* └─ ParagraphNode[2]
*    ├─ SentenceNode[1]
*    │  └─ WordNode[8]
*    │     ├─ TextNode: 'すもも' [data={"word_id":404420,"word_type":"KNOWN","word_position":1,"surface_form":"すもも","pos":"名詞","pos_detail_1":"一般","pos_detail_2":"*","pos_detail_3":"*","conjugated_type":"*","conjugated_form":"*","basic_form":"すもも","reading":"スモモ","pronunciation":"スモモ"}]
*    │     ├─ TextNode: 'も' [data={"word_id":2595480,"word_type":"KNOWN","word_position":4,"surface_form":"も","pos":"助詞","pos_detail_1":"係助詞","pos_detail_2":"*","pos_detail_3":"*","conjugated_type":"*","conjugated_form":"*","basic_form":"も","reading":"モ","pronunciation":"モ"}]
*    │     ├─ TextNode: 'もも' [data={"word_id":604730,"word_type":"KNOWN","word_position":5,"surface_form":"もも","pos":"名詞","pos_detail_1":"一般","pos_detail_2":"*","pos_detail_3":"*","conjugated_type":"*","conjugated_form":"*","basic_form":"もも","reading":"モモ","pronunciation":"モモ"}]
*    │     ├─ TextNode: 'も' [data={"word_id":2595480,"word_type":"KNOWN","word_position":7,"surface_form":"も","pos":"助詞","pos_detail_1":"係助詞","pos_detail_2":"*","pos_detail_3":"*","conjugated_type":"*","conjugated_form":"*","basic_form":"も","reading":"モ","pronunciation":"モ"}]
*    │     ├─ TextNode: 'もも' [data={"word_id":604730,"word_type":"KNOWN","word_position":8,"surface_form":"もも","pos":"名詞","pos_detail_1":"一般","pos_detail_2":"*","pos_detail_3":"*","conjugated_type":"*","conjugated_form":"*","basic_form":"もも","reading":"モモ","pronunciation":"モモ"}]
*    │     ├─ TextNode: 'の' [data={"word_id":2595360,"word_type":"KNOWN","word_position":10,"surface_form":"の","pos":"助詞","pos_detail_1":"連体化","pos_detail_2":"*","pos_detail_3":"*","conjugated_type":"*","conjugated_form":"*","basic_form":"の","reading":"ノ","pronunciation":"ノ"}]
*    │     ├─ TextNode: 'うち' [data={"word_id":1467000,"word_type":"KNOWN","word_position":11,"surface_form":"うち","pos":"名詞","pos_detail_1":"非自立","pos_detail_2":"副詞可能","pos_detail_3":"*","conjugated_type":"*","conjugated_form":"*","basic_form":"うち","reading":"ウチ","pronunciation":"ウチ"}]
*    │     └─ PunctuationNode: '。' [data={"word_id":2612880,"word_type":"KNOWN","word_position":13,"surface_form":"。","pos":"記号","pos_detail_1":"句点","pos_detail_2":"*","pos_detail_3":"*","conjugated_type":"*","conjugated_form":"*","basic_form":"。","reading":"。","pronunciation":"。"}]
*    └─ WhiteSpaceNode: '
* ' [data={"surface_form":"\n","pos":"記号","pos_detail_1":"空白"}]
*/

API

ParseJapanese(options?)

Exposes the functionality needed to tokenize natural Japanese languages into a syntax tree.

Parameters:

  • options (Object, optional)

    • position (boolean, default: true) - Whether to add positional information to nodes.
    • pos (boolean, default: false) - Whether to add part-of-speech information(by using kuromoji.js) to nodes.
    • dicDir (string, default: node_modules/parse-japanese/node_modules/kuromoji/dist/dict/) - Whether to set Dictionaries directory for kuromoji.js.

ParseJapanese#parse(value, cb)

Tokenize natural Japanese languages into an NLCST.

Parameters:

  • value (VFile or string) — Text document;

  • cb (Function). — Callback function;

function cb(cst)

Callback invoked when the output is generated with the processed document.

Parameters:

  • cst (string) — Generated document;

Related

License

MIT

parse-japanese's People

Contributors

azu avatar muraken720 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

parse-japanese's Issues

Out of memory errors

This is the code I'm running, taken from the sample code:

var inspect = require('unist-util-inspect');

var ParseJapanese = require('parse-japanese');
var parser = new ParseJapanese();

var text = '1 これは前段です。これは中段(2文の場合は後段。)です。これは後段です。\n';

parser.parse(text, function(result) {
    console.log(inspect(result));
});

The output:

$ node parse-japanese.js

<--- Last few GCs --->

[6271:0x245fa70]    17400 ms: Mark-sweep 453.1 (979.5) -> 131.6 (655.0) MB, 1231.2 / 0.0 ms  allocation failure GC in old space requested
[6271:0x245fa70]    17639 ms: Mark-sweep 131.6 (1167.0) -> 131.6 (1165.5) MB, 181.2 / 0.0 ms  allocation failure GC in old space requested
[6271:0x245fa70]    17678 ms: Mark-sweep 131.6 (1677.5) -> 131.6 (1673.0) MB, 38.6 / 0.0 ms  last resort
[6271:0x245fa70]    17709 ms: Mark-sweep 131.6 (1673.0) -> 131.6 (1673.0) MB, 30.6 / 0.0 ms  last resort


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x34279dd14239 <JS Object>
    1: shrink [/home/blaine/parse-japanese-js/node_modules/doublearray/doublearray.js:~149] [pc=0x136e059b3321](this=0x35beedd52361 <an Object with map 0x3cae6d73b9e1>)
    2: new constructor(aka DoubleArray) [/home/blaine/parse-japanese-js/node_modules/doublearray/doublearray.js:468] [pc=0x136e059a65e7](this=0x35beedd4...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [node]
 2: 0x12b288c [node]
 3: v8::Utils::ReportOOMFailure(char const*, bool) [node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [node]
 5: v8::internal::Factory::NewFixedArray(int, v8::internal::PretenureFlag) [node]
 6: v8::internal::HashTable<v8::internal::StringTable, v8::internal::StringTableShape, v8::internal::HashTableKey*>::New(v8::internal::Isolate*, int, v8::internal::MinimumCapacity, v8::internal::PretenureFlag) [node]
 7: v8::internal::HashTable<v8::internal::StringTable, v8::internal::StringTableShape, v8::internal::HashTableKey*>::EnsureCapacity(v8::internal::Handle<v8::internal::StringTable>, int, v8::internal::HashTableKey*, v8::internal::PretenureFlag) [node]
 8: v8::internal::StringTable::LookupString(v8::internal::Isolate*, v8::internal::Handle<v8::internal::String>) [node]
 9: v8::internal::LookupIterator::PropertyOrElement(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, bool*, v8::internal::LookupIterator::Configuration) [node]
10: v8::internal::Runtime_KeyedGetProperty(int, v8::internal::Object**, v8::internal::Isolate*) [node]
11: 0x136e057843a7

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.