Giter Club home page Giter Club logo

kwja's Introduction

KWJA: Kyoto-Waseda Japanese Analyzer

test codecov CodeFactor Grade PyPI PyPI - Python Version

[Paper] [Slides]

KWJA is a Japanese language analyzer based on pre-trained language models. KWJA performs many language analysis tasks, including:

  • Typo correction
  • Word segmentation
  • Word normalization
  • Morphological analysis
  • Word feature tagging
  • NER (Named Entity Recognition)
  • Base phrase feature tagging
  • Dependency parsing
  • PAS analysis
  • Bridging reference resolution
  • Coreference resolution
  • Discourse relation analysis

Requirements

Getting Started

Install KWJA with pip:

$ pip install kwja

Perform language analysis with the kwja command (the result is in the KNP format):

# Analyze a text
$ kwja --text "KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。"

# Analyze a text file and write the result to a file
$ kwja --file path/to/file.txt > path/to/analyzed.knp

# Analyze texts interactively
$ kwja
Please end your input with a new line and type "EOD"
KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。
EOD

The output is in the KNP format, like the following:

# S-ID:202210010000-0-0 kwja:1.0.2
* 2D
+ 5D <rel type="=" target="ツール" sid="202210011918-0-0" id="5"/><体言><NE:ARTIFACT:KWJA>
KWJA KWJA KWJA 名詞 6 固有名詞 3 * 0 * 0 <基本句-主辞>
は は は 助詞 9 副助詞 2 * 0 * 0 "代表表記:は/は" <代表表記:は/は>
* 2D
+ 2D <体言>
日本 にほん 日本 名詞 6 地名 4 * 0 * 0 "代表表記:日本/にほん 地名:国" <代表表記:日本/にほん><地名:国><基本句-主辞>
+ 4D <体言><係:ノ格>
語 ご 語 名詞 6 普通名詞 1 * 0 * 0 "代表表記:語/ご 漢字読み:音 カテゴリ:抽象物" <代表表記:語/ご><漢字読み:音><カテゴリ:抽象物><基本句-主辞>
の の の 助詞 9 接続助詞 3 * 0 * 0 "代表表記:の/の" <代表表記:の/の>
...

Here are some other options for kwja command:

--model-size: Model size to be used. Please specify 'tiny', 'base' (default) or 'large'.

--device: Device to be used. Please specify 'cpu' or 'gpu'.

--typo-batch-size: Batch size for typo module.

--seq2seq-batch-size: Batch size for seq2seq module.

--char-batch-size: Batch size for char module.

--word-batch-size: Batch size for word module.

--tasks: Tasks to be performed. Please specify 'typo', 'char', 'seq2seq', 'typo,char', 'typo,seq2seq', 'seq2seq,word', 'char,word', 'typo,char,word', 'typo,seq2seq,word', 'char,word,word_discourse', 'seq2seq,word,word_discourse', 'typo,char,word,word_discourse' or 'typo,char,word,word_discourse'.

  • typo: Typo correction
  • seq2seq: Word segmentation, Word normalization, Reading prediction, lemmatization, and Canonicalization.
  • char: Word segmentation and Word normalization
  • word: Morphological analysis, Named entity recognition, Word feature tagging, Dependency parsing, PAS analysis, Bridging reference resolution, and Coreference resolution
  • word_discourse: Discourse relation analysis
    • If you need the results of discourse relation analysis, please specify this in addition to word.

You can read a KNP format file with rhoknp.

from rhoknp import Document
with open("analyzed.knp") as f:
    parsed_document = Document.from_knp(f.read())

For more details about KNP format, see Reference.

Usage from Python

Make sure you have kwja command in your path:

$ which kwja
/path/to/kwja

Install rhoknp:

$ pip install rhoknp

Perform language analysis with the kwja instance:

from rhoknp import KWJA
kwja = KWJA()
analyzed_document = kwja.apply(
    "KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。"
)

Performance Table

  • The performance on each task except typo correction and discourse relation analysis is the mean over all the corpora (KC, KWDLC, Fuman, and WAC) and over three runs with different random seeds.
  • We set the learning rate of RoBERTaLARGE (word) to 2e-5 because we failed to fine-tune it with a higher learning rate. Other hyperparameters are the same described in configs, which are tuned for DeBERTaBASE.
Task Model
RoBERTaBASE
( char, word )
DeBERTaBASE
( char, word )
RoBERTaLARGE
( char, word )
DeBERTaLARGE
( char, word )
Typo Correction TBU TBU TBU TBU
Sentence Segmentation TBU TBU TBU TBU
Word Segmentation 98.5 98.6 98.7 98.9
Word Normalization 44.0 39.2 39.8 46.0
Morphological
Analysis
POS 99.3 99.4 99.3 99.4
sub-POS 98.1 98.4 98.2 98.4
conjugation type 99.4 99.5 99.2 99.4
conjugation form 99.5 99.6 99.4 99.6
reading 95.5 95.2 90.8 95.1
Named Entity Recognition 83.0 83.8 82.1 84.6
Linguistic
Feature
Tagging
word 98.3 98.5 98.5 98.4
base phrase 86.6 89.5 86.4 89.3
Dependency Parsing 92.9 93.4 93.8 93.3
Pas Analysis 74.2 76.7 75.3 76.9
Bridging Reference Resolution 66.5 67.3 65.2 67.0
Coreference Resolution 74.9 78.4 75.9 78.0
Discourse Relation Analysis 42.2 44.8 41.3 41.0

Citation

@InProceedings{植田2022,
  author    = {植田 暢大 and 大村 和正 and 児玉 貴志 and 清丸 寛一 and 村脇 有吾 and 河原 大輔 and 黒橋 禎夫},
  title     = {KWJA:汎用言語モデルに基づく日本語解析器},
  booktitle = {第253回自然言語処理研究会},
  year      = {2022},
  address   = {京都},
}
@InProceedings{児玉2023,
  author    = {児玉 貴志 and 植田 暢大 and 大村 和正 and 清丸 寛一 and 村脇 有吾 and 河原 大輔 and 黒橋 禎夫},
  title     = {テキスト生成モデルによる日本語形態素解析},
  booktitle = {言語処理学会 第29回年次大会},
  year      = {2023},
  address   = {沖縄},
}

Reference

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.