Giter Club home page Giter Club logo

naruhodo's Introduction

Naruhodo(なるほど)

Build Status License: MIT PyPI version

日本語はこちら

naruhodo is a python library for automatic semantic graph generation from human-readable text. Graphs generated by naruhodo are networkx objects, so you can apply furthur garph analysis/processing conveniently.

Supported languages:

  • Japanese
  • English(WIP)
  • Chinese(WIP)

Supported semantic graph types:

  • knowledge-structure-graph(KSG): A directed graph based on entity-predicate model of knowledge representation.
  • dependency-structure-graph(DSG): A directed graph based on dependency structure.

naruhodo provides basic visulization utilities using nxpd. A full-fledged visualization webapp naruhodo-viewer is also available. This webapp provides faster and interactive visualization of large graphs.

Knowledge structure graph(KSG)

Knowledge structure graph(KSG) tries to capture meaningful relationships between different entities. It is generated based on the dependency structure of the text.

In KSG, there are primarily two kinds of nodes:

  • nodes that represent entities (entities)
  • nodes that represent properties or actions (predicates)

Entity nodes are primarily connected by predicate nodes. Edges represent the relationship between these nodes. If an entity has a property or performs an action, it has an edge pointing to corresponding predicate nodes. If an entity is part of the property of another entity or the object of an action, it has an edge pointing from the predicate node to it.

This kind of graph structure is inspired by the way human brains store knowledge/information. Accurate and comprehensive KSG parsing is the main focus of naruhodo.

Below is an example of KSG generated by naruhodo using the following texts.

"田中一郎は田中次郎が描いた絵を田中三郎に贈った。"
"三郎はこの絵を持って市場に行った。"
"彼はしばらく休む事にした。"

KSG generated from texts

Dependency structure graph (DSG)

Dependency parsing is the analysis of dependency grammar on a block of text using computer programs. The directed linking nature of dependency grammar makes the result of dependency parsing directed graphs. naruhodo generates denpendency structure graphs(DSG) directly from the output of dependency parsing programs.

Below is an example of DSG generated by naruhodo using the following texts.

"田中一郎は田中次郎が描いた絵を田中三郎に贈った。"
"三郎はこの絵を持って市場に行った。"
"彼はしばらく休む事にした。"

DSG generated from texts

Installation

naruhodo supports python version 3.4 and above. You can install the library directly using pip:

pip install naruhodo

This will install the latest release version of naruhodo. The current development version of naruhodo can be installed from github repository directly using the following command:

pip install https://github.com/superkerokero/naruhodo/archive/dev.zip

naruhodo relies on external programs to do Japanese word and dependency parsing, so you need to have corresponding programs installed as well.

naruhodo is designed to support multiple backend parsers, but currently only the support for mecab + cabocha is implemented.

For guide on installing mecab and cabocha, please refer to this page:

Amazon Linux に MeCab と CaboCha をインストール

Support for other parsers such as KNP is planned in the future.

Nodes-and-edges-specification

naruhodo stores graph information in a networkx DiGraph object. The properties of nodes and edges provided by naruhodo are listed in the following table.

  • Node properties

    Property Description
    name A string that stores the name of the node stored in the graph. This is what you use to refer to the node from graph object.
    count An integer representing the number of this node being referred to. Can be used as an indicator of node's significance.
    type An integer representing the type of the node. For meanings of integers, refer to the table of node types below.
    label A string that stores the normalized representation of the node. This is what you see from the visualizations.
    pro An integer representing the pronoun type of this node. For meanings of integers, refer to the table of pronoun types below.
    NE An integer representing the named-entity(NE) type of this node. For meanings of integers, refer to the table of NE types below.
    negative If chunk is negative 1, elif chunk double negtive(strongly positive) -1, else 0
    question If chunk contains ? 1, else 0.
    passive If chunk is passive 1, else 0.
    compulsory If chunk is compulsory 1, else 0.
    tense If chunk has no tense or present 0, elif past -1, elif present continuous 1
    pos[0:n-1] A list of integers representing the id of sentences where this node appears.
    lpos[0:n-1] A list of integers representing the id of chunk in the sentence it appears.
    surface[0:n-1] A list of strings that stores the surfaces of this node(original form as it appears in the text).
    yomi[0:n-1] A list of strings that stores the yomi of the corresponding surface of this node.
    sub A string that stores the subject of this node(if none it will be an empty string).
  • Node types

    Type ID Description
    -1 Unknown type
    0 Noun
    1 Adjective
    2 Verb
    3 Conjective
    4 Interjection
    5 Adverb
    6 Connect
  • Pronoun types

    Pronoun ID Description
    -1 Not a pronoun(or unknown pronoun)
    0 Demonstrative-location
    1 Demonstrative-object
    2 Personal-1st
    3 Personal-2nd
    4 Personal-3rd
    5 Indefinite
    6 Inclusive
    7 Omitted subject
  • Named-entity types

    NE ID Description
    0 Not named-entity(or unknown)
    1 Person
    2 Location
    3 Organization
    4 Number/Date
    5 General
  • Edge properties

    Property Description
    weight An integer representing the number of appearance of this edge. Can be used as an indicator of edge's significance.
    label A string that stores the label of this edge.
    type A string that stores the type of this edge. For details, refer to the table of edge types below.
  • Edge types

    Type Description
    none Unknown type(also used as DSG edges)
    sub Edge from a subject to predicate
    autosub Edge from a potential subject to predicate
    obj Edge from a predicate to object
    aux Edge from auxiliary to predicate
    cause Edge from potential cause to result
    coref Edge from potential antecedent to pronoun
    synonym Edge from potential synonym to an entity
    para Edge between parallel entities
    attr Edge from potential attribute to an entity
    stat Edge from potential subject to a statement

Tutorial

The tutorial of naruhodo is provided as ipynb files in the tutorial folder. You can view it directly in your browser.

Tutorial notebook for Japanese text parsing

Python-API

The complete python API document for naruhodo can be found here:

naruhodo Python API Reference.

This document is generated automatically from source code using pdoc, so it is always up-to-date.

Change-Log

You can find the change log of naruhodo here.

Development status and some personal comments

naruhodo is still in development state(especially KSG related part), so you might find it outputs weird results sometimes. If you like the idea and want to help improve the library, feel free to create a pull request on github.

Here are some of my thoughts on the development of naruhodo :

  • Improvement on the quality of generated graph (0.2 ~ 0.5)

    As you can see from the source code, naruhodo relies mostly on rule-based system to parse given information. And for a subject as large and complex as a language, long-term testing and procedural improvement of the program is neccessary before it can go anywhere.

    Currently the knowledge structure graph(KSG) generated by naruhodo is below my expectation for large amount of input texts. Improvement will come from furthur examination on varieties of input text and corresponding refinement of parsing logic.

    As a rule-based system, it certainly has its limitations such as completely resolving coreferences. But I believe in the realm of NLP, especially in rudimentary information parsing tasks, rule-based system can be used to make practical applications. Recent advances in statitics-based techniques such as deep learning seem promising. But almost all of these techniques require large amount of labelled data, which is hard to retrieve. The rule-based approach taken here is more or less an Ab Initio way of looking at some NLP problems(which doesn't take any training data before making useful predictions). My hope is that applications like this may at least alleviate the pain of collecting large amount of labelled data by automating some of the tedious tasks. naruhodo is my personal experiment on how far rule-based system can go in the world of NLP. It may fail to be practically useful, but I am sure it is going to be an interesting journey.

  • Improving coreference resolution performance (0.2.1 ~)

    Coreference resolution is the task of finding all expressions that refer to the same entity in a text. Without proper coreference resolution, generated KSG does not capture all meaningful information, and its usability will be quite limited. Currently naruhodo has a primitive coreference resolution added from 0.2.1, but the performance is quite limited. I am experimenting with some published works on this topic. A method based on word embeddings and reinforcement learning might be added to naruhodo first starting from version 0.5 ~.

  • Support for more backends (0.5 ~)

    There aren't many Japanese parsing programs available on the internet yet. Aside of mecab + cabocha, the most usable parsing program seems to be juman(++) + knp. The output format of knp does contain extra useful information and can be more accurate than cabocha in some situations. But its output lacks a unified scheme, making it difficult to use. Another important fact is that juman(++) + knp parsing can be very time consuming compared to mecab + cabocha, which limits its use cases.

    I am looking into some fast generic libraries like spaCy as well. Though Japanese is not the officially supported language for the moment.

    To summarize, though naruhodo is designed to support multiple backends, since its current focus is Japanese only, adding support for other backends is not a priority.

  • Support for other languages(?~)

    Japanese is the only language naruhodo supports now. Besides my personal interest, I chosen Japanese because it has some unique characteristics that make it both challenging and rewarding. In my opnion, the difficulty regarding Japanese mostly comes from its ambiguity in the expression(for example, the subject is frequently ommited in Japanese) and large amount of word transformations(the same verb can have as many as 10+ forms).

    From a practical point of view, languages such as English and Chinese are in potentially large demand. So I am thinking about expanding the library to these popular languages in the future, if the rule-based approach taken by naruhodo proves to be usable afterall.

  • Adding statistics-based approaches(?~)

    It seems that everybody is excited about machine learning these days. And I do see huge application potential in techniques like reinforcement learning and generative adversarial models. I do have some thoughts about the applications of these techniques to some specific knowledge retrieval problems. For example, the coreference problem is obviously outside the reach of any rule-based systems, and a reinforcement learning based approach seems quite attractive in this case(provided that we have a real-time feedback system from users).

    As my understanding of machine learning techniques improves, some statistics-based approaches may be added in the future.

  • Applications based on DSG and KSG(new projects)

    I think information of DSG and KSG is especially useful in the realm of automating information retrieval processes. This includes, but not limited to,

    • automatic text summarization
    • knowledge base generation for Q&A system and translation system
    • generic sentiment analysis

    As the quality of KSG generated by naruhodo improves, I will try to apply it to some of these areas.

naruhodo's People

Contributors

superkerokero avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

naruhodo's Issues

On windows OS, the utils/communication.py can not read input from stdin.flush().

C:\Anaconda3\lib\site-packages\naruhodo\utils\communication.py in query(self, inp)
43 pattern = r'EOS'
44 self.proc.stdin.write(inp.encode('utf-8') + six.b('\n'))
---> 45 self.proc.stdin.flush()
46 result = ""
47 while True:

OSError: [Errno 22] Invalid argument

I copied the error log. Would you like to help me out? Thank you very much.

English version example

Hi,
I am trying this code for English text. I try to do this by changing lang from "ja" to "en".. But that is not working. Can you tell how we can make it work with English language texts?

Would you like to provide a mode switch that the user can choose if merge the same word in the dependency graph ?

@superkerokero, I am doing the NLP research in Japanese. I really appreciate that naruhodo can store the dependency analysis result into networkx.

However, I find that even if I only want the dependency graph of one sentence, the same word are merged. I know the merging operation can reduce the scale of the graph when processing multiple sentences at once. However, I want the raw dependency network without merging the same word.

Would you like to add this mode switch to naruhodo. Or tell me a suggestion about how to do this ?

図を保存する関数について

plotToFileによって図を保存できますが、同時に画面にも出力されるようです。
この場合、グラフィカル環境のないサーバー上ではエラーが出てしまいます。
図を保存するのみの関数はないのでしょうか?

conda 環境で使用したいのですが、 naruhodoを対応して欲しい

from naruhodo import parser
で、NameError: name 'naruhodo' is not defined エラーになります。
環境は、Macbookpro2019 macOS BigSurです。

pipでインストールしています。
ouka-MacBook-Pro:Takenaka_Data ouka_macbookpro$ pip show naruhodo
Name: naruhodo
Version: 0.2.9
Summary: A python library for automatic semantic graph generation from human-readable text.
Home-page: https://github.com/superkerokero/naruhodo
Author: superkerokero
Author-email: [email protected]
License: MIT
Location: ~/.anyenv/envs/pyenv/versions/3.9.6/lib/python3.9/site-packages
Requires: networkx, nxpd, lxml, beautifulsoup4
Required-by:

condaでインストールしていないので、そのためかとも思いますが、他のツールを使う関係で、できればnaruhodoをcondaで使用できると助かります。

BrokenPipError occured when calling add function

from naruhodo import parser
DA = parser(lang="ja", gtype="d")
DA.add("一郎は二郎が描いた絵を三郎に贈った。")

and returned

---------------------------------------------------------------------------
BrokenPipeError                           Traceback (most recent call last)
<ipython-input-3-a2f402b9bf2a> in <module>
      1 # Now we can add some text to it
----> 2 DA.add("一郎は二郎が描いた絵を三郎に贈った。")

~/anaconda3/lib/python3.7/site-packages/naruhodo/core/parser.py in add(self, inp)
    266         if inp == "":
    267             return [inp]
--> 268         self.core.add(inp, self.pos)
    269         self.pos += 1
    270         self.G = _mergeGraph(self.G, self.core.G)

~/anaconda3/lib/python3.7/site-packages/naruhodo/core/DependencyCoreJa.py in add(self, inp, pos)
     35         cabo = CabochaClient()
     36         self.pos = pos
---> 37         cabo.add(self.proc.query(inp), self.pos)
     38         root = "" # Initialize root id.
     39         for chunk in cabo.chunks:

~/anaconda3/lib/python3.7/site-packages/naruhodo/utils/communication.py in query(self, inp)
     43         pattern = r'EOS'
     44         self.proc.stdin.write(inp.encode('utf-8') + six.b('\n'))
---> 45         self.proc.stdin.flush()
     46         result = ""
     47         while True:

BrokenPipeError: [Errno 32] Broken pipe

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.