Giter Club home page Giter Club logo

correction's Introduction

Correction

Overview

A project to correct spelling errors in Chinese texts 中文纠错任务

大致思路

  • 使用语言模型计算句子或序列的合理性
  • bigram, trigram, 4-gram 结合,并对每个字的分数求平均以平滑每个字的得分
  • 根据Median Absolute Deviation算出outlier分数,并结合jieba分词结果确定需要修改的范围
  • 根据形近字、音近字构成的混淆集合列出候选字,并对需要修改的范围逐字改正
  • 句子中的错误会使分词结果更加细碎,结合替换字之后的分词结果确定需要改正的字
  • 探测句末语气词,如有错误直接改正

Idea

  • Make use of language models to calculate the likelihood of a sequence of words or a sentence
  • Combine the scores of bigram, trigram, and 4-gram scores and take the average for each character to smooth the score
  • Determine the outliers by Median Absolute Deviation (MAD) and figure out the range of characters to be corrected
  • Generate the confusion set based on character sets of similar shape/pronunciation with the target character
  • Correct the characters in the range of correction one by one
  • The error characters in the sentence would make the word segementation results have smaller granularity. Determine the character to be replaced considering the results of jieba segementation.
  • Errors in modal particles are common. Correct them directly at the end of sentences.

TODO

  • 使用RNN语言模型推算每个字的合理概率(正反双向),以加强长距离前后文关系
  • 构建更小更贴近现实的混淆集合(形近字和近音字)
  • 从现实中收集更多的有语病或错别字的句子并标注
  • Incorporate RNN language models to capture more context information
  • Collect smaller confusion sets and make the sets more closed to daily life
  • Collect and annotate Chinese sentences with grammatical errors from daily life

文件结构

data/
* sighan/: SIGHAN contests data
* bcmi_data/: 源自生活的语病数据集 Dataset of sentences with grammatical errors
* wikipedia/: 中文维基数据集 xml文件、纯文本、提取工具 Tools to preprocess Chinese Wikipedia texts
* simp.pickle: similar pronunciation characters dictionary
* sims.pickle: similar shape characters dictionary
* simp_simplified.pickle: 过滤掉字频100一下的非常用字的版本 Similar pronuncation characters with less common characters filtered out
* xjz.pickle: 简明形近字dictionary Simple similar shape dictionary
* 
kenlm/: library to generate statistical language models
kenmodels/: trained language models *.klm are binary files
nlm/: various neural language models
* tf_char_rnn: Character-level RNN language model implemented using TensorFlow
* cn_char_rnn: RNN language model for Chinese
* lstm_char_cnn: A CNN language model, not useful for this project
* char_rnn: Character-level RNN language model gists
spells/: useful tools for English spelling check
langconv.py: 简繁转换工具 Tools to convert simplified/traditional Chinese characters
zh_wiki.py: 简繁转换dictionary
tf_char_rnn/:
* checkpoints/: 17 is backward model, 10 is forward model
* logs/: summaries of training runs
* data/: text input to train models
* model.py: script describing the model
* train.py: script to train the model
* sample.py: script to sample texts and calculate per-char probabilities of a sequence
* utils.py: utilities for reading data and generating batches
results.out: 输出log

参考链接 References

correction's People

Contributors

ccheng16 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.