Giter Club home page Giter Club logo

new-words-discovery's Introduction

new-words-discovery

这是完成了新词发现功能的python脚本。

##使用方法 ###1. 输入语料,计算出长度为1~5个字的所有候选词的词频

python compute_candidate_freq.py [-h] [-r] [-o OUTPUT] corpus_file

加上参数-r,会将语料文件的句子都翻转后,再统计所有逆序候选词的词频。

###2. 输入候选词词频文件,计算出长度为2~4个字的所有候选词的凝固度

python compute_solidation.py [-h] [-s SEPARATOR] [-f FREQ_LIMIT] [-o OUTPUT] freq_file

可通过参数-s设置词频文件的分隔符,默认是\t;设置-f可只计算词频大于等于词频阈值的候选词,默认为1。

###3. 输入候选词词频文件,计算出长度为2~4个字的所有候选词的右邻字信息熵

python compute_freedegree.py [-h] [-s SEPARATOR] [-f FREQ_LIMIT] [-r] [-o OUTPUT] freq_file

可通过参数-s设置词频文件的分隔符,默认是\t;设置-f可只计算词频大于等于词频阈值的候选词,默认为1;加上-r时,需要输入是逆序候选词词频文件,输出的是正序候选词的左邻字信息熵。

###4. 将词频文件,凝固度文件,左右邻字信息熵文件,合并到一起,然后导入Excel,通过设置词频阈值,凝固度阈值,自由度阈值刷选出新词。

new-words-discovery's People

Contributors

izisong avatar

Watchers

James Cloos avatar Amon avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.