Giter Club home page Giter Club logo

synocn's Introduction

A "soft" Chinese Synonyms Collection

中文近义词表,主要爬取自百度汉语CWN,内含80966个词,335061条近义关系。

Why call it "soft"

实习的过程中想用近义词来辅助同义句子的匹配,找了一圈却找不到一个很好的数据库

  • Open Multilingual Wordnet 内含大量短语而不是词,这可能是为了对应到英文wordnet的单词,翻译时无法避免的问题。例如:"删除背景的照片","利物浦市民的"等。
  • 直接使用词向量。有两个问题,一是阈值不好确定,二是距离近的词不一定是近义词,距离远的词也不一定不是近义词(词向量只是描述了共现度)。例如,在我们训练的词向量中,离"男"最近的词却是"女"。有文章也讨论了这个问题。
  • CWN 词数不多,且涉及简繁转换的问题,对近义词的选择上有点过于偏学术

我想找一个对近义关系要求不严格,适合判断口语语句中词语的近义关系,即相对 "soft" 的库。后来发现百度汉语上提供了近义词,正好符合我的要求。

数据来源

爬取的数据总共有三个来源,下面是总库的链接和各个数据源数据的链接。 总数据:百度网盘

百度汉语(直接爬取)

使用汉典网上(几乎)所有词构成的词表,去百度汉语上爬取了这些词的近义词

链接:百度网盘,共计35395个词,62002条近义关系

爬取了CWN上(几乎)所有词的近义词,将一个词下不同义项的近义词整合到一起

链接:百度网盘,共计6188个词,12541条近义关系

百度汉语(利用每个词的翻译)

第一种方法爬取时,不能解决单字成词的问题,即单字的页面是没有近义词的,例如:"吃" 和 "吃饭"。所以我爬取了每个词的英文解释,然后将相同英文解释的词视为近义词。这种方法只做了一些简单的处理,可能会有一些问题存在,会使整个库更加 "soft"。

链接:百度网盘,共计59074个词,271577条近义关系

数据格式

txt文件,每行两个词,以空格分隔,表示一组近义关系

synocn's People

Contributors

keson96 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

synocn's Issues

LICENSE?

请问这个近义词表是以什么许可发行呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.