Giter Club home page Giter Club logo

cwsharp's Introduction

CWSharp

.Net的跨平台的中文分词组件,支持中英文、符号或者混合词组(比如:T恤,卡拉OK,C#等)以及自定义词典。

特性

  • 默认支持多种分词器
    • StandardTokenizer - 默认分词,基于词典
    • BigramTokenizer - 二元分词,支持英文,数字识别
    • StopwordTokenizer - 自定义过滤词,扩展类
    • UnigramTokenizer - 一元分词
  • 可扩展的自定义分词接口
  • 支持自定义词典
  • 支持Lucene.Net
  • 支持.NET以及.NET Core
  • MIT授权协议

其它版本

Golang版 - CWSharp-Go

Python版 - CWSharp-Python

安装&编译

  • NuGet
nuget install CWSharp 
  • For .Net Core (project.json)
"dependencies": {
    "CWSharp": "1.1.0"
  }
  • Running on Linux
# dotnet ./test.dll cwsharp.dawg
研究/CJK 生命/CJK 起源/CJK

扩展&帮助

说明

  • 基于正向最大匹配的算法。MMSEG算法
  • 词典基于DAWG结构,比传统的前缀树占用更少的内存空间。DAWG算法

示例

var tokenizer = new StandardTokenizer("dict.dawg");
var tokenizer2 = new StopwordTokenizer(tokenizer, new string[] { "" });
foreach (var token in tokenizer2.Traverse("你是我的小苹果"))
{
	Console.Write(token.Text + "/" + token.Type);
}
  • StandardTokenizer
研究生命起源 >> 研究/CJK 生命/CJK 起源/CJK
长春市长春药店 >> 长春市/CJK 长春/CJK 药店/CJK
神秘的组织-北京朝阳群众 >> 神秘/CJK 的/CJK 组织/CJK -/PUNC 北京/CJK 朝阳/CJK 群众/CJK
一次性交一百元 >> 一次/CJK 性交/CJK 一/CJK 百/CJK 元/CJK (歧义词)
  • BigramTokenizer
研究生命起源 >> 研究/CJK 究生/CJK 生命/CJK 命起/CJK 起源/CJK
长春市长春药店 >> 长春/CJK 春市/CJK 市长/CJK 长春/CJK 春药/CJK 药店/CJK
神秘的组织-北京朝阳群众 >> 神秘/CJK 秘的/CJK 的组/CJK 组织/CJK -/PUNC 
							北京/CJK 京朝/CJK 朝阳/CJK 阳群/CJK 群众/CJK
一次性交一百元 >> 一次/CJK 次性/CJK 性交/CJK 交一/CJK 一百/CJK 百元/CJK
  • 自定义分词接口(实现一元分词)
public class CustomTokenizer : ITokenizer
{
	private ITokenizer _tokenizer;
	public CustomTokenizer(ITokenizer tokenizer)
	{
		_tokenizer = tokenizer;
	}
	public IEnumerable<Token> Traverse(string text)
	{
		foreach (var token in _tokenizer.Traverse(text))
		{
			if (token.Type == TokenType.CJK)
			{
				foreach (var ch in token.Text)
					yield return new Token(ch.ToString(), TokenType.CJK);
			}
			else
				yield return token;
		}
	}
}

cwsharp's People

Contributors

zhengchun avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.