ohliming / smallseg Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/smallseg
Automatically exported from code.google.com/p/smallseg
如题,这个函数不能处理unicode的中文字符串吗?
比如,cuttest(u"我喜欢python和c++。")
报错:
Traceback (most recent call last):
File "D:\bluecat2\Desktop\smallseg_0.5.1\test_fenci.py", line 41, in <module>
cuttest(u"我喜欢python和c++。")
File "D:\bluecat2\Desktop\smallseg_0.5.1\test_fenci.py", line 18, in cuttest
wlist = seg.cut(text)
File "D:\bluecat2\Desktop\smallseg_0.5.1\smallseg.py", line 56, in cut
text = text.decode('utf-8','ignore')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)
Windows, Python 2.7
Original issue reported on code.google.com by [email protected]
on 22 Feb 2012 at 12:50
有什么方法可以统计分词中常出现的词吗?
Original issue reported on code.google.com by [email protected]
on 21 Oct 2009 at 3:40
我在IDE中直接运行test.java
是可以进行分词的,但是运行在tomcat
服务器中就报错,javax.servlet.ServletException:
java.lang.NoClassDefFoundError: Could not initialize class fx.sunjoy.SmallSeg
请麻烦详解一下,是否还要加什么配置文件的?谢谢
Original issue reported on code.google.com by [email protected]
on 17 Aug 2010 at 11:59
如下
#def cuttest(s):
#wlist = seg.cut(s)
#wlist.reverse()
#tmp = "/".join(wlist)
#print tmp
#print "================================================================="
if __name__=="__main__":
s1 = file("text1.txt").read()
wlist = seg.cut(s1)
wlist.reverse()
res1 = "/".join(wlist)
print res1
fl=open("result.txt","w")
fl.write(tmp)
fl.close()
取消定义的cuttext模块,下面直接引用,读取文本text1中的内��
�分词,都是可行的。
但是最后三行把分词结果保存到result.txt中出现编码问题:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
不知道怎么解决啊,前辈能不能帮忙看看怎么修改。
Original issue reported on code.google.com by [email protected]
on 1 Jun 2013 at 5:14
What steps will reproduce the problem?
输入“新生小鼠中肌红蛋白含量较成年鼠高吗?”
What is the expected output?
新生 小鼠 中 肌红蛋白 含量 较 成年 鼠 高吗
What do you see instead?
新生 小鼠 中肌 肌红 蛋白 含量 较 成年 鼠高 高吗
What version of the product are you using? On what operating system?
0.6
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 1 May 2011 at 10:15
"干脆就把那部蒙人的闲法给废了 拉倒!RT @laoship ukong :
27日,全国人大常 委会第三次审议侵
权责任法草案,删除了有关 医疗损害责任“举证
倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由��
�将陷入万劫不复的境地。"
分词后的结果是:
"干脆 就把 那部 蒙人 的闲 闲法 法给 废了 拉倒 RT @laoship
ukong 27 日 全国人大 常 委会 第三 次 审议 侵 权 责任
法草案 删除 了 有关 医疗 损害 责任 举证 倒置 的 规定 在
医患 纠纷 中 本已 处于 弱势 地位 的 消费者 由此 将 陷入
万劫不复 的 境地"
可以看到 的闲 闲法 这地方有重复。
修改了下面两行:
http://code.google.com/p/smallseg/source/browse/trunk/smallseg.py#43
http://code.google.com/p/smallseg/source/browse/trunk/smallseg.py#44
改为:
for i in xrange(ln,0,-1):
tmp = s[i-1:i]
...
Original issue reported on code.google.com by [email protected]
on 10 Aug 2012 at 10:01
君哥,在网上找来找去还是找到你这里来了,呵呵
Original issue reported on code.google.com by [email protected]
on 15 Dec 2010 at 4:35
如何在django项目中使用?
Original issue reported on code.google.com by [email protected]
on 21 Feb 2012 at 3:13
我看到你的在线演示,是部署在gae上的。使用起来速度还可��
�。可是我部署在gae
上,每次请求都会加载字典一次,这个过程十分的慢。请问��
�是如何做到让他快速执
行的。
Original issue reported on code.google.com by [email protected]
on 13 May 2010 at 3:42
DFA是什么的缩写?
看到您还有个作品smallgfw也是基于DFA?
Original issue reported on code.google.com by [email protected]
on 28 Aug 2012 at 3:36
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.