Giter Club home page Giter Club logo

python-pinyin's People

Contributors

artoria2e5 avatar ban3 avatar bors-homu avatar bowowzahoya avatar dependabot-preview[bot] avatar dependabot[bot] avatar freed-wu avatar gitter-badger avatar gumblex avatar hanabi1224 avatar howl-anderson avatar mend-bolt-for-github[bot] avatar mingstar avatar mozillazg avatar secsilm avatar snyk-bot avatar timgates42 avatar tyrbonit avatar wdscxsj avatar yangtsesu avatar yangwe1 avatar zacheryguan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-pinyin's Issues

自定义词库没有作用,需要enable吗?怎么做

from pypinyin import load_phrases_dict, lazy_pinyin, TONE2
lazy_pinyin('还没', style = TONE2)
['hua2n', 'me2i']
from pypinyin import load_phrases_dict, lazy_pinyin, TONE
lazy_pinyin('还没', style = TONE)
['huán', 'méi']
load_phrases_dict({'还没': [['hái'], ['méi']]})
lazy_pinyin('还没', style = TONE)
['huán', 'méi']

去除对 jieba 的依赖,将分词交由用户处理

用户可以选择使用自己喜爱的分词模块,只需要将经过分词模块处理的结果传给 pypinyin 就可以了:

hans = seg(u'你好吗')  # 分词模块返回一个列表: [u'你好', u'吗']
pypinyin.pinyin(hans)   # pinyin

声母 Y 和 W 的问题

类似

pinyin(u'中心', style=pypinyin.INITIALS) # 设置拼音风格
[['zh'], ['x']]

代码中声母表

_INITIALS = 'b,p,m,f,d,t,n,l,g,k,h,j,q,x,zh,ch,sh,r,z,c,s,'.split(',')

没有y和w。如果碰到Y和W开头的字,相应的字会返回空。
例如:

pinyin(u'火影忍者', style=pypinyin.INITIALS)
[[u'h'], [u''], [u'r'], [u'zh']]

我查了下资料,有的说声母不包括Y和W,所以这个返回是正常,但这样处理导致应用不好做,只能用首字母模式替代。是否新加一个接口,加上Y和W的返回,或者,说明上告知这个情况,以免别人使用的时候碰到问题。

这句的转换有重复,可能是bug

>>> s=u"两年前七斤喝醉了酒"
>>> pypinyin.lazy_pinyin(s)
[u'liang', u'nian', u'qian', u'qi', u'jin', u'he', u'zui', u',he', u'zui', u'jiu', u'liao', u'jiu']

结果中多了u',he', u'zui', u'jiu', 应该是bug

node版本没有这个问题:

> var pinyin = require("pinyin");
undefined
> s='你好了解了'
'你好了解了'
> pinyin(s)
[ [ 'nǐ' ], [ 'hǎo' ], [ 'liǎo' ], [ 'jiě' ], [ 'liǎo' ] ]
> s='两年前七斤喝醉了酒'
'两年前七斤喝醉了酒'
> pinyin(s)
[ [ 'liǎng' ],
  [ 'nián' ],
  [ 'qián' ],
  [ 'qī' ],
  [ 'jīn' ],
  [ 'hē' ],
  [ 'zuì' ],
  [ 'liǎo' ],
  [ 'jiǔ' ] ]

TONE2 未标轻声

如“打量”,输出 [['da3'], ['liang']],文档中写“用数字 [0-4] 进行表示”。这种情况除了 ü 都有。请问该修文档还是程序行为?

允许不自动使用 jieba 进行分词

先有的功能下,如果系统内有安装 jieba 的话,
pinyin(u'你好') 会自动调用 jieba 进行分词,只能通过 pinyin([u'你好']) 的方式进行禁用。

内置简单的分词处理

将传入的字符串按是否有拼音来分割:

'你好吗にほんごРусский язык我很好'  -> ['你好吗', 'にほんごРусский язык', '我很好']

目前结巴分词结果是:

'你好吗にほんごРусский язык我很好'  -> ['你好吗', 'に', 'ほ', 'ん', 'ご', 'Р', 'у', 'с', 'с', 'к', 'и', 'й', ' ', 'я', 'з', 'ы', 'к', '我很好']

ref:
#16
#17

https://github.com/mozillazg/python-pinyin/blob/master/pypinyin/__init__.py#L256

多音字识别错误

In [4]: lazy_pinyin(u'张靓颖')
Out[4]: [u'zhang', u'jing', u'ying']

第二个字应该为 liang

testing of load_phrases_dict

load_phrases_dict({'几': [['jǐ']]})
load_phrases_dict({'桔子': [['jú'], ['zǐ']]})
load_phrases_dict({'还没': [['hái'], ['méi']]})
load_phrases_dict({'不用谢':[['bú'], ['yòng'],['xiè']]})

Output
不用谢(bú yòng xiè)
桔子(jú zǐ)
还没(huán méi)
几(jǐ)

So 还没 is incorrect. Why?

更新拼音库

通过汉典网按 unicode 码获取所有的汉字和拼音(只获取有拼音的汉字)。

CJK基本:[4E00-9FFF]
CJK扩展A:[3400-4DBF]
CJK扩展B:[20000-2A6DF]
CJK扩展C:[2A700-2B73F]
CJK扩展D:[2B740-2B81D]
CJK兼容扩展:[2F800-2FA1F]
CJK兼容:[F900-FAFF]

http://www.zdic.net/sousuo/

关于分词

core.py中的pinyin 方法
for words in hans:
pys.extend(_pinyin(words, style, heteronym, errors))
这个地方用的是extend,导致即使使用了jieba分词,分出来的结果还是返回一个一个汉字的拼音list
区别不出来那几个是一个词语,跟没分一样。。

STYLE_BOPOMOFO(注音符号)

对岸常用的注音符号是可以从拼音数据转换出来的,至于语文标准审定字音那也不是这个项目的事情。

可以参考 https://github.com/The-Orizon/nlputils/blob/master/libpinyin_bopomofo.py 的转换(我也不知道 tone2 是什么格式,本来看文档还以为是 to2ne 呢)。

Style Desc
STYLE_BOPOMOFO 普通的带声调注音。注音的声调永远在最后,阴平(第一声)不标。
STYLE_BOPOMOFO_NOTONE 没声调。

注音本身可以说有类似双拼的特性。

“妳”读音错误

>>> from pypinyin import pinyin
>>> pinyin('你会')
[['nǐ'], ['huì']]
>>> pinyin('妳會')
[['nǎi'], ['huì']]

共建为拼音服务的字典、词典库

拼音库主要依赖的是拼音字典、词典(后面简称“词典”),这个词典共用性很高,但由于词典库较大,出现问题的概率的也高。

建议大家一起共建、共同维护这个词典,你们觉得怎么样?
#41 #42

cc @mozillazg

多音字问题

我发现有些字不是多音字也识别为多音字了,请问怎么避免这种情况呢

pinyin(u'分', heteronym=True)

fēn
fèn
fén
bàn

pinyin(u'平', heteronym=True)

píng
pián
bìng
bēng

有选项能不把ü转成v吗?如绿色lv se

虽然v可以在python中轻易转换,不过还是想问一下,是不是有办法不用lv se(绿色),而用lü se,我是中文教师,如果用lv会和一般教材产生不统一的问题。

这个工具是我最爱用的python工具之一,非常感谢!

无法解析出多音字的问题

我试了几例:

  • pypinyin.pinyin(u'长江水长长长长长长长', style=pypinyin.NORMAL)
    [[u'zhang'],
    [u'jiang'],
    [u'shui'],
    [u'zhang'],
    [u'zhang'],
    [u'zhang'],
    [u'zhang'],
    [u'zhang'],
    [u'zhang'],
    [u'zhang']]
  • pypinyin.pinyin(u'重', style=pypinyin.NORMAL)
    [[u'zhong']]

请问是我用法错误,还是缺少多音字库?

不好意思,看到参数啦:pypinyin.pinyin(u'长江水长长长长长长长', style=pypinyin.NORMAL, heteronym=True)

use class style instead of function style

class Pinyin(object):
    def __init__(self, ...):
        pass
    def pinyin(self, hans):
       pass
    # ...

# Backward compatibility
def pinyin(hans, ....):
    return Pinyin(...).pinyin(hans)

某些两个汉字的词转换得到一个粘连的长音节

首先谢谢提供如此便利的汉字转拼音工具

近日在使用时遇到这样的问题

>>> from pypinyin import lazy_pinyin
>>> lazy_pinyin('彷徨')
['panghuang']

预想得到的应该是['pang', 'huang']?
我的环境是Python 3.5.1, pypinyin 0.12.0

"厦门" convert to 'shamen'

when use pypinyin.lazy_pinyin or pypinyin.pinyin, it transform '厦门' to 'shamen', but the right output is 'xiamen'

"你明天在上海吗"识别不正确

res = pinyin(u'你明天在上海吗', style=pypinyin.TONE)
for word in res:
    print word[0]

结果如下

nǐ
míng
tiān
zài
shàng
hǎi
má

最后那个“吗”字拼音错了,感觉是常见字,不应该出错

maybe a bug - 苹果 => pin guo

import pypinyin
zi = '苹果'
py = pypinyin.slug(zi, style=pypinyin.NORMAL, separator=' ')
Building Trie..., from /usr/local/lib/python3.4/site-packages/jieba/dict.txt
loading model from cache /var/folders/k9/47fd1ycj2rg19gn7d2g5g16c0000gn/T/jieba.cache
loading model cost 2.451965808868408 seconds.
Trie has been built succesfully.
print(py)
pin guo

What I was looking for is "ping guo".

Do we use a conversion table in python-pinyin? I probably should search for it before asking. If yes, I'm interested in this table.

是否支持训练功能?

In [12]: pinyin("中心")
Out[12]: [['zhōng'], ['xīn']]

In [13]: pinyin("重心")
Out[13]: [['zhòng'], ['xīn']]

In [14]: pinyin("情调来调整风格")
Out[14]: [['qíng'], ['diào'], ['lái'], ['diào'], ['zhěng'], ['fēng'], ['gé']]

In [15]: pinyin("调整风格")
Out[15]: [['diào'], ['zhěng'], ['fēng'], ['gé']]

In [16]: pinyin("调整风格")
Out[16]: [['diào'], ['zhěng'], ['fēng'], ['gé']]

In [17]: pinyin("调整")
Out[17]: [['tiáo'], ['zhěng']]

In [18]: pinyin("调薪")
Out[18]: [['diào'], ['xīn']]

分词了之后识别还是有问题
是否支持训练功能来纠正?

“了”没有音调?

我试了下面的code:

print(lazy_pinyin("了",style=1))      #['le']
print(pinyin("了",style=1))           #[['le']]

请问这是一个bug吗?😉

分词接口

在做一个给汉字添加拼音的功能,但是基于现有的接口没有办法把中英混合的字mapping上。

可否提供接口实现:

lazy_pinyin(u'你好abc☆☆')
#[u'ni', u'hao', 'a', 'b', 'c', u'\u2606', u'\u2606']

或者 暴露出来分词的接口,这样能够mapping上,或者直接能够返回一个dict

磅礴地产,丽水园茶坊(成都市华厦)

返回的是[[u'bang'], [u'bo'], [u'de'], [u'chan']]

实际应该是:pang, bo, di, chan

print pinyin(u'丽水园茶坊(成都市华厦)',pypinyin.NORMAL)
[[u'li'], [u'shui'], [u'yuan'], [u'cha'], [u'fang'], [u'('], [u'cheng'], [u'dou'], [u'shi'], [u'hua'], [u'sha'], [u')']]
成都读音不对

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.