Giter Club home page Giter Club logo

Comments (6)

hightman avatar hightman commented on June 29, 2024

文档有说过,中英混合的最多只支持2个字符;超过2个,单独切分就可以了没必要组在一起了。

Best Regards

hightman/海鳗


微信/微博:hightman
Github:https://github.com/hightman

在 2016年2月29日,下午10:16,iracheng [email protected] 写道:

自定义了一个词库,下面是词条内容

WORD TF IDF ATTR

京基ab 1.00 1.00 @^@
京基1 1.00 1.00 @^@
京基a 1.00 1.00 @^@
京基1ab 1.00 1.00 @^@
京基1a 1.00 1.00 @^@
京基100 1.00 1.00 @^@

测试代码:

set_charset('utf8'); //编码 $so->set_dict('/home/ira/www/farm.ira.orantrip.com/tmp/article/all.xdb'); $so->set_ignore(false); $so->set_ignore(true); //忽略标点符号 $so->send_text($text); print_r($so->get_words('@')); ?>

回传内容:
Array
(
[0] => Array
(
[word] => 京基1
[times] => 1
[weight] => 1
[attr] => @
)

[1] => Array
(
[word] => 京基a
[times] => 1
[weight] => 1
[attr] => @
)

[2] => Array
(
[word] => 京基1a
[times] => 1
[weight] => 1
[attr] => @
)

[3] => Array
(
[word] => 京基ab
[times] => 1
[weight] => 1
[attr] => @
)
)

需要被分词出来的京基100没有被分出来,英文数字总和大于2的词条也没有被分出,是否有什么设置可以处理这个问题?谢谢。


Reply to this email directly or view it on GitHub #29.

from scws.

iracheng avatar iracheng commented on June 29, 2024

目的是想要分析地名或是建築物的名稱,如果切分的話無法判斷目標的內容是否有出現,像是「昂坪360」、「天际100」、「京基100」,實現搜索的比對沒辦法對應出來,是否有設置能夠擴充支持的字符數量?謝謝。

from scws.

hightman avatar hightman commented on June 29, 2024

目前没有。

Best Regards

hightman/海鳗


微信/微博:hightman
Github:https://github.com/hightman

在 2016年3月1日,下午3:31,iracheng [email protected] 写道:

目的是想要分析地名或是建築物的名稱,如果切分的話無法判斷目標的內容是否有出現,像是「昂坪360」、「天际100」、「京基100」,實現搜索的比對沒辦法對應出來,是否有設置能夠擴充支持的字符數量?謝謝。


Reply to this email directly or view it on GitHub #29 (comment).

from scws.

breath-co2 avatar breath-co2 commented on June 29, 2024

应该以自定义词典优先级为准吧?中英文混编的词也很多的,比如:好123,4399游戏,300英雄,163邮箱,2016传奇,荣威550,本田XR-V,大众Polo,神仙道2016,小米note,Wifi万能钥匙,量贩ktv
如果这些词出现在字典里,感觉应该要识别出来才对
另外还有个问题就是不支持空格,比如 iphone 6s,小米5s Plus,等等。。希望能改进支持。。

from scws.

ljx0517 avatar ljx0517 commented on June 29, 2024

+1

from scws.

hightman avatar hightman commented on June 29, 2024

应该以自定义词典优先级为准吧?中英文混编的词也很多的,比如:好123,4399游戏,300英雄,163邮箱,2016传奇,荣威550,本田XR-V,大众Polo,神仙道2016,小米note,Wifi万能钥匙,量贩ktv
如果这些词出现在字典里,感觉应该要识别出来才对
另外还有个问题就是不支持空格,比如 iphone 6s,小米5s Plus,等等。。希望能改进支持。。

觉得意义不大,4399游戏切成4399+游戏也不影响搜索

from scws.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.