Can this tool be used to do Chinese word segmentation? Thank you very much!

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Chinese word segmentation about subword-nmt HOT 7 CLOSED

rsennrich commented on May 9, 2024

Chinese word segmentation

from subword-nmt.

Comments (7)

rsennrich commented on May 9, 2024 1

we have previously used jieba for word segmentation, then used BPE to learn a subword segmentation: http://www.statmt.org/wmt17/pdf/WMT39.pdf

if you want to try word segmentation with BPE, you can run learn_bpe.py on unsegmented text, but note that the code was not optimized for this use case, and will likely be slower and consume more memory.

from subword-nmt.

liesun1994 commented on May 9, 2024 1

@parahaoer for Chinese word segmentation, you can use this tool jieba or stanford parser

from subword-nmt.

parahaoer commented on May 9, 2024

@liesun1994 Could you please tell me how you do it?

from subword-nmt.

guotong1988 commented on May 9, 2024

THX

from subword-nmt.

kalyangvs commented on May 9, 2024

@rsennrich I used a transformer base model in fairseq as specified in paper with same no of merge (59500 - separate BPE models) operations that gave around 24 M sentences except back translated data .
But I am getting low BLEU score of 17.69 (on 2017 test data) as compared to what your paper claims I use sacre BLEU to compute BLEU score which gives same BLEU as official MT eval script
I could not find out the reason to atleast achieve the baseline

from subword-nmt.

rsennrich commented on May 9, 2024

this question/complaint is very underspecified (you don't even mention the translation direction), but I don't expect your problem to be the BPE segmentation.

You can actually find most of our WMT17 submissions here: http://data.statmt.org/wmt17_systems/
Maybe this will help you find the difference to your training setup, for example in preprocessing.

from subword-nmt.

kalyangvs commented on May 9, 2024

The translation direction is en -> zh
The problem is in evaluating BLEU as I haven't properly used tokenizeChinese.py as specified. Got 32.7
Since the doubt pertains to chinese data I have posted in the same issue rather than creating a new one. Apologies

from subword-nmt.

Recommend Projects

Chinese word segmentation about subword-nmt HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent