Giter Club home page Giter Club logo

Comments (7)

rsennrich avatar rsennrich commented on May 9, 2024 1

we have previously used jieba for word segmentation, then used BPE to learn a subword segmentation: http://www.statmt.org/wmt17/pdf/WMT39.pdf

if you want to try word segmentation with BPE, you can run learn_bpe.py on unsegmented text, but note that the code was not optimized for this use case, and will likely be slower and consume more memory.

from subword-nmt.

liesun1994 avatar liesun1994 commented on May 9, 2024 1

@parahaoer for Chinese word segmentation, you can use this tool jieba or stanford parser

from subword-nmt.

parahaoer avatar parahaoer commented on May 9, 2024

@liesun1994 Could you please tell me how you do it?

from subword-nmt.

guotong1988 avatar guotong1988 commented on May 9, 2024

THX

from subword-nmt.

kalyangvs avatar kalyangvs commented on May 9, 2024

@rsennrich I used a transformer base model in fairseq as specified in paper with same no of merge (59500 - separate BPE models) operations that gave around 24 M sentences except back translated data .
But I am getting low BLEU score of 17.69 (on 2017 test data) as compared to what your paper claims I use sacre BLEU to compute BLEU score which gives same BLEU as official MT eval script
I could not find out the reason to atleast achieve the baseline

from subword-nmt.

rsennrich avatar rsennrich commented on May 9, 2024

this question/complaint is very underspecified (you don't even mention the translation direction), but I don't expect your problem to be the BPE segmentation.

You can actually find most of our WMT17 submissions here: http://data.statmt.org/wmt17_systems/
Maybe this will help you find the difference to your training setup, for example in preprocessing.

from subword-nmt.

kalyangvs avatar kalyangvs commented on May 9, 2024

The translation direction is en -> zh
The problem is in evaluating BLEU as I haven't properly used tokenizeChinese.py as specified. Got 32.7
Since the doubt pertains to chinese data I have posted in the same issue rather than creating a new one. Apologies

from subword-nmt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.