(This is a question, please redirect me if this is not the right place to ask) <p

Corpus clean up and normalization about opus-mt-train HOT 4 CLOSED

helsinki-nlp commented on May 10, 2024

Corpus clean up and normalization

from opus-mt-train.

Comments (4)

jorgtied commented on May 10, 2024

The best way would be to implement some python libraries that can do language specific cleanup things. In that way, I could call those methods from another script that can filter any bitext that includes one of those languages with a cleanup function. This would then be easy to integrate in Makefile.data where all kinds of pre-processing happens. There is now already a script (bitext-match-lang.py) that does language identification as another pre-processing step. That helps quite a lot already and I started to train some new models after improved filtering.

We also work on a package called opus-filter that will make it easier to select proper data from larger noisy data sets. It is already released but needs some more tweaking to make it easier to make it applicable in the general case.

from opus-mt-train.

santhoshtr commented on May 10, 2024

Thanks, I spend some time, but the functioning of Makefile.data and related files were difficult to understand. Do you mind if I share a clean up script here and help to add that codebase?

This is a sed script https://gist.github.com/santhoshtr/1d2143ed5a4987b31c8c1a2c17564263

Ideally this script need to run on raw parallel text before any processing.

from opus-mt-train.

jorgtied commented on May 10, 2024

Yes, those makefiles are very much research-in-progress material. I'll be happy to help with integrating your cleanup scripts. Whatever additional scripts / libraries you create I can then find ways of integrating them in the data processing pipeline.

from opus-mt-train.

jorgtied commented on May 10, 2024

The sed script is included now and the makefile includes some routines to read cleanup scripts if available.

from opus-mt-train.

Recommend Projects

Corpus clean up and normalization about opus-mt-train HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent