Comments (4)
The best way would be to implement some python libraries that can do language specific cleanup things. In that way, I could call those methods from another script that can filter any bitext that includes one of those languages with a cleanup function. This would then be easy to integrate in Makefile.data
where all kinds of pre-processing happens. There is now already a script (bitext-match-lang.py) that does language identification as another pre-processing step. That helps quite a lot already and I started to train some new models after improved filtering.
We also work on a package called opus-filter that will make it easier to select proper data from larger noisy data sets. It is already released but needs some more tweaking to make it easier to make it applicable in the general case.
from opus-mt-train.
Thanks, I spend some time, but the functioning of Makefile.data and related files were difficult to understand. Do you mind if I share a clean up script here and help to add that codebase?
This is a sed script https://gist.github.com/santhoshtr/1d2143ed5a4987b31c8c1a2c17564263
Ideally this script need to run on raw parallel text before any processing.
from opus-mt-train.
Yes, those makefiles are very much research-in-progress material. I'll be happy to help with integrating your cleanup scripts. Whatever additional scripts / libraries you create I can then find ways of integrating them in the data processing pipeline.
from opus-mt-train.
The sed script is included now and the makefile includes some routines to read cleanup scripts if available.
from opus-mt-train.
Related Issues (20)
- model Helsinki-NLP/opus-mt-en-uk translates some sentences into Russian instead of Ukrainian HOT 1
- Problem Fine-tuning Models using TMX files HOT 5
- Preprocessing of training data HOT 1
- HuggingFace conversion script doesn't work HOT 1
- Model not available on huggingface model page, how do I use it with huggingface. HOT 4
- Preparing fine-tune data for Marian HOT 2
- Source.spm & Target.spm Files HOT 2
- Bridge Language
- What's the dataset used for training opus-mt-en-de HOT 1
- Language Code Difference HOT 1
- What is tatoeba-langtune? HOT 2
- Preprocessing Script Question
- Korean Finetuning
- Multilingual Tuned Model Translating everything to "sssssssss" HOT 2
- What could cause widely varying inference time when using pre-trained opus-mt-en-fr model with python transformers library? HOT 2
- Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model
- How to translate from english to Japan?
- Using OPUS-MT with DeepSpeed
- update Dockerfile.gpu--fixed
- different sizes of dictionaries in different models HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opus-mt-train.