achimr / m4loc Goto Github PK
View Code? Open in Web Editor NEWMoses for Localization
License: GNU Lesser General Public License v3.0
Moses for Localization
License: GNU Lesser General Public License v3.0
1. Copy & paste this text into the file test.de (without double quotes)
"das ist ein <g id="1">kleines</g> haus"
2. Copy & paste this text into the file test.en (without double quotes)
"this is |0-1| a |2-2| small |3-3| house |4-4| "
3. run "reinsert.pl test.de < test.en > test.out"
Resulting content of test.out:
"this is a <g id="1"> small </g> house "
Expected content of test.out:
"this is a <g id="1"> small </g> house" (i.e. no trailing space)
Original issue reported on code.google.com by [email protected]
on 31 Mar 2011 at 8:38
1. tikal.bat -xm .\t\Sample_AlmostEverything_1.2_strict.xlf
2. copy .\t\Sample_AlmostEverything_1.2_strict.xlf.en-us
.\t\Sample_AlmostEverything_1.2_strict.xlf.fr
3. tikal.bat -lm .\t\Sample_AlmostEverything_1.2_strict.xlf
4. Open file in editor and goto first <trans-unit> element
<alt-trans> unit has the following content:
Text <g id="350263025">text</g>TEXT<g id="350263964">TEXT</g>TEXT<x
id="350265884"/>TEXT<bx id="350266844"/>TEXT<mrk
mtype="x-test">text</mrk> <x id="350267804"/>TEXT<g
id="350268764">TEXT</g>TEXT.<g id="350268764"></g><
Expected - no escaping to of < before the mrk elements:
Text <g id="350263025">text</g>TEXT<g id="350263964">TEXT</g>TEXT<x
id="350265884"/>TEXT<bx id="350266844"/>TEXT<mrk mtype="x-test">text</mrk> <x
id="350267804"/>TEXT<g id="350268764">TEXT</g>TEXT.<g id="350268764"></g><
Original issue reported on code.google.com by [email protected]
on 9 Mar 2011 at 1:05
Tikal can produce g,x, bx and ex tags, however markup remover is able to treat
g and x tags only.
Probably just a small regexp modification is needed + documentation update
Original issue reported on code.google.com by [email protected]
on 26 Jan 2011 at 12:14
Tokenized source file with line:
Text with <g id="1"> code </g> and <g id="2"> bold </g> .
Pseudo-localized file with translation:
ithway odecay |1-2| . |5-5| exttay |0-0| and oldbay |3-4|
Run
reinsert.pl source_tokenized < pseudo_localized
Output:
<g id="1"> ithway odecay . exttay </g> <g id="2"> and oldbay <g id="2"> </g>
<g id="2"> is duplicated. Expected is that only first instance gets emitted.
Original issue reported on code.google.com by [email protected]
on 8 Mar 2011 at 9:26
Merge m4loc_tag.pm into m4loc.pm and add options for
- tag preservation vs. tag removal/reinsertion (default: tag preservation)
- greedy vs. non-greedy tag reinsertion (default: non-greedy)
- retaining semantically relevant stand-alone tags in tag removal/reinsertion
process
- truecasing vs. recasing models
Original issue reported on code.google.com by [email protected]
on 5 Sep 2013 at 8:28
perl wrap_tokenizer.pm -t "perl
/home/larix/moses/scripts/tokenizer/tokenizer.perl" -p "-l cs -threads 2" < try
is not working, but
perl wrap_tokenizer.pm -t "perl
/home/larix/moses/scripts/tokenizer/tokenizer.perl" -p "-l cs -threads 1" < try
Gives correct output.
Problem is in parameter -thread, for more threads it is not working (only for
deafult -thread 1)
Original issue reported on code.google.com by [email protected]
on 29 Apr 2013 at 3:41
1. tikal.sh -xm ./t/languagetool.xlf
2. perl mod_tokenizer.pl -l en-us < languagetool.xlf.en-us >
languagetool.xlf.tok.en-us
Compare line 77 of
languagetool.xlf.en-us:
<br><b> <x id="1"/>. Line <x id="2"/>, column <x id="3"/></b><br>
languagetool.xlf.tok.en-us:
< br > < b > <x id="1"/> . Line <x id="2"/> , column <x id="3"/> < / b > < br >
Expected output:
<br><b> <x id="1"/> . Line <x id="2"/> , column <x id="3"/> </b><br>
or
<br> <b> <x id="1"/> . Line <x id="2"/> , column <x id="3"/> </b> <br>
The extra spaces make it hard to distinguish tags from < and > characters in
sentences like "The temperature is < 12 degrees, but > than 6."
As separate tokens "b" and "br" might be translated into other characters which
would break the markup.
Original issue reported on code.google.com by [email protected]
on 28 Feb 2011 at 9:44
IPC::Open2 is core module, don't list as prerequisite in INSTALL.txt
Original issue reported on code.google.com by [email protected]
on 3 Oct 2013 at 2:58
1. Open a command window and change into the .\xliff directory
2. Run "xliff2moses.bat .\t\RB-12-Test02.xlf.tok en" (or "./xliff2moses.bat
./t/RB-12-Test02.xlf.tok en" on Unix)
3. Open .\t\RB-12-Test02.xlf.tok.en in a text editor
4. View line 7:
"Text with <g id="2"><x id="1"/></g> and more text ."
Expected:
"Text with <g id="2"> <x id="1"/> </g> and more text ."
Original issue reported on code.google.com by [email protected]
on 3 Mar 2011 at 1:56
On Windows:
1. Produce(or edit) tokenized target language file e.g.
languagetool.xlf.ins.fr-fr
2. Run "perl mod_detokenizer.pl -l fr-fr < languagetool.xlf.ins.fr-fr"
Result:
Error message "The system cannot find the path specified."
Cause:
Line 100 in detokenizer.perl
system("./detokenizer.perl -l $lang < $tmpout 2> /dev/null ");
should be changed to the platform independent
system("perl ./detokenizer.perl -l $lang < $tmpout");
Original issue reported on code.google.com by [email protected]
on 4 Mar 2011 at 3:02
1. Open a command window and change into the ./xliff directory
2. Run "xliff2moses.bat .\t\languagetool.xlf en-us" (or "./xliff2moses.bat
.\t\languagetool.xlf en-us" on Unix)
3. Open .\t\languagetool.xlf.tok.en-us in a text editor
4. View line 72
"Don ' t put a space after the opening parenthesis"
Expected:
"Don 't put a space after the opening parenthesis"
The regular Moses tokenizer tokenizes it as this - can be verified by running
"perl tokenizer.perl -l en < .\t\languagetool.xlf.en-us"
after running the steps above.
Original issue reported on code.google.com by [email protected]
on 3 Mar 2011 at 1:51
1. Produce(or edit) tokenized target language file e.g.
languagetool.xlf.ins.fr-fr
2. Run "perl mod_detokenizer.pl -l fr-fr < languagetool.xlf.ins.fr-fr"
Result:
Error: "No built-in rules for language fr-fr, claim en for default behaviour.
at ./detokenizer.perl line 37."
Expected:
Automatic fallback to en. The modified tokenizer should check the language and
if it is not supported fall back to English (en). Ideally the detokenizer would
do that already, but we are picking up the script unmodified from Moses.
Original issue reported on code.google.com by [email protected]
on 4 Mar 2011 at 3:07
What steps will reproduce the problem?
1. Run m4loc.pl
2.
3.
What is the expected output? What do you see instead?
Expected: tokenized inline file without markup
Error message:
Can't locate IO/Pty.pm in @INC (@INC contains: C:/strawberry/perl/site/lib C:/st
rawberry/perl/vendor/lib C:/strawberry/perl/lib .) at C:/strawberry/perl/site/li
b/IPC/Run.pm line 1923.
What version of the product are you using? On what operating system?
Win7--downloaded this week thru Git.
Please provide any additional information below.
Installed all libraries mentioned in Install file. Running Strawberry perl.
Original issue reported on code.google.com by [email protected]
on 28 Sep 2012 at 2:16
Repro steps:
1. Create tokenized source file sourcetok.txt with this content:
Click the search icon <g id="1"></g> to expand the drop-down list .
2. Create word alignment file wa.txt with with content:
0-0 1-1 2-2 3-3 5-4 6-5 8-6 7-7 9-8
3. Create target file target.txt with this content
Clic la búsqueda icono ampliar la lista drop-down .
4. Run
reinsert_wordalign.pl sourcetok.txt wa.txt < target.txt
Resulting output:
Use of uninitialized value in string eq at
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152,
<STDIN> line 1.
Use of uninitialized value in string eq at
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152,
<STDIN> line 1.
Use of uninitialized value in string eq at
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152,
<STDIN> line 1.
Use of uninitialized value in string eq at
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152,
<STDIN> line 1.
Use of uninitialized value in string eq at
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152,
<STDIN> line 1.
Use of uninitialized value in string eq at
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152,
<STDIN> line 1.
Use of uninitialized value in string eq at
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152,
<STDIN> line 1.
Use of uninitialized value in string eq at
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152,
<STDIN> line 1.
Use of uninitialized value in string eq at
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152,
<STDIN> line 1.
Clic la búsqueda icono ampliar la lista drop-down . <g id="1"> </g>
Expected:
No error messages. Resulting insertion:
Clic la búsqueda icono <g id="1"> </g> ampliar la lista drop-down .
Remark: Behavior is the same if the tokenizer inserted a space inside the empty
paired <g> tag:
Click the search icon <g id="1"> </g> to expand the drop-down list .
Original issue reported on code.google.com by [email protected]
on 25 Sep 2013 at 7:50
inpput:
-------------------------------------------------
That agreement shall be negotiated with European Union <g id="1"> Minister for
<g id="2"> Foreign </g> Affairs . </g>
-------------------------------------------------
(1 line)
moses output:
-------------------------------------------------
Šo līgumu apspriež |0-4| ar |5-5| Eiropas Savienības |6-7| ārlietu
ministru |8-11|. |12-12|
-------------------------------------------------
command:
perl reinsert.pm 1file < moses_file
result:
-------------------------------------------------
Šo līgumu apspriež ar Eiropas Savienības <g id="1"> <g id="2"> </g>
ārlietu ministru . </g> <g id="2">
-------------------------------------------------
Problem: last tag (<g id="2">) should not have been there at all.
Original issue reported on code.google.com by [email protected]
on 7 Oct 2011 at 7:44
Repro steps:
1. Extract Moses InlineText from languagetool.xlf
tikal -xm -to languagetool.xlf.en-us languagetool.xlf
2. Translate the text to Spanish with the small En-Es MT system provided
perl ~/Documents/work/oss/m4loc/xliff/m4loc.pm -o p -n -s en -t es -m
~/.../binarized_model/moses.ini -c ~/.../data/truecase-model.en <
languagetool.xlf.en-us > languagetool.xlf.pb.tok.es
Observations:
Line 9 Source:
&Check Text
Line 9 Target:
& cheque texto
Expected: Target ampersand also escaped
Line 16 Source:
Word repetition (e.g. 'will will')
Line 16 Target:
Palabra la repetición ' ( por ejemplo , se va a ' )
Expected: Quotes not escaped in target
Line 32 Target:
Punto : " <x id="1"/> " ( <x id="2"/> ) significa <x id="3"/> ( <x
id="4"/> ) .
This last issue does not happen with tag fixed tag handling
Remark: Try to run deescape-special-chars.perl from Moses scripts/tokenizer
folder
Original issue reported on code.google.com by [email protected]
on 23 Sep 2013 at 7:51
Translating line 2 of Sample_AlmostEverything_1.2_strict.xlf.en-us results in
two different translations:
Word-based:
<x id="1"/> Se interpone para <x id="2"/> .
Tag-fixed:
<x id="1"/> tribuna para <x id="2"/> .
Is this due to just the punctuation being included or not in the phrase
translation?
Check by translating raw text.
Original issue reported on code.google.com by [email protected]
on 23 Sep 2013 at 8:09
source and language codes should be put as option, otherwise Tikal is working
with some OS specific variables. Also the whole process would be more readable.
Now, only target language is identified as option.
Original issue reported on code.google.com by [email protected]
on 11 Mar 2011 at 12:37
Documentation should mention that input and output are tokenized (also in
command line help)
Documentation on wiki does not list the principles that are used for reinsertion
Original issue reported on code.google.com by [email protected]
on 12 Apr 2011 at 1:15
1. tikal.bat -xm .\t\languagetool.xlf
2. copy .\t\languagetool.xlf.en-us .\t\languagetool
3. tikal.bat -lm .\t\languagetool.xlf
Have a look at the resulting languagetool.out.xlf - line 58-60:
<alt-trans match-quality="10" origin="Moses-MT"
xmlns:okp="okapi-framework:xliff-extensions" okp:matchType="MT">
<target xml:lang="fr-fr">Check done, <x id="1"/> potential problems
found</target>
</alt-trans>
Expected:
Check done, <ph id="1">{0}</ph> potential problems found
as in source segment
i.e. the interim <x> tag to be reverted back into a <ph> tag
Original issue reported on code.google.com by [email protected]
on 7 Mar 2011 at 8:43
cat 1 1.mos
asomething something . <g id="1"> </g> something something .
asomething |0-0| something |1-1|. |2-2| something |3-3| something |4-4|. |5-5|
perl reinsert.pm 1 < 1.mos
asomething something . <g id="1"> something </g> something .
However, it should be:
asomething something . <g id="1"> </g> something something .
Original issue reported on code.google.com by [email protected]
on 4 Jan 2012 at 2:20
What steps will reproduce the problem?
1. Running script above
What is the expected output? What do you see instead?
Tikal.sh is present under okapi/okapi-apps_gtk2-linux-x86_0.14. I also copied
Tikal.sh to m4loc folder.
What version of the product are you using? On what operating system?
r106. Ubuntu 11
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 11 Dec 2011 at 11:13
To avoid code duplication we want to switch to the modulino versions of
tokenizer/detokenizer
Original issue reported on code.google.com by [email protected]
on 7 Mar 2012 at 9:27
echo "<p>You are now logged" | mod_tokenizer
result is:
<p>You are now logged
should be:
<p> You are now logged
(space between > and You)
Should be done by some XML entity recognizer?
Original issue reported on code.google.com by [email protected]
on 2 Mar 2011 at 5:52
What steps will reproduce the problem?
1. change to a folder outside of trunk/xliff
2. Copy & paste the following text into a file called test.tok.en:
this is a <g id="1"> small </g> house
3. call "path/to/trunk/xliff/mod_detokenizer.pl < test.tok.en"
Result:
Error message: 'Can't open perl script "detokenizer.perl": No such file or
directory'
Expected:
mod_detokenizer.pl to run w/o error message
Fix:
Detect executing folder with:
use FindBin qw($Bin);
then call detokenizer.pl in system() call with the path in $Bin
Workaround:
Call mod_detokenizer.pl from containing folder
Original issue reported on code.google.com by [email protected]
on 31 Mar 2011 at 8:19
inpput:
-------------------------------------------------
<g id="1"> <g id="2"> Without . </g> </g>
-------------------------------------------------
moses output:
-------------------------------------------------
Neskarot |0-0| . |1-1|
-------------------------------------------------
command:
perl reinsert.pm 1file < moses_file
result:
-------------------------------------------------
<g id="1"> <g id="2"> Neskarot </g> . </g> <g id="2">
-------------------------------------------------
Problems:
1. last tag (<g id="2">) should not have been there at all
2. both closing tags(</g>) should be placed on the same position
Original issue reported on code.google.com by [email protected]
on 7 Oct 2011 at 8:23
The command:
perl ~/scripts/m4loc/corpus/testset.pl -n 200 -o automotive_test.en.txt -n
automotive_held.en.txt < automotive_min.en.txt > automotive_index.txt
Produced 3 files:
automotive_test.en.txt was empty
automotive_index.txt was empty
test.hld contained the content from automotive_min.en.txt
Original issue reported on code.google.com by [email protected]
on 28 Feb 2013 at 1:35
cat 1
hello <g id="1"> how are you <x id="0"/> </g> ?
cat 1.mos
HELLO |0-0| HOW |1-1| ARE |2-2| YOU |3-3| ? |4-4|
perl reinsert.pm 1 < 1.mos
HELLO <g id="1"> HOW ARE YOU </g> <x id="0"/> ?
Expected result:
HELLO <g id="1"> HOW ARE YOU <x id="0"/> </g> ?
<x /> and </g> tags are re-ordered
Original issue reported on code.google.com by [email protected]
on 1 Feb 2012 at 2:09
It would be nice to have the same interface(way of calling and line processing)
in each script. The same treatment of STD(IN,OUT,ERR) and same names for
processing of lines ...
Original issue reported on code.google.com by [email protected]
on 24 Jan 2012 at 9:30
Environment: Windows 7 64-bit, Strawberry Perl 5.10.1.4 32-bit, Okapi Tools M10
What steps will reproduce the problem?
1. Download XLIFF test file
http://okapi.googlecode.com/svn/trunk/okapi/filters/xliff/src/test/resources/RB-
11-Test01.xlf
2. Open a command window and change directory to where the downloaded file is
located
3. Extract translatable text with "tikal -xm RB-11-Test01.xlf"
4. Try to tokenize extracted text "perl mod_tokenizer.pl -l en <
RB-11-Test01.xlf.en > RB-11-Test01.tok.en"
Result:
Error message: "The system cannot find the path specified."
The contents of the file RB-11-Test01.tok.en are (in 2 lines):
-n "Paragraph. <g id="1">code</g> <g id="2">bold</g> -n ". <x id="1"/> -n " and more text <g id="2"><x id="1"/></g> -n " and more text.
<x id="1"/> -n "
Expected:
No error message and the output file to contain properly tokenized text.
Remark:
Removing XLIFF inline elements from the input file removes the error message,
but does not fix the corrupted output.
Original issue reported on code.google.com by [email protected]
on 25 Jan 2011 at 7:33
if command is mistyped, the usage help is displayed:
Usage: ./reinsert.pl <source InlineText file> < <target plain text file > >
<target InlineText file>
However, it is hard to decide what signs like "<", or "> >" mean. Maybe:
Usage: ./reinsert.pl source_InlineText_file < target_plain_text_file >
target_InlineText_file
would be better, or some other better solution?
Original issue reported on code.google.com by [email protected]
on 7 Mar 2011 at 4:36
1. cat oo
DdNé_Local Corrections
2. ./modtokenizer < oo
Wide character in print at /home/moses/m4loc/xliff/./mod_tokenizer.pl line 138,
<STDIN> line 1.
This was not issue in "file-based" approach, therefore I suppose problem is
just a way how the string coming to LibXML::Reader encoded (probably
http://perldoc.perl.org/Encode.html). I'll elaborate on this the next week.
Tomas
Original issue reported on code.google.com by [email protected]
on 15 Jul 2011 at 12:21
What steps will reproduce the problem?
Running ./tikal.sh -xm Computer Threats FAQ.xlf -sl en-us -tl es -trace
What is the expected output? What do you see instead?
Should get InlineText files. Get error msg below instead.
rubo@rubo-Dell-System-XPS-L702X:~/tools/m4loc/xliff$ ./tikal.sh -xm Computer
Threats FAQ.xlf -sl en-us -tl es -trace
-------------------------------------------------------------------------------
Okapi Tikal - Localization Toolset
Version: 2.0.15
-------------------------------------------------------------------------------
Extraction to Moses InlineText
java.lang.RuntimeException: The input file 'Computer' has no extension to guess
the filter from.
at net.sf.okapi.applications.tikal.Main.guessMissingParameters(Main.java:843)
at net.sf.okapi.applications.tikal.Main.process(Main.java:957)
at net.sf.okapi.applications.tikal.Main.main(Main.java:538)
What version of the product are you using? On what operating system?
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 29 Jan 2012 at 8:15
1. tikal.bat -xm .\t\languagetool.xlf
2. copy .\t\languagetool.xlf.en-us .\t\languagetool.xlf.fr-fr
3. tikal.bat -lm .\t\languagetool.xlf
Observation: first line of resulting languagetool.out.xlf is:
<?xml version="1.0" encoding="windows-1252"?>[...]
File encoding is ASCII/Windows-1252 (no characters outside ASCII range in file)
Expected:
File to be UTF-8 encoded - first line:
<?xml version="1.0" encoding="utf-8"?>
The file encoding can be forced by adding the option -oe utf8 to the tikal
command in line 3.
Original issue reported on code.google.com by [email protected]
on 8 Mar 2011 at 8:06
What steps will reproduce the problem?
1. Download the TMX2MosesCorpusTool
2. Install on my Windows XP machine
3. Fill the required fields like target language , export path, etc...
4. imported the TMX "I tried 2 TMX one small file and large file"
I expected to find the TMX i entered to be converted to corpus to be trained on
Moses but I didn't find any result on the assigned path
AMT_CleaningTool_20120519.air
Windows XP SP3
Please advise ASAP.
Thanks
Original issue reported on code.google.com by [email protected]
on 3 Jun 2012 at 3:11
Quite common are strings from Tikal like this:
<x id="1"/> <b> <x id="2"/> Uživatelé systému <x id="3"/> </b>
<x id="4"/>
It is problem of some CAT tools which created valid but not-good XLIFF. Trados
is really expert in how to create bad XLIFF :). Correctly created string would
looks like:
<x id="1"/> Uživatelé systému <x id="2"/>
We need to preserve info:<b> and </b> but it shouldn't go into
Moses. I can see one possibility:
mod_tokenizer will try to identify such strings and wrap them into some special
tag (e.g. <hide><b> </hide>)
remove_markup should remove it (but the whole content of the tag)
markup_reinserter would insert the tag (again - the whole tag)
Or, is there some better solution?
Tomas
Original issue reported on code.google.com by [email protected]
on 4 Mar 2011 at 5:00
1. tikal.sh -xm ./t/languagetool.xlf
2. perl mod_tokenizer.pl -l en-us < languagetool.xlf.en-us >
languagetool.xlf.tok.en-us
Compare line 77 of
languagetool.xlf.en-us:
<br><b> <x id="1"/>. Line <x id="2"/>, column <x id="3"/></b><br>
languagetool.xlf.tok.en-us:
< br > < b > <x id="1"/> . Line <x id="2"/> , column <x id="3"/> < / b > < br >
Later the <x> tags will be removed, but the remaining < and > characters around
the b tags will create problems with Moses. See
http://article.gmane.org/gmane.comp.nlp.moses.user/4123
(this is only from Feb-14, so different from what we discussed earlier)
Therefore expected:
<br><b> <x id="1"/>. Line <x id="2"/>, column <x
id="3"/></b><br>
Original issue reported on code.google.com by [email protected]
on 24 Feb 2011 at 10:09
1. tikal.sh -xm ./t/languagetool.xlf
2. perl mod_tokenizer.pl -l en-us < languagetool.xlf.en-us >
languagetool.xlf.tok.en-us
Compare first line of
languagetool.xlf.en-us:
Belarusian
languagetool.xlf.tok.en-us:
Belarusian
Note the extra space before the first token. Expected: no space
Original issue reported on code.google.com by [email protected]
on 24 Feb 2011 at 10:21
The truecase.perl and detruecase.perl scripts in Moses v1.0 buffer input and/or
output and when integrated with pipes using the open2 function hang. Possible
fixes:
1. Use IPC::Run to interface with these scripts, similar to how the Moses
tokenizers/detokenizers are used in wrap_tokenizer.perl/wrap_detokenizer.perl
2. Flush buffers in truecase.perl and detruecase.perl. This can be achieved by
adding the line
$|++;
to the scripts
Original issue reported on code.google.com by [email protected]
on 11 Sep 2013 at 7:29
reinsert
1. perl reinsert.pm f.en < f.moses > f.out
file f.en:
This is <g id="1"> bold and italic and then </g><g id="2"> only italic </g>
text .
file f.moses:
Toto je |0-1| tučné písmo a kurzívu |2-4| a poté pouze |5-7| kurzívou .
|8-10|
file f.out:
Toto je <g id="1"> tučné písmo a kurzívu <g id="2"> a poté pouze </g> </g>
kurzívou .
however, the result should be like:
Toto je <g id="1"> tučné písmo a kurzívu a poté </g><g id="2"> pouze
kurzívou</g> .
Original issue reported on code.google.com by [email protected]
on 22 Aug 2011 at 1:37
1. line from mod_tokenizer: "<x id="1"/> The current ultra-portable ."
2. after remove_markup.pl: " The current ultra-portable ."
3. cause problem in filter-model-given-input.pl, since the first character on
line is empty
Probably checking whether a line starts or ends with empty characeter should be
moved from mod_tokenizer to remove_markup. Please, add these two lines:
$line =~ s/^ //;
$line =~ s/ $//;
Original issue reported on code.google.com by [email protected]
on 31 Mar 2011 at 4:02
Subject line says it all.
Original issue reported on code.google.com by [email protected]
on 5 Sep 2013 at 8:21
Currently both m4loc.pm/m4log_tag.pm run through the entire process of tag
removal/reinsertion/preservation even if the segment to be translated does not
contain markup. This is inefficient and error-prone.
The process should recognize when a segment contains no markup and translate it
with the Moses decoder plain text translation. (tokenization, detokenization
and casing still need to happen)
Original issue reported on code.google.com by [email protected]
on 5 Sep 2013 at 8:20
To reinsert markup we have to have Moses output phrase alignment information
with the -t option. Example:
this is a |0-2| small |4-4| house |3-3| . |5-5|
The traces (source phrase information between vertical bars) will affect the
recaser model to reintroduce correct upper/lowercase. The model relies on a
model based on n-grams.
Workarounds:
* use truecaser
Possible fixes:
* remove traces for recasing and reinsert them after
Original issue reported on code.google.com by [email protected]
on 2 Mar 2011 at 12:15
Tokenized source file with line:
Text with <g id="1"> code </g> and <g id="2"> bold </g> .
Pseudo-localized file with translation:
ithway odecay |1-2| . |5-5| exttay |0-0| and oldbay |3-4|
Run
reinsert.pl source_tokenized < pseudo_localized
Output:
<g id="1"> ithway odecay . exttay </g> <g id="2"> and oldbay <g id="2"> </g>
Expected output:
<g id="1"> ithway odecay </g> . exttay <g id="2"> and oldbay <g id="2"> </g>
Original issue reported on code.google.com by [email protected]
on 8 Mar 2011 at 9:27
Unlike paired formatting tags, stand-alone tags typically serve two functions:
1. As placeholders for a named entity (as such having semantical significance)
2. As a isolated formatting tag spanning two or more segments (less common)
For the case 1. the stand-alone tag should be funneled through the decoder
rather than removed and reinserted in the tag removal/reinsertion case.
Example:
Firefox is a good browser.
<x id="1"/> is a good browser.
is a good browser .
Original issue reported on code.google.com by [email protected]
on 5 Sep 2013 at 8:35
wrap_detokenizer.pm uses temporary files stored in the current directory. On
failure temp* files often stay around.
Expected: wrap_detokenizer.pm storing temporary XML in string like
wrap_tokenizer.pm
Original issue reported on code.google.com by [email protected]
on 7 Mar 2012 at 9:17
1. Open a command window and change into the ./xliff directory
2. Run "xliff2moses.bat .\t\languagetool.xlf en-us" (or "./xliff2moses.bat
.\t\languagetool.xlf en-us" on Unix)
3. Open .\t\languagetool.xlf.tok.en-us in a text editor
4. View line 9
"&Check Text"
Expected:
"& Check Text"
The ampersand is not a problematic character for the Moses decoder. This worked
in an earlier version of the tokenizer.
Original issue reported on code.google.com by [email protected]
on 3 Mar 2011 at 1:45
Currently m4loc.pm/m4loc_tag.pm hard-code an expected recaser model;
Original issue reported on code.google.com by [email protected]
on 21 Aug 2013 at 8:10
With reinsert.pm r116:
Tokenizer ir <g id="0"> programma , kas <g id="1"> </g> <g id="2"> sadala <g
id="3"> </g> </g> ievadīto & $ tekstu teikumos , un teikumus vārdos14 . </g>
Tokenizer |0-0| programma |2-2| have |1-1| to be |4-4| , the |3-3| sadala
|5-5| ievadīto |6-6| & |7-7| $ |8-8| tekstu |9-9| teikumos |10-10| ,
|5-5| |11-11|
and the |12-12| teikumus |13-13| vārdos14 |14-14| . |15-15|
Result is:
Tokenizer <g id="0"> programma have to be , the <g id="1"> <g id="2"> sadala
</g> <g id="3"> ievadīto & $ tekstu teikumos , and the teikumus vārdos14 .
</g> </g> </g>
But it should be:
Tokenizer <g id="0"> programma have to be , the <g id="1"> </g> <g id="2">
sadala <g id="3"> </g> </g> ievadīto & $ tekstu teikumos , and the teikumus
vārdos14 </g>
Original issue reported on code.google.com by [email protected]
on 31 Jan 2012 at 11:35
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.