achimr / m4loc Goto Github PK

Moses for Localization

License: GNU Lesser General Public License v3.0

ActionScript 16.88% Python 14.90% Makefile 0.37% Common Lisp 0.29% C 38.83% C++ 3.51% Perl 21.80% CSS 0.52% HTML 0.06% PHP 2.46% Shell 0.29% Forth 0.08%

m4loc's People

Contributors

Stargazers

Watchers

Forkers

tofula sridhar-newsdistill yanghaocsg emicol macromediameme

m4loc's Issues

reinsert.pl: Adds trailing space

1. Copy & paste this text into the file test.de (without double quotes)
"das ist ein <g id="1">kleines</g> haus"
2. Copy & paste this text into the file test.en (without double quotes)
"this is |0-1| a |2-2| small |3-3| house |4-4| "
3. run "reinsert.pl test.de < test.en > test.out"

Resulting content of test.out:
"this is a <g id="1"> small </g> house "

Expected content of test.out:
"this is a <g id="1"> small </g> house" (i.e. no trailing space)

Original issue reported on code.google.com by [email protected] on 31 Mar 2011 at 8:38

tikal: Escapes < characters of <mrk> tags on reinsertion

1. tikal.bat -xm .\t\Sample_AlmostEverything_1.2_strict.xlf
2. copy .\t\Sample_AlmostEverything_1.2_strict.xlf.en-us 
.\t\Sample_AlmostEverything_1.2_strict.xlf.fr
3. tikal.bat -lm .\t\Sample_AlmostEverything_1.2_strict.xlf
4. Open file in editor and goto first <trans-unit> element

<alt-trans> unit has the following content:
Text <g id="350263025">text</g>TEXT<g id="350263964">TEXT</g>TEXT<x 
id="350265884"/>TEXT<bx id="350266844"/>TEXT&lt;mrk 
mtype="x-test">text&lt;/mrk> <x id="350267804"/>TEXT<g 
id="350268764">TEXT</g>TEXT.<g id="350268764"></g><

Expected - no escaping to of < before the mrk elements:
Text <g id="350263025">text</g>TEXT<g id="350263964">TEXT</g>TEXT<x 
id="350265884"/>TEXT<bx id="350266844"/>TEXT<mrk mtype="x-test">text</mrk> <x 
id="350267804"/>TEXT<g id="350268764">TEXT</g>TEXT.<g id="350268764"></g><

Original issue reported on code.google.com by [email protected] on 9 Mar 2011 at 1:05

markup remover should be able to process bx, ex tags

Tikal can produce g,x, bx and ex tags, however markup remover is able to treat 
g and x tags only.

Probably just a small regexp modification is needed + documentation update

Original issue reported on code.google.com by [email protected] on 26 Jan 2011 at 12:14

reinsert.pl: Opening <g> tag emitted twice leading to duplicate

Tokenized source file with line:
Text with <g id="1"> code </g> and <g id="2"> bold </g> .
Pseudo-localized file with translation:
ithway odecay |1-2| . |5-5| exttay |0-0| and oldbay |3-4|

Run 
reinsert.pl source_tokenized < pseudo_localized

Output:
<g id="1"> ithway odecay . exttay </g> <g id="2"> and oldbay <g id="2"> </g>

<g id="2"> is duplicated. Expected is that only first instance gets emitted.

Original issue reported on code.google.com by [email protected] on 8 Mar 2011 at 9:26

Script integrating all different options

Merge m4loc_tag.pm into m4loc.pm and add options for
- tag preservation vs. tag removal/reinsertion (default: tag preservation)
- greedy vs. non-greedy tag reinsertion (default: non-greedy)
- retaining semantically relevant stand-alone tags in tag removal/reinsertion 
process
- truecasing vs. recasing models

Original issue reported on code.google.com by [email protected] on 5 Sep 2013 at 8:28

wrap_tokenize is not working with multithead tokenizer.perl

perl wrap_tokenizer.pm -t "perl 
/home/larix/moses/scripts/tokenizer/tokenizer.perl" -p "-l cs -threads 2" < try

is not working, but

perl wrap_tokenizer.pm -t "perl 
/home/larix/moses/scripts/tokenizer/tokenizer.perl" -p "-l cs -threads 1" < try


Gives correct output. 
Problem is in parameter -thread, for more threads it is not working (only for 
deafult -thread 1)

Original issue reported on code.google.com by [email protected] on 29 Apr 2013 at 3:41

Tokenizer introduces extra spaces with unescaped XML character entities

1. tikal.sh -xm ./t/languagetool.xlf
2. perl mod_tokenizer.pl -l en-us < languagetool.xlf.en-us > 
languagetool.xlf.tok.en-us

Compare line 77 of
languagetool.xlf.en-us:
&lt;br>&lt;b> <x id="1"/>. Line <x id="2"/>, column <x id="3"/>&lt;/b>&lt;br>

languagetool.xlf.tok.en-us:
< br > < b > <x id="1"/> . Line <x id="2"/> , column <x id="3"/> < / b > < br >

Expected output:
<br><b> <x id="1"/> . Line <x id="2"/> , column <x id="3"/> </b><br>
or
<br> <b> <x id="1"/> . Line <x id="2"/> , column <x id="3"/> </b> <br>


The extra spaces make it hard to distinguish tags from < and > characters in 
sentences like "The temperature is < 12 degrees, but > than 6."
As separate tokens "b" and "br" might be translated into other characters which 
would break the markup.

Original issue reported on code.google.com by [email protected] on 28 Feb 2011 at 9:44

IPC::Open2 is core module, don't list as prerequisite in INSTALL.txt

IPC::Open2 is core module, don't list as prerequisite in INSTALL.txt

Original issue reported on code.google.com by [email protected] on 3 Oct 2013 at 2:58

Sequence <g id="2"><x id="1"/></g> does not get tokenized

1. Open a command window and change into the .\xliff directory
2. Run "xliff2moses.bat .\t\RB-12-Test02.xlf.tok en" (or "./xliff2moses.bat 
./t/RB-12-Test02.xlf.tok en" on Unix) 
3. Open .\t\RB-12-Test02.xlf.tok.en in a text editor
4. View line 7:
"Text with <g id="2"><x id="1"/></g> and more text ."

Expected:
"Text with <g id="2"> <x id="1"/> </g> and more text ."

Original issue reported on code.google.com by [email protected] on 3 Mar 2011 at 1:56

mod_detokenizer: system command prevents running on Windows

On Windows:
1. Produce(or edit) tokenized target language file e.g. 
languagetool.xlf.ins.fr-fr
2. Run "perl mod_detokenizer.pl -l fr-fr < languagetool.xlf.ins.fr-fr"

Result:
Error message "The system cannot find the path specified."

Cause:
Line 100 in detokenizer.perl 
system("./detokenizer.perl -l $lang < $tmpout 2> /dev/null ");
should be changed to the platform independent
system("perl ./detokenizer.perl -l $lang < $tmpout");

Original issue reported on code.google.com by [email protected] on 4 Mar 2011 at 3:02

mod_tokenizer tokenizes "Don't" as "Don ' t"

1. Open a command window and change into the ./xliff directory
2. Run "xliff2moses.bat .\t\languagetool.xlf en-us" (or "./xliff2moses.bat 
.\t\languagetool.xlf en-us" on Unix) 
3. Open .\t\languagetool.xlf.tok.en-us in a text editor
4. View line 72

"Don ' t put a space after the opening parenthesis"

Expected:
"Don 't put a space after the opening parenthesis"

The regular Moses tokenizer tokenizes it as this - can be verified by running
"perl tokenizer.perl -l en < .\t\languagetool.xlf.en-us" 
after running the steps above.

Original issue reported on code.google.com by [email protected] on 3 Mar 2011 at 1:51

mod_detokenizer: Moses detokenizer only accepts cs|en|fr|it as input languages

1. Produce(or edit) tokenized target language file e.g. 
languagetool.xlf.ins.fr-fr
2. Run "perl mod_detokenizer.pl -l fr-fr < languagetool.xlf.ins.fr-fr"

Result:
Error: "No built-in rules for language fr-fr, claim en for default behaviour. 
at ./detokenizer.perl line 37."

Expected:
Automatic fallback to en. The modified tokenizer should check the language and 
if it is not supported fall back to English (en). Ideally the detokenizer would 
do that already, but we are picking up the script unmodified from Moses.

Original issue reported on code.google.com by [email protected] on 4 Mar 2011 at 3:07

Can't locate IO/Pty.pm

What steps will reproduce the problem?
1. Run m4loc.pl
2.
3.

What is the expected output? What do you see instead?
Expected: tokenized inline file without markup
Error message: 
Can't locate IO/Pty.pm in @INC (@INC contains: C:/strawberry/perl/site/lib C:/st
rawberry/perl/vendor/lib C:/strawberry/perl/lib .) at C:/strawberry/perl/site/li
b/IPC/Run.pm line 1923.


What version of the product are you using? On what operating system?
Win7--downloaded this week thru Git. 


Please provide any additional information below.
Installed all libraries mentioned in Install file. Running Strawberry perl.

Original issue reported on code.google.com by [email protected] on 28 Sep 2012 at 2:16

Word-based reinsertion fails on some cases

Repro steps:
1. Create tokenized source file sourcetok.txt with this content:
Click the search icon <g id="1"></g> to expand the drop-down list .
2. Create word alignment file wa.txt with with content:
0-0 1-1 2-2 3-3 5-4 6-5 8-6 7-7 9-8
3. Create target file target.txt with this content
Clic la búsqueda icono ampliar la lista drop-down .
4. Run
reinsert_wordalign.pl sourcetok.txt wa.txt < target.txt

Resulting output:
Use of uninitialized value in string eq at 
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152, 
<STDIN> line 1.
Use of uninitialized value in string eq at 
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152, 
<STDIN> line 1.
Use of uninitialized value in string eq at 
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152, 
<STDIN> line 1.
Use of uninitialized value in string eq at 
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152, 
<STDIN> line 1.
Use of uninitialized value in string eq at 
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152, 
<STDIN> line 1.
Use of uninitialized value in string eq at 
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152, 
<STDIN> line 1.
Use of uninitialized value in string eq at 
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152, 
<STDIN> line 1.
Use of uninitialized value in string eq at 
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152, 
<STDIN> line 1.
Use of uninitialized value in string eq at 
/home/achim/Documents/work/oss/m4loc/xliff/reinsert_wordalign.pm line 152, 
<STDIN> line 1.
Clic la búsqueda icono ampliar la lista drop-down . <g id="1"> </g>

Expected:
No error messages. Resulting insertion:
Clic la búsqueda icono <g id="1"> </g> ampliar la lista drop-down . 

Remark: Behavior is the same if the tokenizer inserted a space inside the empty 
paired <g> tag:
Click the search icon <g id="1"> </g> to expand the drop-down list .

Original issue reported on code.google.com by [email protected] on 25 Sep 2013 at 7:50

reinserter added extra tag

inpput:
-------------------------------------------------
That agreement shall be negotiated with European Union <g id="1"> Minister for 
<g id="2"> Foreign </g> Affairs . </g>
-------------------------------------------------
(1 line)

moses output:
-------------------------------------------------
Šo līgumu apspriež |0-4| ar |5-5| Eiropas Savienības |6-7| ārlietu 
ministru |8-11|. |12-12|
-------------------------------------------------

command: 
perl reinsert.pm 1file < moses_file

result:
-------------------------------------------------
Šo līgumu apspriež ar Eiropas Savienības <g id="1"> <g id="2"> </g> 
ārlietu ministru . </g> <g id="2">
-------------------------------------------------

Problem: last tag (<g id="2">) should not have been there at all.

Original issue reported on code.google.com by [email protected] on 7 Oct 2011 at 7:44

Merged into: #28

Escape handling broken for some XML/HTML character entities

Repro steps:
1. Extract Moses InlineText from languagetool.xlf
tikal -xm -to languagetool.xlf.en-us languagetool.xlf
2. Translate the text to Spanish with the small En-Es MT system provided
perl ~/Documents/work/oss/m4loc/xliff/m4loc.pm -o p -n -s en -t es -m 
~/.../binarized_model/moses.ini -c ~/.../data/truecase-model.en < 
languagetool.xlf.en-us > languagetool.xlf.pb.tok.es

Observations: 
Line 9 Source:
&amp;Check Text
Line 9 Target:
& cheque texto
Expected: Target ampersand also escaped

Line 16 Source:
Word repetition (e.g. 'will will')
Line 16 Target:
Palabra la repetición &apos; ( por ejemplo , se va a &apos; )
Expected: Quotes not escaped in target

Line 32 Target:
Punto : &quot; <x id="1"/> &quot; ( <x id="2"/> ) significa <x id="3"/> ( <x 
id="4"/> ) .
This last issue does not happen with tag fixed tag handling 

Remark: Try to run deescape-special-chars.perl from Moses scripts/tokenizer 
folder

Original issue reported on code.google.com by [email protected] on 23 Sep 2013 at 7:51

Differing translations between word-/phrase-alignment and tag fixed methods even for continuous phrases

Translating line 2 of Sample_AlmostEverything_1.2_strict.xlf.en-us results in 
two different translations:
Word-based:
<x id="1"/> Se interpone para <x id="2"/> .
Tag-fixed:
<x id="1"/> tribuna para <x id="2"/> .

Is this due to just the punctuation being included or not in the phrase 
translation?

Check by translating raw text.

Original issue reported on code.google.com by [email protected] on 23 Sep 2013 at 8:09

xliff2moses.sh source and target language should be identified via options

source and language codes should be put as option, otherwise Tikal is working 
with some OS specific variables. Also the whole process would be more readable.

Now, only target language is identified as option.

Original issue reported on code.google.com by [email protected] on 11 Mar 2011 at 12:37

reinsert.pl: Documentation updates necessary

Documentation should mention that input and output are tokenized (also in 
command line help)
Documentation on wiki does not list the principles that are used for reinsertion

Original issue reported on code.google.com by [email protected] on 12 Apr 2011 at 1:15

tikal does not replace generic placeholders with original placeholders

1. tikal.bat -xm .\t\languagetool.xlf
2. copy .\t\languagetool.xlf.en-us .\t\languagetool
3. tikal.bat -lm .\t\languagetool.xlf

Have a look at the resulting languagetool.out.xlf - line 58-60:
<alt-trans match-quality="10" origin="Moses-MT" 
xmlns:okp="okapi-framework:xliff-extensions" okp:matchType="MT">
<target xml:lang="fr-fr">Check done, <x id="1"/> potential problems 
found</target>
</alt-trans>

Expected:
Check done, <ph id="1">{0}</ph> potential problems found
as in source segment
i.e. the interim <x> tag to be reverted back into a <ph> tag

Original issue reported on code.google.com by [email protected] on 7 Mar 2011 at 8:43

reinsert - erroneous reinsertion

cat 1 1.mos 
asomething something . <g id="1"> </g> something something . 

asomething |0-0| something |1-1|. |2-2| something |3-3| something |4-4|. |5-5|


perl reinsert.pm 1 < 1.mos 
asomething something . <g id="1"> something </g> something .

However, it should be:
asomething something . <g id="1"> </g> something  something .

Original issue reported on code.google.com by [email protected] on 4 Jan 2012 at 2:20

./xliff2moses.sh: line 25: tikal.sh: command not found

What steps will reproduce the problem?
1. Running script above

What is the expected output? What do you see instead?
Tikal.sh is present under okapi/okapi-apps_gtk2-linux-x86_0.14. I also copied 
Tikal.sh to m4loc folder.

What version of the product are you using? On what operating system?

r106. Ubuntu 11




Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 11 Dec 2011 at 11:13

wrap_tokenizer/wrap_detokenizer uses tokenizer.perl/detokenizer .perl instead .pm

To avoid code duplication we want to switch to the modulino versions of 
tokenizer/detokenizer

Original issue reported on code.google.com by [email protected] on 7 Mar 2012 at 9:27

incorrectly created XLIFF; consequence

echo "&lt;p>You are now logged" | mod_tokenizer

result is:
&lt;p&gt;You are now logged

should be:
&lt;p&gt; You are now logged
(space between &gt; and You)

Should be done by some XML entity recognizer?

Original issue reported on code.google.com by [email protected] on 2 Mar 2011 at 5:52

mod_detokenizer.pl cannot be called from outside containing folder

What steps will reproduce the problem?
1. change to a folder outside of trunk/xliff
2. Copy & paste the following text into a file called test.tok.en:
this is a <g id="1"> small </g> house
3. call "path/to/trunk/xliff/mod_detokenizer.pl < test.tok.en"

Result:
Error message: 'Can't open perl script "detokenizer.perl": No such file or 
directory'

Expected:
mod_detokenizer.pl to run w/o error message

Fix: 
Detect executing folder with:
use FindBin qw($Bin);
then call detokenizer.pl in system() call with the path in $Bin

Workaround:
Call mod_detokenizer.pl from containing folder

Original issue reported on code.google.com by [email protected] on 31 Mar 2011 at 8:19

problems with the reinserter

inpput:
-------------------------------------------------
<g id="1"> <g id="2"> Without . </g> </g>
-------------------------------------------------


moses output:
-------------------------------------------------
Neskarot |0-0| . |1-1|
-------------------------------------------------

command: 
perl reinsert.pm 1file < moses_file

result:
-------------------------------------------------
<g id="1"> <g id="2"> Neskarot </g> . </g> <g id="2"> 
-------------------------------------------------

Problems: 
1. last tag (<g id="2">) should not have been there at all
2. both closing tags(</g>) should be placed on the same position

Original issue reported on code.google.com by [email protected] on 7 Oct 2011 at 8:23

Script testset.pl: when defining -n option twice test set creation fails without error message

The command:
perl ~/scripts/m4loc/corpus/testset.pl -n 200 -o automotive_test.en.txt -n 
automotive_held.en.txt < automotive_min.en.txt > automotive_index.txt

Produced 3 files:
automotive_test.en.txt was empty
automotive_index.txt was empty
test.hld contained the content from automotive_min.en.txt

Original issue reported on code.google.com by [email protected] on 28 Feb 2013 at 1:35

reinsert.pm and tag ordering

cat 1
hello <g id="1">  how are you <x id="0"/> </g>  ?


cat 1.mos
HELLO |0-0| HOW |1-1| ARE |2-2| YOU |3-3| ? |4-4|

perl reinsert.pm 1 < 1.mos
HELLO <g id="1"> HOW ARE YOU </g> <x id="0"/> ?

Expected result:
HELLO <g id="1"> HOW ARE YOU <x id="0"/> </g> ?

<x /> and </g> tags are re-ordered

Original issue reported on code.google.com by [email protected] on 1 Feb 2012 at 2:09

agreement on modulinos interface

It would be nice to have the same interface(way of calling and line processing) 
in each script. The same treatment of STD(IN,OUT,ERR) and same names for 
processing of lines ...

Original issue reported on code.google.com by [email protected] on 24 Jan 2012 at 9:30

mod_tokenizer fails on inline elements on Windows

Environment: Windows 7 64-bit, Strawberry Perl 5.10.1.4 32-bit, Okapi Tools M10

What steps will reproduce the problem?
1. Download XLIFF test file 
http://okapi.googlecode.com/svn/trunk/okapi/filters/xliff/src/test/resources/RB-
11-Test01.xlf 
2. Open a command window and change directory to where the downloaded file is 
located
3. Extract translatable text with "tikal -xm RB-11-Test01.xlf"
4. Try to tokenize extracted text "perl mod_tokenizer.pl -l en < 
RB-11-Test01.xlf.en > RB-11-Test01.tok.en"

Result:
Error message: "The system cannot find the path specified."
The contents of the file RB-11-Test01.tok.en are (in 2 lines):
 -n "Paragraph. <g id="1">code</g>  <g id="2">bold</g> -n ". <x id="1"/> -n " and more text <g id="2"><x id="1"/></g> -n " and more text. 
 <x id="1"/> -n "

Expected:
No error message and the output file to contain properly tokenized text.

Remark:
Removing XLIFF inline elements from the input file removes the error message, 
but does not fix the corrupted output.

Original issue reported on code.google.com by [email protected] on 25 Jan 2011 at 7:33

reinserter.pl - improve usage help

if command is mistyped, the usage help is displayed:

Usage: ./reinsert.pl <source InlineText file> < <target plain text file > > 
<target InlineText file>


However, it is hard to decide what signs like "<", or "> >" mean. Maybe:

Usage: ./reinsert.pl source_InlineText_file  < target_plain_text_file  > 
target_InlineText_file

would be better, or some other better solution?

Original issue reported on code.google.com by [email protected] on 7 Mar 2011 at 4:36

mod_tokenizer.pl - problem with 'wide' utf-8 characters

1. cat oo
DdN√©_Local Corrections
2. ./modtokenizer < oo
Wide character in print at /home/moses/m4loc/xliff/./mod_tokenizer.pl line 138, 
<STDIN> line 1.

This was not issue in "file-based" approach, therefore I suppose problem is 
just a way how the string coming to LibXML::Reader encoded (probably 
http://perldoc.perl.org/Encode.html). I'll elaborate on this the next week.

Tomas

Original issue reported on code.google.com by [email protected] on 15 Jul 2011 at 12:21

The input file 'Computer' has no extension to guess the filter from.

What steps will reproduce the problem?
Running ./tikal.sh -xm Computer Threats FAQ.xlf -sl en-us -tl es -trace

What is the expected output? What do you see instead?
Should get InlineText files. Get error msg below instead. 

rubo@rubo-Dell-System-XPS-L702X:~/tools/m4loc/xliff$ ./tikal.sh -xm Computer 
Threats FAQ.xlf -sl en-us -tl es -trace
-------------------------------------------------------------------------------
Okapi Tikal - Localization Toolset
Version: 2.0.15
-------------------------------------------------------------------------------
Extraction to Moses InlineText
java.lang.RuntimeException: The input file 'Computer' has no extension to guess 
the filter from.
    at net.sf.okapi.applications.tikal.Main.guessMissingParameters(Main.java:843)
    at net.sf.okapi.applications.tikal.Main.process(Main.java:957)
    at net.sf.okapi.applications.tikal.Main.main(Main.java:538)


What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 29 Jan 2012 at 8:15

tikal leveraging with ASCII input leads to Windows-1252 encoded XLIFF file

1. tikal.bat -xm .\t\languagetool.xlf
2. copy .\t\languagetool.xlf.en-us .\t\languagetool.xlf.fr-fr
3. tikal.bat -lm .\t\languagetool.xlf

Observation: first line of resulting languagetool.out.xlf is:
<?xml version="1.0" encoding="windows-1252"?>[...]
File encoding is ASCII/Windows-1252 (no characters outside ASCII range in file)

Expected:
File to be UTF-8 encoded - first line:
<?xml version="1.0" encoding="utf-8"?>

The file encoding can be forced by adding the option -oe utf8 to the tikal 
command in line 3.

Original issue reported on code.google.com by [email protected] on 8 Mar 2011 at 8:06

No results found when running TMX2MosesCorpusTool on windows XP

What steps will reproduce the problem?
1. Download the TMX2MosesCorpusTool
2. Install on my Windows XP machine
3. Fill the required fields like target language , export path, etc...
4. imported the TMX "I tried 2 TMX one small file and large file" 

I expected to find the TMX i entered to be converted to corpus to be trained on 
Moses but I didn't find any result on the assigned path


AMT_CleaningTool_20120519.air
Windows XP SP3


Please advise ASAP.
Thanks

Original issue reported on code.google.com by [email protected] on 3 Jun 2012 at 3:11

handling of special non-textual strings from wrongly created XLIFF

Quite common are strings from Tikal like this:
<x id="1"/> &lt;b&gt; <x id="2"/> Uživatelé systému <x id="3"/> &lt;/b&gt; 
<x id="4"/> 

It is problem of some CAT tools which created valid but not-good XLIFF. Trados 
is really expert in how to create bad XLIFF :). Correctly created string would 
looks like:
<x id="1"/> Uživatelé systému <x id="2"/>

We need to preserve info:&lt;b&gt; and &lt;/b&gt; but it shouldn't go into 
Moses. I can see one possibility: 
mod_tokenizer will try to identify such strings and wrap them into some special 
tag (e.g. <hide>&lt;b&gt; </hide>)

remove_markup should remove it (but the whole content of the tag)

markup_reinserter would insert the tag (again - the whole tag)


Or, is there some better solution?
Tomas

Original issue reported on code.google.com by [email protected] on 4 Mar 2011 at 5:00

No <,>,[,] or | or non-printing characters should be output by the tokenizer

1. tikal.sh -xm ./t/languagetool.xlf
2. perl mod_tokenizer.pl -l en-us < languagetool.xlf.en-us > 
languagetool.xlf.tok.en-us

Compare line 77 of
languagetool.xlf.en-us:
&lt;br>&lt;b> <x id="1"/>. Line <x id="2"/>, column <x id="3"/>&lt;/b>&lt;br>

languagetool.xlf.tok.en-us:
< br > < b > <x id="1"/> . Line <x id="2"/> , column <x id="3"/> < / b > < br >

Later the <x> tags will be removed, but the remaining < and > characters around 
the b tags will create problems with Moses. See
http://article.gmane.org/gmane.comp.nlp.moses.user/4123
(this is only from Feb-14, so different from what we discussed earlier)

Therefore expected:
&lt;br&gt;&lt;b&gt; <x id="1"/>. Line <x id="2"/>, column <x 
id="3"/>&lt;/b&gt;&lt;br&gt;

Original issue reported on code.google.com by [email protected] on 24 Feb 2011 at 10:09

Space added at begin of first line of tokenized file

1. tikal.sh -xm ./t/languagetool.xlf
2. perl mod_tokenizer.pl -l en-us < languagetool.xlf.en-us > 
languagetool.xlf.tok.en-us

Compare first line of 
languagetool.xlf.en-us:
Belarusian

languagetool.xlf.tok.en-us:
 Belarusian

Note the extra space before the first token. Expected: no space

Original issue reported on code.google.com by [email protected] on 24 Feb 2011 at 10:21

Truecasing scripts in Moses v1.0 buffer input/output without flushing

The truecase.perl and detruecase.perl scripts in Moses v1.0 buffer input and/or 
output and when integrated with pipes using the open2 function hang. Possible 
fixes:

1. Use IPC::Run to interface with these scripts, similar to how the Moses 
tokenizers/detokenizers are used in wrap_tokenizer.perl/wrap_detokenizer.perl

2. Flush buffers in truecase.perl and detruecase.perl. This can be achieved by 
adding the line 
$|++;
to the scripts

Original issue reported on code.google.com by [email protected] on 11 Sep 2013 at 7:29

reinserter fails

reinsert
1. perl reinsert.pm f.en < f.moses > f.out

file f.en:
This is <g id="1"> bold and italic and then </g><g id="2"> only italic </g> 
text .

file f.moses:
Toto je |0-1| tučné písmo a kurzívu |2-4| a poté pouze |5-7| kurzívou . 
|8-10|

file f.out:
Toto je <g id="1"> tučné písmo a kurzívu <g id="2"> a poté pouze </g> </g> 
kurzívou .


however, the result should be like:
Toto je <g id="1"> tučné písmo a kurzívu a poté </g><g id="2">  pouze 
kurzívou</g> .

Original issue reported on code.google.com by [email protected] on 22 Aug 2011 at 1:37

markup remover - check empty space on beginning and end of lines

1. line from mod_tokenizer: "<x id="1"/> The current ultra-portable ."
2. after remove_markup.pl: " The current ultra-portable ."
3. cause problem in filter-model-given-input.pl, since the first character on 
line is empty


Probably checking whether a line starts or ends with empty characeter should be 
moved from mod_tokenizer to remove_markup. Please, add these two lines:
$line =~ s/^ //;
$line =~ s/ $//;

Original issue reported on code.google.com by [email protected] on 31 Mar 2011 at 4:02

Add reinsert_greedy.pm as option to m4loc.pm

Subject line says it all.

Original issue reported on code.google.com by [email protected] on 5 Sep 2013 at 8:21

Recognize input with no markup

Currently both m4loc.pm/m4log_tag.pm run through the entire process of tag 
removal/reinsertion/preservation even if the segment to be translated does not 
contain markup. This is inefficient and error-prone.

The process should recognize when a segment contains no markup and translate it 
with the Moses decoder plain text translation. (tokenization, detokenization 
and casing still need to happen)

Original issue reported on code.google.com by [email protected] on 5 Sep 2013 at 8:20

Casing correctness affected when using recaser with traced Moses output

To reinsert markup we have to have Moses output phrase alignment information 
with the -t option. Example:
this is a |0-2| small |4-4| house |3-3| . |5-5|

The traces (source phrase information between vertical bars) will affect the 
recaser model to reintroduce correct upper/lowercase. The model relies on a 
model based on n-grams.

Workarounds:
*  use truecaser
Possible fixes:
*  remove traces for recasing and reinsert them after

Original issue reported on code.google.com by [email protected] on 2 Mar 2011 at 12:15

reinsert.pl: Closing </g> tag placed too late

Tokenized source file with line:
Text with <g id="1"> code </g> and <g id="2"> bold </g> .
Pseudo-localized file with translation:
ithway odecay |1-2| . |5-5| exttay |0-0| and oldbay |3-4|

Run 
reinsert.pl source_tokenized < pseudo_localized

Output:
<g id="1"> ithway odecay . exttay </g> <g id="2"> and oldbay <g id="2"> </g>

Expected output:
<g id="1"> ithway odecay </g> . exttay <g id="2"> and oldbay <g id="2"> </g>

Original issue reported on code.google.com by [email protected] on 8 Mar 2011 at 9:27

Retain semantically relevant stand-alone tags in tag removal/reinsertion process

Unlike paired formatting tags, stand-alone tags typically serve two functions:
1. As placeholders for a named entity (as such having semantical significance)
2. As a isolated formatting tag spanning two or more segments (less common)

For the case 1. the stand-alone tag should be funneled through the decoder 
rather than removed and reinserted in the tag removal/reinsertion case.

Example:
Firefox is a good browser.
<x id="1"/> is a good browser.
is a good browser .

Original issue reported on code.google.com by [email protected] on 5 Sep 2013 at 8:35

wrap_detokenizer.pm uses temporary files

wrap_detokenizer.pm uses temporary files stored in the current directory. On 
failure temp* files often stay around.

Expected: wrap_detokenizer.pm storing temporary XML in string like 
wrap_tokenizer.pm

Original issue reported on code.google.com by [email protected] on 7 Mar 2012 at 9:17

mod_tokenizer is not unescaping & (not a problematic character for Moses)

1. Open a command window and change into the ./xliff directory
2. Run "xliff2moses.bat .\t\languagetool.xlf en-us" (or "./xliff2moses.bat 
.\t\languagetool.xlf en-us" on Unix) 
3. Open .\t\languagetool.xlf.tok.en-us in a text editor
4. View line 9

"&amp;Check Text"

Expected:
"& Check Text"

The ampersand is not a problematic character for the Moses decoder. This worked 
in an earlier version of the tokenizer.

Original issue reported on code.google.com by [email protected] on 3 Mar 2011 at 1:45

For recasing allow true-case/no case models as well

Currently m4loc.pm/m4loc_tag.pm hard-code an expected recaser model;

Original issue reported on code.google.com by [email protected] on 21 Aug 2013 at 8:10

Wrong insertion of closing </g> tags for <g> tag pairs that span zero tokens

With reinsert.pm r116:

Tokenizer ir <g id="0"> programma , kas <g id="1"> </g> <g id="2"> sadala <g 
id="3"> </g> </g> ievadīto & $ tekstu teikumos , un teikumus vārdos14 . </g>

Tokenizer |0-0| programma |2-2| have |1-1| to be |4-4| , the |3-3| sadala
|5-5| ievadīto |6-6| & |7-7| $ |8-8| tekstu |9-9| teikumos |10-10| ,
|5-5| |11-11|
and the |12-12| teikumus |13-13| vārdos14 |14-14| . |15-15|

Result is:
Tokenizer <g id="0"> programma have to be , the <g id="1"> <g id="2"> sadala 
</g> <g id="3"> ievadīto & $ tekstu teikumos , and the teikumus vārdos14 .
</g> </g> </g>

But it should be:
Tokenizer <g id="0"> programma have to be , the <g id="1"> </g>  <g id="2"> 
sadala <g id="3"> </g> </g> ievadīto & $ tekstu teikumos , and the teikumus
vārdos14 </g>

Original issue reported on code.google.com by [email protected] on 31 Jan 2012 at 11:35