mw10104587 / berkeleyaligner Goto Github PK

Automatically exported from code.google.com/p/berkeleyaligner

Eiffel 22.41% Forth 4.32% Fortran 22.51% Shell 0.02% Java 50.71% Ruby 0.04%

berkeleyaligner's Introduction

#############################
# The Berkeley Word Aligner #
#############################

The Berkeley Word Aligner is a statistical machine translation tool that automatically aligns words in a sentence-aligned parallel corpus.

-------------
Install & Run
-------------

The version you have downloaded is primarily intended for people interested in extending or helping develop the aligner. To generate the distribution meant for end users (who just want to align words), you need to compile the package by running ant (http://ant.apache.org/):

  ant

The directory called "distribution" will then contain further instructions on running the tool.

-----------
Information
-----------

For more information about the package as a whole, please visit:

  http://nlp.cs.berkeley.edu/pages/wordaligner.html

Information related to the development of this package resides online:

  http://code.google.com/p/berkeleyaligner

berkeleyaligner's People

Contributors

Stargazers

berkeleyaligner's Issues

ArrayIndexOutOfBoundsException?

java -server -mx1000m -jar berkeleyaligner.jar ++example.conf
main() {
  Execution directory: output_emille
  Preparing Training Data {
    ERROR: java.lang.ArrayIndexOutOfBoundsException: 1:
edu.berkeley.nlp.mt.Alignment.parseAlignments(Alignment.java:818)
edu.berkeley.nlp.mt.SentencePairReader$PairIterator.addAlignmentToPair(Sent
encePairReader.java:386)
edu.berkeley.nlp.mt.SentencePairReader$PairIterator.loadNext(SentencePairRe
ader.java:472)
edu.berkeley.nlp.mt.SentencePairReader$PairIterator.<init>(SentencePairRead
er.java:441)
edu.berkeley.nlp.mt.SentencePairReader$1.newInstance(SentencePairReader.jav
a:601)
edu.berkeley.nlp.mt.SentencePairReader$1.newInstance(SentencePairReader.jav
a:598)
edu.berkeley.nlp.util.Iterators$IteratorIterator.getNextIterator(Iterators.
java:202)
edu.berkeley.nlp.util.Iterators$IteratorIterator.<init>(Iterators.java:195)
edu.berkeley.nlp.mt.SentencePairReader.getSentencePairIteratorFromSource(Se
ntencePairReader.java:605)
edu.berkeley.nlp.mt.SentencePairReader$PairDepot$1.newInstance(SentencePair
Reader.java:92)
edu.berkeley.nlp.mt.SentencePairReader$PairDepot$1.newInstance(SentencePair
Reader.java:89)
edu.berkeley.nlp.util.Iterators$IteratorIterator.getNextIterator(Iterators.
java:202)
edu.berkeley.nlp.util.Iterators$IteratorIterator.<init>(Iterators.java:195)
edu.berkeley.nlp.mt.SentencePairReader$PairDepot.iterator(SentencePairReade
r.java:98)
edu.berkeley.nlp.mt.SentencePairReader$PairDepot.asList(SentencePairReader.
java:120)
edu.berkeley.nlp.mt.SentencePairReader$PairDepot.loadSentenceCache(Sentence
PairReader.java:137)
edu.berkeley.nlp.mt.SentencePairReader$PairDepot.<init>(SentencePairReader.
java:65)
edu.berkeley.nlp.mt.SentencePairReader.pairDepotFromSources(SentencePairRea
der.java:616)
edu.berkeley.nlp.wordAlignment.Data.loadData(Data.java:77)
edu.berkeley.nlp.wordAlignment.Main.run(Main.java:101)
fig.exec.Execution.runWithObjArray(Execution.java:466)
fig.exec.Execution.run(Execution.java:419)
edu.berkeley.nlp.wordAlignment.Main.main(Main.java:95)
    Execution directory: output_emille
1 errors, 0 warnings
  }


Note:
I wrote a script in python to format the EMILLE corpus into the 
example/train and example/test style for it to match the <s 
snum=3>asdasd</s> style.

Running the unsupervised latest version on fedora 10

Original issue reported on code.google.com by [email protected] on 21 Jan 2010 at 5:15

No alignment score

Currently, the output alignment has no alignnment score associated with it.
Would you please add this feature to Berkeley aligner

Original issue reported on code.google.com by [email protected] on 18 Sep 2008 at 12:50

Stage#.#params.txt files might miss content

What is the expected output? What do you see instead?
Sometimes the files that store the word translation probabilities might end 
before  the whole model is written.
The issue is probably that the close method of the output stream is not called.

Example of the problem:
francisco   entropy 0   nTrans 8    sum 1.000000
  san_francisco: 1.000000
list    entropy 0   nTrans 3    sum 1.000000
  all: 1.000000
the entropy 0.444   nTrans 134  sum 1.000000
  answer: 0.840075
  all: 0.159301
  loc_2: 0.000624
massachusetts   entropy 0.500   nTrans 13   sum 1.000000
  massachusetts: 0.800001
  ma: 0.199999
in  entropy 0.288   nTrans 113  sum 1.000000
  loc_2: 0.916384
  any: 0.083493
  stateid: 0.000123
  answe


What version of the product are you using? On what operating system?
I am using the Unsupervised aligner version 2.1.
The source code for this version is not available.
Would you be able to release it?

Original issue reported on code.google.com by [email protected] on 29 Jul 2015 at 8:27

java.lang.AssertionError

I'm using the unsupervised version 2.1 available from the repository. Here is 
my command line

/usr/bin/java -server \
  -Xms1024m \
  -Xmx2048m \
  -Xss768k \
  -ea \
  -jar /usr/local/bin/berkeleyaligner.jar \
  -EMWordAligner.numThreads 6 \
  -Data.trainSources /opt/library/BUILDS/tm/demo_tm/bitext.list \
  -Data.foreignSuffix nl \
  -Data.englishSuffix en \
  -Data.testSources \
  -exec.execDir /opt/library/TRAININGS/alignments/align-demo_tm-en-nl/berk.classes \
  -exec.create True \
  -Evaluator.writeGIZA True \
  -Main.SaveParams True \
  -Main.alignTraining True \
  -Main.forwardModels MODEL1 HMM \
  -Main.reverseModels MODEL1 HMM \
  -Main.iters 5 5 \
  -Main.mode JOINT JOINT

This is a small 40,000 phrase pair corpus for testing and development. The 
machine is a server with a 6-core AMD Opteron and 16 GB RAM and 1TB available 
hard drive space. Jave/OS version as follows:

  user@moses0:~$ java -version
  java version "1.6.0_20"
  OpenJDK Runtime Environment (IcedTea6 1.9.10) (6b20-1.9.10-0ubuntu1~10.04.3)
  OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)

I'm testing a workflow. So, I run the same command on the same corpus multiple 
times. Each time, the previous output folder is deleted. 

Most times, this command completes training successfully. Sometimes, however, 
it fails with an AssertionError. The location of the failure is a different 
phrase-pair each time. It always, however, fails during the first iteration of 
model 1.

main() {
  Execution directory: /opt/library/TRAININGS/alignments/align-demo_tm-en-nl/berk.classes
  Preparing Training Data [2.3s, cum. 2.4s]
  41410 training, 0 test
  Training models: 2 stages {
    Training stage 1: MODEL1 and MODEL1 jointly for 5 iterations {
      Initializing forward model [9.1s, cum. 9.1s]
      Initializing reverse model [7.9s, cum. 17s]
      Joint Train: 41410 sentences, jointly {
        Iteration 1/5 {
          Sentence 2/41410
          Sentence 1/41410
          Sentence 5/41410
          Sentence 13/41410
          WARNING: Translation model update concurrency error
          Sentence 54/41410
          WARNING: Translation model update concurrency error
          Sentence 207/41410
          WARNING: Translation model update concurrency error
          WARNING: Translation model update concurrency error
          ERROR: java.lang.AssertionError:
fig.basic.StringDoubleMap.find(StringDoubleMap.java:397)
fig.basic.StringDoubleMap.incr(StringDoubleMap.java:78)
fig.basic.String2DoubleMap.incr(String2DoubleMap.java:51)
edu.berkeley.nlp.wordAlignment.SentencePairState.updateTransProbs(SentencePairSt
ate.java:79)
edu.berkeley.nlp.wordAlignment.distortion.Model1or2SentencePairState.updateNewPa
rams(Model1or2SentencePairState.java:91)
edu.berkeley.nlp.wordAlignment.EMWordAligner$1.run(EMWordAligner.java:231)
edu.berkeley.nlp.concurrent.WorkQueue$1.run(WorkQueue.java:70)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:636)
1 errors, 4 warnings
          ... 585 lines omitted ...
        }

Do you have any suggestions to solve this intermittent problem?

Thanks,
Tom

Original issue reported on code.google.com by [email protected] on 19 Feb 2012 at 12:44

incorrect number of training data read while training syntactic HMM

Hello,
I was trying to train syntactic HMM on my data. My training data contains 10050 
parallel sentences with parsed target trees. 

wc output of my training data
-------------------------------
   10050   284765  1599230 corpus.en
   10050   804959  4284275 corpus.entrees
   10050   228873  5058993 corpus.ta
   30150  1318597 10942498 total


When I run the alignment, the logfile indicate that there are only 9811 
sentences read instead of 10050.  Here is what I am seeing in the logfile. 
Eventually after the training, I am seeing alignment only for 9811 sentences. 

PS: I don't have any testing data. My test data directories are empty. I have 
attached my config file too.

main() {
  Execution directory: en-ta/alignment_models/berkeley/lc_tok_10000_S
  Preparing Training Data
  Unknown number of training, 0 test
  Training models: 2 stages {
    Training stage 1: MODEL1 and MODEL1 jointly for 5 iterations {
      Initializing forward model [7.9s, cum. 7.9s]
      Initializing reverse model [5.2s, cum. 13s]
      Joint Train: 9811 sentences, jointly {
        Iteration 1/5 {
          Sentence 1/9811
          Sentence 2/9811
          Sentence 3/9811
          Sentence 169/9811
          Sentence 3304/9811
          Sentence 7650/9811
          Log-likelihood 1 = -1337616.882
          Log-likelihood 2 = -1336443.902
          ... 9805 lines omitted ...
        } [20s, cum. 20s]

pls, let me know if I am missing something.

Original issue reported on code.google.com by [email protected] on 2 Aug 2013 at 10:10

Attachments:

lc_tok_10000_S.conf

License confusion

The website states that the license of this package is GPLv2, but the download 
says MIT 
(http://code.google.com/p/berkeleyaligner/source/browse/trunk/resources/distribu
tion_readme.txt).

Which is the license?

Original issue reported on code.google.com by [email protected] on 26 Jul 2012 at 5:55

Run time error/ infinite loop

What steps will reproduce the problem?
1. 500,000 line parallel corpus
2. executing 
     java -server -Xmx16000m -ea -jar /opt/AO/utils/berkeleyaligner/berkeleyaligner.jar -execDir $1/align.1/ -englishSuffix $2 -foreignSuffix $3 -exec.create true -Main.saveParams true -Main.alignTraining true -Data.testSources -Main.iters 5 5 -EMWordAligner.numThreads 4 -Data.trainSources $1 

What is the expected output? What do you see instead?
...
          WARNING: Translation model update concurrency error
...
          Awaited executor termination for 36370 seconds
          Awaited executor termination for 36500 seconds
...

What version of the product are you using? On what operating system?
 berkeleyaligner.jar from september 2009. Running on Linux 64-bit (x86_64 Redhat 4.1.2)

Please provide any additional information below.
  it sometimes work, but out of 10 corpora, at least 1 will fail.

Original issue reported on code.google.com by [email protected] on 16 Jun 2011 at 6:04

Different Alignment Result on the same input and setting

What steps will reproduce the problem?

1.) I rerun the aligner twice on the same data but I got slightly
different results. I am not sure whether it is a bug of the program or
not. Here is my command.

java -server -Xmx1000m -ea -jar
/home/paisarn/mt-util/berkeleyaligner-1.1/berkeleyaligner.jar -execDir
/home/paisarn/aligner/out-enfr20-2-berkeley1.1 -englishSuffix en
-foreignSuffix fr -exec.create true -Main.saveParams true
-Main.alignTraining true -Main.testSources -Main.iters 5 5
-EMWordAligner.numThreads 4 -Main.trainSources
/home/paisarn/aligner/enfr20k/

When I run this command as the second time, i just changed -execDir
parameter. After that, I check both training.en-fr.align from both
folders but they are slightly different. Could you please explain it
to me whether it was normal or there was something wrong.

2. the result in training.en-fr.align generated by aligner version
1.1 seems to be swap between src and target word index.

For example, in the training.en-fr.align

Generated by Aligner 1.0: 6-8 3-2 4-3 2-1 7-10 5-5
Generated by Aligner 1.1: 7-5 1-2 2-3 0-1 9-6 4-4

Supposed for Aligner 1.1: 5-7 2-1 3-2 1-0 6-9 4-4

From what I understand, the result from Aligner 1.1 should be
compatible with Giza++. So the word indexes should be just minus one
from Aligner 1.0.

What version of the product are you using? On what operating system?
Currently I use Aligner 1.1 on Linux Fedora Core 9 (64bit)

Original issue reported on code.google.com by [email protected] on 1 Jun 2009 at 1:41

mw10104587 / berkeleyaligner Goto Github PK

berkeleyaligner's Introduction

berkeleyaligner's People

Contributors

Stargazers

berkeleyaligner's Issues

ArrayIndexOutOfBoundsException?

No alignment score

Stage#.#params.txt files might miss content

java.lang.AssertionError

incorrect number of training data read while training syntactic HMM

License confusion

Run time error/ infinite loop

Different Alignment Result on the same input and setting

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent