mw10104587 / berkeleyaligner Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/berkeleyaligner
Automatically exported from code.google.com/p/berkeleyaligner
############################# # The Berkeley Word Aligner # ############################# The Berkeley Word Aligner is a statistical machine translation tool that automatically aligns words in a sentence-aligned parallel corpus. ------------- Install & Run ------------- The version you have downloaded is primarily intended for people interested in extending or helping develop the aligner. To generate the distribution meant for end users (who just want to align words), you need to compile the package by running ant (http://ant.apache.org/): ant The directory called "distribution" will then contain further instructions on running the tool. ----------- Information ----------- For more information about the package as a whole, please visit: http://nlp.cs.berkeley.edu/pages/wordaligner.html Information related to the development of this package resides online: http://code.google.com/p/berkeleyaligner
java -server -mx1000m -jar berkeleyaligner.jar ++example.conf
main() {
Execution directory: output_emille
Preparing Training Data {
ERROR: java.lang.ArrayIndexOutOfBoundsException: 1:
edu.berkeley.nlp.mt.Alignment.parseAlignments(Alignment.java:818)
edu.berkeley.nlp.mt.SentencePairReader$PairIterator.addAlignmentToPair(Sent
encePairReader.java:386)
edu.berkeley.nlp.mt.SentencePairReader$PairIterator.loadNext(SentencePairRe
ader.java:472)
edu.berkeley.nlp.mt.SentencePairReader$PairIterator.<init>(SentencePairRead
er.java:441)
edu.berkeley.nlp.mt.SentencePairReader$1.newInstance(SentencePairReader.jav
a:601)
edu.berkeley.nlp.mt.SentencePairReader$1.newInstance(SentencePairReader.jav
a:598)
edu.berkeley.nlp.util.Iterators$IteratorIterator.getNextIterator(Iterators.
java:202)
edu.berkeley.nlp.util.Iterators$IteratorIterator.<init>(Iterators.java:195)
edu.berkeley.nlp.mt.SentencePairReader.getSentencePairIteratorFromSource(Se
ntencePairReader.java:605)
edu.berkeley.nlp.mt.SentencePairReader$PairDepot$1.newInstance(SentencePair
Reader.java:92)
edu.berkeley.nlp.mt.SentencePairReader$PairDepot$1.newInstance(SentencePair
Reader.java:89)
edu.berkeley.nlp.util.Iterators$IteratorIterator.getNextIterator(Iterators.
java:202)
edu.berkeley.nlp.util.Iterators$IteratorIterator.<init>(Iterators.java:195)
edu.berkeley.nlp.mt.SentencePairReader$PairDepot.iterator(SentencePairReade
r.java:98)
edu.berkeley.nlp.mt.SentencePairReader$PairDepot.asList(SentencePairReader.
java:120)
edu.berkeley.nlp.mt.SentencePairReader$PairDepot.loadSentenceCache(Sentence
PairReader.java:137)
edu.berkeley.nlp.mt.SentencePairReader$PairDepot.<init>(SentencePairReader.
java:65)
edu.berkeley.nlp.mt.SentencePairReader.pairDepotFromSources(SentencePairRea
der.java:616)
edu.berkeley.nlp.wordAlignment.Data.loadData(Data.java:77)
edu.berkeley.nlp.wordAlignment.Main.run(Main.java:101)
fig.exec.Execution.runWithObjArray(Execution.java:466)
fig.exec.Execution.run(Execution.java:419)
edu.berkeley.nlp.wordAlignment.Main.main(Main.java:95)
Execution directory: output_emille
1 errors, 0 warnings
}
Note:
I wrote a script in python to format the EMILLE corpus into the
example/train and example/test style for it to match the <s
snum=3>asdasd</s> style.
Running the unsupervised latest version on fedora 10
Original issue reported on code.google.com by [email protected]
on 21 Jan 2010 at 5:15
Currently, the output alignment has no alignnment score associated with it.
Would you please add this feature to Berkeley aligner
Original issue reported on code.google.com by [email protected]
on 18 Sep 2008 at 12:50
What is the expected output? What do you see instead?
Sometimes the files that store the word translation probabilities might end
before the whole model is written.
The issue is probably that the close method of the output stream is not called.
Example of the problem:
francisco entropy 0 nTrans 8 sum 1.000000
san_francisco: 1.000000
list entropy 0 nTrans 3 sum 1.000000
all: 1.000000
the entropy 0.444 nTrans 134 sum 1.000000
answer: 0.840075
all: 0.159301
loc_2: 0.000624
massachusetts entropy 0.500 nTrans 13 sum 1.000000
massachusetts: 0.800001
ma: 0.199999
in entropy 0.288 nTrans 113 sum 1.000000
loc_2: 0.916384
any: 0.083493
stateid: 0.000123
answe
What version of the product are you using? On what operating system?
I am using the Unsupervised aligner version 2.1.
The source code for this version is not available.
Would you be able to release it?
Original issue reported on code.google.com by [email protected]
on 29 Jul 2015 at 8:27
I'm using the unsupervised version 2.1 available from the repository. Here is
my command line
/usr/bin/java -server \
-Xms1024m \
-Xmx2048m \
-Xss768k \
-ea \
-jar /usr/local/bin/berkeleyaligner.jar \
-EMWordAligner.numThreads 6 \
-Data.trainSources /opt/library/BUILDS/tm/demo_tm/bitext.list \
-Data.foreignSuffix nl \
-Data.englishSuffix en \
-Data.testSources \
-exec.execDir /opt/library/TRAININGS/alignments/align-demo_tm-en-nl/berk.classes \
-exec.create True \
-Evaluator.writeGIZA True \
-Main.SaveParams True \
-Main.alignTraining True \
-Main.forwardModels MODEL1 HMM \
-Main.reverseModels MODEL1 HMM \
-Main.iters 5 5 \
-Main.mode JOINT JOINT
This is a small 40,000 phrase pair corpus for testing and development. The
machine is a server with a 6-core AMD Opteron and 16 GB RAM and 1TB available
hard drive space. Jave/OS version as follows:
user@moses0:~$ java -version
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.10) (6b20-1.9.10-0ubuntu1~10.04.3)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
I'm testing a workflow. So, I run the same command on the same corpus multiple
times. Each time, the previous output folder is deleted.
Most times, this command completes training successfully. Sometimes, however,
it fails with an AssertionError. The location of the failure is a different
phrase-pair each time. It always, however, fails during the first iteration of
model 1.
main() {
Execution directory: /opt/library/TRAININGS/alignments/align-demo_tm-en-nl/berk.classes
Preparing Training Data [2.3s, cum. 2.4s]
41410 training, 0 test
Training models: 2 stages {
Training stage 1: MODEL1 and MODEL1 jointly for 5 iterations {
Initializing forward model [9.1s, cum. 9.1s]
Initializing reverse model [7.9s, cum. 17s]
Joint Train: 41410 sentences, jointly {
Iteration 1/5 {
Sentence 2/41410
Sentence 1/41410
Sentence 5/41410
Sentence 13/41410
WARNING: Translation model update concurrency error
Sentence 54/41410
WARNING: Translation model update concurrency error
Sentence 207/41410
WARNING: Translation model update concurrency error
WARNING: Translation model update concurrency error
ERROR: java.lang.AssertionError:
fig.basic.StringDoubleMap.find(StringDoubleMap.java:397)
fig.basic.StringDoubleMap.incr(StringDoubleMap.java:78)
fig.basic.String2DoubleMap.incr(String2DoubleMap.java:51)
edu.berkeley.nlp.wordAlignment.SentencePairState.updateTransProbs(SentencePairSt
ate.java:79)
edu.berkeley.nlp.wordAlignment.distortion.Model1or2SentencePairState.updateNewPa
rams(Model1or2SentencePairState.java:91)
edu.berkeley.nlp.wordAlignment.EMWordAligner$1.run(EMWordAligner.java:231)
edu.berkeley.nlp.concurrent.WorkQueue$1.run(WorkQueue.java:70)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:636)
1 errors, 4 warnings
... 585 lines omitted ...
}
Do you have any suggestions to solve this intermittent problem?
Thanks,
Tom
Original issue reported on code.google.com by [email protected]
on 19 Feb 2012 at 12:44
Hello,
I was trying to train syntactic HMM on my data. My training data contains 10050
parallel sentences with parsed target trees.
wc output of my training data
-------------------------------
10050 284765 1599230 corpus.en
10050 804959 4284275 corpus.entrees
10050 228873 5058993 corpus.ta
30150 1318597 10942498 total
When I run the alignment, the logfile indicate that there are only 9811
sentences read instead of 10050. Here is what I am seeing in the logfile.
Eventually after the training, I am seeing alignment only for 9811 sentences.
PS: I don't have any testing data. My test data directories are empty. I have
attached my config file too.
main() {
Execution directory: en-ta/alignment_models/berkeley/lc_tok_10000_S
Preparing Training Data
Unknown number of training, 0 test
Training models: 2 stages {
Training stage 1: MODEL1 and MODEL1 jointly for 5 iterations {
Initializing forward model [7.9s, cum. 7.9s]
Initializing reverse model [5.2s, cum. 13s]
Joint Train: 9811 sentences, jointly {
Iteration 1/5 {
Sentence 1/9811
Sentence 2/9811
Sentence 3/9811
Sentence 169/9811
Sentence 3304/9811
Sentence 7650/9811
Log-likelihood 1 = -1337616.882
Log-likelihood 2 = -1336443.902
... 9805 lines omitted ...
} [20s, cum. 20s]
pls, let me know if I am missing something.
Original issue reported on code.google.com by [email protected]
on 2 Aug 2013 at 10:10
Attachments:
The website states that the license of this package is GPLv2, but the download
says MIT
(http://code.google.com/p/berkeleyaligner/source/browse/trunk/resources/distribu
tion_readme.txt).
Which is the license?
Original issue reported on code.google.com by [email protected]
on 26 Jul 2012 at 5:55
What steps will reproduce the problem?
1. 500,000 line parallel corpus
2. executing
java -server -Xmx16000m -ea -jar /opt/AO/utils/berkeleyaligner/berkeleyaligner.jar -execDir $1/align.1/ -englishSuffix $2 -foreignSuffix $3 -exec.create true -Main.saveParams true -Main.alignTraining true -Data.testSources -Main.iters 5 5 -EMWordAligner.numThreads 4 -Data.trainSources $1
What is the expected output? What do you see instead?
...
WARNING: Translation model update concurrency error
...
Awaited executor termination for 36370 seconds
Awaited executor termination for 36500 seconds
...
What version of the product are you using? On what operating system?
berkeleyaligner.jar from september 2009. Running on Linux 64-bit (x86_64 Redhat 4.1.2)
Please provide any additional information below.
it sometimes work, but out of 10 corpora, at least 1 will fail.
Original issue reported on code.google.com by [email protected]
on 16 Jun 2011 at 6:04
What steps will reproduce the problem?
1.) I rerun the aligner twice on the same data but I got slightly
different results. I am not sure whether it is a bug of the program or
not. Here is my command.
java -server -Xmx1000m -ea -jar
/home/paisarn/mt-util/berkeleyaligner-1.1/berkeleyaligner.jar -execDir
/home/paisarn/aligner/out-enfr20-2-berkeley1.1 -englishSuffix en
-foreignSuffix fr -exec.create true -Main.saveParams true
-Main.alignTraining true -Main.testSources -Main.iters 5 5
-EMWordAligner.numThreads 4 -Main.trainSources
/home/paisarn/aligner/enfr20k/
When I run this command as the second time, i just changed -execDir
parameter. After that, I check both training.en-fr.align from both
folders but they are slightly different. Could you please explain it
to me whether it was normal or there was something wrong.
2. the result in training.en-fr.align generated by aligner version
1.1 seems to be swap between src and target word index.
For example, in the training.en-fr.align
Generated by Aligner 1.0: 6-8 3-2 4-3 2-1 7-10 5-5
Generated by Aligner 1.1: 7-5 1-2 2-3 0-1 9-6 4-4
Supposed for Aligner 1.1: 5-7 2-1 3-2 1-0 6-9 4-4
From what I understand, the result from Aligner 1.1 should be
compatible with Giza++. So the word indexes should be just minus one
from Aligner 1.0.
What version of the product are you using? On what operating system?
Currently I use Aligner 1.1 on Linux Fedora Core 9 (64bit)
Original issue reported on code.google.com by [email protected]
on 1 Jun 2009 at 1:41
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.