alex-berard / multivec Goto Github PK
View Code? Open in Web Editor NEWA Multilingual and Multilevel Representation Learning Toolkit for NLP
License: Apache License 2.0
A Multilingual and Multilevel Representation Learning Toolkit for NLP
License: Apache License 2.0
Multivec just finishes training without an error exit code and stores a unusable model if the input files are only readable once, e.g. if the input is generated by some tools. There should be an error message.
The following shell command illustrates the point
multivec-bi --verbose --train-src <(zcat UNv1.0.en-es.en.gz | tr -d 'โ[:punct:]โ' ) --train-trg <(zcat UNv1.0.en-es.es.gz | tr -d '[:punct:]')
Using all test command from the documentation, I achieve the following numbers on a Debian 8 x86_64 machine. It might be helpful to have reference information.
Vocabulary size: 25298
Embeddings size: 100
capital-common-countries:
accuracy: -nan%
capital-world:
accuracy: -nan%
city-in-state:
accuracy: -nan%
currency:
accuracy: -nan%
family:
accuracy: 33.3%
gram1-adjective-to-adverb:
accuracy: 1.85%
gram2-opposite:
accuracy: 4.99%
gram3-comparative:
accuracy: 27.5%
gram4-superlative:
accuracy: 10.5%
gram5-present-participle:
accuracy: 15.4%
gram6-nationality-adjective:
accuracy: -nan%
gram7-past-tense:
accuracy: 9.46%
gram8-plural:
accuracy: 4.15%
gram9-plural-verbs:
accuracy: 18.9%
Total accuracy: 13.2%
Syntactic accuracy: 12.8%, Semantic accuracy: 33.3%
Questions seen: 6792/19544, 34.8%
This should be an enhancement, not a bug issue.
The sample commands
wget http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz -P data
tar xzf data/de-en.tgz -C data
are inconsistent. http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz is the news corpus, not the Europarl.
The script
scripts/prepare.py
seems to have changed into prepare-data.py , but doesn't expect the same arguments.
Hi,
I am trying to perform a sanskrit -tamil morphological analysis by using a parallel corpora. What are the steps to be followed to import a new parallel corpora to use in this multivec tool. I have tried importing from data/.. and I get segmentation fault after running the multivec bilingual model. I just need the steps to import a new corpus into the multivec. Please reply ASAP. Thanks in advance.
Python 2.7.14 (default, Sep 23 2017, 22:06:14)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from multivec import BilingualModel
>>> model = BilingualModel("/home/gurol/Documents/Bil496/Code/multivec/models/OpenSubtitles2018.en-tr.bin")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "multivec.pyx", line 357, in multivec.BilingualModel.__cinit__
self.model.load(name)
MemoryError: std::bad_alloc
I take above error message when trained model importing on ubuntu 17.10. This model train on ubuntu 17.10. But I don't take same message when I use model which is trained on osx.
On my ubuntu:
g++ version:
g++ --version
g++ (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
cmake version:
cmake version 3.9.1
Hi,
after I installed multivec and python wrapper, I was trying to import multivec into python. however, it exits with Segmentation fault: 11
Any ideas what went wrong?
I am using MacOS 10.10.4.
thanks!
Yulia
Hi,
Can we use monolingual text of two languages to further improve the quality of Bilingual Embeddings? In Bilbowa paper authors claim they can take advantage of monolingual text in addition to parallel text when getting bilingual embeddings, but in the toolkit I didn't see that option.
Thanks
Hi,
I am running a bash file from a java file using the command Runtime.getRuntime().exec() (I have already tried with the ProcessBuilder). The bash file calls a python program which uses multivec functions.
If I run the bash file from terminal it executes perfectly, but when I run it from java file it is not able to execute the import command for multivec files.
The first lines of my python file (fc.py) are these:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
import string, sys, codecs, os, re, subprocess
from multivec import MonolingualModel, BilingualModel # The problem seems to be here!
modelobilingue = sys.argv[1]
model = BilingualModel(modelobilingue)
This is my bash file:
#!/bin/bash
MODELO="fapesp-bitexts.pt-en.bin"
DIRPT="/home/files/pt/"
DIREN="/home/files/en/"
cd scripts
ls -f $DIRPT | grep ".txt" | sort |
while read -r arq;
do
cp $DIRPT/$arq entrada.pt
cp $DIREN/$arq entrada.en
perl normalize-punctuation.perl -l pt < entrada.pt | perl tokenizer.perl -l pt > par.pt
perl normalize-punctuation.perl -l pt < entrada.en | perl tokenizer.perl -l en > par.en
python fc.py $MODELO par.pt par.en
done
Do you have any idea how to solve my problem?
Thanks,
Helena
The Python example uses BilingualModel and MonolingualModel, but it should be BiModel and MonoModel.
Hi,
I am looking at the C++ MonolingualModel public classes and would like to confirm the steps to follow to perform sentence similarity
Question: given one existing sentence, how do I get the closest ones ?
I assume here that to compute the similarity of the given vector with all the other vectors, and there is no method for it. Is this correct ?
I see there is a similaritySentence for strings but maybe you should add one method that works with vectors or maybe a closestSentence method with params (sentence vector, all sentence vectors)
Thanks
Hi, eske
I found the syntax_weights for each pos tag is hard code. so how to compute this pos tag for a new language.
May I know what kind of alignment is used when building bilingual data? Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.