alex-berard / multivec Goto Github PK

A Multilingual and Multilevel Representation Learning Toolkit for NLP

License: Apache License 2.0

C++ 54.84% Makefile 0.10% Python 16.30% Shell 11.46% CMake 1.31% Perl 4.00% Emacs Lisp 0.03% Charity 0.36% xBase 1.01% Java 10.22% JavaScript 0.39%

multivec's People

Contributors

Stargazers

Watchers

multivec's Issues

Training and test files have to be real files

Multivec just finishes training without an error exit code and stores a unusable model if the input files are only readable once, e.g. if the input is generated by some tools. There should be an error message.
The following shell command illustrates the point

multivec-bi --verbose --train-src <(zcat UNv1.0.en-es.en.gz | tr -d '“[:punct:]”' ) --train-trg <(zcat UNv1.0.en-es.es.gz | tr -d '[:punct:]')

Reference Performance

Using all test command from the documentation, I achieve the following numbers on a Debian 8 x86_64 machine. It might be helpful to have reference information.

Vocabulary size: 25298
Embeddings size: 100
capital-common-countries:
    accuracy: -nan%
capital-world:
    accuracy: -nan%
city-in-state:
    accuracy: -nan%
currency:
    accuracy: -nan%
family:
    accuracy: 33.3%
gram1-adjective-to-adverb:
    accuracy: 1.85%
gram2-opposite:
    accuracy: 4.99%
gram3-comparative:
    accuracy: 27.5%
gram4-superlative:
    accuracy: 10.5%
gram5-present-participle:
    accuracy: 15.4%
gram6-nationality-adjective:
    accuracy: -nan%
gram7-past-tense:
    accuracy: 9.46%
gram8-plural:
    accuracy: 4.15%
gram9-plural-verbs:
    accuracy: 18.9%
Total accuracy: 13.2%
Syntactic accuracy: 12.8%, Semantic accuracy: 33.3%
Questions seen: 6792/19544, 34.8%

This should be an enhancement, not a bug issue.

documentation issues

The sample commands

wget http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz -P data
tar xzf data/de-en.tgz -C data

are inconsistent. http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz is the news corpus, not the Europarl.

The script

scripts/prepare.py

seems to have changed into prepare-data.py , but doesn't expect the same arguments.

How to use multivec for different language corpus?

Hi,
I am trying to perform a sanskrit -tamil morphological analysis by using a parallel corpora. What are the steps to be followed to import a new parallel corpora to use in this multivec tool. I have tried importing from data/.. and I get segmentation fault after running the multivec bilingual model. I just need the steps to import a new corpus into the multivec. Please reply ASAP. Thanks in advance.

MemoryError: std::bad_alloc when importing model

Python 2.7.14 (default, Sep 23 2017, 22:06:14) 
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from multivec import BilingualModel
>>> model = BilingualModel("/home/gurol/Documents/Bil496/Code/multivec/models/OpenSubtitles2018.en-tr.bin")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "multivec.pyx", line 357, in multivec.BilingualModel.__cinit__
    self.model.load(name)
MemoryError: std::bad_alloc

I take above error message when trained model importing on ubuntu 17.10. This model train on ubuntu 17.10. But I don't take same message when I use model which is trained on osx.

On my ubuntu:
g++ version:

g++ --version
g++ (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

cmake version:

cmake version 3.9.1

import multivec: Segmentation fault: 11

Hi,
after I installed multivec and python wrapper, I was trying to import multivec into python. however, it exits with Segmentation fault: 11
Any ideas what went wrong?
I am using MacOS 10.10.4.
thanks!
Yulia

Can Monolingual Text be used for Bilingual Training

Hi,
Can we use monolingual text of two languages to further improve the quality of Bilingual Embeddings? In Bilbowa paper authors claim they can take advantage of monolingual text in addition to parallel text when getting bilingual embeddings, but in the toolkit I didn't see that option.
Thanks

Problem when I try to run my python file (multivec wrapper) through a bash file called from a java program

Hi,

I am running a bash file from a java file using the command Runtime.getRuntime().exec() (I have already tried with the ProcessBuilder). The bash file calls a python program which uses multivec functions.

If I run the bash file from terminal it executes perfectly, but when I run it from java file it is not able to execute the import command for multivec files.

The first lines of my python file (fc.py) are these:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
import string, sys, codecs, os, re, subprocess
from multivec import MonolingualModel, BilingualModel # The problem seems to be here!
modelobilingue = sys.argv[1]
model = BilingualModel(modelobilingue)

This is my bash file:

#!/bin/bash
MODELO="fapesp-bitexts.pt-en.bin"
DIRPT="/home/files/pt/"
DIREN="/home/files/en/"
cd scripts
ls -f $DIRPT | grep ".txt" | sort |
while read -r arq;
do
cp $DIRPT/$arq entrada.pt
cp $DIREN/$arq entrada.en
perl normalize-punctuation.perl -l pt < entrada.pt | perl tokenizer.perl -l pt > par.pt
perl normalize-punctuation.perl -l pt < entrada.en | perl tokenizer.perl -l en > par.en
python fc.py $MODELO par.pt par.en
done

Do you have any idea how to solve my problem?

Thanks,

Helena

Python Example uses wrong names for classes

The Python example uses BilingualModel and MonolingualModel, but it should be BiModel and MonoModel.

sentence similarity

Hi,
I am looking at the C++ MonolingualModel public classes and would like to confirm the steps to follow to perform sentence similarity

load - (the word model)
sentVec(file) - compute sentence vectors (one sentence per line)
saveSentVectors - save the sentence vectors in text format

Question: given one existing sentence, how do I get the closest ones ?

I assume here that to compute the similarity of the given vector with all the other vectors, and there is no method for it. Is this correct ?

I see there is a similaritySentence for strings but maybe you should add one method that works with vectors or maybe a closestSentence method with params (sentence vector, all sentence vectors)

Thanks

How to comput syntax_weights for other language?

Hi, eske

I found the syntax_weights for each pos tag is hard code. so how to compute this pos tag for a new language.

Alignment

May I know what kind of alignment is used when building bilingual data? Thank you.

alex-berard / multivec Goto Github PK

multivec's People

Contributors

Stargazers

Watchers

Forkers

multivec's Issues

Recommend Projects

Recommend Topics

Recommend Org