Giter Club home page Giter Club logo

multivec's People

Contributors

cservan avatar draperunner avatar ferrerojeremy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multivec's Issues

Training and test files have to be real files

Multivec just finishes training without an error exit code and stores a unusable model if the input files are only readable once, e.g. if the input is generated by some tools. There should be an error message.
The following shell command illustrates the point

multivec-bi --verbose --train-src <(zcat UNv1.0.en-es.en.gz | tr -d 'โ€œ[:punct:]โ€' ) --train-trg <(zcat UNv1.0.en-es.es.gz | tr -d '[:punct:]')

Reference Performance

Using all test command from the documentation, I achieve the following numbers on a Debian 8 x86_64 machine. It might be helpful to have reference information.

Vocabulary size: 25298
Embeddings size: 100
capital-common-countries:
    accuracy: -nan%
capital-world:
    accuracy: -nan%
city-in-state:
    accuracy: -nan%
currency:
    accuracy: -nan%
family:
    accuracy: 33.3%
gram1-adjective-to-adverb:
    accuracy: 1.85%
gram2-opposite:
    accuracy: 4.99%
gram3-comparative:
    accuracy: 27.5%
gram4-superlative:
    accuracy: 10.5%
gram5-present-participle:
    accuracy: 15.4%
gram6-nationality-adjective:
    accuracy: -nan%
gram7-past-tense:
    accuracy: 9.46%
gram8-plural:
    accuracy: 4.15%
gram9-plural-verbs:
    accuracy: 18.9%
Total accuracy: 13.2%
Syntactic accuracy: 12.8%, Semantic accuracy: 33.3%
Questions seen: 6792/19544, 34.8%

This should be an enhancement, not a bug issue.

How to use multivec for different language corpus?

Hi,
I am trying to perform a sanskrit -tamil morphological analysis by using a parallel corpora. What are the steps to be followed to import a new parallel corpora to use in this multivec tool. I have tried importing from data/.. and I get segmentation fault after running the multivec bilingual model. I just need the steps to import a new corpus into the multivec. Please reply ASAP. Thanks in advance.

MemoryError: std::bad_alloc when importing model

Python 2.7.14 (default, Sep 23 2017, 22:06:14) 
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from multivec import BilingualModel
>>> model = BilingualModel("/home/gurol/Documents/Bil496/Code/multivec/models/OpenSubtitles2018.en-tr.bin")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "multivec.pyx", line 357, in multivec.BilingualModel.__cinit__
    self.model.load(name)
MemoryError: std::bad_alloc

I take above error message when trained model importing on ubuntu 17.10. This model train on ubuntu 17.10. But I don't take same message when I use model which is trained on osx.

On my ubuntu:
g++ version:

g++ --version
g++ (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

cmake version:

cmake version 3.9.1

import multivec: Segmentation fault: 11

Hi,
after I installed multivec and python wrapper, I was trying to import multivec into python. however, it exits with Segmentation fault: 11
Any ideas what went wrong?
I am using MacOS 10.10.4.
thanks!
Yulia

Can Monolingual Text be used for Bilingual Training

Hi,
Can we use monolingual text of two languages to further improve the quality of Bilingual Embeddings? In Bilbowa paper authors claim they can take advantage of monolingual text in addition to parallel text when getting bilingual embeddings, but in the toolkit I didn't see that option.
Thanks

Problem when I try to run my python file (multivec wrapper) through a bash file called from a java program

Hi,

I am running a bash file from a java file using the command Runtime.getRuntime().exec() (I have already tried with the ProcessBuilder). The bash file calls a python program which uses multivec functions.

If I run the bash file from terminal it executes perfectly, but when I run it from java file it is not able to execute the import command for multivec files.

The first lines of my python file (fc.py) are these:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
import string, sys, codecs, os, re, subprocess
from multivec import MonolingualModel, BilingualModel # The problem seems to be here!
modelobilingue = sys.argv[1]
model = BilingualModel(modelobilingue)

This is my bash file:

#!/bin/bash
MODELO="fapesp-bitexts.pt-en.bin"
DIRPT="/home/files/pt/"
DIREN="/home/files/en/"
cd scripts
ls -f $DIRPT | grep ".txt" | sort |
while read -r arq;
do
cp $DIRPT/$arq entrada.pt
cp $DIREN/$arq entrada.en
perl normalize-punctuation.perl -l pt < entrada.pt | perl tokenizer.perl -l pt > par.pt
perl normalize-punctuation.perl -l pt < entrada.en | perl tokenizer.perl -l en > par.en
python fc.py $MODELO par.pt par.en
done

Do you have any idea how to solve my problem?

Thanks,

Helena

sentence similarity

Hi,
I am looking at the C++ MonolingualModel public classes and would like to confirm the steps to follow to perform sentence similarity

  1. load - (the word model)
  2. sentVec(file) - compute sentence vectors (one sentence per line)
  3. saveSentVectors - save the sentence vectors in text format

Question: given one existing sentence, how do I get the closest ones ?

I assume here that to compute the similarity of the given vector with all the other vectors, and there is no method for it. Is this correct ?

I see there is a similaritySentence for strings but maybe you should add one method that works with vectors or maybe a closestSentence method with params (sentence vector, all sentence vectors)

Thanks

Alignment

May I know what kind of alignment is used when building bilingual data? Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.