juditacs / semeval Goto Github PK

MathLing Budapest Team's repo

License: MIT License

Shell 3.23% Python 95.13% Perl 1.64%

semeval's Introduction

semeval

Semantic Textual Similarity (STS) system created by the MathLingBudapest team to participate in Tasks 1 and 2 of Semeval2015

Dependencies

First you need to install the SciPy stack on your machine by following instructions specific to your system on this page

Then run

sudo python setup.py install

Our pipeline relies on the hunpos tool for part-of-speech tagging, which can be downloaded from this page. All you need to do is download and place the binary hunpos-tag and the English model en_wsj.model in the directory semeval/hunpos (or change the value of hunpos_dir in the configuration file to point to a different location).

In order to run the precompiled binaries on a 64-bit machine, you should download the 32-bit C library:

sudo dpkg --add-architecture i386 
sudo apt-get update
sudo apt-get install libc6:i386

The machine similarity component also requires the 4lang module. To download and install it, follow these instructions. Then configure it by editing the configuration file configs/sts\_machine.cfg based on the instructions in the 4lang README

Some of our configurations also make use of word embeddings, to get the pre-trained models, just download this archive and extract it in your semeval directory.

In order to download the required ntlk corpuses, use:

python -m nltk.downloader stopwords wordnet

Usage

The STS system can be invoked from the repo's base directory using:

    cat sts_test.txt | python semeval/paraphrases.py -c configs/sts.cfg > out
    cat twitter_test.txt | python semeval/paraphrases.py -c configs/twitter.cfg > out

These test files follow the format of the Semeval 2015 Tasks 1 and 2, respectively.

To use the machine similarity component, run

cat sts_test.txt | python semeval/paraphrases.py -c configs/sts_machine.cfg > out

Regression

Regression used for Twitter data

Specifying regression mode in the final_score section uses a regression (see configs/twimash.cfg). This mode needs to know the location of the train and test files, which are specified in the regression section:

[regression]
train: data/train.data
train_labels: data/train.labels
test: data/test.data
gold: data/test.label
binary_labels: true
outfile: data/predicted.labels

Specifying a gold file is optional, the rest of the options are mandatory. If you specify a gold file, precision, recall and F-score are computed and printed to stdout.

Regression used for Task 2 STS data

sample uses of regression.py:

 python scripts/regression.py regression_train all.model semeval_data/sts_trial/201213_all data/1213_all/ngram data/1213_all/lsa data/1213_all/machine

 for f in data/2014-test/nosim_4gr_d/STS.input.*; do topic=`basename $f | cut -d'.' -f3`; echo $topic; python scripts/regression.py regression_predict all.model data/2014-test/nosim_4gr_d/STS.input.$topic.txt.out data/2014-test/lsa_sim_bp/STS.input.$topic.txt.out data/2014-test/machine_sim_nodes2/STS.input.$topic.txt.out > data/2014-test/regr/STS.input.$topic.txt.out; done

##Reverting to the Semeval-2015 submission version This code and its dependencies are under constant development. To run the version of the code that was used to create the MathLingBudapest team's submissions (although many bugs have been fixed since), use the following revisions:_

Task 1: https://github.com/juditacs/semeval/tree/submitted

Task 2: https://github.com/juditacs/semeval/tree/15863ba5bc7f857291322c707a899c7c802a7c88

If you'd also like to reproduce the machine similarity component as it was at the time of the submission, you'll need the following revision of the pymachine repository:

https://github.com/kornai/pymachine/tree/3d936067e775fc8aa56c06388eb328ef2c6efe75

Contact

This repository is maintained by Judit Ács and Gábor Recski. Questions, suggestions, bug reports, etc. are very welcome and can be sent by email to recski at mokk bme hu.

Publications

If you use the semeval module, please cite:

@InProceedings{Recski:2015a,
  author    = {Recski, G\'{a}bor  and  \'{A}cs, Judit},
  title     = {MathLingBudapest: Concept Networks for Semantic Similarity},
  booktitle = {Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)},
  month     = {June},
  year      = {2015},
  address   = {Denver, Colorado},
  publisher = {Association for Computational Linguistics},
  pages     = {543--547},
  url       = {http://www.aclweb.org/anthology/S15-2086}
}

semeval's People

Contributors

Stargazers

Watchers

Forkers

recski smartinsightsfromdata sandy4321 eszti javillegas2 fhheaney generalzh nininininini

semeval's Issues

should check if all resources exist before starting to load anything

LSA embedding alternatives

The currently used embedding (word2vec default) assigns high similarity to non-similar words. We should try to find an embedding that is stricter with unsimilar words.

NER shouldn't run at all if NE penalties are off.

Acronyms and head

A simple logic first, something based on external resources later if the need arises

P1B and P2B are not normalized by 2*sentence length, as in (7) of Han_2013

add simple greedy hunspell correction for OOVs

in an attempt to partly resolve #6

way too many OOVs - need spell correction, etc.

"sledghammer" doesn't stand a chance

machine word sim should consider more than just 0 edges

most obvious is to enumerate all REL(x, -) and REL(-, y), then possibly also 1 and 2 edges

wordnet boost

2.2.1 of Han et al. 2013 is not implemented yet!

Stopword filtering filters pronouns

Stopword filtering filters all pronouns, so subject pronouns and their possessive counterparts are never aligned (they would get the score 1). I am creating a PR for fixing this.

Twitter embedding

There is no large Twitter corpus freely available (or it is very hard to find). Smaller annotated ones are available but they are too small for building an embedding.

However, the Rovereto Twitter N-Gram Corpus is available here: http://clic.cimec.unitn.it/amac/twitter_ngram/

I downloaded it and now I'm trying to build an embedding based on 6grams.
Currently all 6grams with frequency count below 50 are discarded, although this threshold may be too strict.

Sentiment analysis on Twitter

Many mistakes on the Twitter dataset are due to aligning opposite sentiment tweets. Most words are the same (or very similar) in these pairs, so the simple aligning method is not capable of distinguishing these pairs.

I'm looking into Twitter sentiment analysis solutions, such as: http://sentistrength.wlv.ac.uk/

hunpos fails on non-latin characters on 51 lines

ignore stopwords and frequent adverbs

Han et al: "We ignore adverbs with frequency count larger than 500, 000 in our corpus and
stop words with general meaning."

HunPos results depend on previous sentence

cat semeval_data/sts_trial/sts-en-test-gs-2014/STS.input.headlines.txt | head -292 | tail -1 | python src/align_and_penalize.py lsa 2>err1
0.629661449331

cat semeval_data/sts_trial/sts-en-test-gs-2014/STS.input.headlines.txt | head -292 | python src/align_and_penalize.py lsa 2>err2 | tail -1
0.652330724666

According to the logs, HunPos tags the words differently if the input is 292 lines:

$ tail -10 err1
2014-11-17 17:22:03,928 : align_and_penalize (324) - INFO - penalty for (u'end', 'NN'): wf: 0.102864932964, wp: 1
2014-11-17 17:22:03,928 : align_and_penalize (293) - INFO - preferred pos: NN
2014-11-17 17:22:03,928 : align_and_penalize (293) - INFO - preferred pos: NN
2014-11-17 17:22:03,928 : align_and_penalize (324) - INFO - penalty for (u'season', 'NN'): wf: 0.130057389931, wp: 1
2014-11-17 17:22:03,928 : align_and_penalize (293) - INFO - preferred pos: NNP
2014-11-17 17:22:03,928 : align_and_penalize (293) - INFO - preferred pos: NNP
2014-11-17 17:22:03,928 : align_and_penalize (333) - INFO - penalty for (u'Football', 'NNP'): wf: 0.176370547033, wp: 1
2014-11-17 17:22:03,928 : align_and_penalize (345) - INFO - P1A: 0.0232922322896 P2A: 0.0220463183792 P1B: 0.0 P2B: 0.0
2014-11-17 17:22:03,928 : align_and_penalize (283) - INFO - T=0.675, P=0.0453385506687
2014-11-17 17:22:03,929 : align_and_penalize (812) - WARNING - 0...

When the input is only the 292th line:

$ tail -10 err2
2014-11-17 17:26:36,237 : align_and_penalize (295) - INFO - not preferred pos: None
2014-11-17 17:26:36,237 : align_and_penalize (324) - INFO - penalty for (u'end', None): wf: 0.102864932964, wp: 0.5
2014-11-17 17:26:36,237 : align_and_penalize (295) - INFO - not preferred pos: JJ
2014-11-17 17:26:36,237 : align_and_penalize (295) - INFO - not preferred pos: JJ
2014-11-17 17:26:36,237 : align_and_penalize (324) - INFO - penalty for (u'season', 'JJ'): wf: 0.130057389931, wp: 0.5
2014-11-17 17:26:36,238 : align_and_penalize (295) - INFO - not preferred pos: None
2014-11-17 17:26:36,238 : align_and_penalize (295) - INFO - not preferred pos: None
2014-11-17 17:26:36,238 : align_and_penalize (333) - INFO - penalty for (u'Football', None): wf: 0.176370547033, wp: 0.5
2014-11-17 17:26:36,238 : align_and_penalize (345) - INFO - P1A: 0.0116461161448 P2A: 0.0110231591896 P1B: 0.0 P2B: 0.0
2014-11-17 17:26:36,238 : align_and_penalize (283) - INFO - T=0.675, P=0.0226692753344

I copied the logs in question to /home/judit/projects/semeval/log/hunpos_issue