General motivation
Computational lexical semantics is a subfield of Natural Language Processing that studies computational models of lexical items, such as words, noun phrases and multiword expressions. Modelling semantic relations between words (e.g. synonyms) and word senses (e.g. “python” as programming language vs “python” as snake) is of practical interest in context of various language processing and information retrieval applications.
During the last 20 years, several accurate computational models of lexical semantics emerged, such as distributional semantics (Biemann, 2013; Baroni, 2011) and word embeddings (Mikolov, 2013). In this thesis, you will deal with one of the state-of-the-art approaches to lexical semantics, developed at TU Darmstadt, called JoBimText: http://jobimtext.org. According to multiple evaluations, the JoBimText approach yields cutting edge accuracy on such tasks as semantic relatedness (Biemann, 2013). Besides, it also enables features missing in other frameworks, such as automatic sense discovery.
Current implementation of the JoBimText let us process text corpora up to 50 Gb on a mid-sized Hadoop cluster of 400 cores and 50 Tb of HDFS. Your goal will be to re-engineer the system in such a way that it is able to process text corpora up to 5 Tb (100 times bigger) on the same cluster. This goal will be achieved by using the modern Apache Spark framework for distributed computation that allows a user to dump to temporary files to disk and thus implement incremental algorithms more efficiently.
The ultimate goal of the project will be to develop a system that will be able to compute a distributional thesaurus from the Common Crawl corpus (541TB dataset on Amazon AWS):
This is supposed to be the biggest experiment in distributional semantics conducted so far. This will be in line with this initiative: http://www.webdatacommons.org/. Read this thesis for reference on a similar project: thesis.pdf
Motivation of the initial experiment
Initial experiment is needed to make a proof-of-concept and show feasibility of results. In this experiment, you will work with trigram holing JoBimText (JBT) approach to construction of distributional thesaurus (DT). The goal of the experiment is to:
- Ensure by extensive testing that the new (Spark) implementation provides the same outputs as ths original (MapReduce) implementation.
- Measure and compare performance of the original and the new implementations.
Implementation of the initial experiment
-
Download the corpus. Download Wikipedia corpus: http://cental.fltr.ucl.ac.be/team/~panchenko/data/corpora/wacky-surface.csv. For testing purposes make a subcorpus of 50Mb and 500Mb. First do all the experiments on these smaller chunks. Then proceed with the entire 5Gb dataset. All experiments are conducted locally on your machine.
-
Compute a trigram DT with the original JBT implementation.
python generateHadoopScript.py -q shortrunning -hl trigram -nb corpora/en/wikipedia_eugen -f 5 -w 5 -wf 2
-
Get the DT from the outputs of the original pipeline. Here is description of the output formats: http://panchenko.me/jbt/
-
Compute the same DT with the new pipeline (use also the trigram holing). Make sure to use exactly the same parameters! Follow instuctions here: https://github.com/tudarmstadt-lt/noun-sense-induction-scala. Use this script to get parameters of the trigram holing without lemmatization: https://github.com/tudarmstadt-lt/noun-sense-induction-scala/blob/master/scripts/run-nsi-trigram-nolemma.sh
-
Create a table in a Google Docs with comparison of the original and the new DT outputs. Rows -- runs. Colums are the following measurements:
- size of the input corpus, MB
- number of words in DT:
cat dt.csv | cut -f 1 | sort | uniq | wc -l
- number of relations in DT
- overlap of relations, percent
- size of DT in MB
- DT computation time in seconds one core:
time
- output size of all files in MB
- memory consumed in MB
-
Put online results of the experiments of both pipelines e.g. Google Drive.
-
Write a report including the table above.
-
Write outline of the thesis. Add references e.g. Spark books and master theses liste below.
References
- Biemann, Chris, and Martin Riedl. "Text: Now in 2D! a framework for lexical expansion with contextual similarity." Journal of Language Modelling 1.1 (2013): 55-95.
- Baroni, Marco, and Alessandro Lenci. "Distributional memory: A general framework for corpusbased semantics." Computational Linguistics 36.4 (2010): 673721.
- Ruppert, Eugen, Manuel Kaufmann, Martin Riedl, and Chris Biemann. "JoBimViz: A Web-based Visualization for Graph-based Distributional Semantic Models." ACL-IJCNLP 2015 (2015): 103.
- Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space.". 1301.3781 (2013).
- Julian Felix Maria Seitner. The Web Tuples Database: A Large-scale Resource of hyponymy Relations
Master Thesis.
- Johannes Simon. Word Sense Induction and Disambiguation using Distributional Semantics