sfu-natlang / glm-parser Goto Github PK

Tree-adjoining grammar based statistical dependency parser using a general linear model (glm).

License: Other

Shell 1.15% Python 96.84% C 1.44% Perl 0.56%

glm-parser's Introduction

glm-parser

Installation

Install a version of Python 2.x that includes Cython such as the anaconda Python distribution (or install Python 2.x and then Cython).

Set up the Cython libraries and classes:

cd src
echo "Compile Cython classes ..."
python setup.py build_ext --inplace


echo "Compile hvector ..."
cd hvector
python setup.py install --install-lib .
cd ../..

Or if you are on a RCG machine such as linux.cs.sfu.ca or queen.rcg.sfu.ca then do:

sh scripts/setup_env.sh

Before You Get Started

If you are trying to use the config files under src/config/, now you have to specify the location of glm-parser-data by environment variable NATLANG_DATA. For example, if your glm-parser-data is here ~/data/glm-parser-data:

$ export NATLANG_DATA=~/data

If you are using Sequential Run or Spark Standalone mode to do the testing, all the file paths by default will be considered local paths. However, when using the Glm Parser in yarn mode, it will be from HDFS(Hadoop File System) by default, if you wish to use local directories, please add file:// before the path. For example, for /Data/A_Sample_Weight_Vector.db in local directory will have to be file:///Data/A_Sample_Weight_Vector.db

Sequential run

Here is a sample training run of the parser:

####Command Line:

python glm_parser.py -i 5 -p ~/data/glm-parser-data/penn-wsj-deps/ --train="wsj_0[0-2][0-9][0-9].mrg.3.pa.gs.tab" --test="wsj_2[3-4][0-9][0-9].mrg.3.pa.gs.tab" -d `date "+%d-%m-%y"` -a --learner=average_perceptron --fgen=english_1st_fgen --parser=ceisner --format=format/penn2malt.format config/default.config

In this example we are doing 5 iterations of training -i 5 and using section 02 for training and testing on section 23 and section 24. -a turns on time accounting. -d prefix dumps the weight vector for each iteration as prefix_Iter_i.db for each iteration i. The data for training is in the directory after -p and the data must be in the CoNLL format. The directory structure (if any) is the usual Penn Treebank directory structure. The format file parameter --format indicates the column structure of the CoNLL data. The rest of the arguments load the actual filenames in learn and feature and parser respectively in order to configure the learning method, the feature generator and the parser which is used to find the argmax tree for each sentence.

####Using Config Files:

Format:

$ python glm_parser.py CONFIG_FILE [options]

Example:

$ python glm_parser.py config/default.config

Note that this will load the default settings from config/default.config which is almost identical to the command line commands above. Under default settings, glm-parser-data are stored at ~/data/glm-parser-data. If your data is not stored under that folder, please add an option to specify the location:

$ python glm_parser.py config/default.config -p YOUR_GLM_PARSER_DATA_LOCATION

Of course, you can also add other options here, for example iterations.

The training progress and the result on the testing section is saved to glm_parser.log

Parallel run

The parallel run comes in two modes: spark standalone mode and spark yarn mode.

####Spark Standalone Mode Please see scripts/run_spark_linearb.sh for example run. The command will need to use an additional option -s NUMBER or --spark NUMBER:

$ spark-submit --driver-memory 4g --executor-memory 4g --master 'local[*]' glm_parser.py -s 4 config/default.config

####Spark Yarn Mode When running on yarn mode, by default the glm_parser reads data, config, and format from HDFS(Hadoop file system). Reading these files from local directories will require the using of proper path, for example:

file:///Users/jetic/Data/penn-wsj-deps

Please note that when running on yarn mode, environment variables could not be read in config files. Also, running the programme on yarn mode would require the driver datanode to possess access to the glm-parser-data, and the config file in local directories. Support for reading data directly from HDFS has not been implemented yet. Also, dumping weight vector for yarn mode has not been implemented either.

The difference with command line options is that you need to use a --hadoop option in addition to --spark NUMBER. This will tell the glm_parser to upload the data to HDFS instead of preparing it in a local directory.

######NOTE: Please do not use the same data for testing and training, HDFS doesn't support overwriting existing data by default.

For the command of launching the glm_parser in yarn mode, please refer to scripts/run_spark_hadoop.sh for detail.

$ python setup_module.py bdist_egg
$ mv dist/module-0.1-py2.7.egg module.egg

$ spark-submit --master yarn-cluster --num-executors 9 --driver-memory 7g --executor-memory 7g --executor-cores 3 --py-files module.egg glm_parser.py -s 8 -i 1 --hadoop config/default.config

Training with Part of Speech Tagger (POS Tagger)

You can use the same config file for the GLM Parser for training the POS Tagger, for example:

$ python pos_tagger.py config/default.config

Additional options could be found by using --help option.

The POS Tagger now can only be run in sequential mode, parallel training with spark is still in development.

Using GLM Parser with Part of Speech Tagger

You can use our Part of Speech Tagger(POS Tagger) while running evaluations for the GLM Parser, but first you need to train the tagger, and retrieve a weight vector file dumped by the POS Tagger, then use --tagger-w-vector option to load it while executing glm_parser.py. For example:

$ python glm_parser.py --tagger-w-vector config/default.config

Notice that a tag_file containing all the tags must also be specified, a sample one is providied in src/tagset.txt

Running Tests for SFU NatLang Lab

Please use configs to load the default setting for each language you are testing. After testing, use scripts/proc_log.sh to process the log file.

When submitting the log file, you must follow our format:

Machine: MACHINE-NAME
\n
Branch: BRANCH-NAME
\n
Command:
COMMAND_OR_SCRIPT_YOU_USED
\n
Result:
COPY THE PROCESSED LOG CONTENTS HERE

Getting started on development

Read the file getting-started.md in the docs directory.

Contributing

Before contributing to this project, please read docs/development.md which explains the development model we use. TL;DR, it's a somewhat simplified git flow model, using the regular git commit message convention, and flake8 for checking your code.

glm-parser's People

Contributors

Stargazers

Watchers

Forkers

rahulravindren hanhanwu

glm-parser's Issues

Refactor directories in preparation for renaming repo

We wish to rename the repo to be sfu-nlp-toolkit but we need to refactor the directories to make sure the parser directories and ner, tagger directories are not in conflict.

Margin infused updates

Implement the MIRA algorithm for weight learning.

See the description and pseudo code provided in David Chiang's paper Hope and fear for discriminative training of statistical translation models.

ner_fgen under yarn mode

The problem is, the ner feature generator needs to read some data from two txt files, and in yarn mode, it's just difficult without adding additional parameters to the universal_tagger.

data_pool & fgen design

This is my design:
-----------------------------------Sean's part-----------------------------------
DataPool read two files when initiated:
config file: n lines, each line a field name
data file: n columns, each column corresponds to the field name specified in config file
They are both specified in command line arguments.

DataPool read the config file and create a field_name_list. In data file, each time an empty line is read, DataPool pack the columns of data into a list (a list of lists) and give it to Sentence along with field_name_list as arguments of init function.

Sentence should have another function (e.g. fetch_column(self, field_name)) to fetch a column of data according to the field_name.

-----------------------------------Kingston's part------------------------------

The field_names of the columns needed for a feature generator is hard-coded (in care_list) in its init function.
Sentence reads f_gen.care_list and knows what columns does f_gen care about.
Sentence fetches the data columns and pack them with a list.
Sentence calls f_gen.init_resources function to give the f_gen the data columns.
Thus f_gen gets all it wants from Sentence without having Sentence as its init function's argument.

----------------------------------Branch merge order change------------
In this design, my branch should be merged with Sean's branch, or it doesn't work. After our two branches are merged and debugged, we can merge Vivian's branch.

So what do you think about this design? If you get a better idea, please comment on this issue : )
I have changed code on my branch (data_pool_fgen_clean). Now the english_1st_fgen works fine.

Dumping trained weight vector to HDFS

Weight vector could only be dumped to a local directory now. We need to find a way to dump it to HDFS.

reorganize files

Let us create a more hierarchical structure to the repo. Perhaps have a src directory, and a expts directory to separate the main source files from the experiment helper scripts.

sparse arrays for the chart

The chart is currently allocated and de-allocated for each sentence. Initialization also takes time proportional to the number of cells in the chart. We should use a sparse array trick where a single chart is allocated with a max size and for each sentence we use a sub-part of the entire chart. Also we should implement the O(1) clear operation for sparse arrays.

http://research.swtch.com/sparse

cube-pruning introduced in paper is different from Yizhou's implementation

There’s difference between the cube-pruning algorithm introduced in the papers and the implementation by Yizhou.

In the paper, the cube-pruning goes like this:

let q loop through [s, t], and separate [s, t] into [s, q], [q+1, t]
for each separation, score the k cells at top-left, and get the highest score
keep the top k scored separation and q is the feature signature

But Yizhou’s implementation is like this:

let q loop through [s, t], and separate [s, t] into [s, q], [q+1, t]
for each separation, score the most top-left cell and push it into a heap
explore: repeatedly pop the heap, score the neighbouring cells of the cell popped, and push them into the heap
the first k popped cells and the scores is kept as the k best scores of [s, t]

The major differences are:

In the paper, only one score could be kept for one feature signature (separation), but in Yizhou’s implementation, it is possible for multiple scores being kept with the same feature signature.
In the paper, up to (t-s)*k cells are scored, but in Yizhou’s implementation, t - s + 4*k cells are scored

Add 2nd order features

Add 2nd order features and augment the Eisner algorithm to handle these extra features.

feature hashing

implement feature hashing using murmur hash.

have a flag to ignore collisions to implement the feature hashing trick where multiple features share the same weight if hashed to the same integer.

what if the number of features is larger than sys.maxint?

use Tree Adjoining Grammar derivations

add TAG elementary trees (etrees) in the edge features and also implement the method for building the derived tree described in the Carreras et al paper.

Weightvector loading does not work in glm_parser

Loading the weight vector with -l option is not working, I tested it on debug.config and got an accuracy of 20%, which should have been over 80%

Wiki Homepage

@anoopsarkar Hi Anoop, what do you expect the wiki homepage to be like? I found that it seems you have pushed an empty *.md file two years ago and I don't know if this is done by mistake.

cube pruning for higher-order features

Parsing will occur in multiple rounds (at least two). First round is done using first-order features. Subsequent rounds use the cube pruning approach described in the following papers:

Hao Zhang; Ryan McDonald. Enforcing Structural Diversity in Cube-pruned Dependency Parsing. EMNLP 2014. (this one has the clearest examples of the cube pruning algorithm).
Hao Zhang; Ryan McDonald. Generalized Higher-Order Dependency Parsing with Cube Pruning
Hao Zhang; Liang Huang; Kai Zhao; Ryan McDonald. Online Learning for Inexact Hypergraph Search. attachment

POS tag the dev and test data

Use automatically assigned POS tags (use the Stanford POS tagger) to the dev and test sets.

spark_submit issues

problem with file permissions are causing some issues with spark_submit.

use hvector for perceptron updates

use the hvector package from http://acl.cs.qc.edu/~lhuang/ to implement the weight vector and updates.

this will also make it easier to implement averaged perceptron and MIRA (see Issue #8 )

Design: Tagger parser merge step 2

For 1st order features

Eisner state C[s][t][d][c] is a dict, of which the keys and values are both tuples:

{(POS_of_head, POS_of_dependent): (score, mid_index), ....}
Keep every combination of POSTAG of head and dependent

for q in range(s, t):
    for pos_head in possible_POSTAG_of_head:
        for pos_dept in possible_POSTAG_of_dependent:
            for pos_mid in possible_POSTAG_of_q:
                # complete right triangle
                score = C[s][q][->][1](pos_head, pos_mid)][0] + C[q][t][->][0][(pos_mid, pos_dept)][0]
                if (C[s][t][->][0][(pos_head, pos_dept)][0] < score)
                    C[s][t][->][0][(pos_head, pos_dept)] = (score, q)
                # incomplete right trapezoidal
                score = C[s][q][->][0](pos_head, pos_mid)][0] + C[q][t][<-][0][(pos_mid, pos_dept)][0] + get_score(s, t, pos_head, pos_dept)
                if (C[s][t][->][1][(pos_head, pos_dept)][0] < score)
                    C[s][t][->][0][(pos_head, pos_dept)] = (score, q)
                # complete left triangle
                ....
                # incomplete left trapezoidal
                ....

For 2nd order features (with cube pruning)

Eisner state C[s][t][d][c] is a list of tuples:

[(feature_signature , POS_of_head, POS_of_dependent, score, mid_index), ....]
Keep only top k tuples

for q in range(s, t):
    for pos_head in possible_POSTAG_of_head:
        for pos_mid in possible_POSTAG_of_q:
            for pos_dept in possible_POSTAG_of_dependent:

                # complete right triangle
                list_left = SELECT tuples FROM C[s][q][->][1] WHERE tuple[1] == pos_head AND tuple[2] == pos_mid ORDER BY score
                list_right = SELECT tuples FROM C[q][t][->][0] WHERE tuple[1] == pos_mid AND tuple[2] == pos_dept ORDER BY score
                C[s][t][->][0].push((q, pos_head, pos_mid, explore(list_left, list_right), q))

                # incomplete right trapezoidal
                list_left = SELECT tuples FROM C[s][q][->][0] WHERE tuple[1] == pos_head AND tuple[2] == pos_mid ORDER BY score
                list_right = SELECT tuples FROM C[q][t][<-][0] WHERE tuple[1] == pos_dept AND tuple[2] == pos_mid ORDER BY score
                C[s][t][->][1].push((t, pos_head, pos_dept, explore(list_left, list_right), q))

                # complete left triangle
                ....
                # incomplete left trapezoidal
                ....
            keep_highest_score(C[s][t][->][0], q, pos_head, pos_mid)
            keep_highest_score(C[s][t][<-][0], q, pos_head, pos_mid)
keep_k_scores(C, s, t)

This can also implemented by dict. In this case, the function keep_k_scores would extract items from the dict, sort them, and delete items other than k best from the dict.

NER on CoNLL 2003 data

Get the NER code working on the CoNLL 2003 data.

This paper contains an exhaustive list of features for NER.

There may be some organization work to be done. Keep the NER feature generator as a separate file that is loaded (just like how glm_parser.py does it).

glm_parser.py may try to read config from unintended file

If the user accidentally* created a file called (e.g.) -i, the code will try to read from that (probably empty) file instead of parsing it as a flag.

I think the safest solution would be to get the config file path passed in using a parameter like --config because then we can warn the user that the file doesn't exist and there's also no ambiguity about what they meant.

* it's very unlikely, but still a possibility if you do stuff like :w -i because you didn't properly hit ENTER after the w

provide hooks for alternative weight update procedures

for instance we should have hooks for the following improvements to the perceptron:

averaged perceptron
MIRA (margin infused perceptron updates) or PA updates
AdaGrad (Duchi et al)
SAG (Schmidt et al)

ValueError: DATAPREP [ERROR]: source directory do not exist

I'm trying to run glm_parser.py using UD_Czech.config (no other parameters), and getting the exception in the title.

Shouldn't that directory get created if it's non-existent?

Circular use of feature generator and data pool generator

A data pool generator is constructed using feature generator: fgen is passed in data_pool constructor
A feature generator is constructed using sentences: sentence is passed in feature_generator constructor
Need to get rid of this circular dependency, so that data_pool can be created without using fgen.

convert old config names to dataformat

we switched from config to format for different formats of data files such as CoNLL-X and CoNLL-U or MALT-tab. the code still refers to the data read from the new format files as config.

implement MST algorithm

provide a choice between Eisner and the Ryan McDonald MST parsing algorithm.

particularly interesting for when we add the TAG features.

unknown words and vocabulary cutoff

We should try computing accuracy on the dev set with different count cutoffs for the vocabulary only (i.e. not a count cutoff on features but a count cutoff based on an initial pruning of the vocabulary based on unigram frequency). This will mean more compact models and faster training as well.

A separate idea (which we can come back to later on) is replacing uncommon words is to replace them with the word shape, so that "reprehensible" would get replaced by the token "xxxx", the word "Boeing" would get replaced by "Xxxx", a hyphenated word "ex-accomplice" would get replaced by "Xx-xx", a sequence of 4 digits would get replaced by "YYYY" and a longer number would become "Dddd" and a number with a decimal point becomes "Dd.dd". Some of these heuristics are not that useful since the information is already captured by the POS tag, but it does help in debugging these unknown word features later on.

performance of the first order model using words only

Track the performance of first order features using words on dev set to check if we are getting similar performance on English UAS as others.

learnerValue is not always defined in glm_parser.py

In glm_parser.py, if sys.argv[1] is not a file, and --learner is not used, then learnerValue never gets defined, but it still gets used on line 401 or 403 in

learner = get_class_from_module('sequential_learn', 'learn', learnerValue)

So you get a NameError.

prune edges based on first-order dependency model

implement the algorithm in sec 3.2 of the Carreras et al paper in order to effectively prune edges using beam search based on marginal scores from a 1st order (purely word-based) dependency model.

Memory consumption in parallel mode

While running the tagger in parallel, one couldn't help to discover that the first iteration usually takes significantly less time than the rest of the iterations. To be more specific, on my own tiny cluster, first iteration of training on penn2malt usually takes 3.4 hours, while the rest takes about 6-7 hours. Such difference exhibits signs of memory leaking.

The issue was even more severe with glm_parser. Unlike tagger which at least was able to successfully complete the test, multiple attempts to train on penn2malt on linearb all lead to outOfMemory error(with only 4 shards). When running it on my cluster it was even worse, all the dataNodes gave out outOfMemory error and crushed short after. I checked the memory usage and found all 32GB of memory on one of the nodes are all used up by the parser.

I suspect that this issue might be caused by our cython written parser and WeightVector(hvector). I haven't checked the memory management part of these implementations but I think it might also have something to do with spark.

Currently I have two ideas on addressing this issue:

We fix the cython implementation; or
instead of using cython for ceisener parser and hvector, we switch to native python for better memory management compatibility with spark and hadoop.

conll format

Hi
could you please tell me what does each column in conll dataset show? specially the last columns which are of names (ARGM-TMP*) , (ARG0*) , thanks, I will share a part of one sample here:

bc/cctv/00/cctv_0000 0 0 In IN (TOP(S(PP* - - - Speaker#1 * * * * (ARGM-TMP* * -
bc/cctv/00/cctv_0000 0 1 the DT (NP(NP* - - - Speaker#1 (DATE* * * * * * -
bc/cctv/00/cctv_0000 0 2 summer NN ) summer - - Speaker#1 * * * * * * -
bc/cctv/00/cctv_0000 0 3 of IN (PP - - - Speaker#1 * * * * * * -

Spark 2.0.2

I just noticed that Spark 2.0.2 is already released.
Previously all our testings are done under spark version 1, I think we'll need to look into this newer version. The only problem is that the school linearb computer and Hadoop cluster doesn't have the newest version of spark installed yet. I'll try to install it on my own cluster first.

feature frequency cutoff analysis

Do an analysis using dev set accuracy of using feature frequency as a cutoff, so ignoring features that occur fewer than k times. Produce some graphs to enable us to speed up the parser by reducing the number of features.

Ontonotes data and experiments

Task: Convert the Ontonotes data into the CoNLL format.

The instructions for conversion are given here: http://cemantix.org/data/ontonotes.html

It also contains the script to convert to CoNLL format for all three languages: English, Chinese and Arabic.

This issue has been fixed in 3rd_order_feature branch

Alternative to Cython

pybind11 is an interesting alternative to Cython

https://github.com/pybind/pybind11

Discuss.