Follow the [data/train_and_test
]
-
For
name
in{train, test}
, create files{name}.words.txt
and{name}.tags.txt
that contain one sentence per line, each word / tag separated by space using IOBES tagging scheme. -
Create files
vocab.words.txt
,vocab.tags.txt
andvocab.chars.txt
that contain one token per line. This can be automized using the corresponding function in [src/preprocessing.py
] and altering the fields DATASET_DIR to point to the locations of the file in Step 1 where DATA_DIR pointing to the output directory. -
Create a
glove.*.npz
file containing one arrayembeddings
of shape(size_vocab_words, 300)
using GloVe 840B vectors. This can be built by using the corresponding function in [src/preprocessing.py
] after completing Step 2 and altering the field VECTOR_DIR to point desired output directory.
Tensorflow 1.15 should be used, other versions are untested. Remaining packages should be operational wihout a specified version.
Once produced all the required data files, use the main.py
with specifying correct parameters. These are:
- model (-m): Three base architectures;
lc
for LSTM-CRF,llc
for LSTM-LSTM-CRF andlcc
for LSTM-CRF-CRF. - embeddings (-e): Three embeddings to pair with a base architecture;
glove
for GloVe,m2v
for Morph2Vec andhybrid
for their combination. Make sure that correct.npz
files are present in thedata
folder and call preprocessing each time a different embedding is used. - preprocessing (-p): Flag to use preprocessing scripts, non-mandatory. Must be called in between new embedding selections.
- mode (-a): Four modes, directories are used as stated in [
src/data.json
];train
to train a model from scratch,k_fold
to perform cross-validation (default=5),test
for testing and validating a specific input (default=None) anduse
for generating an output from a specified file using a trained model (default=None).
main.py -m <model> -e <embed> -p (preprocess flag) -a <mode>
If multiple tests are aimed to be performed with slight changes on the parameters, check out [src/multiple_run.sh
] script.