Giter Club home page Giter Club logo

re2nn-seq's Introduction

RE2NN-SEQ

Source code for our EMNLP2021 paper: "Neuraling Regular Expressions for Slot Filling".

As the title we turn RE for slot filling into a trainable neural network. slot filling pipline

Prepare environment

cd src_seq && pip install -r requirements.txt

Download data/rule/transducer files

Please download the data files (data.zip) first using one of these links google drive, tencent drive. Then extract it under the /data folder.

unzip data.zip

Your directory structure should be like:

/RE2NN-SEQ
    /data
        /ATIS-BIO
        /ATIS-ZH-BIO
        /SNIPS-BIO
    /src_seq
    /model_seq
    ...

We provide the

  • raw dataset files
  • preprocessed dataset files
  • embedding files (e.g. glove.300.emb)
  • rule files (e.g. bio.rules.config)
  • created and decomposed transducer files (e.g. /ATIS-BIO/automata/xxx.pkl)

Run the code

Make sure you download the data files. model_seq contains example configs, you can directly run the code using the hyper-parameters in the configs and get the results using

python main.py --args_path ../model_seq/config_file_path.res

key parameters

  • dataset (ATIS/ATIS-ZH/SNIPS)
  • automata_path
  • farnn (0/2)
  • use_crf (0/1)
  • method (decompose/onehot)
  • use_bert (0/1)
  • bert_finetune (0/1)

Guidelines on using your own data and rules

dataset

To preprocess your dataset and create vocab files, and pretrained static embeddings, make sure this step is correct! you need to modify function create_slot_dataset in data.py, and run it.

rules for slot filling

Our rules for slot filling is regular expressions with capturing groups. You can refer to the provided rule files data/xxx-BIO/rules.bio.config, they will help yo write your own rules.

The basic syntax is Sub-expression<:>Label, we aims to tag the content matched by sub-expressions as the provided label

The syntax symbol is a little bit different, $ is wildcard word, OO is wildcard label. We also support some handy shortcuts, for example, we allow defining a sub-expression variable, and support comments For example:

// class_type
@class_type@=(first class | coach class | coach | thrift)
$<:>OO * @class_type<:>class_type@ $<:>OO *

This RE will tag first class, coach class as B-class_type I-class_type and coach, thrift as B-class_type, no matter what former and latter contexts are.

You can check your rule's sanity by

cd src_seq/rule_utils
python rule_pre_parser.py --rule_path ../../data/ATIS-BIO/bio.rules.config

and check the parsed rule files, if some rules are missed, they may have syntax problems.

transducer

You can check the rule performance by first create transducer: (transducer can be viewed as a automata whose input vocabulary is the catesian product of the transducer's input vocabulary and output vocabulary)

cd src_seq/wfa
python create_dataset_automata.py --independent 2 --decompose 0 \
--rule_name bio.rules.config --dataset SNIPS-BIO --k_best 1 \
--automata_name my_rule
  • FST: --independent 0
  • i-FST: --independent 2

You will get automata files, my_rule.ID2 in data/SNIPS-BIO/automata, and a graph drawing the transducer

and then run the automata/rules

cd src_seq
python main.py --dataset SNIPS-BIO --method onehot \
--automata_path ../data/SNIPS-BIO/automata/my_rule.ID2 \
--normalize_automata none --rand_constant 0

To decompose the transducer

cd src_seq/wfa
python create_dataset_automata.py --independent 2 --decompose 1 \
--rule_name bio.rules.config --dataset SNIPS-BIO --k_best 3 \
--automata_name my_rule_decomposed

Run and train the decomposed transducer

cd src_seq/
python main.py --independent 2 --dataset SNIPS-BIO --method decompose \
--automata_path your_decomposed_automata.pkl --train_portion 1 --beta 0.1 \
--update_nonlinear tanh --lr 0.001

re2nn-seq's People

Contributors

jeffchy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

re2nn-seq's Issues

run the code

请问一下python main.py --arg_path ../model_seq/config_file_path.res代码运行时出行错误:
G:\Anaconda3\envs\emnlp21\python.exe: can't open file 'main.py': [Errno 2] No such file or directory
怎么处理啊?

Run and train the decomposed transducer

你好,我在运行最后一条指令时,报了如下错误:File "../src_seq/farnn/model_decompose_single.py", line 76, in init_forward_parameters
self.embed_r_generalized = nn.Parameter(torch.matmul(_pinv, _V), requires_grad=True) # D x R
RuntimeError: size mismatch, m1: [100 x 8], m2: [47 x 150] at /pytorch/aten/src/TH/generic/THTensorMath.cpp:197

指令中--automata_path 用的是运行“python create_dataset_automata.py --independent 2 --decompose 1 --rule_name bio.rules.config --dataset SNIPS-BIO --k_best 3 --automata_name my_rule_decomposed”指令后生成的IIID.automata.....pkl文件路径。数据换成了其他数据集。请问是什么原因造成的错误?
@jeffchy

运行data.py遇到的问题

你好,我在运行data.py时出现了如下报错:ValueError: ../data/emb/fasttext/cc.zh.100.bin cannot be opened for loading!
并且我没有找到cc.zh.100.bin文件。请问这个文件如何获取?

bio.rules.config是怎么写出来的?

请问,以SNIPS-BIO数据集为例,对应的正则表达式是怎么写出来的呢?是与train.txt对应的吗?换句话说,train.txt里面所有的tag和文本,都能找到对应的正则表达式嘛?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.