Giter Club home page Giter Club logo

etypeclus's Introduction

ETypeClus

This repository contains the code and data for EMNLP 2021 paper "Corpus-based Open-Domain Event Type Induction".

Datasets and Resources

Please download the datasets and related resources at: https://drive.google.com/drive/folders/1_QVv9XwN6PjZGdeMJWqW5D75NJmaD6F1?usp=sharing

  • Each dataset has its own subfolder, e.g., ./covid19/ and ./pandemic/.
  • The verb sense dictionary and background corpus statistics are placed under ./resources/ subfolder.

Please put the downloaded folders under the root directory.

Running ETypeClus

Parse Corpus and Extract Subject-Verb-Object Triplets

python3 parse_corpus_and_extract_svo.py \
    --is_sentence 1 \
    --input_file ./covid19/corpus.txt \
    --save_path ./covid19/corpus_parsed_svo.pk

Select Salient Verb Lemmas and Object Heads

python3 select_salient_terms.py \
    --corpus_w_svo_pickle ./covid19/corpus_parsed_svo.pk \
    --min_verb_freq 3 \
    --min_obj_freq 3

Generate Features for Each Salient <Predicate Lemma, Object Head> Mention

python3 generate_po_mention_features.py \
    --corpus_w_svo_pickle ./covid19/corpus_parsed_svo.pk \
    --top_k 50 \
    --gpu_id 5

Disambiguate Predicate Senses

python3 disambiguate_verb_sense.py \
    --mention_file ./covid19/corpus_parsed_svo_salient_po_mention_features.pk \
    --save_path ./covid19/po_mention_disambiguated.pk

Generate Features for Each Salient <Predicate Sense, Object Head> Tuples

python3 generate_po_tuple_features.py \
    --mention_file ./covid19/corpus_parsed_svo_salient_po_mention_features.pk \
    --sense_mapping ./covid19/po_mention_disambiguated.pk \
    --save_file ./covid19/po_tuple_features_all_svos.pk \
    --use_all_svos

Latent Space Clustering

CUDA_VISIBLE_DEVICES=0 python3 latent_space_clustering.py \
	--dataset_path ./pandemic \
	--input_emb_name po_tuple_features_all_svos.pk

Running Baselines

First follow previous section to generate the features for each salient <Predicate Sense, Object Head> tuples.

Then, Use the following command (with the corresponding baseline code file) to run Kmeans, sp-Kmeans, AggClus, and JCSC. Note that the spherecluster package requires an older version of scikit-learn, and we recommend using version 0.20.0.

python ./baselines/baseline-{agglo/kmeans/spkmeans/jcsc}.py \
    --input ./covid19/po_tuple_features_all_svos.pk \
    --output kmeans_result.json \
    --k 30

For Triframes, first follow the instructions in this link to install and set up the environment, and put it under the root directory. Then, run the following command

python ./baselines/baseline-triframes.py \
    --input ./covid19/po_tuple_features_all_svos.pk \
    --output triframes_result.json \
    --N 100 \
    --min_size 100

Reference

If you find this repository is useful, please consider citing our paper with the below bibliography. Thanks.

@inproceedings{Shen2021ETypeClus,
  title={Corpus-based Open-Domain Event Type Induction},
  author={Jiaming Shen and Yunyi Zhang and Heng Ji and Jiawei Han},
  booktitle={EMNLP},
  year={2021}
}

etypeclus's People

Contributors

mickeysjm avatar

Stargazers

 avatar Adam Faulkner avatar Igor Morgado avatar KeyonYan avatar Pengfei Cao avatar  avatar Siru Ouyang avatar Tyler avatar  avatar Zixuan Li avatar Yejin Cho avatar  avatar  avatar Lovish avatar Yiqing Xie avatar Supriya Arun avatar Zoey Li avatar 刘威甫 avatar  avatar Wanzheng Zhu avatar Yan Xu avatar 爱可可-爱生活 avatar Yunyi Zhang avatar Itsuki Toyota avatar  avatar

Watchers

 avatar Yunyi Zhang avatar

etypeclus's Issues

How to connect the induced events with the sentences?

Distinguished authors,
How could I connect the induced events with the sentences like Table 3 and 5 in your paper? I am a novice researcher in humanities, and I want to run your code to do some categorization based on the new corpus. The results are expected, but they are just clustered <p,o> pairs. How can I get the corresponding sentences?
Thanks.

KeyError of spacy and how do i run the codes on the other two datasets.

  1. When i run the command "python3 parse_corpus_and_extract_svo.py
    --is_sentence 1
    --input_file ./covid19/corpus.txt
    --save_path ./covid19/corpus_parsed_svo.pk",
    I see errors:

print(tok, tok.pos_, tok.dep_)
File "token.pyx", line 864, in spacy.tokens.token.Token.pos_.get
KeyError: 84

It seems that spacy has some bugs to get the pos of the word "false" when parsing the sentence "Institut Pasteur warns against false information circulating on social media". Could you give some suggestions?

  1. How do I run the code on the ACE and ERE datasets??

Generating clustering results

Hi

I'm not sure if maybe i had a package version difference or something else, but to run the latest_space_clustering.py step I had to add "--input_dim1 716 --input_dim2 210" to the command as the prior commands appear to have produced embeddings with a different size to what it expected by default. Could you let us know if this was expected or not?

Note the README also points to pandemic for that step whereas all the proceeding steps given were using covid19.

Thanks

Tony

Could you provide the codes to evaluate the results of clustering???

How could I get the results reported in Table 4 of the paper? The output of the command "CUDA_VISIBLE_DEVICES=0 python3 latent_space_clustering.py
--dataset_path ./pandemic
--input_emb_name po_tuple_features_all_svos.pk" is only a file recording clustering results. Could you provide the codes to evaluate the results?????

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.