This repository holds the code for the submission “Politics, BERTed: Automatic Attribution of Speech Events in German Parliamentary Debates” submitted to the KONVENS 2023 Shared Task on Speaker Attribution, Task 1.
The accompanying paper can be found here: PDF (Full Proceeedings).
The goal of the shared task is the automatic identification of speech events in political debates and attributing them to their respective speakers, essentially identifying who says what to whom in the parliamentary debates.
The task is divided into two subtasks:
- Task 1a is the full task, predicting both cue spans and associated role spans
- Task 1b is the role prediction task only, where gold cue spans are already given.
If you are using the software and/or models, please consider citing the accompanying publication:
Ehrmanntraut, Anton. 2023. “Politics, BERTed: Automatic Attribution of Speech Events in German Parliamentary Debates.” In Proceedings of the GermEval 2023 Shared Task on Speaker Attribution in Newswire and Parliamentary Debates (SpkAtt-2023), edited by Ines Rehbein, Fynn Petersen-Frey, Annelen Brunner, Josef Ruppenhofer, Chris Biemann, and Simone Paolo Ponzetto, 22–30. Ingolstadt, Germany.
@inproceedings{ehrmanntraut_politics_2023,
location = {Ingolstadt, Germany},
title = {Politics, {BERTed}: Automatic Attribution of Speech Events in German Parliamentary Debates},
pages = {22--30},
booktitle = {Proceedings of the {GermEval} 2023 Shared Task on Speaker Attribution in Newswire and Parliamentary Debates ({SpkAtt}-2023)},
author = {Ehrmanntraut, Anton},
editor = {Rehbein, Ines and Petersen-Frey, Fynn and Brunner, Annelen and Ruppenhofer, Josef and Biemann, Chris and Ponzetto, Simone Paolo},
date = {2023-09-18}
}
Used Base Model | SpkAtt-F1 (test set) | Match-F1 (dev set) | Download |
---|---|---|---|
aehrm/gepabert | 82.8 | 84.8 | Link |
deepset/gbert-large | (not evaluated) | 84.4 | Link |
deepset/gbert-base | (not evaluated) | 81.2 | Link |
The project uses poetry for dependency management. You can just run:
poetry install
to install all dependencies.
You may open a shell with poetry shell
with all required python packages and interpreter.
Alternatively, you can run scripts with the project-dependent python interpreter with poetry run python <script.py>
.
Before inference, you either need to download the published models and
place them into the models/
folder, or train the models yourself (see below).
After the models/
folder has been populated, you can run the full inference (1a) like this:
# e.g, download GePaBERT models
(cd models; wget https://github.com/aehrm/spkatt_gepade/releases/download/konvens/gepabert_models.tar; tar xf gepabert_models.tar;)
# adjust if needed
#export PEFT_MODEL_DIR=./models
poetry run python ./predict.sh 1a input_dir [output_dir]
The input_dir
should hold tokenized speeches as JSON file, like in the GePaDe test dataset (the one provided for
the shared task).
E.g., to reproduce the results, run
wget https://github.com/umanlp/SpkAtt-2023/archive/refs/heads/master.zip -O gepade.zip
unzip gepade.zip
poetry run python ./predict.sh 1a SpkAtt-2023-master/data/dev/task1 [output_dir]
Alternatively, you can run the subtask 1b (role prediction from gold cues) like the following, e.g., on the GePaDe dev dataset. Make sure the input_dir JSON files contain annotation objects with cue spans.
poetry run python ./predict.sh 1b path/to/spkatt_data/dev/task1 [output_dir]
After downloading the full GePaDe dataset in the folder data
, you can run the training like this:
# adjust if needed
#export BASE_MODEL_NAME=aehrm/gepabert
#export PEFT_MODEL_DIR=./models
#export TRAIN_FILES='./data/train/task1'
#export DEV_FILES='./data/dev/task1'
poetry run python ./train_cue_detector.py
poetry run python ./train_cue_joiner.py
poetry run python ./train_role_detector.py