Giter Club home page Giter Club logo

spokenner's Introduction

Exploring Spoken Named Entity Recognition: A Cross-Lingual Perspective

Git Repo for the paper source code and data
Paper · Data · Contact




Table of Contents
  1. Paper Abstract
  2. About This Repo
  3. Data
  4. Usage
  5. Contact
  6. Acknowledgments
  7. Paper Citation

Paper Abstract

Recent advancements in Named Entity Recognition (NER) have significantly improved the identification of entities in textual data. However, spoken NER, a specialized field of spoken document retrieval, lags behind due to its limited research and scarce datasets. Moreover, cross-lingual transfer learning in spoken NER has remained unexplored. This paper utilizes transfer learning across Dutch, English, and German using pipeline and End-to-End (E2E) schemes. We employ Wav2Vec2-XLS-R models on custom pseudo-annotated datasets and investigate several architectures for the adaptability of cross-lingual systems. Our results demonstrate that End-to-End spoken NER outperforms pipeline-based alternatives over our limited annotations. Notably, transfer learning from German to Dutch surpasses the Dutch E2E system by 7% and the Dutch pipeline system by 4%. This study not only underscores the feasibility of transfer learning in spoken NER but also sets promising outcomes for future evaluations, hinting at the need for comprehensive data collection to augment the results.

About This Repo

This repository includes the following:

  • The source code we used to run different experiments in this work
  • Links to access different datasets and data format description
  • Usage guide to re-run the experiments

Data

  • For all experiments in this work, we relied on data from Common Voice Corpus v12.0 (07/12/2022) (https://commonvoice.mozilla.org/en/datasets).
  • For every language (English, German, Dutch), we apply some data pre-processing as follows:
    • Remove all duplicated utterances
    • Remove all utterances where the audio standard deviation is equal to or lower than the threshold 0.001 (Empty or very inaudible audio)
    • Remove all punctuation from the transcription, except for English, we keep the apostrophe (').
    • In some utterances, we find that the transcription has multiple languages, for example English sentence with Russian or Chinese names. Those names will be converted to the Latin alphabet. This doesn't apply to special German letters (ä, ö, ü, ß).
  • After pre-processing we use our best NER model to generate pseudo-annotation from the transcription, then all the transcription is converted to lowercase
  • We save the previous steps output of each language on a Json file in the following format:
          {
              "meta": {
                  "Contains meta information about the dataset"
              },
              "data": [
                  [
                      ["token_1","token_2","token_3",...,"token_n"],
                      ["label_1","label_2","label_3",...,"label_n"],
                      "Original Transcription",
                      "file_name.mp3",
                      "sampling_rate",
                      "audio_length_in_seconds"
                  ],
                  ...
              ]
          }
      
  • Our source code heavily depends on this format.
  • The Json files can be found on: https://doi.org/10.5281/zenodo.8104278

Usage

Training

#!/bin/bash

DATA_BASE_PATH="/data/cv-corpus-12.0-2022-12-07"
MODEL="facebook/wav2vec2-xls-r-300m"
OUTPUT_BASE_PATH="/data/output"
LANGUAGE="en"

python run.py \
  --task "End2end Training ${LANGUAGE} XLS-R 300" \
  --data_path $DATA_BASE_PATH"/${LANGUAGE}/cv_${LANGUAGE}_dataset.json" \
  --vocab_path $DATA_BASE_PATH"/${LANGUAGE}/${LANGUAGE}_vocab_with_tags.json" \
  --clips_path $DATA_BASE_PATH"/${LANGUAGE}/clips" \
  --output_path $OUTPUT_BASE_PATH"/${LANGUAGE}/" \
  train \
    --pretrained_model $MODEL

Transfer Learning

#!/bin/bash

DATA_BASE_PATH="/data/cv-corpus-12.0-2022-12-07"
MODEL="/data/output/en/my_best_model_checkpoint"
OUTPUT_BASE_PATH="/data/output/de/"
PORTION=40        # This refers to 40% of target language train data

LANGUAGE="nl"

python run.py \
  --task "Transfer From German to NL XLS-R 300" \
  --data_path $DATA_BASE_PATH"/${LANGUAGE}/cv_${LANGUAGE}_dataset.json" \
  --vocab_path $DATA_BASE_PATH"/de/de_vocab_with_tags.json" \
  --clips_path $DATA_BASE_PATH"/${LANGUAGE}/clips" \
  --output_path $OUTPUT_BASE_PATH"/transfer_to_${LANGUAGE}_p${PORTION}/" \
  transfer \
    --source_model $MODEL \
    --data_proportion ${PORTION} \
    --epochs 30

Evaluation with and without a Language Model

#!/bin/bash

DATA_BASE_PATH="/data/cv-corpus-12.0-2022-12-07"
MODEL="/data/output/en/my_best_model_checkpoint"
OUTPUT_BASE_PATH="/data/output"
LANGUAGE="en"

python run.py \
 --task "End2end Evaluation ${LANGUAGE} XLS-R 300" \
 --data_path $DATA_BASE_PATH"/${LANGUAGE}/cv_${LANGUAGE}_dataset.json" \
 --vocab_path $DATA_BASE_PATH"/${LANGUAGE}/${LANGUAGE}_vocab_with_tags.json" \
 --clips_path $DATA_BASE_PATH"/${LANGUAGE}/clips" \
 --output_path $OUTPUT_BASE_PATH"/evaluation/${LANGUAGE}/no_lm" \
 --batch_size 8 \
 evaluate \
   --asr_model $MODEL

python run.py \
 --task "End2end Evaluation ${LANGUAGE} XLS-R 300 with LM" \
 --data_path $DATA_BASE_PATH"/${LANGUAGE}/cv_${LANGUAGE}_dataset.json" \
 --vocab_path $DATA_BASE_PATH"/${LANGUAGE}/${LANGUAGE}_vocab_with_tags.json" \
 --clips_path $DATA_BASE_PATH"/${LANGUAGE}/clips" \
 --output_path $OUTPUT_BASE_PATH"/evaluation/${LANGUAGE}/lm" \
 --batch_size 8 \
 evaluate \
   --asr_model $MODEL \
   --language_model $DATA_BASE_PATH"/${LANGUAGE}/${LANGUAGE}_lm.arpa" \
   --lm_alpha 1.0 \
   --lm_beta 3.3 \
   --lm_beam_size 2000 \

Contact

In case you have any questions or inquiries, feel free to send an email to: Moncef or Tuğtekin

Acknowledgments

This work was done during my work at Fraunhofer IAIS and it's supported by the European Union’s Horizon 2020 Research and Innovation Program under Grant Agreement No. 957017 SELMA (https://selma-project.eu).

Paper Citation

@misc{benaicha2023exploring,
      title={Exploring Spoken Named Entity Recognition: A Cross-Lingual Perspective}, 
      author={Moncef Benaicha and David Thulke and M. A. Tuğtekin Turan},
      year={2023},
      eprint={2307.01310},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

spokenner's People

Contributors

dependabot[bot] avatar moncefbenaicha avatar

Watchers

 avatar  avatar  avatar  avatar

spokenner's Issues

vocabulary file is missing.

I'm trying to reproduce a training process, but I could not find the vocabulary file: en_vocab_with_tags.json.
Could you please letting me know where is that file or how can I create it? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.