Giter Club home page Giter Club logo

eivindbergem / talk-of-norway Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ltgoslo/talk-of-norway

0.0 1.0 0.0 55 KB

This repository makes available the Talk of Norway (ToN) dataset, a collection of Norwegian parliament speeches from 1998 to 2016. Every speech is richly annotated with metadata pulled from different sources, and augmented with sentence, token, lemma, part-of-speech and morphological feature annotations.

License: Other

R 83.72% Python 16.28%

talk-of-norway's Introduction

talk-of-norway

This repository makes available the v1.0.1 release of the Talk of Norway (TON) dataset, a collection of Norwegian parliament speeches from the 1998-1999 to 2015-2016 sessions. Every speech is richly annotated with metadata pulled from different sources, and augmented with sentence, token, lemma, part-of-speech and morphological feature annotations.

This work is inspired by the Talk of Europe CLARIN campus, and aims primarily at facilitating experimentation at the crossroads between quantitative Political Science and Natural Language Processing. The dataset is currently the core object of study of an interdisciplinary project involving the departments of Political Science and Informatics of the University of Oslo.

For more information on the Talk of Norway project and its participants, please see the UiO project pages at https://www.mn.uio.no/ifi/english/research/projects/ton/index.html

Dataset v1.0.1

The data is split in two main parts: the ./data/ton.csv file containing metadata (see Data.md for a description of the available variables) along with the raw text of the speeches, and the ./data/annotations/ folder containing the linguistic annotations of the speeches. The annotations in this folder are linked to their respective metadata row in the csv file by way of their file name, which is the same as the id variable.

The linguistic annotations their selves loosely follow the CoNLL format, with newline-separeted tokens and double newline-separated sentences. Every line contains tab-separated token-level annotations, following this pattern:

index token lemma part-of-speech features

For instance:

1    Ærede                ære                adj      fl|<perf-part>|tr1
2    medrepresentanter    medrepresentant    subst    appell|mask|ub|fl
3    !                    $!                 clb      <<<|<utrop>|<<<

Note that the morphological features in the fourth column are their selves separated with the pipe (|) character.

Sources

Linguistic annotations are automatically obtained using langid.py for language identification and the Oslo-Bergen tagger for morphological analysis as implemented in the Language Analysis Portal (LAP).

Metadata was pulled from several sources, utilizing a dump of the holder-de-ord database as a starting point and adding further information from the Storting api, scraping the [Storting web pages](Storting web pages) and integrating data from Søyland (forthcoming). See Data.md for more information on the variables.

Get the data

You can download the data from http://ltr.uio.no/ton/ton.data.101.tgz. The recommended way to stay up to date with this repository is to clone it and unpack the downloaded archive to its top-level directory.

On most UNIX systems, you can type the following in your terminal:

git clone https://github.com/ltgoslo/talk-of-norway
cd talk-of-norway
wget http://ltr.uio.no/ton/ton.data.101.tgz
tar -xzf ton.data.101.tgz
rm ton.data.tar.gz

How to cite

Publications connected to this dataset are forthcoming. For the time being, please use the following bit of bibtex to cite this work:

@online{Lap:Soy:16,
  author = {Lapponi, Emanuele and S{\o}yland, Martin G.},
  title = {Talk of Norway},
  year = 2016,
  url = {https://github.com/ltgoslo/talk-of-norway},
  urldate = {2016-10-29}
}

License

Norwegian License for Open Government Data (NLOD)

talk-of-norway's People

Contributors

emanlapponi avatar martigso avatar danmichaelo avatar eivindbergem avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.