Giter Club home page Giter Club logo

wikipedia-codeswitching-data's Introduction

wikipedia-codeswitching-data

Wikipedia talk page code-switching features and editor success scores data. Accompanying paper:

Michael Miller Yoder, Shruti Rijhwani, Carolyn Penstein Rose, Lori Levin. (2017). Code-Switching as a Social Act: The Case of Arabic Wikipedia Talk Pages. Proceedings of the Second Workshop on Natural Language Processing and Computational Social Science (NLP+CSS).

Contact [email protected] for scripts for dataset construction or other questions.

Data columns

Columns in arabic_talk_cs_features.csv are explained here. The above paper has more detail on these features.

  • article: the name of the Wikipedia article from which the talk page conversations are drawn and from which editor scores are calculated.
  • thread_title: the name of the talk page discussion thread
  • editor_anonym: an anonymized ID that corresponds to which editor made the talk contribution
  • editor_talk: that editor's contributions to the discussion thread, concatenated
  • other_talk: other editors' contributions to the discussion thread, concatenated
  • #editor_turns: the number of discussion turns contributed by that editor
  • #other_turns: the number of discussion turns contributed by other editors
  • editor_score: a measure of that editor's success in editing the article page. This is the proportion of modifications the editor made that are still present 24 hours after the last contribution in the discussion thread. Details can be found in the accompanying paper, where this measure is the outcome to be predicted from features in the talk.

Code-switching features

  • latin_cs: binary variable if the editor talk contains code-switching to languages with Latin script.
  • other_latin_cs: binary variable if the talk from other editors in the thread contains code-switching to languages with Latin script.
  • editor_prop_latin: proportion of tokens in the editor talk that contain Latin script.
  • other_prop_latin: proportion of tokens in the other editors' talk that contain Latin script.
  • editor_prop_switches: proportion of word boundaries in the editor text which switch between languages in Arabic and Latin script.
  • other_prop_switches: proportion of word boundaries in the others' text which switch between languages in Arabic and Latin script.
  • editor_talk_latin: extracted text from the editor in this thread that contains Latin script.
  • editor_talk_arabic: extracted text from the editor in this thread that contains Arabic script.
  • editor_talk_non_arabic: extracted text from the editor in this thread that contains anything other than tokens with Arabic script.
  • editor_two_quotes: binary variable indicating whether the editor talk in this thread contains at least 2 quote marks (and likely is quoting).
  • editor_latin_named_entities: whether the Latin script in the editor talk is likely named entities.
  • {language}_prop: proportion of the editor talk that is identified by a language identification system as being from that language.
  • {language}_cs: binary variable that is 'true' if there is a nonzero proportion of this language based on output from a language identification system.
  • categories: which categories that Wikipedia article page belongs to.

wikipedia-codeswitching-data's People

Contributors

michaelmilleryoder avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.