Giter Club home page Giter Club logo

hurtlex's Introduction

Hurtlex

HurtLex is a lexicon of offensive, aggressive, and hateful words in over 50 languages. The words are divided into 17 categories, plus a macro-category indicating whether there is stereotype involved. The 17 categories are:

Label Description
PS negative stereotypes ethnic slurs
RCI locations and demonyms
PA professions and occupations
DDF physical disabilities and diversity
DDP cognitive disabilities and diversity
DMC moral and behavioral defects
IS words related to social and economic disadvantage
OR plants
AN animals
ASM male genitalia
ASF female genitalia
PR: words related to prostitution
OM: words related to homosexuality
QAS with potential negative connotations
CDS derogatory words
RE felonies and words related to crime and immoral behavior
SVP words related to the seven deadly sins of the Christian tradition

Hurtlex has a 2-level structure. Lemmas belong to one of these levels:

  • conservative: obtained by translating offensive senses of the words in the original lexicon.
  • inclusive: obtained by translating all the potentially relevant senses of the words in the original lexicon.

Lexica

Here is the updated list of the Hurtlex word lists in all languages.

Language Available versions
AF Afrikaans 1.0 1.1 1.2
AR Arabic 1.0 1.1 1.2
BG Bulgarian 1.0 1.1 1.2
BN Bengali 1.0 1.1 1.2
CA Catalan 1.0 1.1 1.2
CS Czech 1.0 1.1 1.2
CY Welsh 1.0 1.1 1.2
DA Danish 1.0 1.1 1.2
DE German 1.0 1.1 1.2
EL Greek 1.0 1.1 1.2
EN English 1.0 1.1 1.2
EO Esperanto 1.0 1.1 1.2
ES Spanish 1.0 1.1 1.2
ET Estonian 1.0 1.1 1.2
EU Basque 1.0 1.1 1.2
FA Persian 1.0 1.1 1.2
FI Finnish 1.0 1.1 1.2
FR French 1.0 1.1 1.2
GA Irish 1.0 1.1 1.2
GL Galician 1.0 1.1 1.2
HE Hebrew 1.0 1.1 1.2
HI Hindi 1.0 1.1 1.2
HR Croatian 1.0 1.1 1.2
HU Hungarian 1.0 1.1 1.2
ID Indonesian 1.0 1.1 1.2
IS Icelandic 1.0 1.1 1.2
IT Italian 1.0 1.1 1.2
JA Japanese 1.0 1.1 1.2
KO Korean 1.0 1.1 1.2
LT Lithuanian 1.0 1.1 1.2
LV Latvian 1.0 1.1 1.2
MK Macedonian 1.0 1.1 1.2
MS Malay 1.0 1.1 1.2
MT Maltese 1.0 1.1 1.2
NL Dutch 1.0 1.1 1.2
NO Norwegian 1.0 1.1 1.2
PL Polish 1.0 1.1 1.2
PT Portuguese 1.0 1.1 1.2
RO Romanian 1.0 1.1 1.2
RU Russian 1.0 1.1 1.2
SIMPLE Simple English 1.0 1.1 1.2
SK Slovak 1.0 1.1 1.2
SL Slovenian 1.0 1.1 1.2
SQ Albanian 1.0 1.1 1.2
SR Serbian 1.0 1.1 1.2
SV Swedish 1.0 1.1 1.2
SW Swahili 1.0 1.1 1.2
TH Thai 1.0 1.1 1.2
TL Tagalog 1.0 1.1 1.2
TR Turkish 1.0 1.1 1.2
UK Ukrainian 1.0 1.1 1.2
VI Vietnamese 1.0 1.1 1.2
ZH Chinese 1.0 1.1 1.2

New in version 1.2: a table with the alignment between lemmas across languages is here.

Publications

Reference

Hurtlex is described in this paper:

Elisa Bassignana, Valerio Basile, Viviana Patti. Hurtlex: A Multilingual Lexicon of Words to Hurt. In Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-It 2018)

http://ceur-ws.org/Vol-2253/paper49.pdf

Further Publications

Vivian Stamou, Iakovi Alexiou, Antigone Klimi, Eleftheria Molou, Alexandra Saivanidou, Stella Markantonatou. Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH 2022)

https://aclanthology.org/2022.woah-1.10.pdf

Revised Hurtlex (IT)

The Revised HurtLex is a lexicon in which every headword is annotated with an offensiveness level score. Focusing on the Italian entries, we revised the terms in HurtLex and derived an offensive score for each lexical item by applying an Item Response Theory model to the ratings provided by a large number of annotators.

Revised Hurtlex is described in this paper:

Alice Tontodimamma, Lara Fontanella, Stefano Anzani & Valerio Basile. An Italian lexical resource for incivility detection in online discourses. In Quality & Quantity (2022).

https://link.springer.com/article/10.1007/s11135-022-01494-7

Contribute

Contributions are welcome, in the form of revised lexica. Everyone who is native speaker of a language is invited to fork the repository and file a pull request.

Please try to limit your modifications to the following operations:

  • add: add a new item to a lexicon, by creating a new line. Fill in all the column values, including category and stereotype, set level="conservative", and add a new unique ID for the lemma.
  • remove: remove an item considered wrong for a lexicon, by removing the corresponding line.
  • update: change the lemma or the category of an item, e.g. because of a misspelling.
  • add offensiveness score: create a new column with a real value between 0 and 1 to indicate a score for the offensiveness of an item in a lexicon.

Frequent issues:

  • Some languages are written in more than one script (e.g. Hindi, Bangla, Bulgarian, Russian): in these cases is it good practice to harmonize the lexicon by adding the missing spelling and keeping the same ID for the same lemma written in different scripts.
  • Some lexicons contain inflected forms instead of lemmas. These are mistakes introduced by the automatic processing. It is safe to remove such works if the corresponding lemma is already in the lexicon, or to modify them if it is not.

Please create a new version directory for the lexicon you submit. If yours is the first manually corrected version of a lexicon (that is, the last version is 1.*) please create the directory for version 2.0. Otherwise, proceed incrementally (2.0 -> 2.1, 2.1 -> 2.2, ...).

Finally, do not forget to add a README.md file in your newly created directory, indicating what has changes, and your contact for due credit.

LICENSE

Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

You are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

  • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • NonCommercial — You may not use the material for commercial purposes.
  • ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
  • No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Notices:

  • You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
  • No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.

https://creativecommons.org/licenses/by-nc-sa/4.0/

hurtlex's People

Contributors

alicetdm avatar dependabot[bot] avatar valeriobasile avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.