Giter Club home page Giter Club logo

osdg-ai / osdg-data Goto Github PK

View Code? Open in Web Editor NEW
25.0 1.0 8.0 7.03 MB

The OSDG Community Dataset (OSDG-CD) is a public dataset of thousands of text excerpts, validated by OSDG Community Platform (OSDG-CP) citizen scientists with respect to the Sustainable Development Goals (SDGs). The dataset is updated every quarter and published on Zenodo.

License: GNU General Public License v3.0

citizen-science crowdsourcing dataset machine-learning sdgs sustainability sustainable-development-goals citsci digital-public-goods open-data public-good public-goods sdg sdg-data united-nations

osdg-data's Introduction

Dataset Information

The OSDG Community Dataset (OSDG-CD) is the direct result of the work of hundreds of volunteers who have contributed to our understanding of Sustainable Development Goals (SDGs) via the OSDG Community Platform (OSDG-CP). It contains thousands of text excerpts which were labelled by the community volunteers with respect to SDGs. The data can be used to derive insights into the nature of SDGs using either ontology-based or machine learning approaches. The OSDG Community Dataset will be updated on a quarterly basis.

Please note that all versions of the dataset are hosted on Zenodo. This repository is only intended to provide examples of how the dataset can be used in practice. You can access different versions of the dataset using DOI handles above. The Most Recent handle always resolves to the latest version.

Version DOI Handle
Most Recent DOI
Version 2023.10 DOI
Version 2023.07 DOI
Version 2023.04 DOI
Version 2023.01 DOI
Version 2022.10 DOI
Version 2022.07 DOI
Version 2022.04 DOI
Version 2022.01 DOI
Version 2021.09 DOI

Methodology

The OSDG Community Platform is an ambitious attempt to bring together volunteers and subject matter experts from all around the world to create a large and accurate source of textual information on SDGs. It uses publicly available texts such as publications, reports and other written data sources. Each text is broken down into smaller pieces of paragraph length. These smaller pieces are then being labelled by the Community volunteers. Since the texts we collect have suggested labels associated with them – these usually come from the data source and do not necessarily reflect the content of a particular paragraph – each volunteer is presented with a single simple question that asks if the suggested label is indeed relevant for the short text at hand. Texts are labelled by multiple volunteers to ensure a high degree of quality.

Documentation

The OSDG-CD dataset is provided in a .csv format on Zenodo. It is a flat tabular dataset that contains the following columns:

  • doi - Digital Object Identifier of the original document;
  • text_id - unique text identifier;
  • text - text excerpt from the document;
  • sdg - the SDG the text is validated against;
  • labels_negative - the number of volunteers who rejected the suggested SDG label;
  • labels_positive - the number of volunteers who accepted the suggested SDG label;
  • agreement - agreement score based on the formula $\text{agreement} = \frac{|labels_{positive} - labels_{negative}|}{labels_{positive} + labels_{negative}}$;

Relevant Papers

Pukelis, L., Bautista-Puig, N., Statulevičiūtė, G., Stančiauskas, V., Dikmener, G., & Akylbekova, D. (2022, November 21). OSDG 2.0: A multilingual tool for classifying text data by UN Sustainable Development Goals (SDGs). arXiv.org. https://doi.org/10.48550/arXiv.2211.11252

Pukelis, L., Puig, N. B., Skrynik, M., & Stanciauskas, V. (2020, May 29). OSDG -- Open-source approach to classify text data by UN Sustainable Development Goals (sdgs). arXiv.org. https://arxiv.org/abs/2005.14569

Usage Examples

Examples of text classification using OSDG-CD can be found under the examples directory:

Share Your Work

The OSDG Community Dataset (OSDG-CD) is made available for research purposes. We are making the data open with the hope to enable researchers to discover new insights into and meaningful connections among Sustainable Development Goals.

We would like to know what you discover in the data. So do not hesitate to share with us your outputs, be it a research paper, a machine learning model, a blog post, or just an interesting observation. Send us an email at [email protected].

If you are using the dataset in a research paper, please cite the original version as follows:

OSDG, UNDP IICPSD SDG AI Lab, & PPMI. (2021). OSDG Community Dataset (OSDG-CD) (2021.09) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5550238.

To cite a specific version, use the template provided on Zenodo.

Contribute to OSDG

This dataset is made possible because of a large community effort. We would be glad to see your contribution to the project too. You can join our Community Platform to help us collect more labelled data. If you have a more technical background, you can also contribute to the OSDG Labelling Tool here. If you want to contribute to the project in some other way, do let us know via this contact form.

To learn more about the OSDG project, visit osdg.ai.

osdg-data's People

Contributors

guste55 avatar jonas-nothnagel avatar mykolaskrynnyk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

osdg-data's Issues

Recently added SDG 16 labels dominate dataset

Dear all,

I've noticed that now (since a couple of versions) there are examples labeled with the SDG 16 label. These examples now comprise 18% of the entire dataset.

Is this expected?

Only 15 out of 17 SDGs labeled?

Hi everyone!

Glad to see this work moving forward! Is there a reason why the dataset contains only labels for the first 15 SDGs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.