Giter Club home page Giter Club logo

climate-negotiations's People

Contributors

pauloborges avatar rufuspollock avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

climate-negotiations's Issues

Database of Events

Create a database of events as CSV document in /data/events.csv

Imagine this can come directly from: https://github.com/medialab/climateDebateExplorer/blob/master/ENB-data/metadata_overview/events_metadata.json

Fields:

id:       # id - do not mind this being meaningful e.g. durban-2012 or something rather 001 but up to you
title:    # e.g. Durban Climate Change Conference - COP17/CMP7 
country:
city:     # e.g. Durban  - question do we need country and city separate - we could combine into place ??
date:
year:   # 2012

Note:

  • could add a date which was an actual date or even separate start, end.

Semi-Automated "Reporting Format" Tagging

  1. Manual tagging of the most occurring titles.
    In the MACRO_SUBTYPE column of the tab “Corrected levels” of this Gdoc https://docs.google.com/spreadsheets/d/1BMgXAVYArvr_nOmysb31Ao-j0ShOBEjtuM9vGC6ZEjU/edit?usp=sharing we manually tagged:
    • all H0;
    • all H1 occurring at least twice (and some occurring just once);
    • all H2 occurring at least 6 times (and some occurring less).

Tommaso: I am pretty confident of this manual tagging

  1. Tagging propagation.
    Tags have been propagate downward using the following rule:
    when the format of the title of the section is “agenda”, “analysis”, “corridors”, or “history”, tag all section nested below (at any level not just the next one) with the same tag.

Tommaso: not sure that propagation has been done correctly

  1. Manual checking and tagging completion.

Tommaso: this has not been done yet, but should be

Key Concepts and Data Model

Data Model

  • Actor
    • countries (e.g. France, Saudi Arabia, China - we have a dictionary for this)
    • negotiating groups (e.g. G77 and China, AOSIS, - we have a dictionary for this)
  • Document - the ENB published documents
    • Date, Title
    • => event (each document is only ever associated with one event)
    • URL on the ENB website
  • Event
    • Date (generally the month and the year are enough to distinguish an event)
    • Place (the city where the meeting took place)
  • Sections - essentially correspond to the meeting of a particular track during an event (e.g. the meeting of SBI in Bonn in june 2014) or to a particular reporting bit on an event (e.g. the analysis of one day or the feelings in the corridors)
    • Tracks (generally 1, but there may joint meetings such as SBSTA + SBI )
    • Reporting Format (1)
    • Document - note we may inline certain document and document->event properties e.g.
      • Date
      • Place (of meeting)
  • Text chunks - essentially correspond to a particular discussion during a session at an event
    • Section (1)
    • Actors (0 ≤ actors ≥ as many as mentioned)
    • Topics (0 ≤ actors ≥ 3 most prominent)

Classifications:

  • Type - aka Negotiation Track
  • Subtype - aka Reporting Formats
  • Topics (e.g. Forest, Mitigation, Adaptation)
    For reference existing CSV "database" on the climatenegotiations.org site has:
id,title,actors,countries,topics,url,event_id,track,format,date,year,fulltext

Enumerations

Type (negotiation tracks)

  • COP
  • IPCC
  • SBI
  • SBI/SBSTA (these are joint meeting, it may be more intelligent to tag them with both SBI and SBSTA)
  • SBSTA
  • Special workshop
  • Working groups
    • BUT is may be better to separate the different working groups
  • undeducible
  • blank

Subtypes (reporting formats)

  • Agenda
  • Analysis
  • Corridors
  • History
  • Groups (contact groups, informal groups, non-groups...)
  • Plenary
  • undeducible
  • blank

Semi-Automated "Negotiation Track" Tagging

  1. Manualy tagging of the most occurring titles.
    In the MACRO_TYPE column of the tab “Corrected levels” of this Gdoc https://docs.google.com/spreadsheets/d/1BMgXAVYArvr_nOmysb31Ao-j0ShOBEjtuM9vGC6ZEjU/edit?usp=sharing we manually tagged:
    • all H0;
    • all H1 occurring at least twice (and some occurring just once);
    • all H2 occurring at least 6 times (and some occurring less).

Tommaso: I am pretty confident of this manual tagging

  1. Tagging propagation.
    Tags have been propagate downward using the following rule:
    When the track of the title of the section is “undeducible” or “blank” (not tagged), take the tag available at the closest higher level.

Tommaso: not sure that propagation has been done correctly

  1. Manual checking and tagging completion.

Tommaso: This has not been done yet, but should be

First pass on storing the scraped raw text

As a first pass I'd suggest we store the raw text in a decent form in this repo.

SciPo already have semi-structured raw text based on scrape of ENB (perhaps with some corrections?).

I suggest we do not want to store this SciPO text but transform a bit to nice markdown and then store.

Why?

  • More standardized markup so easier for others to edit
  • We get nice html pages for free when we jekyll-ize

Document Structure


---
title:
id:    # e.g. 1205000e where file was 1205000e.txt
abstract: 
date:
url:     # source url of ENB from which text came

---

text goes here in markdown form

I suggest we therefore get rid of the odd quasi-html structure (where is this from?) and replace with markdown:

  ::H1::§ § WORKING GROUP I§
  ::BODY::§ § Working Group I

Info Architecture

/enb/{id}.md/

Where {id} is the name of the original txt file minus txt.

Asides

Question: but does this make things harder later e.g. when we want to extract sections for tagging? Not sure it really does - we can parse markdown to html and then do the sectioning (the current txt structure does not really give us sections anyway ...)

Chunk text

Do text chunking in a sensible way.

Do we want to store chunks, and if so where?

Structural tagging / markup / annotation

Will use "annotation" following Textus terminology - http://okfnlabs.org/textus/

By structural mean that it is not about typography but about structure: sections, themes, tracks etc.

How do we want to implement?

  • Inline structural markup into the files
  • Have pointers into the "text stream"

Latter is Textus style and preferable IMO but does require us to have a reasonably stable source text. (or some way to update as we change source text).

Text Chunking and Newlines

This important, but difficult to explain - ask me if I am not clear.

Two 'granularities' are relevant when chunking the ENB reports

  1. The first and coarser level is that of 'sections' corresponding to different tracks or formats
  2. The second and finer level is that of 'paragraphs' corresponding to different topics

Generally, ENB writers

  • Use a title followed by a newline to indicate a change of meeting and therefore a change in 1. (track or format). This is the case for all the TITLE and H1, for most of the H2 and for some of the H3)
  • Use a title not followed by a newline to indicate a change of topic and therefore 2.

While tagging the tracks and formats (more urgent), we should therefore take into consideration only the titles followed by a new line.

When we'll move to the topic tagging (less urgent), we should to take into consideration all titles.

Hope this help

Semi-Automated "Actor" Tagging

(This is easy)
A tag referring to a country or a negotiating group has been added to all sections that contained any version of the name of the country or negotiating group as present in a dictionary that I will upload in this Gith

[plan] Iteration 1

  • #2 - store raw-ish scraped text

Then three parallel directions

  • Make it into a website:
    • #11 - jekyll-ize it and theme it
    • TODO - make each ENB annotatable
    • #15 - front page listing all ENBs
  • Make the database - cf the data model #4
    • #1 - DB of documents
    • #14 DB of events
  • Chunk the text up
    • #12 - chunk text
Raw ENB HTML => Clean Markdown / Txt => Chunks => Annotated Chunks => Database

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.