rufuspollock / climate-negotiations Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 6.64 MB

Information on the UNFCC climate negotiations using the Earth Negotiations Bulletin from the IISD

Home Page: https://rufuspollock.github.io/climate-negotiations/

Python 6.03% HTML 10.19% CSS 81.05% JavaScript 2.72%

climate-change climate-crisis

climate-negotiations's People

Contributors

Stargazers

Watchers

climate-negotiations's Issues

Typographic tidying of text (questions)

This issue is about discussing ways the texts can be cleaned typographically

§ character

Have a lot of § characters e.g. our sample has:

::H1::§ § WORKING GROUP I§

Can § be deleted?

/cc @tommv

Basic website via jekyll

Turn this repo into a website via the github pages + jekyll

Theme - use https://github.com/okfn/handbook-theme
Special layout for ENB (?)
Switch to gh-pages (by default?)

Database of Events

Create a database of events as CSV document in /data/events.csv

Imagine this can come directly from: https://github.com/medialab/climateDebateExplorer/blob/master/ENB-data/metadata_overview/events_metadata.json

Fields:

id:       # id - do not mind this being meaningful e.g. durban-2012 or something rather 001 but up to you
title:    # e.g. Durban Climate Change Conference - COP17/CMP7 
country:
city:     # e.g. Durban  - question do we need country and city separate - we could combine into place ??
date:
year:   # 2012

Note:

could add a date which was an actual date or even separate start, end.

Semi-Automated "Reporting Format" Tagging

Manual tagging of the most occurring titles.
In the MACRO_SUBTYPE column of the tab “Corrected levels” of this Gdoc https://docs.google.com/spreadsheets/d/1BMgXAVYArvr_nOmysb31Ao-j0ShOBEjtuM9vGC6ZEjU/edit?usp=sharing we manually tagged:
- all H0;
- all H1 occurring at least twice (and some occurring just once);
- all H2 occurring at least 6 times (and some occurring less).

Tommaso: I am pretty confident of this manual tagging

Tagging propagation.
Tags have been propagate downward using the following rule:
when the format of the title of the section is “agenda”, “analysis”, “corridors”, or “history”, tag all section nested below (at any level not just the next one) with the same tag.

Tommaso: not sure that propagation has been done correctly

Manual checking and tagging completion.

Tommaso: this has not been done yet, but should be

Key Concepts and Data Model

Data Model

Actor
- countries (e.g. France, Saudi Arabia, China - we have a dictionary for this)
- negotiating groups (e.g. G77 and China, AOSIS, - we have a dictionary for this)
Document - the ENB published documents
- Date, Title
- => event (each document is only ever associated with one event)
- URL on the ENB website
Event
- Date (generally the month and the year are enough to distinguish an event)
- Place (the city where the meeting took place)
Sections - essentially correspond to the meeting of a particular track during an event (e.g. the meeting of SBI in Bonn in june 2014) or to a particular reporting bit on an event (e.g. the analysis of one day or the feelings in the corridors)
- Tracks (generally 1, but there may joint meetings such as SBSTA + SBI )
- Reporting Format (1)
- Document - note we may inline certain document and document->event properties e.g.
  - Date
  - Place (of meeting)
Text chunks - essentially correspond to a particular discussion during a session at an event
- Section (1)
- Actors (0 ≤ actors ≥ as many as mentioned)
- Topics (0 ≤ actors ≥ 3 most prominent)

Classifications:

Type - aka Negotiation Track
Subtype - aka Reporting Formats
Topics (e.g. Forest, Mitigation, Adaptation)
For reference existing CSV "database" on the climatenegotiations.org site has:

id,title,actors,countries,topics,url,event_id,track,format,date,year,fulltext

Enumerations

Type (negotiation tracks)

COP
IPCC
SBI
SBI/SBSTA (these are joint meeting, it may be more intelligent to tag them with both SBI and SBSTA)
SBSTA
Special workshop
Working groups
- BUT is may be better to separate the different working groups
undeducible
blank

Subtypes (reporting formats)

Agenda
Analysis
Corridors
History
Groups (contact groups, informal groups, non-groups...)
Plenary
undeducible
blank

Semi-Automated "Negotiation Track" Tagging

Manualy tagging of the most occurring titles.
In the MACRO_TYPE column of the tab “Corrected levels” of this Gdoc https://docs.google.com/spreadsheets/d/1BMgXAVYArvr_nOmysb31Ao-j0ShOBEjtuM9vGC6ZEjU/edit?usp=sharing we manually tagged:
- all H0;
- all H1 occurring at least twice (and some occurring just once);
- all H2 occurring at least 6 times (and some occurring less).

Tommaso: I am pretty confident of this manual tagging

Tagging propagation.
Tags have been propagate downward using the following rule:
When the track of the title of the section is “undeducible” or “blank” (not tagged), take the tag available at the closest higher level.

Tommaso: not sure that propagation has been done correctly

Manual checking and tagging completion.

Tommaso: This has not been done yet, but should be

First pass on storing the scraped raw text

As a first pass I'd suggest we store the raw text in a decent form in this repo.

SciPo already have semi-structured raw text based on scrape of ENB (perhaps with some corrections?).

I suggest we do not want to store this SciPO text but transform a bit to nice markdown and then store.

Why?

More standardized markup so easier for others to edit
We get nice html pages for free when we jekyll-ize

Document Structure


---
title:
id:    # e.g. 1205000e where file was 1205000e.txt
abstract: 
date:
url:     # source url of ENB from which text came

---

text goes here in markdown form

I suggest we therefore get rid of the odd quasi-html structure (where is this from?) and replace with markdown:

  ::H1::§ § WORKING GROUP I§
  ::BODY::§ § Working Group I

Info Architecture

/enb/{id}.md/

Where {id} is the name of the original txt file minus txt.

Asides

Question: but does this make things harder later e.g. when we want to extract sections for tagging? Not sure it really does - we can parse markdown to html and then do the sectioning (the current txt structure does not really give us sections anyway ...)

Chunk text

Do text chunking in a sensible way.

Do we want to store chunks, and if so where?

Database of the sources documents (i.e. ENB publications)

Store at: data/documents.csv

What is wanted:

id
source url
date
event - event id from event table #14

Do we already have this?

If not we can sort of get it from the txt files we have.

/cc @tommv

ENB Archive index page

List all ENB documents.

Semi-Automated "Topic" Tagging

TODO: detail the algorithm.

Structural tagging / markup / annotation

Will use "annotation" following Textus terminology - http://okfnlabs.org/textus/

By structural mean that it is not about typography but about structure: sections, themes, tracks etc.

How do we want to implement?

Inline structural markup into the files
Have pointers into the "text stream"

Latter is Textus style and preferable IMO but does require us to have a reasonably stable source text. (or some way to update as we change source text).

Text Chunking and Newlines

This important, but difficult to explain - ask me if I am not clear.

Two 'granularities' are relevant when chunking the ENB reports

The first and coarser level is that of 'sections' corresponding to different tracks or formats
The second and finer level is that of 'paragraphs' corresponding to different topics

Generally, ENB writers

Use a title followed by a newline to indicate a change of meeting and therefore a change in 1. (track or format). This is the case for all the TITLE and H1, for most of the H2 and for some of the H3)
Use a title not followed by a newline to indicate a change of topic and therefore 2.

While tagging the tracks and formats (more urgent), we should therefore take into consideration only the titles followed by a new line.

When we'll move to the topic tagging (less urgent), we should to take into consideration all titles.

Hope this help

Semi-Automated "Actor" Tagging

(This is easy)
A tag referring to a country or a negotiating group has been added to all sections that contained any version of the name of the country or negotiating group as present in a dictionary that I will upload in this Gith

[plan] Iteration 1

#2 - store raw-ish scraped text

Then three parallel directions

Make it into a website:
- #11 - jekyll-ize it and theme it
- TODO - make each ENB annotatable
- #15 - front page listing all ENBs
Make the database - cf the data model #4
- #1 - DB of documents
- #14 DB of events
Chunk the text up
- #12 - chunk text

Raw ENB HTML => Clean Markdown / Txt => Chunks => Annotated Chunks => Database