gorcenski / women-streets-berlin Goto Github PK

Exploring the women's history hidden in the street names of Berlin

License: Other

Jupyter Notebook 99.44% Shell 0.17% Python 0.39%

women-streets-berlin's Issues

Build Static Site

Vision

The public-facing part of this project will be a static site capable of displaying content about the streets of Berlin, showing some qualitative and quantitative analysis, and providing means for users to contribute content of their own.

The static site model is chosen because hosting is inexpensive and portable. Any user can replicate the site simply by downloading the content, and no additional tooling or configuration is required to host the site anywhere. This is particularly well-suited for archival tools, wherein dependencies on software severely hinder the longevity of the archive.

A suitable Static Site Generator will need to be selected. Some of the requirements:

Easy to install with several templates
Ability to configure and generate plaintext content
Ability to differentiate content types, e.g. posts vs. pages
Easy integration of third party JS tools

For the purposes of this milestone, configuring hosting will be considered separate. The desired MVP is a working, semi-styled static site capable of displaying content and providing tools for contributors. It is acceptable to have a working site running in dev.

Approach

For the approach, the following steps will be taken:

Select a SSG (#9)
Select a suitable template (#10)
Configure basic landing page and taxonomies (#11)
Build templates for index content
Build templates for granular content

With respect to the last two points, those may become epics or milestones themselves.

Select SSG template

Given a suitable choice of SSG, we should select a template that will suit the project. Desired features include:

relatively recent updates
option for landing page
easy configuration of pages and posts
easy menu configuration
responsive
good for displaying long text content as well as mixed-media content

Document external sources for data and upload working copies when possible

The external data sources should be documented so others can replicate the work. This ticket captures the work needed to document the sources of the external data and to put it into a place where it can be processed into a usable form.

Document Geo data sources
Document name-gender correlation sources
Upload files that don't exceed allowable file size
Document basic processing steps

Create a data pipeline and link to source data in README

The files are large so it makes little sense to keep them here so maybe creating a brief data pipeline shell script is a good idea, and also documenting the source of data and the tools needed to do the extraction is good.

Configure landing page and set up page taxonomies

Once we select an SSG tool (#9) and a template (#10), we can set up a landing page and begin to craft how we want the menu and page taxonomies to operate. We'll probably want a blog component, but the main content -- information about streets -- should reside in pages. We'll therefore also need content indices, etc. This ticket captures that work, and will be further expanded once #10 is complete.

Figure out how to organize as deployable website

Ideally I'll want the data to be pushed to the web, but still have a data and exploration pipeline available. Using a static site generator is probably ideal, so how does that fit into the repo? I don't know but I'll figure it out.

Building the Data Model

Vision

My vision for this project is to create a lightweight, hostable knowledge repository regarding the history of women in Berlin and the places named for them. With this in mind, I have a few key goals:

That it should be easily and inexpensively hosted;
That people can contribute reviewable knowledge to it;
That we can use data to quantify the comparative presence of men vs. not-men in public history;
That we can incorporate mapping tools to make it compelling.

Given this, the first objective is to figure out how to distill the 8000+ public place names in Berlin into a workable form and have a flexible data model off which this can be built. Some of the key factors regarding the aforementioned points are discussed here.

Lightweight, hostable solution

My vision is to be able to use a static site generator (SSG) to create a lightweight, easily clonable solution. SSG hosting is inexpensive and easy to set up. The general SSG workflow is something like this: Generate source material > compile source material using SSG templates > host compiled material in basic HTML format.

I have built SSG sites that can be hosted for roughly 1 euro a month on Amazon S3. I would like to replicate this here. Most SSG sites use Markdown (md) files as source material. These often contain some YAML frontmatter and the SSG tool uses templates to take the md file and convert it into HTML, building all the necessary path references for images, etc. as necessary.

For this project, we will need steps to the left of generating the source material. Namely, we need to extract street names and sanitize data before source material can be generated. This will be discussed below. One of the drawbacks to static sites is that they are not dynamic (vacuously), which means that a build pipeline will have to be integrated. This is not so difficult to accomplish using AWS Lambda and webhooks.

User-contributable

Ideally, this knowledge based will be contributed to by many people in a peer-reviewable way. The process of extracting place names and sanitizing data will largely be a manual process. I estimate that 80% or more of the place names can be extracted automatically using basic pattern matching techniques. The remaining names will require either manual intervention or more complex tooling. Because of the human-digestible size of the data, it makes more sense to use manual intervention that a complex scaffold of statistical tools.

By way of example, many streets named after people have take the form of, e.g. Hannah-Karminski-Straße. However, this is not universal, and it is simple enough to write a more comprehensive set of rules to capture, e.g. Kopernikusstraße, but at some point the process of identifying all the heuristics becomes the task of just doing the work manually to begin with.

In any case, extraction of place names is only one step. The data will need to be sanitized and this will be a largely human-centered process. For instance, Melli-Beese-Straße and Amelie-Beese-Zeile are (probably?) both named for the same person, but this would not be clear unless we further incorporate a nickname correlation dictionary. This is not worth it for a few hundred data points. Instead, we should rely on human-centered knowledge to be able to merge these entries as two place names referencing the same historical person.

The ideal model will allow users to merge the two entries for these two places into one entry representing a person with two place names linked to that identity. This may change over time, as not every such relationship may be immediately clear.

Beyond the data extraction/sanitization issues, my vision for this project includes the ability to contribute additional material and knowledge and to edit information as necessary in the future. The ideal flow will use a Github-like (or even Github-actual) PR model, where peer review is necessary to publish content. An overarching vision I have is to make Github-style PR flows more familiar to academicians outside of the technology industry. This has a number of benefits over the Wiki model (see: Why not just use a wiki? below).

Upon a merged PR, the content generation pipeline should kick off automatically.

Metrizable

Ideally, this project will be able to expose how women are treated in the historical record vs men. (Because we are talking about modern European history, I am using a false gender binary dichotomy, as it is more likely that binary gender will overwhelmingly arise in the historical record). Therefore, our data should be in a place where we can quantify the history of women memorialized in the names of Berlin's public spaces, including numbers of people, numbers of places per person, the centrality of those locations, the lifespans and lifetimes of those people, the searchable record (e.g. number of words per wikipedia article), and so on. As such, the underlying data should make this easy and replicable.

Map Integration

Mapping data is often highly specific and not always easy to present concurrently with other data (such as extensive text). Therefore, we should be sure that our data includes ways to incorporate mapping information in a way that can be easily extracted or converted to use with various mapping software (e.g. Open Street Maps).

Approach

This epic will capture tickets to discuss the approach using the four key points as described above. Roughly outlined, the work will precede according to the following vague roadmap:

Build the ETL pipeline.
Build the revision cycle pipeline.
Build tools to convert persisted information to suitable SSG format -- an SSG pre-generator of sorts.

I retain the right to change this last step somewhat, as it appears to have the potential for being a self-defeating rabbit hole.

Why not just use a wiki?

The Wiki model has a number of flaws, in my opinion. First, it is much more difficult to host on a static site. Second, the edit-by-anyone model has led to quite a bit of corruption of information, trolling, and academic dishonesty. Wikipedia isn't considered a reliable source for many reasons, but publish-before-review is a major one. Also, Wikis are terribly old-school in terms of basic formatting and usability. I dislike the model, and I believe that the peer review/pull request model is superior and will allow for better content moderation and accuracy.

Select SSG

For this ticket, we'll evaluate static site generator tools and select an appropriate one based on the availability of desired features.

These features necessarily include:

plaintext content sources
multiple page display types
good looking templates
easy configuration and installation

Set up repo

Get a few things set up for the repo:

Initial commits and folder structure
Readme
License

Build an automated processing pipeline

The data processing pipeline should perform the following steps:

Take source data and export it into a more convenient format
- Process the Geo data to extract street names and export them into a usable format;
- Filter the exported data and break it down by gender;
- Output the results in a format that clearly identifies gendered location names and non-gendered location names and puts the data in a format suitable for further development.
Generate human-editable content from this data
Be able to push edits back into data store
Have a strategy for handling conflicts

This can probably be accomplished with the development of a shell script and a basic python script. I haven't yet decided on the output format yet, so that remains to be determined.

Write a python script to post-process street data and output it to JSON

The pipeline is basically:

Download source Geo and name-gender data > process Geo data > merge with name-gender data > place-gender correlation data.

For this last step, a python script will do the place-gender correlation and will output the data in a more generalized data model, in this case, a JSON file that can be used to further generate the data in a more suitable markdown file or something similar.

This will be the final step in the automated extraction pipeline.

gorcenski / women-streets-berlin Goto Github PK

women-streets-berlin's People

Contributors

Watchers

women-streets-berlin's Issues

Vision

Approach

Vision

Lightweight, hostable solution

User-contributable

Metrizable

Map Integration

Approach

Why not just use a wiki?

Recommend Projects

Recommend Topics

Recommend Org