gorcenski / women-streets-berlin Goto Github PK
View Code? Open in Web Editor NEWExploring the women's history hidden in the street names of Berlin
License: Other
Exploring the women's history hidden in the street names of Berlin
License: Other
The public-facing part of this project will be a static site capable of displaying content about the streets of Berlin, showing some qualitative and quantitative analysis, and providing means for users to contribute content of their own.
The static site model is chosen because hosting is inexpensive and portable. Any user can replicate the site simply by downloading the content, and no additional tooling or configuration is required to host the site anywhere. This is particularly well-suited for archival tools, wherein dependencies on software severely hinder the longevity of the archive.
A suitable Static Site Generator will need to be selected. Some of the requirements:
For the purposes of this milestone, configuring hosting will be considered separate. The desired MVP is a working, semi-styled static site capable of displaying content and providing tools for contributors. It is acceptable to have a working site running in dev.
For the approach, the following steps will be taken:
With respect to the last two points, those may become epics or milestones themselves.
Given a suitable choice of SSG, we should select a template that will suit the project. Desired features include:
The external data sources should be documented so others can replicate the work. This ticket captures the work needed to document the sources of the external data and to put it into a place where it can be processed into a usable form.
The files are large so it makes little sense to keep them here so maybe creating a brief data pipeline shell script is a good idea, and also documenting the source of data and the tools needed to do the extraction is good.
Once we select an SSG tool (#9) and a template (#10), we can set up a landing page and begin to craft how we want the menu and page taxonomies to operate. We'll probably want a blog component, but the main content -- information about streets -- should reside in pages. We'll therefore also need content indices, etc. This ticket captures that work, and will be further expanded once #10 is complete.
Ideally I'll want the data to be pushed to the web, but still have a data and exploration pipeline available. Using a static site generator is probably ideal, so how does that fit into the repo? I don't know but I'll figure it out.
My vision for this project is to create a lightweight, hostable knowledge repository regarding the history of women in Berlin and the places named for them. With this in mind, I have a few key goals:
Given this, the first objective is to figure out how to distill the 8000+ public place names in Berlin into a workable form and have a flexible data model off which this can be built. Some of the key factors regarding the aforementioned points are discussed here.
My vision is to be able to use a static site generator (SSG) to create a lightweight, easily clonable solution. SSG hosting is inexpensive and easy to set up. The general SSG workflow is something like this: Generate source material > compile source material using SSG templates > host compiled material in basic HTML format.
I have built SSG sites that can be hosted for roughly 1 euro a month on Amazon S3. I would like to replicate this here. Most SSG sites use Markdown (md) files as source material. These often contain some YAML frontmatter and the SSG tool uses templates to take the md file and convert it into HTML, building all the necessary path references for images, etc. as necessary.
For this project, we will need steps to the left of generating the source material. Namely, we need to extract street names and sanitize data before source material can be generated. This will be discussed below. One of the drawbacks to static sites is that they are not dynamic (vacuously), which means that a build pipeline will have to be integrated. This is not so difficult to accomplish using AWS Lambda and webhooks.
Ideally, this knowledge based will be contributed to by many people in a peer-reviewable way. The process of extracting place names and sanitizing data will largely be a manual process. I estimate that 80% or more of the place names can be extracted automatically using basic pattern matching techniques. The remaining names will require either manual intervention or more complex tooling. Because of the human-digestible size of the data, it makes more sense to use manual intervention that a complex scaffold of statistical tools.
By way of example, many streets named after people have take the form of, e.g. Hannah-Karminski-Straße. However, this is not universal, and it is simple enough to write a more comprehensive set of rules to capture, e.g. Kopernikusstraße, but at some point the process of identifying all the heuristics becomes the task of just doing the work manually to begin with.
In any case, extraction of place names is only one step. The data will need to be sanitized and this will be a largely human-centered process. For instance, Melli-Beese-Straße and Amelie-Beese-Zeile are (probably?) both named for the same person, but this would not be clear unless we further incorporate a nickname correlation dictionary. This is not worth it for a few hundred data points. Instead, we should rely on human-centered knowledge to be able to merge these entries as two place names referencing the same historical person.
The ideal model will allow users to merge the two entries for these two places into one entry representing a person with two place names linked to that identity. This may change over time, as not every such relationship may be immediately clear.
Beyond the data extraction/sanitization issues, my vision for this project includes the ability to contribute additional material and knowledge and to edit information as necessary in the future. The ideal flow will use a Github-like (or even Github-actual) PR model, where peer review is necessary to publish content. An overarching vision I have is to make Github-style PR flows more familiar to academicians outside of the technology industry. This has a number of benefits over the Wiki model (see: Why not just use a wiki? below).
Upon a merged PR, the content generation pipeline should kick off automatically.
Ideally, this project will be able to expose how women are treated in the historical record vs men. (Because we are talking about modern European history, I am using a false gender binary dichotomy, as it is more likely that binary gender will overwhelmingly arise in the historical record). Therefore, our data should be in a place where we can quantify the history of women memorialized in the names of Berlin's public spaces, including numbers of people, numbers of places per person, the centrality of those locations, the lifespans and lifetimes of those people, the searchable record (e.g. number of words per wikipedia article), and so on. As such, the underlying data should make this easy and replicable.
Mapping data is often highly specific and not always easy to present concurrently with other data (such as extensive text). Therefore, we should be sure that our data includes ways to incorporate mapping information in a way that can be easily extracted or converted to use with various mapping software (e.g. Open Street Maps).
This epic will capture tickets to discuss the approach using the four key points as described above. Roughly outlined, the work will precede according to the following vague roadmap:
I retain the right to change this last step somewhat, as it appears to have the potential for being a self-defeating rabbit hole.
The Wiki model has a number of flaws, in my opinion. First, it is much more difficult to host on a static site. Second, the edit-by-anyone model has led to quite a bit of corruption of information, trolling, and academic dishonesty. Wikipedia isn't considered a reliable source for many reasons, but publish-before-review is a major one. Also, Wikis are terribly old-school in terms of basic formatting and usability. I dislike the model, and I believe that the peer review/pull request model is superior and will allow for better content moderation and accuracy.
For this ticket, we'll evaluate static site generator tools and select an appropriate one based on the availability of desired features.
These features necessarily include:
Get a few things set up for the repo:
The data processing pipeline should perform the following steps:
This can probably be accomplished with the development of a shell script and a basic python script. I haven't yet decided on the output format yet, so that remains to be determined.
The pipeline is basically:
Download source Geo and name-gender data > process Geo data > merge with name-gender data > place-gender correlation data.
For this last step, a python script will do the place-gender correlation and will output the data in a more generalized data model, in this case, a JSON file that can be used to further generate the data in a more suitable markdown file or something similar.
This will be the final step in the automated extraction pipeline.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.