data4democracy / project-ideas Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 7.0 3 KB

A place for project ideas to live, be discussed, and be brought to life!

project-ideas's People

Contributors

Stargazers

Watchers

Forkers

christopherjenness radovankavicky gapdata data-mining prashantkumartiss 073145

project-ideas's Issues

Project Idea: Visualizations illustrating the impact of bad data ethics

Just what percentage continuing wealth disparity is due to continuing discriminatory practices based on algorithms? If these algorithms are fixed, what are the anticipated economic and social impact benefits?

If ___ % of these 20 global corporations or these government or academic institutions adhered to the following set of data standards, what would the impact be on ____ % of transactions? Of academic papers? Of health or privacy of the new generations?

Looking for help generating the concepts, background data wrangling, javascript and data artistry, inspired primarily through the work of The Pudding (https://pudding.cool/)

Anti-Trust Data

Gathering the impact of tech companies and how there platform can be used to create false perceptions and how removing the use of a tech company can create major problems!

Many projects only use github, and slack.
If you take those platforms away, its very hard to interact with people on those platforms. This means that if a person in the abuse department disagrees with something your saying they can completely remove you from the community ! This data is hidden, so its hard to see how prolific it actually is.
Try not using gmail, slack, or many of the big companies, and it may be more clear how much control those companies have.

This would likely also get attention and feedback from lots of people. Look at Tic-Tok and the latest news around how tech companies are handling the election.

Related Links

Tenant Assistance/Eviction Fighter Chatbot

If you're interested in chatbots and/or tenant rights issues, check out #p-eviction-chatbot! Inspired by the "DoNotPay" bot, whose inventor was featured on Partially Derivative earlier this year, this project was prototyped during the Sketch City Houston Hackathon. An initial version for Houston tenants is operational, and the eventual aim is to make a chatbot that can provide advice and, in some cases, legal document generation, to users throughout the US.

There are a few well-defined issues already spelled out in the GitHub repo, including:

evaluation of different chatbot frameworks (the prototype was built using motion.ai, but that may change for the next iteration)
front end development help (the prototype is accessed through a simple GH page)
automatic document generation
and more!

Plotting traffic tickets in Washington

VA Open Data

Since public money is used to both create conditions for a variety of mental health conditions (e.g. PTSD, TBI, suicide) to develop and pay for their subsequent research and treatment, it's in the public interest to understand how our own defense spending is creating a mental health crisis in the military and veteran communities.

Right now I'd like to gather the available datasets, including meta-analyses of what's already been published on the topic. But ultimately I'd like to create some engaging and interactive visualizations for public awareness.

I'd be willing to take lead but my skills are limited to Python and MYSQL. It might be nice to have R and JS folks aboard too.

Example dataset from the VA below
https://catalog.data.gov/dataset?collection_package_id=2f1fe591-ca2d-455a-aede-cc6a5ef2c5cd

D4D Slack Channel Viz

The D4D Slack is a a rich ecosystem, but it's a bit beastly to identify which of the nearly 200 channels one may want to join! A roadmap to help members know where to find the location/interest/language/package/project channels they want to follow would improve the onboarding experience, as well as help existing members keep up with what's new.

Build a phone app to detect factual probability of statements read or heard

This would be based on machine learning, which I know about. Need help on phone apps, data sources/munging and creating a Kickstarter project.

Media Citation Crawler and Tree Generator

Proposed itinerary at bottom :)

I realized my last description on the Slack left a bit to be desired, so I wanted to flesh it out:

What I'm proposing is a media citation and reference crawler which can produce reference trees for analysis and determining the strength of a source (with respect to how well it backs itself up with citations, at least).

Let's say, for instance, that you take a Washington Post article...

You would then grab only the body content of the article itself with a web scraper and grab every <a href="..."> tag from it. You could also save the context of the tag--all the paragraph text surrounding it--tagging the words and content used to frame the reference as the source/description/lead-up/etc. This could be done with something like Python's lxml package and a little tree traversal, but let's forget those implementation details for now.

Imagine that this article itself is a node in a larger n-ary tree. Its children and parents are tweets, articles on other sites, government releases, comments and text posts on Reddit, and maybe some other internal articles from Washington Post--all the way down to collected transcripts. Let's call these media nodes for reference. All the articles and their references out there are just hanging out in one big, generally acyclic graph.

You could start from any of these potential media nodes and build a tree of sources from a given root media node. You could even allow a user to submit an article, tweet, post, or whatever on a web frontend to generate a tree of a certain (probably limited) depth which they can have visualized.

Scaled up, this could also permit analysis of citations from bodies of sources themselves. How often does WaPo (or other media entities) cite external sources? How often does WaPo cite themselves as an entity? What can be said for the authors of articles and comments? What can be said for users on Twitter? Do sources from certain entities tend to fall back on government documents and corroborated sources, tweets from the horse's mouth, or just good-old-fashioned hot air two levels down? How deep does a certain rabbit hole go?

The context of references can be used for both pruning and natural language analysis in the context of research as well.

You could store trees in a growing document database as well as specialized graph databases like Neo4j and text-search databases like ElasticSearch.

The main catch I see with this is how to work with each specific site. HTML traversal logic can be generified to some extent, but utilities and crawlers for each variety of media node will likely be necessary at some level. Wrappers for Reddit and Twitter could be useful as well. The silver lining with respect to non-API sites is that citations/references seem to just hang out in a tags in the middle of p, span, em, and other tags with text content.

I'd like this to grow to include even media such as Breitbart, Sputnik, and other intensely polarized sites and sources. Scrutiny doesn't need to see party lines or extremes [unless you want to prune or tag those branches (; ].

I imagine this project could also impact existing D4D projects such as assemble and the collaboration with propublica if implemented with a well-documented web API.

Proposed Itinerary for Base API:

Build base web scraper
Write logic for generic HTML tree traversal and a tag farming (this will likely evolve through each phase with the media node varieties)
Write logic for scraping mainstream media nodes
Write logic for working with Reddit, Twitter, and other APIs as media nodes
Write logic for progressively less and less mainstream media nodes--opening the floor to each as an issue and eventually PR that can be integrated

A frontend and database solution can begin happening once the API and node structure are reasonably solidified, respectively. This will also likely evolve as the project grows.

The Texas Beer Project

Howdy, folks! I'm starting to do some data and policy analysis work around state beer laws, with the hopes of putting together something substantive to support the Texas Craft Brewers Guild as they ramp up to push for beer law reform in the 2019 Texas Legislative Session. The exact end-product is still TBD---maybe a policy playbook, compilation of interviews with key players in beer policy around the country, and/or quantitative analysis of state-by-state laws vs. their economic impact.

Two questions:

Is there anyone here who's done work in the state beer policy space before that'd be open to chatting with me about what types of research have been done so far?

Anyone wanna join this crusade with me and the 200+ craft breweries in Texas (plus all the beer-loving folk outside of Texas who are pushing for this)?

I've started compiling data on data.world in a private data project, and am happy to add anyone who is interested. If you're interested in connecting/contributing, please DM me (@gabriela) on the D4D slack channel or email me ([email protected]) to let me know.

Here’s a good article that outlines the issues, for reference: http://www.austinbeerguide.com/post/170157943040/texas-craft-brewers-guilds-charles-vallhonrat

Maternal Healthcare Project: Prenatal Mortality Statistics

Are you looking to help examine healthcare trends in the United States of America? Looking to help alleviate disparities in healthcare especially for women and minority populations? According to the WHO, the maternal mortality rate in the United States of America is equivalent to Papua New Guinea, a developing country. It is only country with such a high mortality rate that is considered a developed nation. Women of color and from minority ethnicities have double the rate of maternal mortality in comparison to women who are not from minority populations. The Maternal Healthcare Project:Prenatal Care and Pregnancy Related Deaths will be looking at women who are impacted by health issues such as pre-eclampsia, endometriosis, cardiovascular diseases, and other data points discussed during the meeting Sunday. The scope of the project is as follows:
Examine trends in healthcare provided for women of African descent, Caucasian descent, different levels of income, and quality of prenatal healthcare provided.
Examine extenuating factors such as prior medical coverage, type of healthcare used (PPO, HMO, ACA, etc.).
Look at establishing trends between policies by state (i.e. prenatal care is not required to be covered by insurance companies in all 50 states), health factors (i.e. women from this ethnic group have a tendency to have pre-eclampsia).
Find comparisons between maternal mortality rates in different countries and how different health policies/genetic backgrounds can impact health.
Come up with policy suggestions based on research that can help with alleviating the problem.
This project will be extremely fascinating because we will be looking at genetics, epidemiological data, political science, gender and ethnic studies, and economic policies. We will be looking at data from endometriosis in women to how surgical studies are carried out. The great part of the project is that there is a little bit for everyone. Even Melinda Gates has outlined an initiative to look at alleviating maternal mortality rates. Are you interested? Join our Google Hangouts meeting at 8 a.m. CST on Monday January 21. Here is the Google Calendar link: https://calendar.google.com/calendar/embed?src=alu32dpe86i8n0sd3ub5otn9ho%40group.calendar.google.com&ctz=America%2FChicago , exploratory notebook: https://colab.research.google.com/drive/1Je6-L3EM6NPgvgs1S0pKW9W1_u_7t1fk , project proposal/brainstorming document: https://docs.google.com/document/d/11rKo9r3q0akyz1jkfm3dhlklFzlnpjIQdTemubXjPBQ/edit?usp=sharing, and pertinent dataset: https://pregnancycolab.tghn.org/articles/collect-database/. My email is [email protected]. Please please please, if you are interested, @ me on Slack. I will receive an email if you do! Thanks everyone! I look forward to seeing you Sunday! (Psst… I need to have a good meeting time, and I would like to know who is interested to schedule the meeting!)

Add CPEDS working groups' work to GDEP (ethics-resources)

Nationwide Polling Place App?

I find it hard to believe that this doesn't exist or that someone isn't feverishly working on it right now, but my searching has turned up nothing at the moment. Happy to take pointers to anything I've missed, but I'm specifically not looking for voter registration or education functionality. I just want to help get people to the polls.

The problem: one huge barrier to getting young people to vote is that it's not top of mind, and IF they do remember they want to vote on election day they don't know where to go. Making that information easily available in a way they're used to getting it could have significant effects on voter turnout.

THE PROPOSITION: a simple smartphone app that tells the user where their voting place is based on their supplied address, and gives you an alert on election day (and/or the day before) to vote. One interface for anywhere in the country, so the trick is scraping/linking all the various state elections websites for the official information.

I emphasize that the interface should be simple, both for usability and for speed in development. Adding other functions like candidate or issue look-ups/information is nice but I want to focus on solving one big problem and then fit others in later if it's possible.

I have expertise in scraping data in Python and some familiarity with Java but haven't ever developed an App myself, so would be looking to partner with anyone who does have App development expertise. Thoughts?

Drug Spending: Polypharmacy in at-risk populations.

The project idea I am pitching is studying trends in polypharmacy with at-risk populations such as children in foster care and the elderly. The idea is to understand the trends between the number of drugs taken and the overall quality of life of those who are impacted by polypharmacy. Let me know if you are interested in the project.

Data Warehouse For Democracy

Motivation

There are TONS of publicly available datasets in various formats made available by local, state and federal agencies. BUT, there's quite a bit of labor involved in massaging their data into a usable format. In enterprise environments, it's common to have a "data warehouse" which enforces a common data schema that outside data is molded into.

It could be valuable to create a shared data warehouse resource (I have some hardware that I might be able to donate to this cause). We could then predigest/prejoin data that relates to various common areas (or other facts/dimensions that may not yet be obvious to me). This data could support the exploratory phase of data science work or help to model the impacts of various policies in an open source way.

Example Data sources that could be pre-merged

Open Street Map
Census Tigerline Dataset
Housing Data From HUD (https://data.hud.gov/data_sets.html)

An Example Schema

Zipcode Fact

(Integer) zcta5
(Polygon) boundaries

Zipcode Year Dimension

(Varchar 255) Congressional Representative
(Varchar 255) Congressional Representative Political Party
(Varchar 255) Senate Representative
... Other Elected Official Information ...
(Integer) Public Housing Units
(Integer) Public Housing Facilities
(Decimal) Median Income
... Other summary statistics about the area's population ...

Tools We Could Use

Postgres (https://www.postgresql.org/)

I've used PostgreSQL for warehouse projects before with great success, it's open source, supports geospatial querying which could be useful for querying regional data and supports foreign data wrappers and database followers which could be useful for mirroring if the dataset were to grow in size or usage to the point that additional resources were needed.

Airflow (https://airflow.incubator.apache.org/)

Airflow is an open source ETL tool created by Airbnb and now being Incubated by the Apache Foundation. We could use this tool to do fetching and transformation of the data before it was placed in the warehouse. I'm proposing that Postgres/Airflow be hosted somewhere and that then their configuration/code be available somewhere publicly (github?) so that adding data sources to the warehouse would be as simple as merging a pull request.

DBT (https://dbt.readme.io/docs/overview)

DBT is a tool that makes it easy to use SQL to transform data in its raw format into "models" which are tables that are more useful for analysis. I believe this would make it easier to shape the data coming from various sources into a common fact/dimension table format that would facilitate easier analysis.

Mobile app to collect and visualize hyperlocal infrastructure information for a community in the Amazon

This project is a request from outside D4D looking for support. If you're interested in being the D4D lead or participating in this project, please ping @gecky in the Slack.

As described by the project's requestor:

The project aims to encourage the citizenship of the residents and workers of the community through connected digital mobile technologies. With the help of a mobile application, created for this purpose, users of the system can, in a geolocalized way, insert data about the infrastructure of neighborhood, feeding database with information about the infrastructure deficit in the following areas: Drinking Water; Collection and Treatment of Sewage; Street lighting; Sidewalks; Asphalt; Urban Cleaning.
The district was chosen by the project for being the second largest in population in the city of Macapá, according to current estimates the number is around 45 thousand inhabitants. Novo Horizonte has a history in community communication, since it has its community radio and has already had its community print newspaper.
Based on the contemporary concepts of hyperlocal communication, the project promotes the conversation among community members with the intention of structuring data and configure, through them, the real situation of the infrastructure provided by the public authorities to the residents of the neighborhood.
As the community itself is responsible for supplying the application with relevant data about the environment where it lives, it will commit itself to actions that pressure public entities to transform the reality of the local infrastructure, according to the demands identified as priorities.
The Lupa NH is an initiative of Professor Walter Lima, with the participation of journalism and computer science students of the Federal University of Amapá (UNIFAP), whose task is to implement elements of journalism to the hyperlocal project. The current phase of the project contemplates the modeling of the application.
The project has the support of two important transforming agents in the Novo Horizonte Community: a teacher, who runs the projects in the communication area in Raimunda dos Passos School and a journalism student who is also responsible for the radio Community of Novo Horizonte.
To broadly disseminate the project and win participants, profiles are being opened on the following social networks: Facebook, Twitter and Instagram. This action will allow greater proximity between all those involved and greater visibility of the neighborhood Novo Horizonte, giving voice to its community to expose the reality in which it lives and in search of significant improvements in the quality of life.

Use federal payroll records for author profiling systems

This project would seek to build a wage/job profiling system for people on the federal payroll in the US.
The project would seek map writing style/linguistic information to salary/work.

It would require identifying social media accounts (probably on Twitter) for people working for the federal government.