Giter Club home page Giter Club logo

congress_work's Introduction

Congressional Bill Modeling

Samuel Sherman

The topics of congressional bills would have a fair indication of what our congressional leaders are talking about and how consistent our leaders have been in living up to their campaigns. These topics were derived using machine learning and later used as features to predict potential outcomes of bills. The results were highly intriguing and will be implemented in future work.

Data

Web scraped bill text from congress.gov and retrieved bill and votes data using the Sunlight Foundation API. Bill and vote information dates back to congress 103.

Non-Negative Matrix Factorization

Applied NNMF to derive latent topics in bill text and examine any distinguishable characteristics between time periods in congress. The cosine distance between different topics (H) and within topics (W/tfidf) was explored.

Cosine distance

Examining the graph above it can be seen that the cosine distance between topics increases and levels off. The cosine distance within topics decreases and also levels off. A value of 300 topics was chosen to achieve a good balance between both metrics.

Word clouds were developed from the latent topics as a representation of how the algorithm is performing with 300 topics. An example is displayed below.

Internet WC

After deriving the topics with NNMF, the prevalence of a particular topic can be examined over time by grouping by the congress year. The prevalence of the internet is displayed below.

Internet Over Time

Here, the prevalence of "internet" coincides with what could have been expected. Essentially this displays how the internet importance has risen and become a relevant aspect of people's lives.

Regression

With the hypothesis that there is a high level of polarization in government and that these characteristics are prophetic, applied gradient boosted and random forest ensemble methods to predict the percent of yes votes for a given party.

Multiple models were explored, but ultimately the gradient boosted regression ensemble method was chosen for the best results. A plot of the residuals for the Democratic party is displayed below.

GB Dem Residuals

This is not an ideal scenario for a residual plot. However, it does have some indication about the polarization in government. The algorithm is much better at predicting when there is a dominant party voting yes. This has clear indications of what drives a particular party to vote yes on a given bill.

Classification

With less than 6% of bills introduced in congress since 1993 being brought to a vote, applied random forest, gradient boosted, neural network, and logistic regression models to predict this rare event. Oversampling (SMOTE) and under-sampling techniques were used to balance the classes within the training data. Recall, precision, and area under the precision recall curve were used as the priority metrics to measure the performance in classifying bills that will be voted on.

Python Tools

pymongo, mongodb, sci-kit learn, scipy, numpy, nltk, AWS, BeautifulSoup

congress_work's People

Contributors

scsherm avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.