Giter Club home page Giter Club logo

political-ads's Introduction

political-ads - WIP

Political Advertising

Slack Channel: #political-ads

Project Description: Extracting and parsing data on political advertising from FCC Public Files and related sources.

Project Lead: @pratheekrebala

Project Goals

This project aims to develop a pipeline for fetching, cleaning and parsing the data on political ad spending available from the FCC's Public Files.

Fetching: Designing a pipeline to fetch data from the FCC API, this involves getting older data from the FCC's OPIF API and monitoring new filings using the RSS Feed.

Archiving: Some TV stations delete public files from the FCC system following the minimum retention period (~2yrs). A process for archiving these documents will allow for historical research.

Pre-Processing: All the files available on the Public File system are in PDF format. A significant number of these files are image based PDFs which need to be OCRd before they can be processed.

Processing: A processing pipeline to detect and extract relevant metadata from filings (e.g. Flight Dates, Invoice Amount, Number of Spots).

More TK.

About Public Files

The Federal Communications Commission (FCC), per 47 CFR 73.3526 requires that commercial television states keep a political file that documents the details the advertising activity of political candidates on their channel. The public files extensively detail the transactions that political campaigns and ad-buyers have with the TV station including price per spot, detailed schedules and program information. Some stations also choose to provide additional information about the targeted demographic and Cost Per-Engagement (CPE) information.

A few sample PDF files have been provided in the samples directory.

Related Projects

Sunlight AdSleuth

FCC Political Ads

political-ads's People

Contributors

pratheekrebala avatar chrisdick14 avatar

Stargazers

 avatar Nick F avatar Chris Zubak-Skees avatar

Watchers

James Cloos avatar  avatar Eric Bickel avatar nick avatar George Richardson avatar Amanda Alvarez avatar  avatar

political-ads's Issues

Archive public files

Some TV stations delete public files from the FCC system following the minimum retention period (~2yrs).

The deleted file ids can be fetched from the “File History” endpoint on the FCC API by looking for files with the status of “delete” (https://publicfiles.fcc.gov/api/manager/file/history.json?startDate=2017-01-01&status=delete)

These files, although deleted, can be fetched by using the file_manager_id value from FCC’s File Manager: https://files.fcc.gov/download/{file_manager_id}.pdf

A pipeline to fetch these files and signal for deletes would be helpful for reporting purposes. Additionally, we need to identify a place to store these files for long term (e.g. Archive.org)

Tracking order versioning

TV Stations often produce successive versions of the same invoice/receipt when the candidate updates their Ad-Buy.

A good number of these can be identified by looking for keywords such as “REV” in the file name through the search API. A more thorough approach for the same is to look for the Contract or Order ID number inside the PDF to view when successive filings have the same ID suggesting that it could be a revision

Based on this detection, we could offer notifications to alert when certain candidates/races have revised orders

Station based parsers

By and large there seem to be a few (~4-5) types of layouts that stations upload their document in - if we can classify these documents into groups based on the layout - it might make it easier to implement layout specific parsers.

Canonicalize candidate names

The file path from the public files in the FCC API are largely structured in the following format:

{StationCallsign}/political-files/{Year}/{OfficeLevel}/{OfficeName}/{Candidate/PAC Name}

For instance, the following is the file path for Bernie Sanders filings for WBTV:

wbtv/political-files/2016/federal/president/bernie-sanders/

However, the names aren't always consistent - some entities use PAC Names (e.g. Hillary For America), some use last names (Clinton) and some use full names (Bernie Sanders) - this format is largely station specific. While its easy to write rule based lookups for federal candidates by matching names with data from GovTrack or Propublica - this problem gets more challenging when handling local candidates.

An approach to canonicalize these candidate names across TV Stations and States would greatly help streamline how data is aggregated later.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.