The Public Utility Data Liberation Project (PUDL)

Any color you want, so long as it's black.

Schedule a 1-on-1 chat with us about PUDL.

What is PUDL?

The PUDL Project is an open source data processing pipeline that makes US energy data easier to access and use programmatically.

Hundreds of gigabytes of valuable data are published by US government agencies, but it's often difficult to work with. PUDL takes the original spreadsheets, CSV files, and databases and turns them into a unified resource. This allows users to spend more time on novel analysis and less time on data preparation.

The project is focused on serving researchers, activists, journalists, policy makers, and small businesses that might not otherwise be able to afford access to this data from commercial sources and who may not have the time or expertise to do all the data processing themselves from scratch.

We want to make this data accessible and easy to work with for as wide an audience as possible: anyone from a grassroots youth climate organizers working with Google sheets to university researchers with access to scalable cloud computing resources and everyone in between!

PUDL is comprised of three core components:

Raw Data Archives

PUDL archives all our raw inputs on Zenodo to ensure permanent, versioned access to the data. In the event that an agency changes how they publish data or deletes old files, the data processing pipeline will still have access to the original inputs. Each of the data inputs may have several different versions archived, and all are assigned a unique DOI (digital object identifier) and made available through Zenodo's REST API. You can read more about the Raw Data Archives in the docs.

Data Pipeline

The data pipeline (this repo) ingests raw data from the archives, cleans and integrates it, and writes the resulting tables to SQLite and Apache Parquet files, with some acompanying metadata stored as JSON. Each release of the PUDL software contains a set of of DOIs indicating which versions of the raw inputs it processes. This helps ensure that the outputs are replicable. You can read more about our ETL (extract, transform, load) process in the PUDL documentation.

Data Warehouse

The SQLite, Parquet, and JSON outputs from the data pipeline, sometimes called "PUDL outputs", are updated each night by an automated build process, and periodically archived so that users can access the data without having to install and run our data processing system. These outputs contain hundreds of tables and comprise a small file-based data warehouse that can be used for a variety of energy system analyses. Learn more about how to access the PUDL data.

What data is available?

PUDL currently integrates data from:

EIA Form 860: 2001-2022 - Source Docs - PUDL Docs
EIA Form 860m: 2023-12 - Source Docs
EIA Form 861: 2001-2022 - Source Docs - PUDL Docs
EIA Form 923: 2001-2023 - Source Docs - PUDL Docs
EPA Continuous Emissions Monitoring System (CEMS): 1995Q1-2023Q4 - Source Docs - PUDL Docs
FERC Form 1: 1994-2022 - Source Docs - PUDL Docs
FERC Form 714: 2006-2022 (mostly raw) - Source Docs - PUDL Docs
FERC Form 2: 1996-2022 (raw only) - Source Docs
FERC Form 6: 2000-2022 (raw only) - Source Docs
FERC Form 60: 2006-2022 (raw only) - Source Docs
US Census Demographic Profile 1 Geodatabase: 2010 - Source Docs

Thanks to support from the Alfred P. Sloan Foundation Energy & Environment Program, from 2021 to 2024 we will be cleaning and integrating the following data as well:

EIA Form 176 (The Annual Report of Natural Gas Supply and Disposition)
FERC Electric Quarterly Reports (EQR)
FERC Form 2 (Annual Report of Major Natural Gas Companies)
PHMSA Natural Gas Annual Report
Machine Readable Specifications of State Clean Energy Standards

How do I access the data?

For details on how to access PUDL data, see the data access documentation. A quick summary:

Datasette provides browsable and queryable data from our nightly builds on the web: https://data.catalyst.coop
Kaggle provides easy Jupyter notebook access to the PUDL data, updated weekly: https://www.kaggle.com/datasets/catalystcooperative/pudl-project
Zenodo provides stable long-term access to our versioned data releases with a citeable DOI: https://doi.org/10.5281/zenodo.3653158
Nightly Data Builds push their outputs to the AWS Open Data Registry: https://registry.opendata.aws/catalyst-cooperative-pudl/ See the nightly build docs for direct download links.
The PUDL Development Environment lets you run the PUDL data processing pipeline locally.

Contributing to PUDL

Find PUDL useful? Want to help make it better? There are lots of ways to help!

Check out our contribution guide including our Code of Conduct.
You can file a bug report, make a feature request, or ask questions in the Github issue tracker.
Feel free to fork the project and make a pull request with new code, better documentation, or example notebooks.
Make a recurring financial contribution to support our work liberating public energy data.
Hire us to do some custom analysis and allow us to integrate the resulting code into PUDL.

Licensing

In general, our code, data, and other work are permissively licensed for use by anybody, for any purpose, so long as you give us credit for the work we've done.

The PUDL software is released under the MIT License.
The PUDL data and documentation are published under the Creative Commons Attribution License v4.0 (CC-BY-4.0).

Contact Us

For bug reports, feature requests, and other software or data issues please make a GitHub Issue.
For more general support, questions, or other conversations around the project that might be of interest to others, check out the GitHub Discussions
If you'd like to get occasional updates about the project sign up for our email list.
Want to schedule a time to chat with us one-on-one about your PUDL use case, ideas for improvement, or get some personalized support? Join us for Office Hours
Follow us here on GitHub
Follow us on Mastodon: @[email protected]
Follow us on BlueSky: @catalyst.coop
Follow us on LinkedIn
Follow us on HuggingFace
Follow us on Twitter: @CatalystCoop
Follow us on Kaggle
More info on our website: https://catalyst.coop
Email us if you'd like to hire us to provide customized data extraction and analysis: [email protected]

About Catalyst Cooperative

Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.

jaishree2310 / pudl Goto Github PK

pudl's Introduction

The Public Utility Data Liberation Project (PUDL)

What is PUDL?

Raw Data Archives

Data Pipeline

Data Warehouse

What data is available?

How do I access the data?

Contributing to PUDL

Licensing

Contact Us

About Catalyst Cooperative

pudl's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent