singularity-energy / open-grid-emissions Goto Github PK

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids

License: MIT License

Python 52.57% Jupyter Notebook 47.43%

carbon-accounting climate electricity emissions ghg open-data eia epa power-systems python

open-grid-emissions's Introduction

Open Grid Emissions Initiative

The Open Grid Emissions Initiative seeks to fill a critical need for high-quality, publicly-accessible, hourly grid emissions data that can be used for GHG accounting, policymaking, academic research, and energy attribute certificate markets. The initiative includes this repository of open-source grid emissions data processing tools that use peer-reviewed, well-documented, and validated methodologies to create the accompanying public dataset of hourly, monthly, and annual U.S. electric grid generation, GHG, and air pollution data.

Please check out our documentation for more details about the Open Grid Emissions methodology.

The Open Grid Emissions Dataset can be downloaded here. An archive of previous versions of the dataset and intermediate data outputs (for research and validation purposes) can be found on Zenodo.

Installing and running the data pipeline

To manage the code environment necessary to run the OGE data pipeline, either pipenv or conda may be used. Currently, we utilize pipenv as our preferred environment manager for running the pipeline that is used for data releases, but conda will also work if you are more familiar with conda.

First, navigate to the folder where you want to save the repository and run the following commands:

If you are using pipenv

Note that this option requires to have Python and git installed on your machine.

pip install pipenv
git clone https://github.com/singularity-energy/open-grid-emissions.git
cd open-grid-emissions
pipenv sync
pipenv shell
pip install build
python -m build
pip install .

If you are using conda

conda install git
git clone https://github.com/singularity-energy/open-grid-emissions.git
conda update conda
cd open-grid-emissions
conda env create -f environment.yml
conda activate open_grid_emissions
pip install build
python -m build
pip install .

The pipeline can be run as follows:

cd src/oge
python data_pipeline.py --year 2022

independently of the installation method you chose.

A more detailed walkthrough of these steps can be found below in the "Development Setup" section.

Data Availability and Release Schedule

The latest release includes data for year 2019-2022 covering the contiguous United States, Alaska, and Hawaii. In future releases, we plan to expand the geographic coverage to additional U.S. territories (dependent on data availability), and to expand the historical coverage of the data.

Parts of the input data used for the Open Grid Emissions dataset is released by the U.S. Energy Information Administration in the Autumn following the end of each year (2022 data was published in September 2023). Each release will include the most recent year of available data as well as updates of all previous available years based on any updates to the OGE methodology. All previous versions of the data will be archived on Zenodo.

Updated datasets will also be published whenever a new version of the open-grid-emissions repository is released.

Contribute

There are many ways that you can contribute!

Tell us how you are using the dataset or python tools
Request new features or data outputs by submitting a feature request or emailing us at <>
Tell us how we can make the datasets even easier to use
Ask a question about the data or methods in our discussion forum
Submit an issue if you've identified a way the methods or assumptions could be improved
Contribute your subject matter expertise to the discussion about open issues and questions
Submit a pull request to help us fix open issues

Repository Structure

Modules

anomaly_screening: classes use to flag timeseries for anomalies as proposed in Tyler H. Ruggles et al. Developing reliable hourly electricity demand data through screening and imputation (2020)
column_checks: functions that check that all data outputs have the correct column names
constants: specifies conversion factors and constants used across all modules
data_pipeline: main script for running the data pipeline from start to finish
download_data: functions that download data from the internet
data_cleaning: functions that clean loaded data
eia930: functions for cleaning and formatting EIA-930 data
emissions: functions used for imputing emissions data
filepaths: used to identify where repository files are located on the user's computer
gross_to_net_generation: functions for identifying subplants and gross to net generation conversion factors
helpers: functions that are used across modules
impute_hourly_profiles: functions related to assigning an hourly profile to monthly data
load_data: functions for loading data from downloaded files
output_data: functions for writing intermediate and final data to csvs
subplant_identification: functions for identifying subplant IDs
validation: functions for testing and validating data outputs
visualization: functions for visualizing data in notebooks

Notebooks

Notebooks are organized into five directories based on their purpose

explore_data: notebooks used for exploring data outputs and results
explore_methods: notebooks that can be used to explore specific methods step-by-step
manual_data: notebooks that are used to create/update certain files in data/manual
validation: notebooks related to validating results
visualization: notebooks used to visualize data
work_in_progress: temporary notebooks being used for development purposes on specific branches

Data Structure

All manual reference tables are stored in src/oge/reference_tables.

All files downloaded/created as part of the pipeline are stored in your HOME directory (e.g. users/user.name/):

HOME/open_grid_emissions_data/downloads contains all files that are downloaded by functions in load_data
HOME/open_grid_emissions_data/outputs contains intermediate outputs from the data pipeline... any files created by our code that are not final results
HOME/open_grid_emissions_data/results contains all final output files that will be published

Importing OGE as a Package in your Project

OGE is not yet available on PyPi but can be installed from GitHub. For example, this can be done by adding oge = {git="https://github.com/singularity-energy/open-grid-emissions.git"} to your Pipfile if you are using pipenv for your project.

Note that you don't need to run the pipeline to generate the output data as these are available on Amazon Simple Storage Service (S3). Simply, set the OGE_DATA_STORE environment variable to s3 in the __init__.py file of your project to fetch OGE data from Amazon S3. To summarize, your __init__.py file would then look like this:

import os

os.environ["OGE_DATA_STORE"] = "s3"

Development Setup

If you would like to run the code on your own computer and/or contribute updates to the code, the following steps can help get you started.

Setup with conda

This installation is recommended if you are unfamiliar with git and Python.

Install conda and python

We suggest using miniconda or Anaconda to manage the packages needed to run the Open Grid Emissions code. Anaconda and Miniconda install a similar environment, but Anaconda installs more packages by default and Miniconda installs them as needed. These can be downloaded from miniconda or Anaconda

Install and setup git software manager

In order to download the repository, you will need to use git. You can either install Git Bash from https://git-scm.com/downloads, or you can install it using conda. To do so, after installing Anaconda or Miniconda, open an Anaconda Command Prompt (Windows) or Terminal.app (Mac) and type the following command:

conda install git

Then you will need set up git following these instructions: https://docs.github.com/en/get-started/quickstart/set-up-git

Download the codebase to a local repository

Using Anaconda command prompt or Git Bash, use the cd and mkdir commands to create and/or enter the directory where you would like to download the code (e.g. "Users/myusername/GitHub"). Then run:

git clone https://github.com/singularity-energy/open-grid-emissions.git

Setup the conda environment

Open anaconda prompt, use cd to navigate to the directory where your local files are stored (e.g. "GitHub/open-grid-emissions"), and then run:

conda update conda
conda env create -f environment.yml

Installation requires that the conda channel-priority be set to "flexible". This is the default behavior, so if you've never manually changed this, you shouldn't have to worry about this. However, if you receive an error message like "Found conflicts!" when trying to install the environment, try setting your channel priority to flexible by running the following command:conda config --set channel_priority flexible and then re-running the above commands.

The final step is to install the oge package itself in the conda environment. To do so, run:

conda activate open_grid_emissions
pip install build
python -m build
pip install --editable .

The open_grid_emissions conda environment should now be set up and ready to run.

Setup with pipenv

Install python and git

We recommend that you use Python 3.11. If you don't have Python installed, we recommend that you use pyenv. It lets you easily switch between multiple versions of Python. You will also need to use git to clone the repository. It can be installed from https://git-scm.com/downloads,

Install pipenv

This can be done via:

pip install pipenv

Download the codebase

As mentioned previously, clone the repository with:

git clone https://github.com/singularity-energy/open-grid-emissions.git

and navigate to the root of the directory:

cd open-grid-emissions

Setup the environment

In the root of the directory, create and activate the environment with:

# set up virtual environment (use whichever version of python 3.11 you have installed)
pipenv --python 3.11.4

# if you have updated the pipfile and need to update pipfile.lock, run
pipenv install
# Otherwise, if you just want to install packages from the pipfile.lock, run
pipenv sync

# activate virtual environment
pipenv shell

# install an editable version of the oge package
pip install build
python -m build
pip install –-editable .

If you ever need to remove and reinstall the environment, run pipenv --rm from the root directory then follow the directions above.

Running the complete data pipeline

If you would like to run the full data pipeline to generate all intermediate outputs and results files, navigate to open-grid-emissions/src/oge, and run the following (replacing 2022 with whichever year you want to run):

python data_pipeline.py --year 2022

Keeping the code updated

From time to time, the code will be updated on GitHub. To ensure that you are keeping your local version of the code up to date, open git bash and follow these steps:

# change the directory to where ever your local git repository is saved
# after hitting enter, it should show the name of the git branch (e.g. "(main)")
cd GitHub/open-grid-emissions  

# save any changes that you might have made locally to your copy of the code
git add .

# fetch and merge the updated code from github
git pull origin main

Install a code editor

If you want to edit the code and do not already have an integrated development environment (IDE) installed, one good option is Visual Studio Code (download: https://code.visualstudio.com/).

Contribution Guidelines

If you plan on contributing edits to the codebase that will be merged into the main branch, please follow these best practices:

Please do not make edits directly to the main branch. Any new features or edits should be completed in a new branch. To do so, open git bash, navigate to your local repo (e.g. cd GitHub/open-grid-emissions), and create a new branch, giving it a descriptive name related to the edit you will be doing:

git checkout -b branch_name
As you code, it is a good practice to 'save' your work frequently by opening git bash, navigating to your local repo (cd GitHub/open-grid-emissions), making sure that your current feature branch is active (you should see the feature name in parentheses next to the command line), and running

git add .
You should commit your work to the branch whenever you have working code or whenever you stop working on it using:

git add .
git commit -m "short message about updates"
Once you are done with your edits, save and commit your code using step #3 and then push your changes:

git push
Now open the GitHub repo web page. You should see the branch you pushed up in a yellow bar at the top of the page with a button to "Compare & pull request".
- Click "Compare & pull request". This will take you to the "Open a pull request" page.
- From here, you should write a brief description of what you actually changed.
- Click "Create pull request"
- The changes will be reviewed and discussed. Once any edits have been made, the code will be merged into the main branch.

Conventions and standards

We generally follow the naming conventions used by the Public Utility Data Liberation Project: https://catalystcoop-pudl.readthedocs.io/en/latest/dev/naming_conventions.html
Functions should include descriptive docstrings (using the Google style guide https://google.github.io/styleguide/pyguide.html#383-functions-and-methods), inline comments should be used to describe individual steps, and variable names should be made descriptive (e.g. cems_plants_with_missing_co2_data not cems_missing or cpmco2)
All pandas merge operations should include the validate parameter to ensure that unintentional duplicate entries are not created (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)
All pandas groupby operations should include the dropna=False parameter so that data with missing groupby keys are not unintentionally dropped from the data.
All code should be formatted using black
Clear all outputs from notebooks before committing your work.
Any manual changes to reported categorical data, conversion factors, or manual data mappings should be loaded from a .csv file data/manual rather than stored in a dictionary or variable in the code.

open-grid-emissions's People

Stargazers

Watchers

Forkers

geoffmanningcleartrace catalyst-cooperative climatepals jhtravis

open-grid-emissions's Issues

Move manual update dictionaries to data/manual

Any time we use a hard-coded dictionary to update/fix values from one of our data input sources, we should instead be using a csv table located in data/manual. This will help:

Be more transparent about what values are being manually changed
Allow for more convenient updating of these values if needed.

Refactor data pipeline

Move data pipeline from notebook to script
Move exploratory validation code to separate notebooks
Export standardized files from data pipeline for each cleaned data product
Create documentation of each column in data files, potentially also with a check to enforce it

Package the repository

Packaging open-grid-emissions will allow us (and others) to use code snippets in other projects.

Specific tasks:

Currently all data paths are hardcoded to assume a working directory within src (eg ../data/downloads/x. Eventually, we should move this into a variable that can be set, allowing us to treat the code in src as a package and move data_pipeline.py outside of src, to treat it as a script calling a package.
Rename the repo from hourly-egrid

Some generators are missing from the EIA-923 primary fuel table

Context: Currently, the clean_eia923 function returns (1) gen_fuel_allocated and (2) primary_fuel_table. For each plant/generator/month row in gen_fuel_allocated, I've been using primary_fuel_table to look up the energy source code for that plant and generator.

Problem: For a small number of generators in gen_fuel_allocated, there is no corresponding data in primary_fuel_table. Usually, that generator "appears" in the primary_fuel_table in a later year, which has allowed me to manually fix those generators.

Any idea why this might be @grgmiller ? Maybe this is a problem you've already found? I can either create a manual data file to assign fuel types to these generators, or we can dig deeper into why this is happening.

For reference, here is my manual workaround:

MANUAL_PLANT_GENERATOR_ENERGY_SOURCE_CODE = {
  '6058': {
    '2': 'NG'
  },
  '54224': {
    'GEN6': 'BIT'
  },
  '6190': {
    '3': 'PC'
  },
  '7790': {
    '2': 'BIT' # Default to the plant primary fuel type.
  },
  '10612': {
    'GEN2': 'NG'
  },
  '54690': {
    '6000': 'SUB'
  },
  '55821': {
    'BCT': 'NG',
    'BST': 'NG'
  },
  '645': {
    'GT4': 'NG'
  },
  '7652': {
    '1': 'BIT'
  },
  '54408': {
    '2': 'WDS'
  },
  '1904': {
    '6': 'NG'
  },
  '10562': {
    'GEN5': 'WDS',
  },
  '54851': {
    'SOL1': 'LFG',
    'SOL2': 'LFG',
    'SOL3': 'LFG',
  },
  # IC1/IC2 generators are both DFO, so assume the same for the rest.
  '676': {
    'IC3': 'DFO',
    'IC4': 'DFO',
    'IC5': 'DFO',
    'IC6': 'DFO',
    'IC7': 'DFO'
  }
}

Nonbaseload emission rates

We do not currently calculate nonbaseload emission rates, which are a type of marginal emission factor estimate.

In eGRID, nonbaseload emission rates are calculated based on the plant-level capacity factor.

All generation and emissions at plants with a low capacity factor (less than 0.2) are considered nonbaseload and are assigned a nonbaseload factor of 1. Plants with a capacity factor greater than 0.8 are considered baseload and are assigned a nonbaseload factor of 0. For plants with a capacity factor between 0.2 and 0.8, we use a linear relationship to determine the percent of generation and emissions that is nonbaseload:
Nonbaseload_Factor = -5/3 * (Capacity_Factor) + 4/3

It is unclear whether nonbaseload factors would make sense at the hourly resolution or if an alternate methodology would need to be developed. We could consider publishing monthly and annual resolution nonbaseload factors even if hourly factors do not make sense.

This could also be an opportunity to consider whether the nonbaseload methodology could be improved.

If we publish these, they should probably be separated as a different use case in results, such as results/marginal emissions

Improve heat rate validation test

In validation.test_for_outlier_heat_rates() we currently test for outliers within each fuel type, although we may want to consider refining this to also filter by prime mover, since this may significantly impact the heat rate of a generator.

Identify outlier values in reported CEMS data

We should implement some sort of outlier detection and screening for the hourly values reported in CEMS. This outlier detection could use a combination of statistical methods and physics-based methods (e.g. gross generation should not exceed nameplate capacity).

This should probably be implemented after loading the CEMS data but before any missing data imputation steps.

Estimating energy storage hourly profiles

We do not currently have a method for identifying hourly charging/discharging profiles for energy storage, which primarily consists of battery energy storage (energy_source_code == MWH and prime_mover_code == BA) or pumped storage hydro (energy_source_code == WAT and prime_mover_code == PS).

sources of data for charging and discharging

EIA-923 only reports net generation (net discharge) for energy storage technologies, but we do not have any information about the total charging and total discharging from these storage resources.

Furthermore, EIA-930 does not include energy storage as one of fuel types for net generation, but instead this data is theoretically spread between reported demand, hydro generation, and other generation. The EIA-930 instructions note:

Pumped storage: Pumped storage is included in net generation only when there is net output to the system during the hour. During hours when electricity from the system is used on net to store energy, this electricity is to be includedin actual demand.

The EIA-930 instructions do not include any instructions related to other energy storage, but if energy storage is reported consistently with the rules for pumped storage, then discharge would likely be reported as “other” net generation, whereas charging would be reported as increased net demand

Pumped-storage specific considerations

See #37

Potential approaches to estimating storage profiles

As of 2020, there were 230 utility-scale batteries reported in EIA-860, a majority of which are located within the territories of the major RTOs/ISOs in the US. If each of these ISOs report timeseries data for energy storage dispatch separate from the data they report to EIA-930, we could use that data to assign a profile. This means that we would need to ingest data from these sources separately. To do this, we could potentially pull data from the singularity API, pyiso, or potentially ElectricityMap.

If our only option were to interpolate storage profiles, EIA-860, schedule 3-4 also reports the various applications that an energy storage plant served (e.g. load following, excess wind and solar generation, system peak shaving, arbitrage). Using this information, we could develop synthetic storage dispatch profiles based on how we assume these batteries would operate.

Ensure complete `subplant_id` mapping

Currently, subplant IDs are only created for units that exist both in CEMS and EIA-923, meaning that there are certain generators/units that have a subplant ID of NaN.

Ensure that all merge and groupby functions that use subplant_id as one of the keys are not dropping observations with missing subplant values.
Although the primary purpose of the subplant ID is to group CEMS units with EIA generators and boilers, it could also be useful for grouping EIA boilers and generators that do not exist in CEMS. We should update the pudl.analysis.epa_crosswalk code to generate subplant IDs for all boilers/generators that exist in the EIA data, regardless of whether data exists in CEMS.
If there are any remaining missing subplant values, we should perhaps fill these missing values with a code of 99 so that there is a non-missing code that would not overlap with any subplant ids already assigned during the crosswalk process.

Package gridemissions

We are currently using gridemissions for physics-based data cleaning https://github.com/jdechalendar/gridemissions
The current process is to manually export a file from that package and copy it to our data folder, but it should get run from the rest of our data pipeline.

Add package structure to gridemissions
Add package to our conda environment and call it from our code

Add Mercury Emissions

eGRID reports mercury (Hg) emissions, although the technical guide notes that:

However, while electric generating units started to report mercury data to CAMD’s Power Sector Emissions Data in 2015, the data are incomplete. We have included the unit-level emissions, but since only a subset of the units at one plant may list mercury emissions, we have not summed these emissions to the plant-level. Therefore, we have retained these fields in anticipation of being able to report plant-level mercury emissions and emission rates in a future edition of eGRID

Currently eGRID only reports those mercury emissions that are reported in CAMD, and does not attempt to calculate mercury emissions based on reported EIA-923 data.

To do:

Investigate whether there are emissions factors for mercury emissions
Research what factors affect mercury emissions from power generation
Incorporate reported Hg data from CAMD
Impute missing Hg emissions from EIA-923 data using emissions factors if available

Improve Code Readability

We want to make it as easy as possible for people to understand and contribute the code, so we want to make the code as readable and easy to follow as possible

Arrange order of functions in data_cleaning and other modules in the order they are called.
Ensure that functions are in the appropriate module (e.g. any function that reads data from a csv should be in load_data)
Consider splitting existing modules into multiple more specific modules. For example, split load_data into download_data and load_data, or split data_cleaning into multiple modules based on the functions that are used for cleaning, and the functions that are calculating new data.
Ensure variable names are descriptive and understandable
Make sure every function contains a docstring that explains the function, parameters, and outputs.

Dealing with Generation-Only Balancing Authorities

Generation-only and limited-generation BAs

As the EIA-930 data about page notes,

Generation-only BAs consist of a power plant or group of power plants and do not directly serve retail customers. Therefore, they only report net generation and interchange and do not report demand or demand forecasts.
Eleven active BAs are generation-only:

Avangrid Renewables, LLC (AVRN)

Arlington Valley, LLC – AVBA (DEAA)

GridLiance (GLHB)

Gridforce Energy Management, LLC (GRID)

Griffith Energy, LLC (GRIF)

Gila River Power, LLC (GRMA)

NaturEner Power Watch, LLC (GWA)

New Harquahala Generating Company, LLC – HGBA (HGMA)

Southeastern Power Administration (SEPA)

NaturEner Wind Watch, LLC (WWA)

Alcoa Power Generating, Inc. – Yadkin Division (YAD)

The EIA also notes that there are "limited generation balancing authorities":

Most BAs produce electricity within their BA area. However, the following active BA has a small number of local generators that do not always produce electricity, therefore it will not always have net generation to report

EIA notes that these BAs (as well as HST, CPLW, and NSB) may have zero or even negative net generation during some hours because they might not be running their generators during all hours.

We should:

Add a flag to data/manual/ba_reference.csv to indicate those BAs which the EIA reports as generation-only or limited-generation BAs
Not calculate consumption-based emission factors for these BAs because they do not have any direct retail load.

Improve coverage of `ba_reference` table

The table data/manual/ba_reference.csv lists all of the ba codes and relevant metadata about each BA for the pipeline. This file was created based on the BAs that exist in the EIA-930 reference tables (https://www.eia.gov/electricity/930-content/EIA930_Reference_Tables.xlsx). However, my understanding is that this data starts as of 2015, so BAs that retired before then might not be represented.

However, FERC maintains a spreadsheet of "Allowable Entries for Balancing Authorities and Hubs" on its EQR website which seems to be a more complete list. We should use this to improve the coverage of our ba_reference spreadsheet. We will have to manually identify the local time zone for each BA in this spreadsheet.

Investigate need to impute missing hourly net generation in CEMS

In certain limited cases, some CEMS generators report heat input, but no gross generation or steam load in an hour, which seems to suggest that the gross generation data might be missing (if there is fuel consumption, there should in theory be some gross output).

We had previously implemented a function data_cleaning.impute_missing_hourly_net_generation(cems, eia923_allocated), but this is currently removed from the data pipeline.

To determine if we should address this we need to:

attempt to understand what might be going on in practice if there is heat input but no gross generation
Understand how common this is. If this only affects a small amount of heat input at a small amount of plants, it might not be worth spending a lot of time on (at least in the short run).

If/when we focus on this issue, one of the first changes we would need to implement to the existing function would be to perform the matching on the unit or subplant level rather than the plant level. One example of why this is important is the Ivanpah concentrating solar plant (plant id 57075), which primarily consumes solar energy, but also runs some fossil generation at night to keep the thermal storage warm.

Landfill Gas (LFG) Emission Adjustments

The eGRID 2020 technical guide notes:

Emissions adjustments for NOx , SO2 , CH4 , and N2O emissions are only conducted for landfill gas in eGRID. This adjustment is based on the assumption that in many cases landfills would flare the gas if they did not combust it for electricity generation. Therefore, we assume that, at a minimum, the gas would have been combusted in a flare and would have produced some emissions of NOx , SO2 , CH4 , and N2O anyway.

Potential Methodological issues

Is this a good assumption? Is there data about landfill flaring?
Even if this is an appropriate assumption, should these emissions be adjusted to zero? As opposed to other biomass fuels, where the argument for adjusting them is that there is no net addition of GHG to the atmosphere because of the carbon that was sequestered in the biomass in the first place, landfill gas energy production still leads to net emissions to into the atmosphere. This starts to border on a consequential accounting approach of emissions since it uses a baseline to make this adjustment. If we are taking a consequential approach, what about emissions from LFG that would have otherwise been vented instead of flared? Relative to the baseline of venting, this reduces emissions, but we are still adding emissions into the atmosphere. This could be a slippery slope. Also, selectively applying a consequential emissions approach to landfill gas emissions is not necessarily appropriate for inventorying attributional emissions from the power sector.

Emission factor assumption

It also notes:

For NOx emissions from landfill gas, an emission factor for flaring of landfill gas, 0.02 tons per MMBtu,
is used (EPA, 1995). Note that this factor was converted from units of lb/standard cubic foot (scf) to tons/MMBtu based on a value
of 500 Btu/scf (EPA, 2016).

Need to add this 0.02 emission factor into the data pipeline (we are currently using 0)
Look into whether there is a more recent emission factor available, and whether the heating value used is reasonable.

Update Pipfile to include all dependencies

Very low priority issue, but at some point we should do a pipenv install for all of the dependencies so that they get saved to Pipfile. Some users (like myself) might be using pip instead of conda, so we should support that type of environment.

Improve specificity of NOx and SO2 emissions factors used for imputing emissions

Although hourly measured NOx and SO2 emissions are reported for a majority generation in CEMS, in certain cases we must impute missing hourly NOx/SO2 emissions or calculate these emissions based on reported EIA-923 fuel consumption.

NOx and SO2 emissions depend on not only the fuel being combusted, but also the prime mover, boiler firing type, and air emissions control equipment. Currently our data pipeline only includes information about the generator fuel type and prime mover, so when performing these imputations, we average the NOx or SO2 emissions factors by fuel and prime mover type. To improve NOx and SO2 emissions calculations, we should incorporate information about the boiler firing type and emissions control equipment into the data pipeline.

Information about these characteristics exists in EIA-860 and EIA-923, but is currently not included in the PUDL ETL pipeline. See this issue in the PUDL repository. The preferred method to fix this would be to integrate this into the PUDL data pipeline, although as a temporary fix we could consider loading this data directly from the raw EIA-860 and 923 files.

Boiler Firing Type

Information about the boiler firing type is located in EIA-860 Schedule 6C, 'Boiler Information - Design Parameters'. Once we have this information, it could be merged into our intermediate data files and be used as a merge key for the NOx and SO2 emissions factors.

Season-specific NOx emission rates

EIA-923 Schedule 8C, 'Air Emissions Control Info' also reports ozone season and non-ozone season-specific NOx emissions factors for each unit based on the NOx control equipment used. While this data does not specifically identify a unit or boiler number, it reports these emissions factors for each "NOx control ID" at each plant, which appear to line up with boilers or units. If these unit and season specific NOx emissions factors are reported, they should probably be used in place of the generic NOx emissions factors.

The eGRID2020 Technical Guide notes that:

For some units, EIA reports unit-level NOx emission rates (lb/MMBtu) for both annual and ozone season
emissions, from EIA Form 923, Schedule 8C. These unit-level emissions rates are multiplied by the
unit-level heat input used to estimate annual and ozone season NOx emissions. For all other units that
report to EIA but are not included in CAMD’s Power Sector Emissions Data, the unit-level heat input
is multiplied by a prime mover- and fuel-specific emission factor from EPA’s AP-42 Compilation of
Air Pollutant Emission Factors or the EIA Electric Power Annual (EPA, 1995; EIA, 2021f, Table A-2)

SO2 emissions control efficiencies

The eGRID2020 Technical Guide notes that:

For some units for which we calculated SO2 emissions with an emission factor, EIA reports SO2
control efficiencies. For these units the estimated SO2 emissions are multiplied by (1 – control
efficiency) to estimate the controlled emissions. Units that do not have unit-level control efficiency
data are assumed to be uncontrolled. The control efficiencies are not used for units where the
emissions data are from CAMD’s Power Sector Emissions Data, because these emissions already take
controls into account.

These SO2 control efficiencies are reported in EIA-923 Schedule 8C, 'Air Emissions Control Info'.

Check methodologies/outputs related to combined cycle plants

For some combined cycle plants, it is possible that fuel input or net generation data is allocated to one portion of a combined cycle but not another, meaning that unit or generator-level statistics might appear to be outliers. Most of these issues should be addressed by aggregating the data at the subplant level (which should aggregate all parts of a combined cycle plant together), but this is something that we should look into.

Estimating missing hourly profiles for hydro

When assigning an hourly profile to monthly hydroelectric data, the current method is to use the cleaned hydro profiles from EIA-930 if available, and assign a flat hourly profile to each month otherwise. In exploring the data, there are a couple of ways that this methodology should be improved.

Pumped strorage hydro

Currently conventional hydroelectric and pumped storage hydroelectric (PSH) are grouped together, both in our cleaned EIA-923 values and in the EIA-930 data. It appears that at least in some cases, certain BAs are reporting net negative hydro generation in certain hours, which would reasonably represent PSH charging. Things to investigate:

consider separating conventional hydroelectric from PSH in the EIA-923 data, since one is primarily a storage technology. This would be relatively easy to do because conventional hydroelectric is identified with the HY prime mover code, and PSH is identified with the PS prime mover code.
Currently, I believe that negative generation values reported in EIA-930 are being treated as anomolous in our data cleaning process and being filtered out of the data. We may not want to filter these out for hydroelectric, but this perhaps raises a much deeper conversation about how to treat energy storage data more broadly.

Avoiding use of flat profiles for hydroelectric

Although hydroelectic often displays a significant amount of seasonal variation, many hydro generators (especially reservoirs/dams) exhibit significant variation in generation across hours of a day. We might want to consider how we could estimate this hourly variation if we do not have direct data for the hydro facilities operating in that BA. Several options:

consider using hydro data from directly interconnected balancing authorities. The reasoning here is that regionally, hydro may be operated in similar patterns, especially if it represents a large portion of electricity interchange.
Perhaps investigate a month-hour fixed effects model for hydro generation based on national data from EIA-930
Consider whether data for reservoir hydro and run of river hydro can be separated, since these likely have different dispatch patterns. However, there does not seem to be any directly-reported data that could be used to directly identify hydro facilities in this way (although it is possible that there is another database that contains this information).
It is possible that daily variation in hydro dispatch is load following, and thus could be modeled using regional demand data.
Another method that was used in this paper was to use stream guage data downstream of each hydroelectric facility to estimate production. However, this sounds like a labor-intensive process and may not be super accurate (eg how would you determine flow through the turbine vs spill-over).

Refine emission factors used for imputing missing CEMS data

When imputing missing CEMS emissions data, we use a fuel-specific emission rate based on the fuel type identified in the power sector data crosswalk. However there is a chance that these assigned values are incorrect or at least represent the primary fuel type for a multi-fuel plant. A more robust approach could be to calculate a month-specific weighed average emission factor based on the proportion of each type of fuel actually burned in that unit in a given month (as reported in EIA923), assuming that there is a 1:1 or 1:m mapping between the EPA unit and the EIA boiler/generator.

Add `--clobber` option in data pipeline

Certain functions in data_pipeline, specifically which download data from the internet or take a long time to run (like to gross to net generation calculations), are implemented such that they will not be re-run if the data already exists in the directory. However, in certain cases (new source data is released or need to re-generate GTN calculations) currently the user would have to manually delete these directories. Instead, we should include an option like --clobber in these functions, which if used, would overwrite the existing data even if it already exists. Could also make this a command line argument if it’s a common use case

Refine assumption for assumed gross to net generation ratio

Currently, in our gross to net generation methodology hierarchy, if all other methods fail, the final method is to assume a gross to net generation ratio of 0.85, which was based on an approximate average of fleetwide GTN ratios. However, this assumption is not necessarily very robust and should be improved. For the 2020 data, this method is currently used for about 2.2% of total net generation.

One simple fix could be to at least calculate prime mover and fuel specific gross to net ratios.

Add monthly and annual aggregations to results

Although the primary purpose of this data is to provide accurate hourly emissions data, some users may wish to still use monthly or annual averages for their purposes. Thus, we should include outputs at these aggregations for users as well.

Look into updated default emission factors

The eGRID2020 technical guide notes that:

The emission factors are primarily from the default CO2 emission factors from the EPA Mandatory Reporting of Greenhouse Gases Final Rule (EPA, 2009, Table C-1). For fuel types that are included in eGRID2020 but are not in the EPA Mandatory Reporting of Greenhouse Gases Final Rule, additional emission factors are used from the 2006 Intergovernmental Panel on Climate Change (IPCC) Guidelines for National Greenhouse Gas Inventories and the EPA Inventory of U.S. Greenhouse Gas Emissions and Sinks: 1990-2015 (IPCC, 2007a; EPA, 2017).

However, it is unclear whether there might be more up to date emissions factors that should be used:

Look into whether there are newer default GHG emission factors than those in the 2009 EPA Final Rule
Look into whether IPCC emission factors have been updated since 2006
Look into whether there are updates to the 2017 EPA Inventory of U.S. Greenhouse Gas Emissions and Sinks

The technical guide also notes:

Several fuel types do not have direct reported emission factors, so emission factors from similar fuel types are used:
• The emission factor for natural gas is used to estimate emissions from process gas and other gas;
• The emission factor for anthracite, bituminous, and lignite coal are used to estimate emissions from refined coal and waste coal; and
• The emission factor for other biomass liquids is used to estimate emissions from sludge waste and liquid wood waste

Research whether there are now specific emission factors for process gas (PRG)
Research whether there are now specific emission factors for other gas (OG)
Research whether there are now specific emission factors for refined coal (RC)
Research whether there are now specific emission factors for waste coal (WC)
Research whether there are now specific emission factors for sludge waste (SLW)
Research whether there are now specific emission factors for liquid wood waste (WDL)

Improve functionality of `--small` argument for testing data pipeline

As noted in this PR, data_pipeline.py contains the argument --small which filters out 95% of plants so pipeline runs faster for testing.

A few follow up ideas to improve the functionality of this would be to:

As currently implemented, the filter currently occurs after data_cleaning.clean_cems(year), it still takes 10+ minutes. If it were faster, we could run it in a commit hook to guarantee that data_cleaning is always functional. We should enable filtering in data_cleaning.clean_cems(year) right after loading the cems data from parquet, but before cleaning it.
We might want to specify a static seed parameter for selecting the random data, so that we can consistently access the same subset of data each time

Fix biomass adjustment for CEMS waste (MSW) data

Historically, any generators that burned municipal solid waste (MSW) reported this fuel consumption under a single fuel code. In recent years, however, EIA-923 began reporting these data under two separate codes for the biogenic portion (MSB) and non-biogenic portion (MSN). This is important because each portion has different emission rates. We should ensure that when this data is available, the more specific fuel codes are being used instead of MSW.

Validate prime mover assigned by EIA-923 allocation process

One of the validation tests we should implement for the outputs of data_cleaning.clean_eia923() is to make sure that the prime mover assigned to each generator in the allocation process matches the prime mover reportedin EIA-860

Allocate state-level (99999) data from EIA-923

As the eGRID technical guide notes:

some generator-level net generation data are missing or not reported for various generators in the 2020 EIA-923. EIA aggregates these missing data to the state level by fuel type, but it is not possible to distribute them back to the generator level accurately.

This imputed state-level data is reported in EIA-923 as a "State-Level Fuel Increment" using plant ID 99999. We currently exclude this imputed state-level data from our calculations, and it is unclear if/how eGRID incorporates these data.

Thus we should

investigate if/how eGRID currently uses these data
Better understand EIA's methodology for imputing these data
Consider integrating these data into our pipeline.

Validate approach to imputing missing hourly profiles for wind generation

In some BAs, wind generation is reported in EIA-923, but there is no hourly wind generation reported in EIA-930. In those cases, we have implemented a method that imputes this missing generation data by averaging together the reported wind generation profiles for directly-interconnected BAs located in the same time zone and using that as a proxy. However, we have not yet validated whether this is a reasonable approach.

To do so, we can evaluate the correlation between regional wind generation profiles, and also cross-validate by estimating profiles using this method for regions that do not have missing wind generation data and comparing the estimate to the actual reported value.

Validate gross to net generation conversion method

To convert hourly gross generation in CEMS to hourly net generation, we use a combination of five different methods applied in hierarchical order to the data based on the quality of the conversion factors. However, the order in which we apply these methods could affect the final outcome. Thus, we should test how the ordering of these methods affects the results, and set the method hierarchy based on those results.

One validation test we could do is to compare the annual sum of our calculated net generation values to the annual sum of the reported EIA-923 generation for each plant. Whichever approach minimizes the residual between the two should be the one we use.

Digester gas (DG) fuel type

The eGRID technical guide mentions digester gas (DG) several times, although this energy source code does not seem to appear in any static tables or EIA data.

Look into whether digester gas needs to be added as an energy source code to all static tables
Look into emissions factors for digester gas

Identify and replace all "OTH" energy source codes

Some generators have a reported energy_source_code of "OTH" or other. The challenge with this code is that there are no emissions factors for "OTH" fuels, so any emissions imputation results in missing values. To address this, we will need to manually replace the energy_source_code for any generators with OTH as the code with another fuel code that best matches the fuel actually burned at the generator.

We currently doing this using the function data_cleaning.update_energy_source_codes which manually replaces these values for three plants. However to be more systematic about this we should:

Move any updates to a static table in data/manual, such as updated_energy_source_codes.csv
Identify whether there are any existing generators with OTH codes that we have not yet caught
Determine whether there is a programmatic way to update these fuel codes instead of manually updating. For example, many generators with OTH fuel types are at oil refineries and are likely burning some sort of gas byproduct. We might be able to identify whether "refinery" is in the plant name and update based on that

Calculate CO2 equivalent values using global warming potentials

Our data pipeline currently calculates CO2, CH4, and N2O emissions but does not combine these into a single CO2e value. Part of the reason for this is that the global warming potentials (GWPs) used to calculate CO2e vary based on the IPCC assessment report from which they come, and whether they are for a 20-year horizon or 100-year horizon.

Functionality to do this was added in #25, but there remain some outstanding questions:

should CO2e values be included in our results files? If so, which CO2e value should we use?

AR6 GWPs

The AR6 GWPs can be found here. According to this summary, there are now different GWPs for methane depending on whether the methane is of fossil or non-fossil origin.

Add the AR6 GWPs to data/manual/ipcc_gwp.csv
Reorganize the static table in long format rather than wide format
Figure out how to indicate biogenic and nonbiogenic methane
Add report years

It is my understanding that GWPs change over time because as the atmospheric concentrations of GHG changes over time, the GWP of newly emitted GHGs also changes. Thus, each time the GWPs are updated, the new values should be used for all emissions starting in that year, but the values from previous years should not be retroactively changed.

Dealing with Balancing Authorities that retire

Retired BAs

As the EIA-930 data about page notes,

Entities occasionally stop performing the BA role because their electric system is incorporated into another BA's system or they have made other arrangements. Five BAs retired after July 1, 2015, the first date of EIA-930 data availability:

Gila River Power, LLC (GRMA) – retired May 3, 2018

Ohio Valley Electric Corporation (OVEC) – retired December 1, 2018

Utilities Commission of New Smyrna Beach (NSB) – retired January 8, 2020

Electric Energy, Inc. (EEI) – retired February 29, 2020

PowerSouth Energy Cooperatives (AEC) – retired September 1, 2021

We need to implement a filter to remove retired BAs from final outputs if they retired prior to the reporting year. This leads to two additional questions:

Once a BA retires, how do we determine which new BA to assign its generation to?
If the BA retires mid-year, how should we treat that generation? We could 1) continue to assign it to the retiring BA for the entire year, 2) switch the generation to the new BA after the retirement date, 3) Assign the generation to the new BA for the entire year.

Filter out known data quality issues when validating against published eGRID values

When we are validating our final outputs against the published annual eGRID values, we should be sure to filter out plants where there are known data quality issues with the published eGRID data.

Does emission adjustment order matter?

When adjusting emissions for CHP and biomass, eGRID first makes the biomass adjustment, then adjusts for CHP.

However, in our data pipeline, we first adjust for CHP, then for biomass. We use a slightly different method for these adjustments because we are working with hourly unit-level data rather than monthly plant-level data (like eGRID), but the question remains:

does it matter the order in which we apply these adjustments?
If it does, what makes sense as the right order in which to apply these?

Improve gross to net generation regression

Generally, our linear equation for regressing gross to net generation seems to fit the data very well (in some cases almost exactly). However, in the future we may want to consider refining our gross to net generation regression to see if any additional factors help explain some of the variation in the data. These factors could include:

Investigate annual fixed effects
Determine if/when major change in equipment (repower, new environmental controls) affects gross to net generation
Investigate monthly fixed effects
Capacity factor
Binary variable for months where unit is operating and months when it is not

Emissions from fuel cells

The eGRID2020 technical guide notes:

The CO2 emissions for units with a fuel cell prime mover are also assumed to be zero.

However, as noted by several sources, fuel cells not not necessarily have zero emissions if they use natural gas as a fuel.

This reporting on Bloom fuel cells indicates an emission rate of 773-884 lbCO2/MWh
This EPA report indicates fuel cell emissions between 734 and 1131 lbCO2/MWh for uncontolled emissions, and that certain types of fuel cells have non-negligable NOx and SOx emissions (see table 6-5)

Currently our data pipeline does not set fuel cell emissions to zero, but uses the default natural gas combustion emission factor to calculate emissions - it is unclear if this is appropriate.

To do:

Identify appropriate default emission factors for natural gas fuel cells.
Integrate these emission factors into the pipeline

Method for assigning hourly profile to negative net generation

When calculating residual profiles, negative data might not always represent “bad” data that needs to be scaled. There might actually be negative net generation if all of the power plants are idling and consuming more electricity than they generate. In this case, we might want to check the monthly data that we plan to distribute and see if there is any negative net generation represented there.

If a plant has reported negative net generation for an entire month, we cannot simply multiply this by a profile, since multiplying the profile by a negative number will invert the shape of the profile.

If a plant has negative net generation and no fuel consumption, it likely didn’t generate at all - we can probably assign a flat profile to this to represent a flat house load.

If a plant has negative net generation but some fuel consumption, it might have generated in certain hours. Thus, we might want to shift the residual profile such that some hourly values are greater than zero, and some are less than zero, and the sum of all these values adds up to the total negative net generation. One way to do this could be how this was implemented here. Instead of using a scaling factor, use a shift factor to shift the profile up or down.

Consumption-based emissions

Generate consumption-based BA-level emission factors from BA-level hourly generation (from data_pipeline) and interchange (from EIA-930).

Add checks for column dtypes

Certain named columns should always have a certain data type in order to work properly in the code. For example plant_id_eia should always be of dtype int, and report_date should always be a datetime. We should probably implement this when loading data from csvs (explicitly set the dtype) or immediately after loading the data.

See the apply_pudl_dtype function in the pudl repository as a potential example of this

Output metric files

In addition to results files in U.S. units, we should include outputs in metric units.

Data type	US unit	Metric unit
Emissions mass	lb	kg
Electricity	MWh	MWh
Heat content	mmbtu	GJ
Percentages	Decimal between 0 and 1	Decimal between 0 and 1

Treatment of energy storage in hourly emisison intensities

Although we do not yet have hourly storage profiles integrated into this pipeline (see #59), once we do have hourly charging and discharging profiles, the question becomes how we should treat energy storage and stored emissions in our output emission factor calculations.

Although energy storage does not generally have any direct emissions, if it charges using electricity with associated emissions, you could say that the discharged electricity might have an emissions intensity associated with the carbon intensity of the stored electricity. I am not sure if formalized rules have yet been standardized for how to do this.

Add support for NOx, SO2, and CO2e in clean_eia923

Using eGRID table C-2 and C-3, we should be able to compute NOx and SO2 emissions from the EIA-923 fuel_consumed_units and fuel_consumed_for_electricity_units columns. I thought it would make more sense to implement this within Hourly eGRID so that I can use the cleaning steps you've already established for 923.

Refine emissions adjustments for CHP

There are several ways that the CHP adjustment (used to calculate emission_mass_lb_for_electricity could be improved.

Assumed values

When calculating useful_thermal_output, data_cleaning.calculate_electric_allocation_factor() uses an assumed efficiency factor of 0.8, because this is what is used in the eGRID methodology. We should investigate whether this assumption can be improved.
When calculating the electric_allocation_factor, data_cleaning.calculate_electric_allocation_factor() uses an additional assumed efficiency factor of 0.75, because this is what is used in the eGRID methodology. We should investigate whether this assumption can be improved.

Adjust calculation for bottoming cycles

The eGRID techincal support document notes regarding their CHP adjustment methodology that:

This assumes that the CHP units generate electricity first and use the waste heat for other purposes, also
known as “topping.” While there aresome units that generate and use heat first and then use the waste heat
to generate electricity, also known as “bottoming,” data from the EIA shows that the vast majority of CHP
facilities are topping facilities

However, the EIA-860 generator table contains information about whether each plant uses a topping or bottoming cycle, so we could incorporate this information to create a different calculation for bottoming cycle plants.

According to the data for 2020, of the 72, 337 MW of operable capacity for CHP generators, 67,571MW (93%) uses a topping cycle, while the remaining 7% uses a bottoming cycle.

Refine methodology for adjusting CEMS data

Because CEMS reports data by the unit, our current understanding is that each unit either only produces steam (heat), or only produces electricity, but not both. If this is the case, it simplifies the calculation because we can simply exclude steam-only units from the calculation of emissions for electricity production. However, we need to investigate this further to understand whether this is the case, and whether there would be any reason that any emissions from these plants should be allocated to electricity generation.

Check negative net generation methods and outputs

In certain cases, it is possible for net generation values if a generator (or an entire fleet) was consuming more electricity than it was generation. However, these negative values have the potential to result in counterintuitive or strange results. Thus, we should ensure that we have a consistent approach to handling negative values throughout our pipeline.

when loading inputs and cleaning data, make sure that we are not automatically assuming that negative values are bad or outlier data and removing them from our data. We should have a consistent approach among all of our input datasets
Whenever there is a multiplication or division by a negative number, it can potentially lead to results that we do not intend. For example multiplying an hourly profile by a negative net generation value would invert the shape of the hourly profile, which may not be what we want.
Whenever we are calculating output emission factors, it may not make sense to have negative emissions factors. Thus in these cases we should consider whether it makes sense to replace negative factors with zeros.

Add annotations to all functions

Also low priority for now, but it would be nice to include types on all arguments and return types.

I propose that we wait until everything is refactored and in a more stable state, and then go through and add the types. I imagine you'll be doing a pass through the code to add more documentation/cleanup before the release anyways.

Assign hourly profile to EIA-923 data for `partial_cems` subplants

There is a set of subplants for which we have incomplete hourly data from CEMS. Instead of using EIA-930 residual profiles to assign an hourly profile to the EIA-923 data for these subplants, we can use the hourly profiles for the subplant units that do report hourly data as the hourly profile for the entire subplant.

Once we assign an hourly profile to these subplants, we will need to concatenate these profiles with the CEMS hourly data and the EIA-923 that was distributed using EIA-930 data.

Impute missing NOX and SO2 data in CEMS

When cleaning hourly CO2 data in CEMS, we figured that in any hours where a unit reported zero CO2 emissions but non-zero fuel consumption, the zero CO2 reported should be treated as a missing value and imputed using the fuel consumption and fuel-specific emission factor. It seems that this only affects a small number of hours/units, and in many cases occured because CO2 mass is reported in tons in CEMS, the zero values in this case might be rounding errors. Because our pipeline works in lb instead of tons, this imputation should result in non-zero CO2 values.

This may be even less of an issue for reported NOx/SO2 emissions, but we should implement the same imputation method used for CO2 for these emissions as well.