Giter Club home page Giter Club logo

openaq-fetch's Introduction

OpenAQ Data Ingest Pipeline

Build Status

Overview

This is the main data ingest pipeline for the OpenAQ project.

Starting with index.js, there is an ingest mechanism to gather global air quality measurements from a variety of sources. This is currently run every 10 minutes and saves all unique measurements to a database.

openaq-api-v2 powers the API and more information on the data format can be found in openaq-data-format.

For more info see the OpenAQ-Fetch documentation index.

Installing & Running

To run the API locally, you will need Node.js installed.

Install necessary Node.js packages by running

npm install

Now you can get started with:

node index.js --help

For production deployment, you will need to have certain environment variables set as in the table below

Name Description Default
API_URL URL of openaq-api http://localhost:3004/v1/webhooks
WEBHOOK_KEY Secret key to interact with openaq-api '123'
EEA_TOKEN API token for EEA API not set
DATA_GOV_IN_TOKEN API token for data.gov.in not set
EPA_VICTORIA_TOKEN API token for portal.api.epa.vic.gov.au not set
EEA_GLOBAL_TIMEOUT How long to check for EEA async results before quitting in seconds 360
EEA_ASYNC_RECHECK How long to wait to recheck for EEA async results in seconds 60
SAVE_TO_S3 Does the process save the measurements to an AWS S3 Bucket not set

For full list of environment variables and process arguments, see environment documentation.

Pushing to AWS S3

If you want to push results to an S3 bucket as well for further processing, the environment variable SAVE_TO_S3 should be set to the value true. Additionally, you have to set the following environment variables (or be running in a process with a suitable IAM role):

Name Description Default
AWS_BUCKET_NAME AWS Bucket to store the results not set
AWS_ACCESS_KEY_ID AWS Credentials key ID not set
AWS_SECRET_ACCESS_KEY AWS Credentials secret key not set

The measurements will be stored using the structure bucket_name/fetches/yyyy-mm-dd/unixtime.ndjson for each fetch.

Tests

To confirm that everything is working as expected, you can run the tests with

npm test

To test an individual adapter, you can use something like:

node index.js --dryrun --source 'Beijing US Embassy'

For a more detailed description of the command line options available, use: node index.js --help

Deployment

Deployment is is being built from the lambda-deployment branch. Any development for openaq-fetch should be branched/merged from/to the lambda-deployment branch until further notice.

Deployments rely on a json object that contains the different deployments. The schedular is then used to loop through that object and post a message that will trigger a lambda to run that deployment. A deployment consists of a set of arguments that are passed to the fetch script to limit the sources that are run.

You can test the deployments with the following

Show all deployments but dont submit and dont run the fetcher node index.js --dryrun --deployments all --nofetch Only the japan deployment but dont run the fetcher node index.js --dryrun --deployments japan --nofetch

Only the japan deployment, dont submit a file but run the fetcher node index.js --dryrun --deployments japan

Data Source Criteria

This section lists the key criteria for air quality data aggregated onto the platform. A full explanation can be accessed here. OpenAQ is an ever-evolving process that is shaped by its community: your feedback and questions are actively invited on the criteria listed in this section.

  1. Data must be of one of these pollutant types: PM10, PM2.5, sulfur dioxide (SO2), carbon monoxide (CO), nitrogen dioxide (NO2), ozone (O3), and black carbon (BC).

  2. Data must be from an official-level outdoor air quality source, as defined as data produced by a government entity or international organization. We do not, at this stage, include data from low-cost, temporary, and/or indoor sensors.

  3. Data must be ‘raw’ and reported in physical concentrations on their originating site. Data cannot be shared in an 'Air Quality Index' or equivalent (e.g. AQI, PSI, API) format.

  4. Data must be at the ‘station-level,’ associable with geographic coordinates, not aggregated into a higher (e.g. city) level.

  5. Data must be from measurements averaged between 10 minutes and 24 hours.

Contributing

There are a lot of ways to contribute to this project, more details can be found in the contributing guide.

openaq-fetch's People

Contributors

abarciauskas-bgse avatar alexdunncs avatar andrewharvey avatar aretey avatar caparker avatar dabutvin avatar dolugen avatar esonica avatar jflasher avatar kamicut avatar magsyg avatar majesticio avatar maxgrossman avatar michalcz avatar mondorescue avatar nishadhka avatar olafveerman avatar rub21 avatar russbiggs avatar sethvincent avatar sourabhtk37 avatar sruti avatar tverilytt avatar ymhuang0808 avatar yunica avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openaq-fetch's Issues

Peru - Data Sources

-Country of data source: Peru
-Specific coverage within country: Lima

URL: (A) http://peruclima.pe/?p=calidad-de-aire OR (B) http://calidaddelaire.minam.gob.pe/estaciones.php [These appear to be the same sources, separate confirmation would be good though]
-Comments on URL: [Ex. You have to click on x, then y to see values]
For (B), you will need to click on the red dots to go to table like this: http://www.senamhi.gob.pe/?p=0412&txt=112192

-Pollutants measured: PM2.5, PM10, CO, SO2, NO2

-Are units of measurement clearly indicated (e.g. do you see 'AQI', 'ppm' or 'ug/m^3'?) Note: We cannot directly input AQI, API, AQHI, etc. values directly into our system.
Yes, physical measurements (e.g. ppm, ug/m^3) are clearly labelled in tables. Though note for (A), you will see the map shows a presumed AQI, but as soon as you hover over the location, you see a nice table with great labels including timestamp, coords, and units of measurements.

-Comments about units: [ex. 'They are viewable on {this page}, see the table]
None.

-Are there date+timestamps for pollutant measurements?
Yes.

-Are time intervals otherwise indicated on the site (e.g. 'Measurements are 15 minute averages'):
Hourly. It's pretty clear from the tables.

*-Are coordinates for the stations clearly indicated or discernible? Comment. (e.g. do you see a map or a list of coordinates): * [<--I can't get the *bold * function to work here! Weird.}
For (A), they are clearly listed when you hover over a spot. For (B), when you click on a station (little red dot), they are also listed.

-Are you with the host agency and may we contact you?: No.

-Any other comments about this source:
This source looks awesome.
Another source in Lima: #99

[OPTIONAL]Your name/contact info/interest in open air quality data: n/a

Belgian sources

@olafveerman 's issue transported to this repository:

Measurements for a lot of Belgian measuring stations: http://www.ircel.be/nl/luchtkwaliteit/metingen
This page shows rolling averages for most parameters.

When you drill down, it's possible to get the actual hourly measurements and not the rolling averages. For example:

go to this page: http://www.ircel.be/en/air-quality/measurements/particulate-matter/pm10?set_language=en
click on table with detailed info per monitoring site
Have not found a programmatic way to access this data, it might need to be scraped.

Add better downtime monitoring

We had an outage of a few days due to a silly error in the Dockerfile, however there weren't any alerts triggered or anything. So we should figure out how to monitor that we're consistently getting new data added to the system correctly.

We should probably have something that runs periodically and checks to make sure new data has been added but it should probably also be aware of how often a data source is to add data? We don't want to fire off any alarms if we haven't gotten data for 1.5 hours if something only reports every hour, or 24 hours. This could be tied to the better email handling #24.

Backcalculating AQI to concentration

For Iran (#51) and possibly other sources, we may have to back-calculate AQI values to pollutant concentration. It seems that Iran uses the EPA's methodology. Should we write a generic tool in lib that does this calculation?

Data

This tool would accept an object, or array of objects with a measurement. Each measurement needs at least the parameter and the value. We can store a reference to the original value and an annotation about the conversion.

{ 
  city: 'Tehran',
  location: 'emam khomeyni',
  parameter: 'so',
  value: 150
}

would become something like:

{ 
  city: 'Tehran',
  location: 'emam khomeyni',
  parameter: 'so',
  unit: 'ppm',
  value: 0.22,
  original: {
    value: 150,
    methodology: 'EPA AQI to concentration'
  }
}

Methodology

On page 16 of this EPA publication:

image

@RocketD0g Can we use this?

cc @jflasher

India - additional sources

UPDATE: 20 Mar 2016: Please see my comment below (also timestamped 20 Mar 2016) for updated information and formatting of this issue.

Note: While several sources are listed, none are straightforward. If you'd like a straightforward new data issue, visit this issue for adding data sources from Tehran, Iran: #51

Below are listed potential other sources in India, contributed by the OpenAQ Community:
These are listed as high priority because members of the community suggested them.

(1) http://www.cpcb.gov.in/CAAQM/frmUserAvgReportCriteria.aspx
Note: Several Sites around India. I have not been able to access any data yet from this link with any combination of places I have put in. I'm also uncertain if it yields AQI or physical concentrations yet.

(2) http://164.100.160.234:9000/
Note: Several sites around India. These values are in AQI.

(3) http://www.cpcb.gov.in/CAAQM/mapPage/frmindiamap.aspx (same presumed sources as (1), just different view)
Note: Several sites around India. To date, we've had issues programmatically accessing these data - something to do with a timeout issue perhaps? Do you recall,@jflasher?

(4) http://mpcb.gov.in/envtdata/demoPage1.php
Note: Several sites around India. Looks like daily-averaged data. Unclear if data are available in real-time.

Iran (Tehran) - Data Sources

This is listed as high priority b/c Tehran, Iran is experiencing very bad AQ. Media outlets report schools shut/will be shutting down to AQ. Plus it is in a region where we have minimal to no coverage currently in our system.

Most useful info:
List of stations and coordinates: http://31.24.238.89/home/station.aspx
Hourly physical concentration data: http://31.24.238.89/home/DataArchive.aspx
(also downloadable via csv)

Other info:
General map (in AQI format): http://air.tehran.ir/Default.aspx?tabid=193
General map with hourly data (in AQI format): http://31.24.238.89/home/OnlineAQI.aspx
It appears they are using the US EPA scale. I assume this means they are doing a similar calculation for US EPA

Better email handling

@jflasher originally created this issue:

Too many emails are getting sent, figure out a more sane way to handle this. They are currently disabled via Heroku scheduler task until this is fixed.

Vancouver/British Columbia, Canada - Data Sources

These are labelled as a priority because we have gotten a request for Vancouver, though coverage over all of BC would be great.

US National Parks Ozone Data - Sources

Hourly/ real-time data reported on airnow.gov is aggregated into geographical areas and reported as an AQI using the NowCast formula (a weighted 12-hour average) so we are unable to use it for direct actual pollution measurements.

In search of sites that display Washington DC AQ data, found this:
http://www.nature.nps.gov/air/data/current/index.cfm
This lists hourly ozone data in national parks around the US.

Malaysia - Source (not yet useable, seeking help in contacting correct person for data)

Listed as a priority given recent AQ issues in region, but NOTE: There is an issue with these data as they are currently presented: They are shown as an index value of combined pollutants into a unitless value. The FAQ's indicate to contact an individual for historical concentrations in non-index units, though currently unsure how to do that.

After discussions with WRI/GFW, they are particularly interested in AQ data from countries affected by the fires in Indonesia.

AQ data for Malaysia:
http://apims.doe.gov.my/v2/
(click 'apims' when given choice at portal)

FAQ's (and 5th one down is the one on hourly concentration data): http://apims.doe.gov.my/v2/faq.html

Seeking help in identifying who to contact about access to hourly AQ data.

China - Sources (not useable yet)

There are a ton of China air quality data sources. Here are some:
aepb.gov.cn
bjmemc.com.cn
cdemc.cn
cfhb.gov.cn
cepb.gov.cn
dl.gov.cn
dyhb.gov.cn
nbemc.gov.cn
sdein.gov.cn
fjepb.gov.cn
gsep.gansu.gov.cn
qhepb.gov.cn
gdep.gov.cn
gxepb.gov.cn
ghb.gov.cn
dloer.gov.cn
hebei.gov.cn
hljdep.gov.cn
hnep.gov.cn
hbepb.gov.cn
hbt.hunan.gov.cn
nmgepb.gov.cn
jshb.gov.cn
jxepb.gov.cn
shbj.klmy.gov.cn
lnemc.cn
lzhb.gov.cn
nnems.gov.cn
nxep.gov.cn
ordoshb.gov.cn
semc.gov.cn
sxhjjcz.com.cn
szhec.gov.cn
tjemc.org.cn
xzep.gov.cn
wlmqhb.gov.cn
whepb.gov.cn
xianemc.gov.cn
xnepb.gov.cn
xjepb.gov.cn
ynepb.gov.cn
zjepb.gov.cn
Sources from: aqicn.org/sources

There is also a site that provides an API to Chinese AQ data:
http://pm25.in/

However the issue is that from what I can tell, these data are not shared in their raw format only AQI (I have not clicked on each and everyone however).

Anyone know differently or find ones above that are shared our in raw format?

Indonesia - Sources

Of course listed as a priority due to recent AQ issues in the region, but there are issues accessing data at an appropriate time resolution (@olafveerman, correct/edit if I'm mistaken).

After discussions with WRI/GFW, they are particularly interested in AQ data from countries affected by the fires in Indonesia.

Here is the general AQ data page for Indonesia:
http://www.bmkg.go.id/BMKG_Pusat/Kualitas_Udara/

Here is an example of one of the page (PM10). I am unsure if we can programmatically get data from this page either from the graph (hover over point) or hit 'view data':
http://www.bmkg.go.id/BMKG_Pusat/Kualitas_Udara/Informasi_Partikulat.bmkg

Add timeouts to the requests in adapters

@jflasher 's original issue:
If we don't set timeouts on the requests, Heroku may time us out which feels worse. Makes me wonder if we should have a system-wide request object that gets passed around so we can set defaults in one place?

Washington DC - Sources

Listed as medium priority b/c it was requested, but frankly there's not much there, so didn't list it as high priority:

This is the only site in the DC area I'm finding hourly real-time air quality measurements, not AQI. And it is only for one location (McMillan Reservoir).

http://www.nature.nps.gov/air/webcams/parks/nacccam/washcam.cfm

This site gives a nice synopsis of how AQ data in the region is handled via EPA, DC govt, and NPS:
http://doee.dc.gov/service/ambient-air-quality-monitoring

Please comment if other sites with actual measurements in the DC area are available.

Netherlands adapter sometimes returns invalid attribution array

I am going to be making this invalid to avoid DynamoDB insert errors. There shouldn't be an empty item in the attribution array.

{ date: 
   { utc: '2016-01-07T00:00:00.000Z',
     local: '2016-01-07T00:00:00+00:00' },
  parameter: 'co',
  location: 'Amsterdam-Einsteinweg',
  value: 861.5,
  unit: 'µg/m³',
  city: 'Amsterdam',
  attribution: 
   [ { name: 'RIVM', url: 'http://www.lml.rivm.nl/' },
     { name: '' } ],
  averagingPeriod: { value: 1, unit: 'hours' },
  coordinates: { latitude: 52.3813, longitude: 4.84523 },
  country: 'NL',
  sourceName: 'Netherlands',
  Timestamp: '2016-01-07T00:00:00.000Z',
  LocationParameter: 'co-Amsterdam-Einsteinweg' }

Set up staging environment

Haven't been avoiding this for any particular reason, but should definitely add this in the near future.

Turkey - Sources

Map of stations with coordinates and current readings and stations' current data are here (though not on unique urls):

http://www.havaizleme.gov.tr/Default.ltr.aspx

'TÜM İSTASYONLAR' = all stations

Clicking on the stations reveals site coordinates (click 'station description) and pollutant types measured.

(Sidenote: They use an AQI system with breakpoints same as the US EPA)

Switch to using DynamoDB

Related to openaq/openaq-api#141, trying to first get writing to DynamoDB working. I'm writing to a table (yay!), but there are some weird behaviors (boo). Specifically, using Texas and Sao Paulo data sources as examples, if I write all the Texas data, it saves fine, but then when I go to write Sao Paulo data into the same table, I seem to be losing some of the Texas data. I have no idea how writing more data into the database can make other data disappear. I thought I could be accidentally overwrite Texas data with Sao Paulo data, but I have done a few things to convince myself that this isn't the case, so not sure what it is at the moment.

Example output before and after a Sao Paulo data source looks like below

Brizo: ~
$ aws --profile openaq dynamodb scan --table-name measurements --filter-expression 'country = :country' --expression-attribute-values {\":country\":{\"S\":\"US\"}} --select COUNT
{
    "Count": 1366, 
    "Items": [], 
    "ScannedCount": 1366, 
    "ConsumedCapacity": null
}

Brizo: ~
$ aws --profile openaq dynamodb scan --table-name measurements --filter-expression 'country = :country' --expression-attribute-values {\":country\":{\"S\":\"US\"}} --select COUNT
{
    "Count": 899, 
    "Items": [], 
    "ScannedCount": 3045, 
    "ConsumedCapacity": null
}

The total number of records does increase, but doesn't seem to get to the right amount. And now even weirder, when I try to insert records in batches after each other, the total number of records in the database decreases? Craziness!

Singapore - Sources (not useable yet)

Listed a priority given recent AQ issues in area but NOTE: There is an issue with these data as they are currently presented: They are not listed as individual stations, but rather as aggregated stations that represent different regions of the city. WRI/GFW is contacting NEA on our behalf to see if the station-level data are accessible. We will not proceed with these data until we received the station-level information.

After discussions with WRI/GFW, they are particularly interested in AQ data from countries affected by the fires in Indonesia.

AQ data for Singapore:
http://www.nea.gov.sg/anti-pollution-radiation-protection/air-pollution-control/psi/pollutant-concentrations/type/PM25-1Hr#pollutant

If anyone knows differently, please comment.

Data Sources for Vietnam (Hanoi), India(Mumbai, New Delhi, Kolkata, Chennai, Hyderbad), Ulaanbaatar (and potential future sources)

These sources are available via RSS feed at:
http://stateair.net/dos/
RSS Feeds:
Chennai: http://stateair.net/dos/RSS/Chennai/Chennai-PM2.5.xml
Hanoi: http://stateair.net/dos/RSS/Hanoi/Hanoi-PM2.5.xml
Hyderbad:http://stateair.net/dos/RSS/Hyderabad/Hyderabad-PM2.5.xml
Kolkata: http://stateair.net/dos/RSS/Kolkata/Kolkata-PM2.5.xml
Mumbai: http://stateair.net/dos/RSS/Mumbai/Mumbai-PM2.5.xml
New Delhi: http://stateair.net/dos/RSS/NewDelhi/NewDelhi-PM2.5.xml
Ulaanbaatar: http://stateair.net/dos/RSS/Ulaanbaatar/Ulaanbaatar-PM2.5.xml

Geographic Coordinates:
Chennai: 13.052371, 80.251932
Hanoi: 21.021770, 105.819002 (source: Google Maps, US Embassy in Hanoi, need to verify this is the location of the monitor)
Hyderbad: 17.443464, 78.474890 (source: Google Maps)
Kolkata: 22.547142, 88.351048 (source: Google Maps)
Mumbai: 19.066023, 72.868702 (source: Google Maps)
New Delhi: 28.598096, 77.189066 (source: Google Maps)
Ulaanbaatar: 47.928444, 106.930189 (source: Google Maps)

We could also add the Mission China data, which is also similar on RSS. Currently we use Kimono.

Japanese Sources

Japanese data was listed as a priority as we got a request to have this in our database longterm:

Sources for Yokohama:

http://cgi.city.yokohama.lg.jp/kankyou/saigai/data/taiki/all/all_0000_00_001.html
http://www.ihe.pref.miyagi.jp/telem/dayreportitem/?itemSelect=10&day=2015%E5%B9%B410%E6%9C%8804%E6%97%A5

Appears to be hourly but unsure. Joe is contacting Miyagi Prefecture regarding details, potentially existing API, station coordinates.

Tokyo:
Just oxides? Need help with translation:
http://www.ox.kankyo.metro.tokyo.jp/index.php?chiku=1
http://www.ox.kankyo.metro.tokyo.jp/

Main page: http://www.kankyo.metro.tokyo.jp/nature/index.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.