Giter Club home page Giter Club logo

ncdc-analysis's Introduction

Data analysis with MapReduce

National Oceanic and Atmospheric Administration's (NOAA) National Climatic Data Center (NCDC)

Data is available at NOAA(Link)


The commands used are in the commands.txt file.

The output files are excluded from git commit.

Task #0: Filtering

We are only interested in specific weather stations that were operable (aka, reported at least 2 observations every month during a given year). If a particular weather station was not operable in a particular year, all of its data (if any) for that year should be filtered out. Keep the years where it was operable! All analysis tasks will operate on this subset of data, so be sure to save it into HDFS in a format usable as input in later tasks!

Step 1

Mapper 1: The goal was to filter out the necessary information from the original data for the future usage. As it goes through the original data line by line, values were extracted from each line. The keys of id, years, and month with the values of temperatures per line, were mapped for the reducer 1.

Reducer 1: The task was to pass on the data only if 2 or more observations existed. After counting how many temperatures each station measured, only lines with 2 or more stations were passed onto next reducer.

Step 2

Reducer 2: As a final reducer, if the length of the pairs was equal to 12, as there are twelve months in a year, the 'id' was emitted as a key, and 'id, year, month, temperature' was emitted as values. The 'id' was included in the again in values as it made easier for playing with data in the future.

Task #1: Analysis

If any stations were operable all years in that period, generate a report of their station IDs. If none were operable for all years in that period, rank the stations by which station was operable the greatest number of years and list the top 50 stations.

Step 1

Mapper 1: The output from task00 was received. The values were again extracted, and yielded id and year only if the year was bigger than 1920, and less than 1940.

Reducer 1: The years mapped from the mapper were gathered to create the list. Then, by calculating the difference of year 1941 to 1920, and multiplying 0.8, it was possible to find the 80% of operated stations. (by the greatest number of years operable), and include what years each station was operable. The first if statement emitted the ones that were operating 80% or more from 1920 to 1940. The second if statement emitted the stations that were operated in all years between 1920 to 1940.

Step 2

Reducer 2: The final reducer emitted the top 50 stations operated if non were operable for all year in the period as instructed.

What (if any) weather stations were operable for 80% or more of the entire time period 1920-1940?

Yes. Top 50 station list: 108650, 103380, 29750, 29110, 106370, 104190, 267020, 103840, 101270, 101200, 228020, 124250, 124000, 122050, 121160, 121140, 116430, 115180, 112310, 111200, 110350, 109350, 108660, 107760, 107630, 107280, 107270, 106850, 105780, 105770, 105540, 105130, 105010, 104880, 104690, 104680, 104530, 104270, 104100, 103610, 101700, 101470, 101310, 100910, 100670, 100190, 28360, 333930, 330190, 265090

Task #2: More Analysis

For each year, find descriptive statistics such as the min/max/mean/median temperatures for the year including data from all operable stations for that year. Graph all results.

Mapper: The year and temperatures corresponding to the year were extracted and yielded to the reducer.

Reducer: Using the builtin python's statistics library, the minimum, maximum, average, and the median temperatures were calculated and emitted.

Task02

Task #3: Analysis

The main goal was to answer the question of, 'Are SKC values being affected by other values?' I concluded that in year 1901, SCT has the most correlation with TEMP, BKN with SPD, and OVC with SPD.

The details are included in the task03 Jupyer-notebook file.

Task03

Task #4: Analysis

The main goal was to find the stations that reported the highest or lowest temperatures of the years from 1901 to 1940.

The details are included in the task04 Jupyer-notebook file.

Task04

ncdc-analysis's People

Contributors

ryandhjeon avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.