Giter Club home page Giter Club logo

dsc-3-25-03-managing-time-series-data-lab-demo-online-ds-000's Introduction

Managing Time Series Data - Lab

Introduction

In the previous lecture, you learned that time series data are everywhere and understanding time series data is an important skill for data scientists!

In this lab, you'll practice your previously learned techniques to import/load, clean and manipulate time series data.

The lab will cover how to perform time series analysis while working with large datasets. The dataset can be memory intensive so your computer will need at least 2GB of memory to perform some of the calculations.

Objectives

You will be able to:

  • Load time series data using Pandas and perform time series indexing
  • Perform index based slicing to create subsets of a time series
  • Change the granularity of a time series
  • Perform basic data cleaning operations on time series data

Let's get started!

We will start the lab by loading the required libraries

  • pandas for data wrangling and manipulations
  • matplotlib for visualising timeseries data
  • statsmodels primarily for bundled datasets
# Load required libraries

Loading time series data

The StatsModels library comes bundled with built-in datasets for experimentation and practice. A detailed description of these datasets can be found here. Using StatsModels, the time series datasets can be loaded straight into memory.

In this lab, we'll use the "Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory, Hawaii, U.S.A.", containing CO2 samples from March 1958 to December 2001. Further details on this dataset are available here.

We can bring in this data using the load_pandas()-method, which will allow us to read this data into a pandas dataframe by using dataset.data.

# Load the "co2" dataset from sm.datasets

Let's check the type of CO2 and also first 15 entries of CO2 dataframe as our first exploratory step.

# Print the datatype of CO2 and check first 15 values

# datatype of CO2 is <class 'pandas.core.frame.DataFrame'>

#               co2
# 1958-03-29  316.1
# 1958-04-05  317.3
# 1958-04-12  317.6
# 1958-04-19  317.5
# 1958-04-26  316.4
# 1958-05-03  316.9
# 1958-05-10    NaN
# 1958-05-17  317.5
# 1958-05-24  317.9
# 1958-05-31    NaN
# 1958-06-07    NaN
# 1958-06-14    NaN
# 1958-06-21    NaN
# 1958-06-28    NaN
# 1958-07-05  315.8

With all the required packages imported and the CO2 dataset as a Dataframe ready to go, we can move on to indexing our data.

Data Indexing

You may have noticed that by default, the dates have been set as the index of our pandas DataFrame. While working with time series data in Python, it's important to always ensure that dates are used as index values and are set as a timestamp object. Timestamp is the pandas equivalent of python’s Datetime and is interchangeable with it in most cases. It’s the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data structures in pandas. Further details can be found here.

We can confirm these assumption in python by checking index values of a pandas dataframe with DataFrame.index.

# Confirm that date values are used for indexing purpose in the CO2 dataset 

# DatetimeIndex(['1958-03-29', '1958-04-05', '1958-04-12', '1958-04-19',
#                '1958-04-26', '1958-05-03', '1958-05-10', '1958-05-17',
#                '1958-05-24', '1958-05-31',
#                ...
#                '2001-10-27', '2001-11-03', '2001-11-10', '2001-11-17',
#                '2001-11-24', '2001-12-01', '2001-12-08', '2001-12-15',
#                '2001-12-22', '2001-12-29'],
#               dtype='datetime64[ns]', length=2284, freq='W-SAT')

The output above shows that our dataset clearly fulfills the indexing requirements. Look at the last line:

dtype='datetime64[ns]', length=2284, freq='W-SAT'

  • dtype=datetime[ns] field confirms that the index is made of timestamp objects.
  • length=2284 shows the total number of entries in our timeseries data.
  • freq='W-SAT' tells us that we have 2,284 weekly (W) date stamps starting on Saturdays (SAT).

Resampling

Remember that depepending on the nature of analytical question, the resolution of timestamps can also be changed to other frequencies. For this data set we can resample to monthly CO2 consumption values. This can be obtained by using the resample() function. Let's

  • Group the time-series into buckets representing 1 month using resample() function.
  • Apply a mean()function on each group (i.e. get monthly average).
  • Combine the result as one row per monthly group.
# Group the timeseries into monthly buckets
# Take the mean of each group 
# get the first 10 elements of resulting timeseries


# 1958-03-01    316.100000
# 1958-04-01    317.200000
# 1958-05-01    317.433333
# 1958-06-01           NaN
# 1958-07-01    315.625000
# 1958-08-01    314.950000
# 1958-09-01    313.500000
# 1958-10-01           NaN
# 1958-11-01    313.425000
# 1958-12-01    314.700000
# Freq: MS, Name: co2, dtype: float64

Looking at the index values, we can see that our timeseries now carries aggregated data on monthly terms, shown as Freq: MS.

Time-series Index Slicing for Data Selection

Slice our dataset to only retrieve data points that come after the year 1990.

# Slice the timeseries to contain data after year 1990. 

# 1990-01-01    353.650
# 1990-02-01    354.650
#                ...   
# 2001-11-01    369.375
# 2001-12-01    371.020
# Freq: MS, Name: co2, Length: 144, dtype: float64

Slice the time series for a given time interval. Let's try to retrieve data starting from Jan 1990 to Jan 1991.

# Retrieve the data between 1st Jan 1990 to 1st Jan 1991

# 1990-01-01    353.650
# 1990-02-01    354.650
# 1990-03-01    355.480
# 1990-04-01    356.175
# 1990-05-01    357.075
# 1990-06-01    356.080
# 1990-07-01    354.675
# 1990-08-01    352.900
# 1990-09-01    350.940
# 1990-10-01    351.225
# 1990-11-01    352.700
# 1990-12-01    354.140
# 1991-01-01    354.675
# Freq: MS, Name: co2, dtype: float64

Missing Values

Check if there are missing values in the data set.

# Get the total number of missing values in the time series

# 5

Remember that missing values can be filled in a multitude of ways. Look for the next valid entry in the time series and fills the gaps with this value. Next, check if your attempt was successful by checking for missing values again.

# perform backward filling of missing values
# check again for missing values

# 0

Great! Now your time series are ready for visualization and further analysis.

Summary

In this introductory lab, we learnt how to create a time-series object in Python using Pandas. You learned how to check timestamp values as the data index and you learned about basic data handling techniques for getting time-series data ready for further analysis.

dsc-3-25-03-managing-time-series-data-lab-demo-online-ds-000's People

Contributors

loredirick avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.