The purpose of this project is to demonstrate how to prepare a tidy dataset from a given dataset.
For this project, an existing dataset was used, which was collected from the accelerometers from the Samsung Galaxy S smartphone. A full description of this data is available here.
Dataset used for the project: getdata_projectfiles_UCI HAR Dataset.zip
Following files are created for this project:
- CodeBook.md ( a code book that describes the variables, the data, and any transformations or work that was performed to clean up the data )
- run_analysis.R ( script to create a tidy data set )
- tidy_aggregated_UCI_HAR_Dataset.txt ( tidy dataset created by the script )
Use following code to read this tidy dataset using R:
tidy_data <- read.table ("tidy_aggregated_UCI_HAR_Dataset.txt", header=TRUE)
There are several files that comes with the source data. Following are the files used for this project:
Code list files:
- 'features.txt': It contains the list of all features.
- 'activity_labels.txt': It links the activity labels with their activity name.
Data files:
There are 2 sets of data. One for training, and the other for test. Following are the files for training data set. Similar files are present for test dataset.
- 'X_train.txt' : Measurements collected for the training data. It is a 561-feature vector with time and frequency domain variables. All the values are normalized and bounded within [-1,1].
- 'y_train.txt' : Activity labels for the training data.
- 'subject_train.txt' : Each row of this file identifies the subject who performed the activity. Its range is from 1 to 30.
The primary data is collected using the raw signals from Accelerometer and Gyroscope of the Samsung device. There are several other derived features which uses this raw data as a base. Finally there are some aggregated features that gives measures like mean, min, max, standard deviation etc. Please see features_info.txt file to get more information about the source data.
R Scripting is used for extracting the data to local folder and a sequence of steps are executed to create the tidy data.
Command used for extracting the source files (all the variables should be preset):
#to download the file from sourceUrl to a local file, sourceFile
download.file(sourceUrl, sourceFile)
#to unzip the file contents to a local folder
unzip(sourceFile, exdir = targetFolder)
#to read the data into data frames
features <- read.table("./UCI HAR Dataset/features.txt",header=FALSE)
Following steps are performed to transform the data into a tidy dataset:
Training and test datasets are individually extracted and binded into a single dataset using the cbind() and rbind() functions. After merging the data, column names are provided for all measurements using the features.txt as a reference. make.names() function is used to derive valid and unique column names.
The source dataset contain 561 features for each activity. For this project, only the measurements on mean and standard deviation are used. The required fields are selected using the select() function. The restriction is provided using the contains() argument. All the columns that contain the word "mean" and "std" are selected. This gives 66 measurements containing the mean and the standard deviation.
Activity names are stored in a separate file. The dataset from step 2 is merged with the activity_label data to get the activity name using a join based on the activity_id.
Following substitutions are made to the column names to provide better names for the features:
- Column names starting with "t", substituted to "Time."
- Column names starting with "f", substituted to "Frequency."
- Column name containig "...X", substituted with ".X" (same rule for "...Y" and "...Z"
- Column name containing ".mean", substituted with ".Mean"
- Column name containing ".std", substituted with ".Std"
The modified column names are applied to the final data set.
5. Create an independent tidy data set with the average of each variable for each activity and each subject.
Dataset from above step is transformed into another tidy dataset containing average of each variable grouped by subject and activity.
dpply() and numcolwise() functions are used to generate the new tidy dataset.
new_tidy_data <- ddply(new_selected_data, .(subject_id,activity), numcolwise(mean))
The above mentioned steps are implemented in R programming. The source code is available in the repository as run_analysis.R. At the end of the process, the script will create a tidy dataset and saves the file in the working directory as tidy_aggregated_UCI_HAR_Dataset.txt.
Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec 2012 (Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz)