'run_analysis.R' script does the following action sequence:
- Downloading and extracting raw data
- Reading dictionary entities - Activity and Feature
- Reading and cleaning test dataset in the following way:
- Reads file 'subject_test.txt' with subjects and 'y_test.txt' with activities
- Merging observations set (X_test.txt file) with Subject and Activity which were loaded in the previous step. Activity name is used as activity value, not ID
- Each column is a feature. Sets column names accordingly
- Extract only columns 'activity', 'subject' and columns containing 'mean()' and 'std()' in names. 'Sqldf' package was used to achieve this.
- Reading and cleaning train dataset in the same way as test dataset (see item 3). Dev note: reading and cleaning dataset code was extracted to separate function which was reused in the 3th and 4th items.
- Union both test and train datasets. Sorting result dataset by Subject and Activity.
- Transforming column names in readable form.
- During the cleaning data, raw symbols '-', '(', ')' was replaced with underline symbol. For example, 'mean()' was replaced with 'mean__'. So, firstly scripts replaces several underlined symbols with one: '__' is going to be '_'
- Then scripts removes underlined symbols from the end of a label
- Observations contain values for time and frequency domains. The labels start with 't' and 'f' accordingly. Scripts expands them to 'Time' and 'Frequency'. For example, 'tBodyAcc' is transformed to 'TimeBodyAcc'
- 'Mag' in the labels was expanded to 'Magnitude'
- 'BodyBody' typo in the labels was repladed with 'Body'
- Creating the second dataset from the first one. The new dataset contains only average values of all observations of the 1st dataset grouping by subject and it's activity.