FinalProject

Overview

This is the final project for the Data Science Specialization Course: Practical Machine Learning.

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data

The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

Reading in the Data

In this data set, there are 3 representations for NA. Account for this in read_csv.

library(tidyverse)
library(caret)
pml_train <- read_csv("./data/pml-training.csv", na=c("NA", "", "#DIV/0!"))
pml_test <- read_csv("./data/pml-testing.csv", na=c("NA", "", "#DIV/0!"))
print(paste("Train", paste(dim(pml_train), collapse=" ")))

## [1] "Train 19622 160"

print(paste("Test", paste(dim(pml_test), collapse=" ")))

## [1] "Test 20 160"

Cleaning the Data

Many columns are almost entirely NAs. Examine the distribution of NAs.

# get the counts of NA in each column
na_count <-sapply(pml_train, function(y) sum(length(which(is.na(y)))))

# look at the distribution of counts
table(na_count)

## na_count
##     0     1 19216 19217 19218 19220 19221 19225 19226 19227 19248 19293 
##    57     3    67     1     1     1     4     1     4     2     2     1 
## 19294 19296 19299 19300 19301 19622 
##     1     2     1     4     2     6

We see 57 columns with no NAs, and 3 columns with 1 NA. The rest of the columns are over 97% NA.

Also, a visual examination of the data shows that the first 6 columns are not useful as predictors. Remove these as well.

# remove the columns with 19216 or more NAs
cols_to_remove <- which(na_count >= 19216)

# also remove the first 6 columns
cols_to_remove <- union(cols_to_remove, 1:6)

# whatever is done to one data set must be done to the other
train_data <- subset(pml_train, select = -cols_to_remove)
test_data <- subset(pml_test, select = -cols_to_remove)
print(paste("train_data", paste(dim(train_data), collapse=" ")))

## [1] "train_data 19622 54"

print(paste("test_data", paste(dim(test_data), collapse=" ")))

## [1] "test_data 20 54"

So we are down to 19622 rows and 54 columns on train_data.

Let’s just look at complete cases (rows with no NAs).

# only keep records with no NAs
train_data <- train_data[complete.cases(train_data),]
print(paste("train_data", paste(dim(train_data), collapse=" ")))

## [1] "train_data 19621 54"

print(paste("test_data", paste(dim(test_data), collapse=" ")))

## [1] "test_data 20 54"

We see one row was removed from train_data.

Normally data would be split something like 70% train and 30% test, or perhaps 60% train, 20% validation and 20% test.

Here we see that train_data (pml_train with fewer columns) has 19621 records but test_data (pml_test with fewer columns) only has 20 records. 20 records is much too small to be useful as a test set for estimating model accuarcy. pml_test will only be used for making predictions against it as required by the last Quiz in this course.

We need to create a train and test set from train_data.

set.seed(314159) # so createDataPartition is reproducible
indexes <- createDataPartition(train_data$classe, p=0.70, list=F)
train <- train_data[indexes, ]
test <- train_data[-indexes, ]
print(paste("train", paste(dim(train), collapse=" ")))

## [1] "train 13737 54"

print(paste("test", paste(dim(test), collapse=" ")))

## [1] "test 5884 54"

Building the Model

For Performance Only

My Linux computer has 4 cores. Use all 4 cores to create the Random Forest.

library(doParallel)

# be sure to stop an existing cluster before starting a new one
if (exists("cl")) {
    stopCluster(cl)
}

# FORK only works on Linux
cl <- makeCluster(4, type="FORK")
registerDoParallel(cl)

Create the Random Forest

Use 5 fold cross validation. Allow parallel processing.

controlRF <- trainControl(method = "cv",
                           number = 5,
                           allowParallel = TRUE)
modelRF <- train(classe ~ ., data=train, method="rf", trControl=controlRF, ntree=250)
modelRF

## Random Forest 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10991, 10990, 10989, 10989, 10989 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9922110  0.9901461
##   27    0.9965785  0.9956722
##   53    0.9942492  0.9927258
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

Estimate the Accuarcy

The caret package automatically tuned the mtry hyperparameter to be 27. As the train data was used to find the best model, we cannot also use it to estimate the accuarcy of the best model it found (without being overly optimistic).

# use the best RF model (mtry=27) on the test data and compute the accuarcy
predictRf <- predict(modelRF, test)
confusionMatrix(test$classe, predictRf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    0    0    0    0
##          B    3 1135    1    0    0
##          C    0    3 1023    0    0
##          D    0    0    6  957    1
##          E    0    2    0    1 1079
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9971          
##                  95% CI : (0.9954, 0.9983)
##     No Information Rate : 0.2848          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9963          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9956   0.9932   0.9990   0.9991
## Specificity            1.0000   0.9992   0.9994   0.9986   0.9994
## Pos Pred Value         1.0000   0.9965   0.9971   0.9927   0.9972
## Neg Pred Value         0.9993   0.9989   0.9986   0.9998   0.9998
## Prevalence             0.2848   0.1937   0.1751   0.1628   0.1835
## Detection Rate         0.2843   0.1929   0.1739   0.1626   0.1834
## Detection Prevalence   0.2843   0.1936   0.1744   0.1638   0.1839
## Balanced Accuracy      0.9991   0.9974   0.9963   0.9988   0.9992

We see from above that the estimated accuarcy is 99.71% with a 95% confidence interval of the accuarcy being between 99.54% and 99.83%.

As this model was run on out-of-sample data, 99.71% is the out-of-sample accuarcy. The out-of-sample error rate is the compliment of this, which is 0.29%

Predict on test_data for Quiz

# test_data is pml_test with appropriate columns removed
predict(modelRF, test_data)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

House Keeping for Parallel Processing Only

It’s a good idea to shut down the parallel cluster explicitly to ensure no memory leaks.

stopCluster(cl)