Predicting Human Activity Using PCA features

7 min readMay 30, 2019

Motivation

When working on a regression/classification problem involving data with a large number of features, we are often susceptible to the “Curse of Dimensionality”. As we keep increasing the number of features, we will reach a point where our regressor/classifier fails to perform well on test data.

This is due to the increase in data sparsity brought about by the increase in the number of features in the data. Increased data sparsity enables our models to overfit, thereby failing on future data.

This “Curse” can be avoided by reducing the number of features. This practice is called Dimensionality Reduction. Broadly speaking, there are two types of dimensionality reduction techniques.

Feature Selection
Features Extraction

Feature Selection

Feature Selection involves discarding features that are irrelevant to the explanation of the underlying data.

Feature Extraction

Instead of discarding features, feature extraction techniques create a new set of features using a combination of the original features. A small subset of these new features is enough to effectively explain the data. One of these methods includes Principal Component Analysis(PCA). PCA allows for data representation in lower dimensions.

In this post, I will be using predicting human activity through the use of PCA on the Human Activity Detection data found on Kaggle. The scripts for subsequent data analysis are written in R.

Principal Component Analysis Procedure

In PCA, the original set of features are combined and linearly transformed into principal components that are not correlated to each other.

The procedure to find principal components is as follows

Let X be a matrix that represents the set of features that we are running PCA on. Each column in this matrix represents a feature.

Subtract each column by its mean and divide it by the standard deviation.
Let the new normalized matrix be Z . Using this normalized matrix , calculate the covariance matrix ie transpose(Z)*Z . Let this covariance matrix be C.
Decompose the covariance matrix into eigenvalues and eigenvectors. Here the eigen() function is used. The eigenvector corresponding to the highest eigenvalue is the direction of maximum variance.
To acquire the principal components, arrange the eigenvalues in descending order and reorder the eigenvectors accordingly. This reordering is taken care of by the eigen() function.
Now that we have the directions along which the features are orthogonal, we need to project them on the data that we have, to create the principal components. This is achieved by multiplying the scaled feature matrix Z with the reordered eigenmatrix. The explained variance of each component is the eigenvalue corresponding to that eigenvector divided by the sum of eigenvalues.

The first principal component accounts for the largest possible variability in the data.

Implementing the PCA function in R

The steps mentioned in the previous section are implemented through the pca_function() below. This function returns the following

The projected vectors
The eigenvalues
The directions along which the data is projected.(Eigen Vectors)

pca_function <- function(x,k){
  
  # x is the input data frame
  ## Step 1 Normalize each column
  
  x_scaled <- scale(x)
  ## Convert the data frame to a matrix
  
  x_mat <- as.matrix(x_scaled)
  
  ## Calculate the covariance matrix
  
  cov_x <- t(x_mat) %*% x_mat
  
  
  ## Eigen Value decomposition
  
  decomp_eigen <- eigen(cov_x)
  
  eigen_values <- decomp_eigen$values
  
  eigen_vectors <- decomp_eigen$vectors# choose top k eigen values and explain the variance
  
  eigen_values <- eigen_values[1:k]
  
  ## calculate the projected vectors
  
  
  projected_vectors <- x_mat %*% eigen_vectors
  projected_vectors <- projected_vectors[1:nrow(projected_vectors),1:k]
  return(list(projected_vectors=projected_vectors,eigen_values=eigen_values,basis_vectors=eigen_vectors))
}

Returning a list of vectors enables us to use the results just by invoking $

Data Read and Feature Extraction

As mentioned earlier, I will be using the Human Activity detection dataset on Kaggle. This data comprises of 561 features that were acquired through sensors for each type of human activity.

These activities are classified as STANDING,SITTING,LAYING,WALKING,WALKING DOWNSTAIRS,WALKING UPSTAIRS.

The idea is to reduce the number of features without much loss of information through the use of PCA. PCA gives us a new set of features called principal components that are arranged in descending order of how much variablity they explain in the data.

The code for this feature extraction procedure is as follows

set.seed(42)libraries_needed <- c('tidyr','dplyr','ggplot2','caret','purrr','rlang','caret')
lapply(libraries_needed ,require,character.only=TRUE)
data_raw<- read.csv('train.csv',stringsAsFactors = FALSE)data_raw_pca <- data_raw %>% select(-rn,-activity)train_index <- sample(1:nrow(data_raw_pca),0.75*nrow(data_raw_pca))data_raw_pca_train <- data_raw_pca[train_index,]data_raw_pca_test <- data_raw_pca[-train_index,]
  
pca_decomposed_data <- pca_function(data_raw_pca_train,561)

The dataset is divided into a train and test set. The PCA is run on the train set.

The principal components that are created can be accessed through the following command.

pca_decomposed_data$projected_vectors

Choosing the Principal Components

Now that we have the principal components ready, how do we choose which ones to use?

The principal components are arranged in descending order of how well they explain the data. Each component’s ability to explain the data is given by the eigenvalue of that principal component divided by the sum of all eigenvalues.

To choose the principal components, we plot the cumulative variance explained by the principal components.

eigen_values_and_vectors <- data.frame(principal_component = 1:ncol(data_raw_pca_train),eigen_values = pca_decomposed_data$eigen_values)eigen_values_and_vectors %>%
  mutate(var = eigen_values/sum(eigen_values)) %>%
  mutate(cumulative_var = cumsum(var)) %>%
  ggplot(aes(x=principal_component,y=cumulative_var))+
  geom_line()+geom_point()

The first 100 principal components explain about 95% of the variability in the data which is pretty good!

We started off with 561 features and reduced it to 100. That is 82% reduction in the number of features!

Fitting a multinomial logistic model

Multi-class logistic regression helps in deriving the log odds of an event happening against a base event. Here, I have chosen SITTING as the base event. The model uses the derived principal components to estimate the human activity.

library(nnet)data_pca_train$activity <- as.factor(data_pca_train$activity)
data_pca_train$activity <- relevel(data_pca_train$activity,ref='SITTING')model_logistic_pca <- multinom(activity~.,data=data_pca_train)
#broom::tidy(model_logistic_pca)test <- predict(model_logistic_pca,newdata=data_pca_train %>% select(-activity))real_pred <- data.frame(real=data_pca_train$activity,test=as.character(test))

Looking at training performance

One of the best ways to look at classifier performance is through a confusion matrix. A confusion matrix is a summary of the prediction results of a classifier. This shows where the classifier got “confused” in making a classification.

It is a cross table with actual and predicted values.

The confusion matrix can be displayed using the confusionMatrix() command under caret

caret::confusionMatrix(real_pred$real,real_pred$test)

The above command gives the following output

Output of the confusionMatrix() command of caret

98% of the data points were classified properly
Sensitivity and specificity rates range from 95% to 100%

The training accuracy of our model seems impressive. But how does this perform on test data?

Testing the model

The issue with the test data is that it is not transformed into principal components. To do that, we just have to project the normalized test data along the directions of maximum variance. These directions are represented by the eigenvectors that returned by the pca_function().

The following code does just that and uses the model built in the previous section to estimate the classes of the test set.

data_pca_test <- as.matrix(scale(data_raw_pca_test)) %*% pca_final_train$basis_vectorsdata_pca_test <- as.data.frame(data_pca_test)data_pca_test <- data_pca_test[,1:100]data_pca_test$activity <- data_raw[-train_index,]$activity
 
test_prediction <- predict(model_logistic_pca,newdata = data_pca_test %>% select(-activity))
real_pred_test <- data.frame(real=data_pca_test$activity,test=as.character(test_prediction))caret::confusionMatrix(real_pred_test$real,real_pred_test$test)

The output of the test data confusion matrix is as follows

94% of test data is classified correctly.
There is some decrease in sensitivity when it comes to classifying SITTING correctly. 87% of SITTING data points were classified correctly.

Cross-Validation

A primary characteristic of a good model is its ability to perform well on unseen data. To test how good a classification model is in predicting unseen data, a technique called cross-validation comes to the rescue.

There are many types of cross-validation techniques, the most popular one being the K- folds cross-validation

K Folds Cross-Validation

In K folds cross-validation, the data is split into K equal parts. Each of these parts is used as a test set while the rest is used as a train set. The K test and train scores are observed for stability.

The following code implements K Folds Cross Validation with K=10.

random_data <- dplyr::slice(data_raw,sample(1:n()))
test_idx <- round(seq(1,nrow(data_raw),by=nrow(data_raw)/11))
accuracy_df <- data.frame(iteration= as.numeric(),training_score= as.numeric(),testing_score= as.numeric())
for(i in 1:10){
  
  
  ## Training
  
  test_data <- slice(random_data,test_idx[i]:test_idx[i+1])
  train_data <- slice(random_data,-test_idx[i]:-test_idx[i+1])
  pca_final_train <- pca_function(train_data %>% select(-rn,-activity),100)
  
  data_pca_train <- as.data.frame(pca_final_train$projected_vectors)
  data_pca_train$activity <- as.character(train_data$activity)
  #data_pca_train$activity <- relevel(data_pca_train$activity,ref='SITTING')model_cv <- multinom(activity~.,data=data_pca_train)
  
  train_predict <- as.character(predict(model_cv,newdata=data_pca_train %>% select(-activity)))
  
  training_score <- sum(as.character(train_predict)==as.character(train_data$activity))/nrow(train_data)
  
  
  
  
  
  ## Testing 
  data_pca_test <- as.matrix(scale(test_data %>% select(-rn,-activity))) %*% pca_final_train$basis_vectorsdata_pca_test <- as.data.frame(data_pca_test)data_pca_test <- data_pca_test[,1:100]data_pca_test$activity <- test_data$activity
  
  
  
  predict_cv <- as.character(predict(model_cv,newdata=data_pca_test %>% select(-activity)))
  testing_score <- sum(as.character(predict_cv)==as.character(test_data$activity))/nrow(test_data)
  accuracy_df <- rbind(accuracy_df,data.frame(iteration=i,testing_score=testing_score,training_score =training_score ))}
scaleFUN <- function(x) sprintf("%.f", x)accuracy_df %>%
  tidyr::gather(type_of_score,value,2:3) %>%
  mutate(type_of_score = ifelse(grepl('training',type_of_score),'Training Score','Testing Score')) %>%
  ggplot(aes(x=iteration,y=value,color=type_of_score))+geom_line()+
  scale_x_continuous(labels = scaleFUN)+
  labs(x='Iteration Number',y='Score')

Looking at test and train scores of each iteration of the 10-Fold Cross Validation procedure

Training scores are stable and testing scores range from 0.94 to 0.97 which is acceptable for a good multi-class classifier.
The acceptable test scores depend on the type of problem the classifier is trying to solve. In cases such as detecting benign/malignant tumors, the threshold of acceptance will be higher.

We are able to get an impressive classifier performance even though we reduced the features by 82%.

Closing Remarks

PCA enables us to reduce the number of features by projecting data into a different space.
The principal components are not correlated with each other.
A subset of the principal components is sufficient to explain the data.

Thanks for taking the time to read! Please feel free to comment with pointers to improve the article.