Photo by John Fornander on Unsplash

Predicting Human Activity Using PCA features

Pradeep Adhokshaja
7 min readMay 30, 2019

--

Motivation

When working on a regression/classification problem involving data with a large number of features, we are often susceptible to the “Curse of Dimensionality”. As we keep increasing the number of features, we will reach a point where our regressor/classifier fails to perform well on test data.

This is due to the increase in data sparsity brought about by the increase in the number of features in the data. Increased data sparsity enables our models to overfit, thereby failing on future data.

This “Curse” can be avoided by reducing the number of features. This practice is called Dimensionality Reduction. Broadly speaking, there are two types of dimensionality reduction techniques.

  1. Feature Selection
  2. Features Extraction

Feature Selection

Feature Selection involves discarding features that are irrelevant to the explanation of the underlying data.

Feature Extraction

Instead of discarding features, feature extraction techniques create a new set of features using a combination of the original features. A small subset of these new features is enough to effectively explain the data. One of these methods includes Principal Component Analysis(PCA). PCA allows for data representation in lower dimensions.

In this post, I will be using predicting human activity through the use of PCA on the Human Activity Detection data found on Kaggle. The scripts for subsequent data analysis are written in R.

Principal Component Analysis Procedure

In PCA, the original set of features are combined and linearly transformed into principal components that are not correlated to each other.

The procedure to find principal components is as follows

Let X be a matrix that represents the set of features that we are running PCA on. Each column in this matrix represents a feature.

  1. Subtract each column by its mean and divide it by the standard deviation.
  2. Let the new normalized matrix be Z . Using this normalized matrix , calculate the covariance matrix ie transpose(Z)*Z . Let this covariance matrix be C.
  3. Decompose the covariance matrix into eigenvalues and eigenvectors. Here the eigen() function is used. The eigenvector corresponding to the highest eigenvalue is the direction of maximum variance.
  4. To acquire the principal components, arrange the eigenvalues in descending order and reorder the eigenvectors accordingly. This reordering is taken care of by the eigen() function.
  5. Now that we have the directions along which the features are orthogonal, we need to project them on the data that we have, to create the principal components. This is achieved by multiplying the scaled feature matrix Z with the reordered eigenmatrix. The explained variance of each component is the eigenvalue corresponding to that eigenvector divided by the sum of eigenvalues.

The first principal component accounts for the largest possible variability in the data.

Implementing the PCA function in R

The steps mentioned in the previous section are implemented through the pca_function() below. This function returns the following

  1. The projected vectors
  2. The eigenvalues
  3. The directions along which the data is projected.(Eigen Vectors)
pca_function <- function(x,k){

# x is the input data frame
## Step 1 Normalize each column

x_scaled <- scale(x)
## Convert the data frame to a matrix

x_mat <- as.matrix(x_scaled)

## Calculate the covariance matrix

cov_x <- t(x_mat) %*% x_mat


## Eigen Value decomposition

decomp_eigen <- eigen(cov_x)

eigen_values <- decomp_eigen$values

eigen_vectors <- decomp_eigen$vectors
# choose top k eigen values and explain the variance

eigen_values <- eigen_values[1:k]

## calculate the projected vectors


projected_vectors <- x_mat %*% eigen_vectors
projected_vectors <- projected_vectors[1:nrow(projected_vectors),1:k]
return(list(projected_vectors=projected_vectors,eigen_values=eigen_values,basis_vectors=eigen_vectors))
}

Returning a list of vectors enables us to use the results just by invoking $

Data Read and Feature Extraction

As mentioned earlier, I will be using the Human Activity detection dataset on Kaggle. This data comprises of 561 features that were acquired through sensors for each type of human activity.

These activities are classified as STANDING,SITTING,LAYING,WALKING,WALKING DOWNSTAIRS,WALKING UPSTAIRS.

The idea is to reduce the number of features without much loss of information through the use of PCA. PCA gives us a new set of features called principal components that are arranged in descending order of how much variablity they explain in the data.

The code for this feature extraction procedure is as follows

set.seed(42)libraries_needed <- c('tidyr','dplyr','ggplot2','caret','purrr','rlang','caret')
lapply(libraries_needed ,require,character.only=TRUE)
data_raw<- read.csv('train.csv',stringsAsFactors = FALSE)
data_raw_pca <- data_raw %>% select(-rn,-activity)train_index <- sample(1:nrow(data_raw_pca),0.75*nrow(data_raw_pca))data_raw_pca_train <- data_raw_pca[train_index,]data_raw_pca_test <- data_raw_pca[-train_index,]

pca_decomposed_data <- pca_function(data_raw_pca_train,561)

The dataset is divided into a train and test set. The PCA is run on the train set.

The principal components that are created can be accessed through the following command.

pca_decomposed_data$projected_vectors

Choosing the Principal Components

Now that we have the principal components ready, how do we choose which ones to use?

The principal components are arranged in descending order of how well they explain the data. Each component’s ability to explain the data is given by the eigenvalue of that principal component divided by the sum of all eigenvalues.

To choose the principal components, we plot the cumulative variance explained by the principal components.

eigen_values_and_vectors <- data.frame(principal_component = 1:ncol(data_raw_pca_train),eigen_values = pca_decomposed_data$eigen_values)eigen_values_and_vectors %>%
mutate(var = eigen_values/sum(eigen_values)) %>%
mutate(cumulative_var = cumsum(var)) %>%
ggplot(aes(x=principal_component,y=cumulative_var))+
geom_line()+geom_point()

The first 100 principal components explain about 95% of the variability in the data which is pretty good!

We started off with 561 features and reduced it to 100. That is 82% reduction in the number of features!

Fitting a multinomial logistic model

Multi-class logistic regression helps in deriving the log odds of an event happening against a base event. Here, I have chosen SITTING as the base event. The model uses the derived principal components to estimate the human activity.

library(nnet)data_pca_train$activity <- as.factor(data_pca_train$activity)
data_pca_train$activity <- relevel(data_pca_train$activity,ref='SITTING')
model_logistic_pca <- multinom(activity~.,data=data_pca_train)
#broom::tidy(model_logistic_pca)
test <- predict(model_logistic_pca,newdata=data_pca_train %>% select(-activity))real_pred <- data.frame(real=data_pca_train$activity,test=as.character(test))

Looking at training performance

One of the best ways to look at classifier performance is through a confusion matrix. A confusion matrix is a summary of the prediction results of a classifier. This shows where the classifier got “confused” in making a classification.

It is a cross table with actual and predicted values.

The confusion matrix can be displayed using the confusionMatrix() command under caret

caret::confusionMatrix(real_pred$real,real_pred$test)

The above command gives the following output

Output of the confusionMatrix() command of caret
  • 98% of the data points were classified properly
  • Sensitivity and specificity rates range from 95% to 100%

The training accuracy of our model seems impressive. But how does this perform on test data?

Testing the model

The issue with the test data is that it is not transformed into principal components. To do that, we just have to project the normalized test data along the directions of maximum variance. These directions are represented by the eigenvectors that returned by the pca_function().

The following code does just that and uses the model built in the previous section to estimate the classes of the test set.

data_pca_test <- as.matrix(scale(data_raw_pca_test)) %*% pca_final_train$basis_vectorsdata_pca_test <- as.data.frame(data_pca_test)data_pca_test <- data_pca_test[,1:100]data_pca_test$activity <- data_raw[-train_index,]$activity

test_prediction <- predict(model_logistic_pca,newdata = data_pca_test %>% select(-activity))
real_pred_test <- data.frame(real=data_pca_test$activity,test=as.character(test_prediction))
caret::confusionMatrix(real_pred_test$real,real_pred_test$test)

The output of the test data confusion matrix is as follows

Model test output
  • 94% of test data is classified correctly.
  • There is some decrease in sensitivity when it comes to classifying SITTING correctly. 87% of SITTING data points were classified correctly.

Cross-Validation

A primary characteristic of a good model is its ability to perform well on unseen data. To test how good a classification model is in predicting unseen data, a technique called cross-validation comes to the rescue.

There are many types of cross-validation techniques, the most popular one being the K- folds cross-validation

K Folds Cross-Validation

In K folds cross-validation, the data is split into K equal parts. Each of these parts is used as a test set while the rest is used as a train set. The K test and train scores are observed for stability.

The following code implements K Folds Cross Validation with K=10.

random_data <- dplyr::slice(data_raw,sample(1:n()))
test_idx <- round(seq(1,nrow(data_raw),by=nrow(data_raw)/11))
accuracy_df <- data.frame(iteration= as.numeric(),training_score= as.numeric(),testing_score= as.numeric())
for(i in 1:10){


## Training

test_data <- slice(random_data,test_idx[i]:test_idx[i+1])
train_data <- slice(random_data,-test_idx[i]:-test_idx[i+1])
pca_final_train <- pca_function(train_data %>% select(-rn,-activity),100)

data_pca_train <- as.data.frame(pca_final_train$projected_vectors)
data_pca_train$activity <- as.character(train_data$activity)
#data_pca_train$activity <- relevel(data_pca_train$activity,ref='SITTING')
model_cv <- multinom(activity~.,data=data_pca_train)

train_predict <- as.character(predict(model_cv,newdata=data_pca_train %>% select(-activity)))

training_score <- sum(as.character(train_predict)==as.character(train_data$activity))/nrow(train_data)





## Testing
data_pca_test <- as.matrix(scale(test_data %>% select(-rn,-activity))) %*% pca_final_train$basis_vectors
data_pca_test <- as.data.frame(data_pca_test)data_pca_test <- data_pca_test[,1:100]data_pca_test$activity <- test_data$activity



predict_cv <- as.character(predict(model_cv,newdata=data_pca_test %>% select(-activity)))
testing_score <- sum(as.character(predict_cv)==as.character(test_data$activity))/nrow(test_data)
accuracy_df <- rbind(accuracy_df,data.frame(iteration=i,testing_score=testing_score,training_score =training_score ))
}
scaleFUN <- function(x) sprintf("%.f", x)
accuracy_df %>%
tidyr::gather(type_of_score,value,2:3) %>%
mutate(type_of_score = ifelse(grepl('training',type_of_score),'Training Score','Testing Score')) %>%
ggplot(aes(x=iteration,y=value,color=type_of_score))+geom_line()+
scale_x_continuous(labels = scaleFUN)+
labs(x='Iteration Number',y='Score')
Looking at test and train scores of each iteration of the 10-Fold Cross Validation procedure
  • Training scores are stable and testing scores range from 0.94 to 0.97 which is acceptable for a good multi-class classifier.
  • The acceptable test scores depend on the type of problem the classifier is trying to solve. In cases such as detecting benign/malignant tumors, the threshold of acceptance will be higher.

We are able to get an impressive classifier performance even though we reduced the features by 82%.

Closing Remarks

  • PCA enables us to reduce the number of features by projecting data into a different space.
  • The principal components are not correlated with each other.
  • A subset of the principal components is sufficient to explain the data.

Thanks for taking the time to read! Please feel free to comment with pointers to improve the article.

Useful References

  1. https://medium.com/@cxu24/why-dimensionality-reduction-is-important-dd60b5611543
  2. https://www.geeksforgeeks.org/confusion-matrix-machine-learning/
  3. https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
  4. http://www.math.union.edu/~jaureguj/PCA.pdf
  5. Youtube Video Series on PCA

--

--

Pradeep Adhokshaja

Data Scientist @Philips. Passionate about ML,Statistics & hiking. If you like to buy me a coffee, you can use this link https://ko-fi.com/pradeepadhokshaja