Action Recognition on Kinect data

In the action recognition problem, we have a set of N training examples $D = \{(a_{i},c_{i})\}_{i=1}^{N}$ consisting of actions $a_{i}$ and their associated classes $c_{i};$ there are 3 different types of actions in the training set, so $c_{i} \in \{1,2,3\}$, where 1 is "clap", 2 is "high kick" and 3 is "low kick". The goal is to build a model to classify unseen actions.

The Kinect data was collected by performing actions and extracting the human pose data associated with each action. I have collected data for the following three actions: "clap", "high kick", and "low kick".


Figure 1: Clapping action captured
by Kinect


Figure 2: High-kicking action captured
by Kinect


Figure 3: Low-kicking action captured
by Kinect


Figure 4: Clapping action captured
by Kinect


Figure 5: High-kicking action captured
by Kinect


Figure 6: Low-kicking action captured
by Kinect

To do so, we will use the general framework of a Bayesian classifier: we model the joint distribution over an action A and its class C as

$P(A,C) = P(A|C)P(C)$

The intuition behind this model is that each action has a set of distinct characteristics that we can capture using the class conditional action model $P(A|C)$.

We can classify an action $a$ by picking the class assignment $c^{*}$ with the highest posterior probability $P(c^{*}|a)$ given the action:

$c^{*} = arg_{c}max \: P(c|a) = arg_{c}max \: \frac{P(a|c)P(c)}{P(a)} = arg_{c}max \: P(a|c) $

The last equality follows because we have a uniform class distribution in the training set, and the denominator $P(a)$ is the same for each of the classes. Thus, given an unseen action, we can classify it by simply computing $P(a|c) $ over each class, and picking the class whose model yields the highest likelihood.

We have yet to specify a key component of our classifier: the class conditional action model $P(A|C) $. What model should we use? One possibility would be to fit a Bayesian clustering model (a mixture of poses model) for each action to model the types of poses that appear in each action class. However, we can guess that this will probably not work, as many of these actions share poses that look similar. Thus, it is likely that the class conditional action models will be similar and the classfier will not perform well.

A key observation about actions that will help us build a better action model is that though the poses comprising an action may look similar, or even be the same, the sequence in which these poses occur defines the action. For example, in the "low kick" action, we would expect a sequence of poses in which the foot is lifted, kicked in a direction, and then returned back to the original position. Thus, we should try to leverage the temporal nature of action poses in our model. The mixture of poses model does not account for this temporal nature of actions, so we will turn to a different model that allows us to leverage this information.

HMM action models

Using a Hidden Markov Model (HMM) for the action model will allow us to capture the sequential nature of the data. Given an action $a_{i}$ consisting of a sequence of m poses $p_{1}^{i},....,p_{m}^{i}$ we can construct a HMM of length m with hidden state variables $S_{1},....,S_{m}$ for each pose that correspond to the unknown pose classes. The HMM action model defines a joint distribution over the hidden state variables and the poses comprising an action of length m:

$P(S_{1},...,S_{m},P_{1},...,P_{m}) = P(S_{1})P(P_{1}|S_{1})\prod_{i=2}^{m}P(S_{i}|S_{i-1})P(P_{i}|S_{i})$

Since the HMM is a template model, it is parameterized by 3 CPDs - the prior distribution over the initial states $P(S_{1})$, the transition model $ P(S^{'}|S)$, and the emission model $P(P|S)$. The first two CPDs $P(S_{1})$ and $P(S^{'}|S)$ are table CPDs, while the emission model for pose $P_{j}$ with parts $\{O_{k}\}_{k=1}^{10}$ is

$P(P_{j}|S) = \prod_{i=1}^{10}P(O_{i}|S,O_{pa(i)})$

where $pa(i)$ denotes the parent of node i.

Figure shows an example HMM for an action consisting of a sequence of 3 poses, where we have explicitly shown the first 6 body parts that comprise a pose.


Figure 7: In this HMM, each state variable represents the class of the underlying pose. The emission probabilities for each pose are computed using the learned parameters.

Learning the HMM action model using EM

Since the state variables are hidden, we will use the EM algorithm to learn the parameters of the HMM.


Many a times, algorithm achieves local maxima instead of global maxima.

After implementing the EM algorithm to learn an HMM, we will use a Bayesian classifier, which means that we will train a separate HMM for each action type, then classify actions based on which HMM gives the highest likelihood for the action.

We obtained an accuracy of 82.22% after successfully implementing an action recognition system.