Deep learning spatio-temporal descriptors for videos

The major try is to incorporate spatial and temporal of videos for representation.

1. 3D CNN

C3D [project][paper] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015.

[paper] Ji, W. Xu, M. Yang and K. Yu, 3D Convolutional Neural Networks for Human Action Recognition, TPAMI 2013

[project] [paper]Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei, Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014.

2. Two streams: one CNN for spatial, one for temporal (usually optical flow)

[Paper] Karen Simonyan, Andrew Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos, NIPS 2014.

[Project] [Paper] G. Gkioxari and J. Malik, Finding action tubes, in CVPR, 2015.

3. CNN+LSTM: LRCN [project] [paper]

Jeff Donahue and Lisa Anne Hendricks and Sergio Guadarrama and Marcus Rohrbach and Subhashini Venugopalan and Kate Saenko and Trevor Darrell, Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015.

Zhen Zuo1, Bing Shuai1, Gang Wang1,2, Xiao Liu1, Xingxing Wang1, Bing Wang1, Yushi Chen, Convolutional Recurrent Neural Networks: Learning Spatial Dependencies for Image Representation, CVPR 2015.

Ming Liang Xiaolin Hu, Recurrent Convolutional Neural Network for Object Recognition, CVPR 2015.

Pedro O. Pinheiro and Ronan Collobert, Recurrent Convolutional Neural Networks for Scene Labeling, ICML 2014.


4. CNN+GRU: GRU-RCN [paper]

Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, Delving Deeper into Convolutional Networks for Learning Video Representations, ICLR 2016.

5. Grid LSTM:

Nal Kalchbrenner, Ivo Danihelka, Alex Graves, Grid Long Short-Term Memory, ICLR 2016.


Considering a large number of applications in video in SAIVT lab (action recognition, abnormal event detection, facial expression, VQA), it’s worthwhile investigating an effective patio-temporal video descriptor. In the literature, researchers are trying different approaches to get temporal information in the representation:

  1. 3D CNN: temporal information is in the form of the third dimension of the filter kernels.
  2. Parallel two CNNs: one for spatial and one for temporal (optical flow).
  3. CNNs followed by RNNs (LSTM,GRU):

I have 2 ideas to explore here:

  1. Similar to our idea in CNN+CRF, the initial idea is to combine CNN and RNN in one equation. This will save the computing from 3 optimisation approaches, incorporating the better constraints for representation.
  2. Pure RNN network: modify RNN network to incorporate both spatial and temporal information. RNN is essentially good at representing temporal information. How to represent spatial? Recently there is a paper prove 2D LSTM is equivalent to CNN. That may be the clue.

Another way to look at the above approaches is: 3D CNNs encode temporal locally and RNNs encode temporal globally. It depends on the applications to choose the appropriate local or global approaches. For abnormal event detection of individual, the local approach is critical. Otherwise in case of group, the global approach is the key.