NVIDIA NVAIL Partners Present their Research at CVPR 2019

Many of our NVAIL partners are at CVPR this week presenting their top-tier research.

Sandra Skaff
9 min readintermediate
--
View Original

Overview

The article discusses the presentations made by NVIDIA NVAIL partners at CVPR 2019, highlighting innovative research from Stanford, CASIA, and the Max Planck Institute. Key topics include advancements in 4D convolutional networks, attention-enhanced graph convolutional LSTM networks for human action recognition, and novel approaches to object representation using superquadrics.

What You'll Learn

1

How to implement 4D convolutional networks for processing 3D images

2

Why attention mechanisms enhance human action recognition in deep learning models

3

How to utilize superquadrics for effective 3D object representation

Prerequisites & Requirements

  • Understanding of convolutional neural networks and deep learning concepts
  • Familiarity with PyTorch for implementing deep learning models

Key Questions Answered

What is the Minkowski network and its significance?
The Minkowski network is a large-scale 3D/4D convolutional network designed to process continuous streams of 3D images. It utilizes generalized sparse convolution to effectively handle high-dimensional data, outperforming traditional methods in semantic segmentation tasks across various benchmarks.
How does the AGC-LSTM model improve human action recognition?
The AGC-LSTM model enhances human action recognition by integrating graph convolutional layers with LSTMs to capture both spatial configurations and temporal dynamics. This approach allows for better modeling of the correlation between these dynamics, leading to higher classification accuracies on datasets like NTU RGB+D and Northwestern-UCLA.
What are the advantages of using superquadrics in 3D object representation?
Superquadrics offer a flexible way to represent a diverse range of shapes, such as cylinders and spheres, in a continuous parameter space. This method allows for effective high-level 3D scene understanding without relying on primitive annotations, thus enabling unsupervised learning from unstructured point clouds.

Key Statistics & Figures

Training time for MinkowskiUNet42
0.987 sec on Titan Xp and 0.913 sec on Titan RTX
This is the average training time for processing a 5m x 5m x 3m room with 2cm resolution input.
Inference time for AGC-LSTM (Joint)
~59 msec for NTU dataset video
This is the time taken to process a single video frame for action recognition.
Inference time for AGC-LSTM (Part)
~33 msec for Northwestern-UCLA dataset video
This reflects the efficiency of the model in real-time action recognition tasks.
Training time for AGC-LSTM (Joint)
~20 hours
This is the time required to train the model on 2 NVIDIA Titan Xp GPUs.
Training time for AGC-LSTM (Part)
~10 hours
This is the time required to train the model on 2 NVIDIA Titan Xp GPUs.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Library
Minkowski Engine
An open-source auto-differentiation library for sparse tensors used to build 3D and 4D convolutional networks.
Framework
Pytorch
Used for implementing the AGC-LSTM and superquadric models.
Hardware
Nvidia Titan Xp
Used for training and inference of the models discussed in the article.
Hardware
Nvidia Titan Rtx
Also used for training and inference of the models discussed in the article.

Key Actionable Insights

1
Incorporating 4D convolutional networks can significantly enhance the processing of 3D image streams in applications like robotics and autonomous vehicles.
As the demand for real-time processing of 3D data increases, adopting advanced architectures like the Minkowski network can lead to improved performance in semantic segmentation tasks.
2
Utilizing attention mechanisms in deep learning models can lead to better feature extraction and improved accuracy in tasks such as human action recognition.
By focusing on key joints and their importance, models like AGC-LSTM can achieve state-of-the-art results, making them suitable for applications in surveillance and human-computer interaction.
3
Exploring different shape representations, such as superquadrics, can provide more efficient and effective solutions for 3D object understanding.
This approach not only simplifies the modeling process but also enhances the ability to infer shapes from complex data, which is crucial for applications in robotics and augmented reality.

Common Pitfalls

1
Neglecting the importance of spatial and temporal dynamics in human action recognition can lead to suboptimal model performance.
Many existing models focus primarily on either spatial or temporal aspects, missing the crucial interplay between them. Incorporating both into the model design is essential for achieving higher accuracy.

Related Concepts

Deep Learning Architectures
Convolutional Neural Networks
Human Action Recognition
3d Object Representation