Human Pose Estimation and Extended Research

Posted by : at

Category : computer_vision


Introduction

As recent progress in computational capabilities enable machines to come into the real world from a lab settings, it becomes important to understand and study nearby human activity to address safety concerns. Human pose estimation is already an active research problem for machine perception systems in self-driving cars, search and rescue systems, automated surveillance and other Human-Machine Interaction (HRI) applications. Accurate and efficient human pose estimation is critical in achieving high-level tasks such as pedestrian avoidance, automated robotic lifting and moving victims for search and rescue applications, human behavior recognition, etc.

Summary of Contributions

To solve the dilemma of accurate estimation and fast processing speed, the contributions of this work can be summarized as,

  1. Parallel architecture , taking advantage of distributed computing when using multiple GPUs.
  2. Portable, independent limb detection branch , serving as light-weight process in tasks that focus on different sets of limbs.

Proposed Method


Fig. 1 Architecture of the 2-stage 13-parallel CNN ensemble network.

As shown in Fig. 1, the proposed neural network architecture can divided into three stages: the preprocessing stage and predicting stages one and two. In the preprocessing stage, the color input image, \(I\), is fed into a pre-trained VGG19 network to obtain the feature map \(F\). In each predicting stage, there exist 13 CNN ensembles to independently predict the link fields and joint confidence maps for each of the 13 skeletal model linkages.


Fig. 2 Joint confidence and PAF for right knee-right ankle body linkage.

The human model is represented with 13 limbs and each limb consists of one proximal joint, one distal joint, and one linkage that connected these paired joints. For example, the right knee-right ankle limb consists of a right knee as the proximal joint, a right ankle as the distal joint and the linkage connected these joints, as shown in Fig. 2. A heatmap is used to represent the position of both the proximal and distal joints, with positive values for the proximal joint and negative values for the distal joint. And a partial affinity field is used to represent both the direction and orientation of the linkage that starts from the proximal joint to the distal joint.


Fig. 3 Body part parsing and individual parsing.

After the limbs predictions are output from the proposed neural network, body parsing needs to associate proximal and distal joints and linkage into individual limbs. After the body part parsing, the individual parsing assembles all the body parts to form individual skeletons by assembling rules. Once individual parsing is done, the connected joint position is then refined by weighting the predicted position of the joints from the two connected limb pairs.


Fig. 4 Mean average precision of specific joints.

The overall detection accuracy for body parts is greater than 98%. When using a single GTX 1080, the prediction process takes around 103.5 ms, while the processing time decreases to 76.8 ms when using two GTX1080. The mean average precision is presented in Fig. 4.

  1. [C1] Ren, H., Kumar, A., Wang, X., Ben-Tzvi, P., "Parallel Deep Learning Ensembles for Human Pose Estimation", Proceedings of the ASME 2018 Dynamic Systems and Control Conf. (DSCC 2018), Atlanta, GA, Sep. 30 - Oct. 3, 2018.

About Hailin Ren

Hello, my name is Hailin Ren. I obtained my Ph.D. degree in Robotics and Mechatronics Lab (RML) in the Mechanical Engineering Department at Virginia Tech. My research interests include Reinforcement Learning, Computer Vision, Mechatronics System Design, etc.

Useful Links