Notes for Paper “Compositional human pose regression”

Paper:

Sun, Xiao, et al. “Compositional human pose regression.” The IEEE International Conference on Computer Vision (ICCV). Vol. 2. 2017.

Key: Structure-aware

      • Performance:
        • 48.3mm on H3.6M Protocol 1 (Avg joint error)
        • 59.1mm on H3.6M Protocol 2 (Avg joint error)
        • PCK(0.5) 86.4 on MPII
      • Evaluation
        • Metrics:
          • Absolute
            • 3D: Procrustes Analysis + MPJPE
            • 2D: PCK
          • Relative:
            • 2D: Mean per bone position error
            • 3D pose: bone length standard deviation and the percentage of illegal joint angle.
        • MPII, H3.6M
      • Basics
        • Structure-aware approach
        • Use bones instead of joints as pose representation.
        • Use joint connection structure to define a compositional loss function.
        • Just re-parameterizes the pose representation. Compatible with any other algorithm design.
        • Both 3D and 2D
      • Main method
        • Use L1 norm for joint regression. (instead of squared distance)
        • Bone based representation.
          • Bone is easier to learn compared with joints. And Bone can express constraints more easily than joints.
          • Many pose-driven applications only need local bone, not global joints.
        • Use L1 norm for bone loss function.
        • Bone is a vector from one joint to another joint. Then the relative joint position is the summation of the bones along the path.
        • Network
          • ResNet-50 pre-trained on ImageNet
          • Last FC outputs 3-coordinates (or 2-coordinates)
          • Fine-tuned on the task

      • Other methods mentioned
        • Detection based and regression based
          • The heatmaps are usually noisy and multi-mode
        • Problem: Simply minimize the per-joint location errors independently but ignore the internal structures of the pose.
        • 3D pose estimation
          • Not use prior knowledge in 3D model
            • Use two separate steps: First do 2D joint prediction, then re-construct the 3D pose via optimization or search.
            • [[20] Sparseness Meets Deepness] combines uncertainty maps of the 2D joints location and a sparsity-driven 3D geometric prior to infer the 3D joint location via an EM (expectation maximization) algorithm
            • Represents 3D pose with an over-complete dictionary, use high-dim latent pose representation
            • Extend Hourglass from 2D to 3D
          • Use prior knowledge in 3D model
            • Embedding kinematic model layer into deep neutral networks and estimating model parameters instead of joints.
              • The kinematic model parameterization is highly non-linear and its optimization in deep networks is hard.
        • 2D pose estimation
          • Pure Graphical models, inference models.
            • PS model
          • Graphical model with CNN
      • Evaluation
        • Dataset: H3.6M
        • Metrics:
          • 59.1 mm Average joint error.
          • 86.4% PCK(h0.5)
      • Coding
        • Caffe
        • Two GPU

Leave a comment