Notes for Paper “Compositional human pose regression”

Paper:

Sun, Xiao, et al. “Compositional human pose regression.” The IEEE International Conference on Computer Vision (ICCV). Vol. 2. 2017.

Key: Structure-aware

- - Performance:
    - 48.3mm on H3.6M Protocol 1 (Avg joint error)
    - 59.1mm on H3.6M Protocol 2 (Avg joint error)
    - PCK(0.5) 86.4 on MPII
  - Evaluation
    - Metrics:
      - Absolute
        
        3D: Procrustes Analysis + MPJPE
        
        2D: PCK
      - Relative:
        
        2D: Mean per bone position error
        
        3D pose: bone length standard deviation and the percentage of illegal joint angle.
    - MPII, H3.6M
  - Basics
    - Structure-aware approach
    - Use bones instead of joints as pose representation.
    - Use joint connection structure to define a compositional loss function.
    - Just re-parameterizes the pose representation. Compatible with any other algorithm design.
    - Both 3D and 2D
  - Main method
    - Use L1 norm for joint regression. (instead of squared distance)
    - Bone based representation.
      - Bone is easier to learn compared with joints. And Bone can express constraints more easily than joints.
      - Many pose-driven applications only need local bone, not global joints.
    - Use L1 norm for bone loss function.
    - Bone is a vector from one joint to another joint. Then the relative joint position is the summation of the bones along the path.
    - Network
      - ResNet-50 pre-trained on ImageNet
      - Last FC outputs 3-coordinates (or 2-coordinates)
      - Fine-tuned on the task

- - Other methods mentioned
    - Detection based and regression based
      - The heatmaps are usually noisy and multi-mode
    - Problem: Simply minimize the per-joint location errors independently but ignore the internal structures of the pose.
    - 3D pose estimation
      - Not use prior knowledge in 3D model
        
        Use two separate steps: First do 2D joint prediction, then re-construct the 3D pose via optimization or search.
        
        [[20] Sparseness Meets Deepness] combines uncertainty maps of the 2D joints location and a sparsity-driven 3D geometric prior to infer the 3D joint location via an EM (expectation maximization) algorithm
        
        Represents 3D pose with an over-complete dictionary, use high-dim latent pose representation
        
        Extend Hourglass from 2D to 3D
      - Use prior knowledge in 3D model
        
        Embedding kinematic model layer into deep neutral networks and estimating model parameters instead of joints.
        
        The kinematic model parameterization is highly non-linear and its optimization in deep networks is hard.
    - 2D pose estimation
      - Pure Graphical models, inference models.
        
        PS model
      - Graphical model with CNN
  - Evaluation
    - Dataset: H3.6M
    - Metrics:
      - 59.1 mm Average joint error.
      - 86.4% PCK(h0.5)
  - Coding
    - Caffe
    - Two GPU

Leave a comment Cancel reply