Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.
Intro
Design of model is motivated by two factors
- bottom-up end-to-end learning
- multi-person articulated tracking
Leverate available image information
Learn a model for associating a body joint to a specific person end-to-end relying on a conv. network
Facilitate efficient inference in video
Fast inference method:
E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres.Joint graph decomposition & node labeling: Problem, algo-rithms, applications. In CVPR’17.
Contribution:
- Articulated tracking model operating by bottom-up assembly of part detections within each frame and overtime
- Single-frame pose estimation relying on a sparse graph between body parts and generating body-part proposals conditioned on a person’s location
Related
U. Iqbal and J. Gall. Multi-person pose estimation with local joint-to-person associations. In ECCVw’16.
using graph partitioning approach closely related to:
S. Tang, B. Andres, M. Andriluka, and B. Schiele. Subgraph decomposition for multi-target tracking. In CVPR, 2015.
L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. An-driluka, P. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR’16.
E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV’16.
work is closely related to (similar formulation):
U. Iqbal, A. Milan, and J. Gall. Posetrack: Joint multi-person pose estimation and tracking. In CVPR’17.
differin the type of body-part proposals and the structure of the spatio-temporal graph
Overview
- a convolutional network for generating body part proposals
- an approach to group the proposals into spatio-temporal clusters
Tracking by Spatio-temporal Grouping
Part detection:
- : index of video
- : spatial loc
- : prob of correct detection
- : body joint
Graph:
by
Tracking solution:
Given image observations:
- Node and edge features:
- x and y are given by maxmizing
- Integer-Linear Program
where - Constraints on Z:
- minimum cost subgraph multicut problem
- (3) and (4) ensure assignment of node and edge variables is consistent
- (5) ensures that for every two nodes either all or none of the paths between these nodes in graph G are contained in one of the connected components of subgraph
Articulated Multi-person Tracking
Three edges
- cross-type
- same-type
- temporal
- Bottom-Up Model
- Top-Down/Bottom-Up Model
Solver
E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres. Joint graph decomposition & node labeling: Problem, algorithms, applications. In CVPR’17.
Bottom-Up (BU) Model
model the body part proposals are generated with our publicly available convolutional part detector (Deepercut)
Two connectivity patterns
- fully connected graph
- sparse version: simpler and faster version of the model by omitting edges between parts that carry little information about each other’s image location
Edge costs
- depending on detection types
- computed by logistic regression given the features computed from offset and angle
Top-Down/Bottom-Up (TD/BU) Model
(compared to BU)substitute these generic part detectors with a new convolutional body-part detector that is trained to output consistent body configurations conditioned on the person location
- generating body part proposals conditioned on the locations of people in the image (TD)
- performing joint reasoning to group these proposals into spatio-temporal clusters (BU)

select person’s head as a root part that is responsible for representing the person location (spatial propagation)
For root part set,
explicitly set (“must-not-link” constraint)
in combination with the cycle inequality (5) - each proposal can be connected to one of the “person nodes” only
cost for edge connecting detection proposal and a “person node” :
generated by the convolutional network
augment graph G with attractive/repulsive and temporal terms
Attractive/Repulsive Edges
- defined between two proposals of the same type within the same image
- decision to group two nodes is made based on the evidence from the entire image
- not just NMS based on the state of just two detections
- cost of edges is inversely-proportional to distance
Spatial Propagation
