Paper Reading: ArtTrack

Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.


Design of model is motivated by two factors

  • bottom-up end-to-end learning
  • multi-person articulated tracking

Leverate available image information

Learn a model for associating a body joint to a specific person end-to-end relying on a conv. network

Facilitate efficient inference in video

Fast inference method:

E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres.Joint graph decomposition & node labeling: Problem, algo-rithms, applications. In CVPR’17.


  • Articulated tracking model operating by bottom-up assembly of part detections within each frame and overtime
  • Single-frame pose estimation relying on a sparse graph between body parts and generating body-part proposals conditioned on a person’s location

U. Iqbal and J. Gall. Multi-person pose estimation with local joint-to-person associations. In ECCVw’16.

using graph partitioning approach closely related to:

S. Tang, B. Andres, M. Andriluka, and B. Schiele. Subgraph decomposition for multi-target tracking. In CVPR, 2015.

L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. An-driluka, P. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR’16.

E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV’16.

work is closely related to (similar formulation):

U. Iqbal, A. Milan, and J. Gall. Posetrack: Joint multi-person pose estimation and tracking. In CVPR’17.

differin the type of body-part proposals and the structure of the spatio-temporal graph


  1. a convolutional network for generating body part proposals
  2. an approach to group the proposals into spatio-temporal clusters

Tracking by Spatio-temporal Grouping

Part detection: D={di},di=(ti,dipos,πi,τi)D = \{ \mathbf{d}_i \}, \mathbf{d}_i = (t_i, d_i^{pos}, \pi_i, \tau_i)

  • tit_i: index of video
  • diposd_i^{pos}: spatial loc
  • πi\pi_i: prob of correct detection
  • τi\tau_i: body joint

Graph: G=(D,E)G = (D, E)

G=(D,E)G' = (D', E') by x{0,1}D and y{0,1}Ex \in \{0,1\}^D\text{ and }y \in \{0,1\}^E

Tracking solution: Z{0,1}DEZ \subseteq \{0,1\}^{D \cup E}

Given image observations:

  • Node and edge features: f and gf\text{ and }g
  • x and y are given by maxmizing p(x,yf,g,Z)p(Zx,y)dDp(xdfd)eEp(yege)p(x,y \mid f,g,Z) \propto p(Z \mid x,y) \prod_{d\in D} p(x_d \mid f^d) \prod_{e\in E} p(y_e \mid g^e)
  • Integer-Linear Program min(x,y)ZdDcdxd+eEdeye\min_{(x,y) \in Z} \sum_{d\in D} c_d x_d + \sum_{e \in E} d_e y_e
    where cd=logp(xd=1fd)p(xd=0fd),de=logp(ye=1ge)p(ye=0ge)c_d = \log \frac{p(x_d=1 \mid f^d)}{p(x_d=0 \mid f^d)}, d_e = \log \frac{p(y_e=1 \mid g^e)}{p(y_e=0 \mid g^e)}
  • Constraints on Z:
    • minimum cost subgraph multicut problem
    • (3) and (4) ensure assignment of node and edge variables is consistent
    • (5) ensures that for every two nodes either all or none of the paths between these nodes in graph G are contained in one of the connected components of subgraph GG'

Articulated Multi-person Tracking

Three edges

  • cross-type
  • same-type
  • temporal
  • Bottom-Up Model
  • Top-Down/Bottom-Up Model


E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres. Joint graph decomposition & node labeling: Problem, algorithms, applications. In CVPR’17.

Bottom-Up (BU) Model

model the body part proposals are generated with our publicly available convolutional part detector (Deepercut)

Two connectivity patterns

  • fully connected graph
  • sparse version: simpler and faster version of the model by omitting edges between parts that carry little information about each other’s image location

Edge costs

  • depending on detection types
  • computed by logistic regression given the features computed from offset and angle

Top-Down/Bottom-Up (TD/BU) Model

(compared to BU)substitute these generic part detectors with a new convolutional body-part detector that is trained to output consistent body configurations conditioned on the person location

  1. generating body part proposals conditioned on the locations of people in the image (TD)
  2. performing joint reasoning to group these proposals into spatio-temporal clusters (BU)

select person’s head as a root part that is responsible for representing the person location (spatial propagation)

For root part set,

Droot={diroot}D^{root} = \{ d_i^{root} \}

explicitly set (“must-not-link” constraint)

ydjroot,dkroot=0y_{d_j^{root}, d_k^{root}} = 0

in combination with the cycle inequality (5) - each proposal can be connected to one of the “person nodes” only

cost for edge connecting detection proposal dk\mathbf{d}_k and a “person node” dirootd_i^{root}:
pdkc(dkposdiroot)p_{d_k^c} (d_k^{pos} \mid d_i^{root}) generated by the convolutional network

augment graph G with attractive/repulsive and temporal terms

Attractive/Repulsive Edges

  • defined between two proposals of the same type within the same image
  • decision to group two nodes is made based on the evidence from the entire image
    • not just NMS based on the state of just two detections
  • cost of edges is inversely-proportional to distance

Spatial Propagation

Author: Texot
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !