Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

## Intro

Design of model is motivated by two factors

- bottom-up end-to-end learning
- multi-person articulated tracking

**Leverate available image information**

Learn a model for associating a body joint to a specific person end-to-end relying on a conv. network

**Facilitate efficient inference in video**

Fast inference method:

E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres.Joint graph decomposition & node labeling: Problem, algo-rithms, applications. In CVPR’17.

Contribution:

- Articulated tracking model operating by bottom-up assembly of part detections within each frame and overtime
- Single-frame pose estimation relying on a sparse graph between body parts and generating body-part proposals conditioned on a person’s location

## Related

U. Iqbal and J. Gall. Multi-person pose estimation with local joint-to-person associations. In ECCVw’16.

using graph partitioning approach closely related to:

S. Tang, B. Andres, M. Andriluka, and B. Schiele.

Subgraph decomposition for multi-target tracking.In CVPR, 2015.L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. An-driluka, P. Gehler, and B. Schiele.

Deepcut: Joint subset partition and labeling for multi person pose estimation.In CVPR’16.E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele.

Deepercut: A deeper, stronger, and faster multi-person pose estimation model.In ECCV’16.

work is closely related to (similar formulation):

U. Iqbal, A. Milan, and J. Gall.

Posetrack: Joint multi-person pose estimation and tracking.In CVPR’17.

differin the type of body-part proposals and the structure of the spatio-temporal graph

## Overview

- a convolutional network for generating body part proposals
- an approach to group the proposals into spatio-temporal clusters

## Tracking by Spatio-temporal Grouping

Part detection: $D = \{ \mathbf{d}_i \}, \mathbf{d}_i = (t_i, d_i^{pos}, \pi_i, \tau_i)$

- $t_i$: index of video
- $d_i^{pos}$: spatial loc
- $\pi_i$: prob of correct detection
- $\tau_i$: body joint

Graph: $G = (D, E)$

$G' = (D', E')$ by $x \in \{0,1\}^D\text{ and }y \in \{0,1\}^E$

Tracking solution: $Z \subseteq \{0,1\}^{D \cup E}$

Given image observations:

- Node and edge features: $f\text{ and }g$
- x and y are given by maxmizing $p(x,y \mid f,g,Z) \propto p(Z \mid x,y) \prod_{d\in D} p(x_d \mid f^d) \prod_{e\in E} p(y_e \mid g^e)$
- Integer-Linear Program $\min_{(x,y) \in Z} \sum_{d\in D} c_d x_d + \sum_{e \in E} d_e y_e$

where $c_d = \log \frac{p(x_d=1 \mid f^d)}{p(x_d=0 \mid f^d)}, d_e = \log \frac{p(y_e=1 \mid g^e)}{p(y_e=0 \mid g^e)}$ - Constraints on Z:
- minimum cost subgraph multicut problem
- (3) and (4) ensure assignment of node and edge variables is consistent
- (5) ensures that for every two nodes either all or none of the paths between these nodes in graph G are contained in one of the connected components of subgraph $G'$

## Articulated Multi-person Tracking

Three edges

- cross-type
- same-type
- temporal
- Bottom-Up Model
- Top-Down/Bottom-Up Model

Solver

E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres. Joint graph decomposition & node labeling: Problem, algorithms, applications. In CVPR’17.

### Bottom-Up (BU) Model

model the body part proposals are generated with our publicly available convolutional part detector (Deepercut)

Two connectivity patterns

- fully connected graph
**sparse version**: simpler and faster version of the model by omitting edges between parts that carry little information about each other’s image location

**Edge costs**

- depending on detection types
- computed by logistic regression given the features computed from offset and angle

### Top-Down/Bottom-Up (TD/BU) Model

(compared to BU)substitute these generic part detectors with a new convolutional body-part detector that is trained to output consistent body configurations __conditioned on the person location__

- generating body part proposals conditioned on the locations of people in the image (TD)
- performing joint reasoning to group these proposals into spatio-temporal clusters (BU)

select person’s **head** as a root part that is responsible for representing the person location (spatial propagation)

For root part set,

$D^{root} = \{ d_i^{root} \}$

explicitly set (“must-not-link” constraint)

$y_{d_j^{root}, d_k^{root}} = 0$

in combination with the cycle inequality (5) - each proposal can be connected to one of the “person nodes” only

cost for edge connecting detection proposal $\mathbf{d}_k$ and a “person node” $d_i^{root}$:

$p_{d_k^c} (d_k^{pos} \mid d_i^{root})$ generated by the convolutional network

augment graph G with attractive/repulsive and temporal terms

### Attractive/Repulsive Edges

- deﬁned between two proposals of the same type within the same image
- decision to group two nodes is made based on the evidence from the entire image
- not just NMS based on the state of just two detections

- cost of edges is inversely-proportional to distance