Paper Reading: PersonLab

Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

Title: PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model

George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, Kevin Murphy



Hough voting

hk(x)=1πR2i=1:Npk(xi)B(xi+Sk(xi)x)h_k(x) = \frac{1}{\pi R^2} \sum_{i=1:N} p_k(x_i) B(x_i + S_k(x_i) -x)

B is bilinear interpolation kernel.

S_k is short-range offset


we add to our network a separate pairwise mid-range 2-D offset field output M_{k,l}(x) designed to connect pairs of keypoints. We compute 2(K − 1) such offset fields, one for each directed edge connecting pairs (k, l) of keypoints which are adjacent to each other in a tree-structured kinematic graph of the person

Mid-range pairwise offsets


Mk,l(x)=(yj,lx)[xDR(yj,k)]M_{k,l}(x) = (y_{j,l} - x) [ x \in \mathcal{D}_R (y_{j,k}) ]

Recurrent offset refinement

Mk,l(x)x+Sl(x),where x=Mk,l(x)M_{k,l} (x) \leftarrow x' + S_l(x'), \text{where } x' = M_{k,l} (x)

Fast greedy decoding

  1. create priority queue, shared across all K keypoint types, in which we insert the position x i and keypoint type k of all local maxima in the Hough score maps h_k(x) (sever as candidate seeds for starting a detection instance)
  2. pop elements out in descentding score order
  3. if the position x_i of the current candidate detection seed of type k is within a disk D_r(y_{j’},k) of the corresponding keypoint of previously detected person instances j’, then we reject it
  4. otherwise, start a new detection instance j with the k-th keypoint at position y_{j,k} = x_i serving as seed
  5. follow mid-range displacement vectors to connect pairs, setting y_{j,l} = y_{j,k} + M_{k,l} (y_{j,k})


keypoint score


sj,k=E{OKSj,k}=pk(yj,k)xDR(yj,k)h^k(x)exp((xyj,k)22λj2κk2)dxs_{j,k} = \mathrm{E}\left\{ {OKS}_{j,k} \right\} = p_k (y_{j,k}) \int_{x \in \mathcal{D}_R (y_{j,k})} \hat{h}_k (x) \exp \left( - \frac{(x-y_{j,k})^2}{2 \lambda_j^2 \kappa_k^2} \right) dx

instance-level score


we use as final instance-level score the sum of the scores of the keypoints that have not already been claimed by higher scoring instances

sj=(1/K)k=1:Ksj,k[yj,kyj,k>r, for every j<j]s_j = (1/K) \sum_{k=1:K} s_{j,k} \left[ \lVert y_{j,k} - y_{j',k} \rVert > r,\text{ for every } j'<j \right]

Instance-level person segmentation

Given keypoint-level person instance detections, identify pixels that belong to people (recognition) and associate them with the detected person instances (grouping)

  1. Semantic person segmentation in standard fully-convolutional fashion
  2. Associating segments with instances via geometric embeddings

Long-range offset

Lk(x)=yj,kxL_k(x) = y_{j,k} - x


Lk(x)x+Lk(x),x=Lk(x) and Lk(x)x+Sk(x),x=Lk(x)L_k (x) \leftarrow x' + L_k (x'), x'=L_k(x) \text{ and } L_k(x) \leftarrow x' + S_k (x'), x'=L_k(x)

Geometric embedding

G(x)=(Gk(x))k=1,,K,Gk(x)=x+Lk(x)G(x) = (G_k(x))_{k=1,\dots,K}, G_k(x) = x+L_k(x)

Embedding distance

To decide if the person pixel x_i belongs to the j-th person instance, we compute the embedding distance metric

Di,j=1kpk(yj,k)k=1Kpk(yj,k)1λjGk(xi)yj,kD_{i,j} = \frac{1}{\sum_k p_k (y_{j,k})} \sum_{k=1}^K p_k(y_{j,k}) \frac{1}{\lambda_j} \lVert G_k(x_i) - y_{j,k} \rVert

set \lambda_j equal to the square root of the area of the bounding box tightly containing all detected keypoints of the j-th person instance.

With N_s pixels and M person instances, having complexity O(N_S * M) instead of O(N_S * N_S) of standard embedding-based segmentation techniques which, at least in principle, require computation of embedding vector distances for all pixel pairs.


  1. find all positions x_i marked as person in semantic sementation map, i.e. those pixels that have semantic segmentation probability p_S(x_i) >= 0.5.
  2. associate each person pixel x_i with every detected person instance j for which the embedding distance metric satisfies D_{i,j} <= t

Author: Texot
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !