Paper Reading: Self-sup. Learning of Pose Emb. from S.T. Relations in Videos

Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

Title: Self-supervised Learning of Pose Embeddings from Spatiotemporal Relations in Videos Ömer Sümer, Tobias Dencker, Björn Ommer

Heidelberg Collaboratory for Image Processing IWR, Heidelberg University, Germany


To avoid expensive labelling, exploit spatialtemporal relations in training videos for self-supervised learning of pose embeddings

Key Idea: combine temporal ordering and spatial placement estimation as auxiliary tasks for learning learning pose similarities

Avoid ambiguous and incorrect training labels: starts training with the most reliable data samples => gradually increases the difficulty

Further refine: Mine repetitive poses in individual videos (provide reliable labels while removing inconsistencies)


Finding similar postures enables applications like action recognition or video content retrieval

Human joints as surrogate for describing similarity

  1. measuring distances in pose space accurately and coming up with a non-ambiguous Euclidean embedding is a challenging problem
  2. manually annotating human joints in larger datasets is expensive

a solution of unsupervised training

  • switch to a related auxiliary task for which label information is available

Several well-known sources of weak supervision

  • spatial configuration of natural scenes
  • inpainting
  • super-resolution
  • image colorization
  • tracking
  • ego-motion
  • audio

Proposed method:

  • Learning spatiotemporal relations in videos by two auxiliary tasks:
    • temporal ordering task - whether two given person images are temporally close (similar)
    • spatial placement task - discover randomly extracted crops from spatial neighborhood of persons, and learns whether given patches are a person or not
    • Learning spatial and temporal relations provides
      • “what” we are looking at (person/not person)
      • “how” the instances differ (similar/dissimilar poses)

Using Curriculum-based learning and repetition mining

  • arrange the training set: only the easy samples => iteratively extend to harder ones
  • eliminating inactive video parts

Pose Estimation

X. Chen and A. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS, 2014

learned pairwise part relations combining CNN with graphical models

J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1799–1807. Curran Associates, Inc., 2014.

exploited CNNs for relationship between body parts with a cascade refinement

Similarity learning

Siamese-type architecture

J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature verification using a ”siamese” time delay neural network. In NIPS, pages 737–744. 1994.

Similarity learning in human pose analysis

G. Mori, C. Pantofaru, N. Kothari, T. Leung, G. Toderici, A. Toshev, and W. Yang. Pose embeddings: A deep architecture for learning to match human poses. arXiv preprint arXiv:1507.00302, 2015.

S. Kwak, M. Cho, and I. Laptev. Thin-slicing for pose: Learning to understand pose without explicit pose estimation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4938–4947, June 2016.

body joint locations are used to create similar and dissimilar pairs of instances from annotated human pose datasets

Self-supervised learning

Alternative sources of supervision

  • ego-motion
  • colorization
  • image generation
  • spatial or temporal clues

C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1422–1430, Dec 2015.

take image patches from a 3 × 3 grid and classify the relative location of 8 patches with respect to a center patch

M. Noroozi and P. Favaro. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles, pages 69–84. Springer, Cham, 2016.

proposed a localization problem given all 9 patches at once. Also, they used 100 relative locations as class labels out of 9! permutations using a Hamming distance-based selection

X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In ICCV, pages 2794–2802, 2015.

exploited videos by detecting interesting regions with SURF keypoints and tracking them

I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification, pages 527–544. Springer, Cham, 2016.

defined a temporal order verification task, which classifies whether given 3-frame sequences are temporally ordered or not by altering the middle frame

To learn better representation, the author argue that temporal cues which aim to learn whether given inputs are from temporally close windows or not will be a more effective approach

Temporally close windows or not

S. Becker and G. E. Hinton. A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355:161–163, 1992.

Local proximity in data (slow feature analysis, SFA)

R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun. Unsupervised learning of spatiotemporally coherent metrics. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4086–4093, Dec 2015.

inspired from SFA, created a connection between slowness and metric learning by temporal coherence

D. Jayaraman and K. Grauman. Slow and steady feature analysis: Higher order temporal coherence in video. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3852–3861, June 2016.

motivated by temporal smoothness in feature space, exploited higher order coherence, which they referred to as steadiness, in various tasks.

However, slowness or steadiness is bad for limited motion and repetitive nature of human actions


  • learn auxiliary tasks in relatively small temporal windows, without more than a single cycle of action
  • curriculum learning [1] and repetition mining refine guide the self-supervised tasks to learn stronger temporal features

[1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, New York, NY, USA, 2009. ACM.


Insight: spatiotemporal relations in videos provide sufficient information for learning

Propose self-supervised pipeline that creates training data for two auxiliary tasks:

  • temporal ordering
  • spatial placement

Raw self-supervised output needs refinement:

  • curriculum learning
  • repetition mining

Two auxiliary tasks are trained in a Siamese CNN

Learned features are eventually used as pose embeddings

Self-supervised Pose Embeddings: Temporal Ordering and Spatial Placement

Creating a Curriculum for Training

Difficulty: motion in videos - optical flow based, foreground-motion / background-motion

Act as signal-to-noise ratio

Sort the training samples according to difficulty, splitted into discrete blocks. Trainging with increasing difficulty (decreasing flow ratio)

Mining Repetitive Poses

Use the learned pose embeddings to detect repetitive poses in the training data

Obtain a self-similarity matrices by computing all pairwise distances between frames; the distance is Euclidean norm of normalized pool5 feature.

Convolve the self-similarity matrix with a 5x5 circulant filter matrix to suppress potential outliers that are not aligned with the off-diagonals by thresholding.

Author: Texot
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !