Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.
Title: Self-supervised Learning of Pose Embeddings from Spatiotemporal Relations in Videos Ömer Sümer, Tobias Dencker, Björn Ommer
Heidelberg Collaboratory for Image Processing IWR, Heidelberg University, Germany
To avoid expensive labelling, exploit spatialtemporal relations in training videos for self-supervised learning of pose embeddings
Key Idea: combine temporal ordering and spatial placement estimation as auxiliary tasks for learning learning pose similarities
Avoid ambiguous and incorrect training labels: starts training with the most reliable data samples => gradually increases the difficulty
Further refine: Mine repetitive poses in individual videos (provide reliable labels while removing inconsistencies)
Finding similar postures enables applications like action recognition or video content retrieval
Human joints as surrogate for describing similarity
- measuring distances in pose space accurately and coming up with a non-ambiguous Euclidean embedding is a challenging problem
- manually annotating human joints in larger datasets is expensive
a solution of unsupervised training
- switch to a related auxiliary task for which label information is available
Several well-known sources of weak supervision
- spatial configuration of natural scenes
- image colorization
- Learning spatiotemporal relations in videos by two auxiliary tasks:
- temporal ordering task - whether two given person images are temporally close (similar)
- spatial placement task - discover randomly extracted crops from spatial neighborhood of persons, and learns whether given patches are a person or not
- Learning spatial and temporal relations provides
- “what” we are looking at (person/not person)
- “how” the instances differ (similar/dissimilar poses)
Using Curriculum-based learning and repetition mining
- arrange the training set: only the easy samples => iteratively extend to harder ones
- eliminating inactive video parts
X. Chen and A. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS, 2014
learned pairwise part relations combining CNN with graphical models
J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1799–1807. Curran Associates, Inc., 2014.
exploited CNNs for relationship between body parts with a cascade reﬁnement
J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature veriﬁcation using a ”siamese” time delay neural network. In NIPS, pages 737–744. 1994.
Similarity learning in human pose analysis
G. Mori, C. Pantofaru, N. Kothari, T. Leung, G. Toderici, A. Toshev, and W. Yang. Pose embeddings: A deep architecture for learning to match human poses. arXiv preprint arXiv:1507.00302, 2015.
S. Kwak, M. Cho, and I. Laptev. Thin-slicing for pose: Learning to understand pose without explicit pose estimation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4938–4947, June 2016.
body joint locations are used to create similar and dissimilar pairs of instances from annotated human pose datasets
Alternative sources of supervision
- image generation
- spatial or temporal clues
C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1422–1430, Dec 2015.
take image patches from a 3 × 3 grid and classify the relative location of 8 patches with respect to a center patch
M. Noroozi and P. Favaro. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles, pages 69–84. Springer, Cham, 2016.
proposed a localization problem given all 9 patches at once. Also, they used 100 relative locations as class labels out of 9! permutations using a Hamming distance-based selection
X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In ICCV, pages 2794–2802, 2015.
exploited videos by detecting interesting regions with SURF keypoints and tracking them
I. Misra, C. L. Zitnick, and M. Hebert. Shufﬂe and Learn: Unsupervised Learning Using Temporal Order Veriﬁcation, pages 527–544. Springer, Cham, 2016.
deﬁned a temporal order veriﬁcation task, which classiﬁes whether given 3-frame sequences are temporally ordered or not by altering the middle frame
To learn better representation, the author argue that temporal cues which aim to learn whether given inputs are from temporally close windows or not will be a more effective approach
Temporally close windows or not
S. Becker and G. E. Hinton. A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355:161–163, 1992.
Local proximity in data (slow feature analysis, SFA)
R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun. Unsupervised learning of spatiotemporally coherent metrics. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4086–4093, Dec 2015.
inspired from SFA, created a connection between slowness and metric learning by temporal coherence
D. Jayaraman and K. Grauman. Slow and steady feature analysis: Higher order temporal coherence in video. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3852–3861, June 2016.
motivated by temporal smoothness in feature space, exploited higher order coherence, which they referred to as steadiness, in various tasks.
However, slowness or steadiness is bad for limited motion and repetitive nature of human actions
- learn auxiliary tasks in relatively small temporal windows, without more than a single cycle of action
- curriculum learning  and repetition mining refine guide the self-supervised tasks to learn stronger temporal features
 Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, New York, NY, USA, 2009. ACM.
Insight: spatiotemporal relations in videos provide sufﬁcient information for learning
Propose self-supervised pipeline that creates training data for two auxiliary tasks:
- temporal ordering
- spatial placement
Raw self-supervised output needs reﬁnement:
- curriculum learning
- repetition mining
Two auxiliary tasks are trained in a Siamese CNN
Learned features are eventually used as pose embeddings
Self-supervised Pose Embeddings: Temporal Ordering and Spatial Placement
Creating a Curriculum for Training
Difficulty: motion in videos - optical flow based, foreground-motion / background-motion
Act as signal-to-noise ratio
Sort the training samples according to difficulty, splitted into discrete blocks. Trainging with increasing difficulty (decreasing flow ratio)
Mining Repetitive Poses
Use the learned pose embeddings to detect repetitive poses in the training data
Obtain a self-similarity matrices by computing all pairwise distances between frames; the distance is Euclidean norm of normalized pool5 feature.
Convolve the self-similarity matrix with a 5x5 circulant filter matrix to suppress potential outliers that are not aligned with the off-diagonals by thresholding.