Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

Abstract

• Present non-local operations as generic family of building blocks
• for capturing long-range dependencies
• Inspired by [1]
• computes the response at a position as a weighted sum of the features at all positions

[1] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In Computer Vision and Pattern Recognition (CVPR), 2005.

Introduction

Captureing long-range dependencies

• Sequential data - recurrent operations
• Image data - large receptive fields formed by deep stacks of conv. op.

Limitations of repeating local op.

• computationally inefficient
• optimization difficulties
• make multi-hop dependency modeling, e.g., when messages need to be delivered back and forth between distant positions, difficult

Proposed non-local op. is a generalization of classical non-local mean op. in [1] - intuitively, non-local op. computes the response at a position as a weighted sum of the features at all positions in the input feature maps

• instead of progressive behavior, directly capture long range dependencies
• best results with only a few layers (e.g. 5)
• can be easily combined with other op. (e.g. conv.)

In video classification

• interactions between distant pixels in space and time
• better than 2D and 3D conv. networks

• obj. det./seg. and pose estimation
• on top of Mask R-CNN baseline
• increase accuracy at a small extra comp. cost

Non-local image processing

Later developed into BM3D (Block-matching 3D)

Perform filtering on a group of similar, but non-local, patches

Used for image denoising, texture synthesis, super-resolution and inpainting

K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative ﬁltering. Transactions on Image Processing (TIP), 2007.

Graphical models

Long-range dependencies can be modeled by graphical models such as conditional random fields (CRF).

• A CRF can be exploited to post-process semantic segmentation predictions of a network
• The iterative mean-field inference of CRF can be turned into a recurrent network and trained

In contrast, this work is a simpler feadforward block and for classification and detection (instead of segmentation in CRF methods)

this work and CRF methods are related to a more abstract model - graph neural networks.

Feedforward modeling for sequences

Recent trend using feedforward for modeling sequence in speech and language

Long-term depend. captured by deep 1-D conv.

amenable to parallelized imple. and can be more efficient than recurrent models

Self-attention

Computes the response at a position in a sequence (e.g., a sentence) by attending to all positions and taking their weighted average in an embedding space

can be viewed as a form of non-local mean

Interaction networks (IN)

IN were proposed recently for modeling physical systems

operating on graphs of objects involved in pairwise interactions

Variants

• Vertex Attention IN (VAIN) more efficient in the context of multi-agent predictive modeling
• Relation Networks computes a function on the feature embeddings at all pairs of positions in its input

this work also process all pairs. But this work indicate that non-locality of the model is the key to success, orthogonal to the ideas of attention/interaction/reloation (e.g. a network can attend to a local region)

Video classification architectures

Feedforward models’ using of optical flow and trajectories is also leveraging long-range, non-local dependency.

Non-local NN

Formulation

$\mathbf{y}_i = \frac{1}{\mathcal{C}(\mathbf{x})} \sum_{\forall j} f(\mathbf{x}_i, \mathbf{x}_j) g(\mathbf{x}_j)$

i - the index of an output position
j - the index that enum. all possible positions
x - input signal (image, seq., video; often their features)
y - output signal (same size as x)
f - computes a scalar (representing relationship) between i and j
g - unary function, computes representation of the input signal at j
C - the response is normalized by

Compare with conv.

• Conv. op. can be viewed as summing up the weighted input in a local neighborhood (i-1 <= j <= i+1 in a 1D case with kernel size 3)

Compare with recurrent

• Recurrent op. at time i – j=i or i-1

Compare with FC

• this method computes responses based on relationships between different locs, where as FC uses learned weights.
• this method supports variable sizes and maintains the corresponding size in the output, while FC requires fixed-size input/output and loses positional correspondence

Instantiations

non-local models are not sensitive to kernel choices, indicating generic non-local behavior is the main reason for the observed improvements.

Gaussian

$f(\mathbf{x}_i, \mathbf{x}_j) = e^{\mathbf{x}_i^T \mathbf{x}_j}$

Embedded Gaussian

$f(\mathbf{x}_i, \mathbf{x}_j) = e^{\theta(\mathbf{x}_i)^T \phi(\mathbf{x}_j)}$

where $\theta(\mathbf{x}_i) = W_\theta \mathbf{x}_i$ and $\phi(\mathbf{x}_j) = W_\phi \mathbf{x}_j$

can be view in the form of self-attention model

$\mathbf{y} = softmax(\mathbf{x}^T W_{\theta}^{T} W_\phi \mathbf{x}) g(\mathbf{x})$

To show attentional behavior (due to softmax) is not essential, describe the following 2 alternatives

Dot product

$f(\mathbf{x}_i, \mathbf{x}_j) = \theta(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$

Main difference between this and embedded Gaussian is presence of softmax.

Concatenation

Concatenation is used by the pairwise function in Relation Networks for visual reasoning.

$f(\mathbf{x}_i, \mathbf{x}_j) = \mathrm{ReLU}(\mathbf{w}_f^T \left[ \theta(\mathbf{x}_i), \phi(\mathbf{x}_j) \right])$

The above 4 variants demonstrate the flexibiliity of the generic non-local operation

Non-local Block

$\mathbf{z}_i = W_z\mathbf{y}_i + \mathbf{x}_i$

$+\mathbf{x}_i"$ denotes a residual connection, allowing non-local block inserted in to any pre-trained model

The pairwise computation of a non-local block is lightweight when it is used in high-level, sub-sampled feature maps.

Implementation

Following the design in bottleneck, number of channels of W_g, W_\theta and W_\phi is half of that of x.

Subsampling trick to futher reduce computation

$\mathbf{y}_i = \frac{1}{\mathcal{C}(\mathbf{\hat{x}})} \sum_{\forall j} f(\mathbf{x}_i, \mathbf{\hat{x}}_j) g(\mathbf{\hat{x}}_j)$

where $\mathbf{\hat{x}}$ is a subsampled version of $\mathbf{x}$

Video Classification Models

• baseline network
• extended into 3D ConvNets
• proposed non-local nets

2D ConvNet baseline (C2D)

To isolate the temporal effects of non-local nets vs. 3D ConvNets, constructing simple 2D baseline architecture. Only operation involving the temporal domain are the pooling layers.

Inflated 3D ConvNets (I3D)

• inflate the 3x3 kernel to 3x3x3
• 1x1 kernel to 3x1x1

Non-local network

Insert non-local blocks into C2D or I3D

Investigate adding 1, 5 or 10 non-local blocks

Author:
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !
TOC