Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

## Abstract

- Present non-local operations as generic family of building blocks
- for capturing long-range dependencies
- Inspired by [1]
- computes the response at a position as a weighted sum of the features at all positions

[1] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In Computer Vision and Pattern Recognition (CVPR), 2005.

## Introduction

Captureing long-range dependencies

- Sequential data - recurrent operations
- Image data - large receptive fields formed by deep stacks of conv. op.

Limitations of repeating local op.

- computationally inefficient
- optimization difficulties
- make multi-hop dependency modeling, e.g., when messages need to be delivered back and forth between distant positions, difficult

Proposed non-local op. is a generalization of classical non-local mean op. in [1] - intuitively, non-local op. computes the response at a position as a weighted sum of the features at all positions in the input feature maps

Advantages

- instead of progressive behavior, directly capture long range dependencies
- best results with only a few layers (e.g. 5)
- can be easily combined with other op. (e.g. conv.)

In video classification

- interactions between distant pixels in space and time
- better than 2D and 3D conv. networks

Other tasks

- obj. det./seg. and pose estimation
- on top of Mask R-CNN baseline
- increase accuracy at a small extra comp. cost

## Related

### Non-local image processing

Later developed into **BM3D (Block-matching 3D)**

Perform filtering on a group of similar, but non-local, patches

Used for image denoising, texture synthesis, super-resolution and inpainting

K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative ﬁltering. Transactions on Image Processing (TIP), 2007.

### Graphical models

Long-range dependencies can be modeled by graphical models such as conditional random fields (CRF).

- A CRF can be exploited to post-process semantic segmentation predictions of a network
- The iterative mean-field inference of CRF can be turned into a recurrent network and trained

In contrast, **this work** is a simpler feadforward block and for classification and detection (instead of segmentation in CRF methods)

**this work** and CRF methods are related to a more abstract model - **graph neural networks**.

### Feedforward modeling for sequences

Recent trend using feedforward for modeling sequence in speech and language

Long-term depend. captured by deep 1-D conv.

amenable to parallelized imple. and can be more efficient than recurrent models

### Self-attention

Computes the response at a position in a sequence (e.g., a sentence) by attending to all positions and taking their weighted average in an embedding space

can be viewed as a form of non-local mean

### Interaction networks (IN)

IN were proposed recently for modeling physical systems

operating on graphs of objects involved in pairwise interactions

Variants

**Vertex Attention IN (VAIN)**more efficient in the context of multi-agent predictive modeling**Relation Networks**computes a function on the feature embeddings at all pairs of positions in its input

**this work** also process all pairs. But **this work** indicate that *non-locality* of the model is the key to success, orthogonal to the ideas of attention/interaction/reloation (e.g. a network can attend to a local region)

### Video classification architectures

Feedforward models’ using of optical flow and trajectories is also leveraging long-range, non-local dependency.

## Non-local NN

### Formulation

$\mathbf{y}_i = \frac{1}{\mathcal{C}(\mathbf{x})} \sum_{\forall j} f(\mathbf{x}_i, \mathbf{x}_j) g(\mathbf{x}_j)$

i - the index of an output position

j - the index that enum. all possible positions

x - input signal (image, seq., video; often their features)

y - output signal (same size as x)

f - computes a scalar (representing relationship) between i and j

g - unary function, computes representation of the input signal at j

C - the response is normalized by

Compare with conv.

- Conv. op. can be viewed as summing up the weighted input in a local neighborhood (i-1 <= j <= i+1 in a 1D case with kernel size 3)

Compare with recurrent

- Recurrent op. at time i – j=i or i-1

Compare with FC

**this method**computes responses based on relationships between different locs, where as FC uses learned weights.**this method**supports variable sizes and maintains the corresponding size in the output, while FC requires fixed-size input/output and loses positional correspondence

### Instantiations

non-local models are not sensitive to kernel choices, indicating generic non-local behavior is the main reason for the observed improvements.

#### Gaussian

$f(\mathbf{x}_i, \mathbf{x}_j) = e^{\mathbf{x}_i^T \mathbf{x}_j}$

#### Embedded Gaussian

$f(\mathbf{x}_i, \mathbf{x}_j) = e^{\theta(\mathbf{x}_i)^T \phi(\mathbf{x}_j)}$

where $\theta(\mathbf{x}_i) = W_\theta \mathbf{x}_i$ and $\phi(\mathbf{x}_j) = W_\phi \mathbf{x}_j$

can be view in the form of self-attention model

$\mathbf{y} = softmax(\mathbf{x}^T W_{\theta}^{T} W_\phi \mathbf{x}) g(\mathbf{x})$

*To show attentional behavior (due to softmax) is not essential, describe the following 2 alternatives*

#### Dot product

$f(\mathbf{x}_i, \mathbf{x}_j) = \theta(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$

Main difference between this and **embedded Gaussian** is presence of softmax.

#### Concatenation

Concatenation is used by the pairwise function in Relation Networks for visual reasoning.

$f(\mathbf{x}_i, \mathbf{x}_j) = \mathrm{ReLU}(\mathbf{w}_f^T \left[ \theta(\mathbf{x}_i), \phi(\mathbf{x}_j) \right])$

*The above 4 variants demonstrate the flexibiliity of the generic non-local operation*

### Non-local Block

$\mathbf{z}_i = W_z\mathbf{y}_i + \mathbf{x}_i$

$``+\mathbf{x}_i"$ denotes a residual connection, allowing non-local block inserted in to any pre-trained model

The pairwise computation of a non-local block is lightweight when it is used in high-level, sub-sampled feature maps.

#### Implementation

Following the design in bottleneck, number of channels of W_g, W_\theta and W_\phi is half of that of x.

Subsampling trick to futher reduce computation

$\mathbf{y}_i = \frac{1}{\mathcal{C}(\mathbf{\hat{x}})} \sum_{\forall j} f(\mathbf{x}_i, \mathbf{\hat{x}}_j) g(\mathbf{\hat{x}}_j)$

where $\mathbf{\hat{x}}$ is a subsampled version of $\mathbf{x}$

## Video Classification Models

- baseline network
- extended into 3D ConvNets
- proposed non-local nets

### 2D ConvNet baseline (C2D)

To isolate the temporal effects of non-local nets vs. 3D ConvNets, constructing simple 2D baseline architecture. Only operation involving the temporal domain are the pooling layers.

### Inflated 3D ConvNets (I3D)

- inflate the 3x3 kernel to 3x3x3
- 1x1 kernel to 3x1x1

### Non-local network

Insert non-local blocks into C2D or I3D

Investigate adding 1, 5 or 10 non-local blocks