Paper Reading: Non-local Neural Networks

Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.


  • Present non-local operations as generic family of building blocks
  • for capturing long-range dependencies
  • Inspired by [1]
  • computes the response at a position as a weighted sum of the features at all positions

[1] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In Computer Vision and Pattern Recognition (CVPR), 2005.


Captureing long-range dependencies

  • Sequential data - recurrent operations
  • Image data - large receptive fields formed by deep stacks of conv. op.

Limitations of repeating local op.

  • computationally inefficient
  • optimization difficulties
  • make multi-hop dependency modeling, e.g., when messages need to be delivered back and forth between distant positions, difficult

Proposed non-local op. is a generalization of classical non-local mean op. in [1] - intuitively, non-local op. computes the response at a position as a weighted sum of the features at all positions in the input feature maps


  • instead of progressive behavior, directly capture long range dependencies
  • best results with only a few layers (e.g. 5)
  • can be easily combined with other op. (e.g. conv.)

In video classification

  • interactions between distant pixels in space and time
  • better than 2D and 3D conv. networks

Other tasks

  • obj. det./seg. and pose estimation
  • on top of Mask R-CNN baseline
  • increase accuracy at a small extra comp. cost

Non-local image processing

Later developed into BM3D (Block-matching 3D)

Perform filtering on a group of similar, but non-local, patches

Used for image denoising, texture synthesis, super-resolution and inpainting

K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. Transactions on Image Processing (TIP), 2007.

Graphical models

Long-range dependencies can be modeled by graphical models such as conditional random fields (CRF).

  • A CRF can be exploited to post-process semantic segmentation predictions of a network
  • The iterative mean-field inference of CRF can be turned into a recurrent network and trained

In contrast, this work is a simpler feadforward block and for classification and detection (instead of segmentation in CRF methods)

this work and CRF methods are related to a more abstract model - graph neural networks.

Feedforward modeling for sequences

Recent trend using feedforward for modeling sequence in speech and language

Long-term depend. captured by deep 1-D conv.

amenable to parallelized imple. and can be more efficient than recurrent models


Computes the response at a position in a sequence (e.g., a sentence) by attending to all positions and taking their weighted average in an embedding space

can be viewed as a form of non-local mean

Interaction networks (IN)

IN were proposed recently for modeling physical systems

operating on graphs of objects involved in pairwise interactions


  • Vertex Attention IN (VAIN) more efficient in the context of multi-agent predictive modeling
  • Relation Networks computes a function on the feature embeddings at all pairs of positions in its input

this work also process all pairs. But this work indicate that non-locality of the model is the key to success, orthogonal to the ideas of attention/interaction/reloation (e.g. a network can attend to a local region)

Video classification architectures

Feedforward models’ using of optical flow and trajectories is also leveraging long-range, non-local dependency.

Non-local NN


yi=1C(x)jf(xi,xj)g(xj)\mathbf{y}_i = \frac{1}{\mathcal{C}(\mathbf{x})} \sum_{\forall j} f(\mathbf{x}_i, \mathbf{x}_j) g(\mathbf{x}_j)

i - the index of an output position
j - the index that enum. all possible positions
x - input signal (image, seq., video; often their features)
y - output signal (same size as x)
f - computes a scalar (representing relationship) between i and j
g - unary function, computes representation of the input signal at j
C - the response is normalized by

Compare with conv.

  • Conv. op. can be viewed as summing up the weighted input in a local neighborhood (i-1 <= j <= i+1 in a 1D case with kernel size 3)

Compare with recurrent

  • Recurrent op. at time i – j=i or i-1

Compare with FC

  • this method computes responses based on relationships between different locs, where as FC uses learned weights.
  • this method supports variable sizes and maintains the corresponding size in the output, while FC requires fixed-size input/output and loses positional correspondence


non-local models are not sensitive to kernel choices, indicating generic non-local behavior is the main reason for the observed improvements.


f(xi,xj)=exiTxjf(\mathbf{x}_i, \mathbf{x}_j) = e^{\mathbf{x}_i^T \mathbf{x}_j}

Embedded Gaussian

f(xi,xj)=eθ(xi)Tϕ(xj)f(\mathbf{x}_i, \mathbf{x}_j) = e^{\theta(\mathbf{x}_i)^T \phi(\mathbf{x}_j)}

where θ(xi)=Wθxi\theta(\mathbf{x}_i) = W_\theta \mathbf{x}_i and ϕ(xj)=Wϕxj\phi(\mathbf{x}_j) = W_\phi \mathbf{x}_j

can be view in the form of self-attention model

y=softmax(xTWθTWϕx)g(x)\mathbf{y} = softmax(\mathbf{x}^T W_{\theta}^{T} W_\phi \mathbf{x}) g(\mathbf{x})

To show attentional behavior (due to softmax) is not essential, describe the following 2 alternatives

Dot product

f(xi,xj)=θ(xi)Tϕ(xj)f(\mathbf{x}_i, \mathbf{x}_j) = \theta(\mathbf{x}_i)^T \phi(\mathbf{x}_j)

Main difference between this and embedded Gaussian is presence of softmax.


Concatenation is used by the pairwise function in Relation Networks for visual reasoning.

f(xi,xj)=ReLU(wfT[θ(xi),ϕ(xj)])f(\mathbf{x}_i, \mathbf{x}_j) = \mathrm{ReLU}(\mathbf{w}_f^T \left[ \theta(\mathbf{x}_i), \phi(\mathbf{x}_j) \right])

The above 4 variants demonstrate the flexibiliity of the generic non-local operation

Non-local Block

zi=Wzyi+xi\mathbf{z}_i = W_z\mathbf{y}_i + \mathbf{x}_i

+xi"``+\mathbf{x}_i" denotes a residual connection, allowing non-local block inserted in to any pre-trained model

The pairwise computation of a non-local block is lightweight when it is used in high-level, sub-sampled feature maps.


Following the design in bottleneck, number of channels of W_g, W_\theta and W_\phi is half of that of x.

Subsampling trick to futher reduce computation

yi=1C(x^)jf(xi,x^j)g(x^j)\mathbf{y}_i = \frac{1}{\mathcal{C}(\mathbf{\hat{x}})} \sum_{\forall j} f(\mathbf{x}_i, \mathbf{\hat{x}}_j) g(\mathbf{\hat{x}}_j)

where x^\mathbf{\hat{x}} is a subsampled version of x\mathbf{x}

Video Classification Models

  • baseline network
  • extended into 3D ConvNets
  • proposed non-local nets

2D ConvNet baseline (C2D)

To isolate the temporal effects of non-local nets vs. 3D ConvNets, constructing simple 2D baseline architecture. Only operation involving the temporal domain are the pooling layers.

Inflated 3D ConvNets (I3D)

  • inflate the 3x3 kernel to 3x3x3
  • 1x1 kernel to 3x1x1

Non-local network

Insert non-local blocks into C2D or I3D

Investigate adding 1, 5 or 10 non-local blocks

Author: Texot
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !