When a human sees an object, certain neurons in our brain’s visual cortex light up with activity, but when we take hallucinogenic drugs, these drugs overwhelm our serotonin receptors and lead to the distorted visual perception of colours and shapes. Similarly, deep neural networks that are modelled on structures in our brain, stores data in huge tables of numeric coefficients, which defy direct human comprehension. But when these neural network’s activation is overstimulated (virtual drugs), we get phenomenons like neural dreams and neural hallucinations. Dreams are the mental conjectures that are produced by our brain when the perceptual apparatus shuts down, whereas hallucinations are produced when this perceptual apparatus becomes hyperactive. In this blog, we will discuss how this phenomenon of hallucination in neural networks can be utilized to perform the task of image inpainting.

Image and Video Inpainting

Image inpainting is the art of synthesizing alternative contents for the reconstruction of missing or deteriorated parts of an image such that the modification is semantically correct and visually realistic. Image inpainting has received significant attention from the computer vision and image processing community throughout the past years and led to key advances in the research and application field. Traditionally, inpainting is achieved either using examplar-based approaches that reconstruct one missing pixel/patch of a plausible hypothesis at a time, while maintaining the neighbourhood consistency, or diffusion-based approaches that propagate local semantic structures into the missing parts. However, irrespective of the employed method, the core challenge of image inpainting is to maintain a global semantic structure and generate realistic texture details for the unknown regions. The traditional approaches fail to achieve this global semantic structure and realistic textures when the size of the missing regions is large or high irregular. Therefore a component that can provide a plausible hallucination for the missing pixels is needed to tackle such inpainting problems. To design these hallucinative components, researchers generally choose deep neural networks to provide high-order models of natural images. There are a plethora of use cases that use image inpainting to retouch undesired regions, remove distracting objects or to complete occluded regions in images. It can also be extensively applied to tasks including video un-cropping, re-targeting, re-composition, rotation, and stitching.

Image Inpainting Application example-Source

Similar to image inpainting, video inpainting aims to fill in a given space-time region with newly synthesized content. It reconstructs the missing regions of a given video sequence with pixels that are both temporally and spatially coherent. Traditional video inpainting algorithms formulate the problem as a patch-based optimization task and follow the traditional image inpainting pipelines to fill the missing regions through sampling spatial-temporal patches of the known regions and then solve it as a minimization problem. Although most of the video inpainting algorithms face an obvious challenge due to the complex motion of objects and cameras. These challenges are mostly due to the assumption of a smooth and homogeneous optical motion field in the unknown region. Similar to image inpainting, association with a plausible motion field hallucination of the missing regions helps to tackle these challenges and generate seamless content for the video sequence, making the alteration almost imperceptible. Video inpainting is mostly used for video restoration (removing scratches), editing special effects workflows (removing unwanted objects, watermark and logo), and video stabilization.

Video Inpainting Application Example-Source

Hallucinating without any Prior Learning

Convolutional Neural Network’s excellent performance is generally imputed to their ability to learn realistic image priors from a huge amount of data. In case you are wondering what “image prior” means, it is the “prior information” on our image dataset, that is used to ease the choice of processing parameters and resolve indeterminacies in image processing, like the vector representations which a CNN learns after training. But on the contrary, researches like [1] show that even the structure of a generative CNN is capable of capturing a lot of low-level image statistics prior to any data-intensive learning. The main idea is that even a randomly-initialized convolutional neural network can be used as a handcrafted prior with high-quality performance in standard inverse problems such as image inpainting and denoising. This idea not only highlights the inductive bias captured by the generator networks but also bridges the gap between deep learning CNNs and learning-free algorithms based on handcrafted image priors. In this section, we will focus on how this technique of attaining image priors can be used to hallucinate unseen pixels in an image.

Image Inpainting using deep image prior-Source

It is a consensus that the structure of a CNN plays a key role in the performance of the network and also that the network structure must resonate with the structure of the data. But at the same time, we cannot expect an untrained network F(θ) to know about the specific appearance details of certain object categories. However, as suggested in [1], even a sequence of untrained convolutional filters has the ability to capture multi-scale low-level image statistics between pixel neighbourhoods due to their properties of local and translational invariance. These statistics are sufficient to model the conditional image distribution p(x_filled|x_missing) required in the image inpainting problem. During formulation, this distribution is written in a more generic manner, it is stated as an energy minimization problem (in our case can be a loss function minimization eg. MSE). We assume that the ground truth belongs to a manifold of points x that have null energy E(x, x_in) = 0.

X = argmin( E ( x_prior_estimated | x_input_with_missing_pixels ) ) + R(x) | where E can be a loss function like MSE, x_prior_estimated is the output of randomly initialized network, x_input_with_missing_pixels is the input that needs to be inpainted and the regularizer R(x) can be omitted during solving, considering implicit prior captured by the network parameters.

Image-space visualization during Restoration with priors — source | DIP stands for Deep Image Prior. The visualisations exemplify the task of super-resolution (left) and denoising (right) to explain the visual aspect of distribution the manifold of points during the optimisation through parameterization process.

To start the hallucination, we would first need an image with missing or occluded pixels in correspondence of a binary mask (M). Now if a randomly initialized CNN estimates the missing region, we can calculate the loss as:

Loss = [ (x_prior_estimated − x_input_with_missing_pixels) ⦿ M ] | where ⦿ is element-wise multiplication

The above-mentioned equation is independent of the actual values of the missing pixels, which makes it impossible to optimize it directly over pixel values. Therefore x_prior_estimated is calculated post-optimization w.r.t. the reparametrization. The produced hallucination leads to almost perfect results in many cases with virtually no seams and artefacts. However, this approach seems to have certain drawbacks like inpainting large holes or any highly semantical missing region. But these drawbacks are acceptable on an argument that this method is not trained on any supervised data and it works surprisingly well for most other situations.

Image Inpainting using deep image prior-Source

The achieved hallucinations highlights that:

1. The network utilizes the global and local context of the image and interpolates the missing region with textures from the known parts.

2. The relationship between the traditional self-similarity priors and the deep learning architectures and also suggests and explains the benefits of deep architectures with skip connections for general recognition tasks.

Run the colab notebook linked below, if you want to try out the network on image inpainting (Courtesy: DmitryUlyanov).

Link to Colab Notebook

Hallucinating after Learning on Images

In this section, we will discuss some relevant network architectural components that help deep neural networks hallucinate. Before we discuss the components, it’s important to investigate the human behaviour that inspires the architecture of these components to achieve better hallucinations for the image restoration task. The basic process majorly involves two steps as conceptualization and painting to maintain global structural consistency and local pixel continuity of the image. During the painting process, humans generally draw new lines from the end nodes of the previously drawn lines to ensure neighbouring pixel continuity and consistency. Keeping this in mind, we will discuss various components suggested in multiple research papers that aim to fulfil a similar purpose.

Edge Hallucination before we fill in the finer details and propagate colour.

Recent image inpainting studies have shown good quality results by utilizing the contextual information using mainly two types of methods. The first family of methods uses spatial attention which utilizes the neighbouring pixel features as a reference to restore the unknown pixels, thus ensuring the semantic consistency of the hallucinated content w.r.t. the global context. The second family of methods uses conditioned values of the valid pixels to predict the missing pixels. Nevertheless, both types of methods sometimes fail to generate semantically flawless content and artefact-less boundaries. But if we utilize architectures like coherent semantic attention modules and gated convolutions as suggested in the papers [2] and [3], we can easily overcome most of these challenges. We will also look into the concept of periodic activation functions like SIREN for generating better implicit neural representations, thus resulting in better neural hallucinations.

Coherent semantic attention

Illustration of the CSA layer showing how each neural patch in the hole “m” searches for the most similar neural patch on the boundary-Source

Inspired by the human methodology of conceptualization and painting, authors of [2] introduced coherent semantic attention (CSA) layer. The CSA initializes the missing pixel values with the most similar feature pixel in the known region. Then these initialized pixels are iteratively optimized using the adjacent pixel’s value by assuming spatial consistency. The advantages of the mentioned process are two-fold, the first benefit is the global semantic consistency introduced due to the initialization and the second benefit is the local feature coherency ensured by the optimization iterations. The original network first computes a rough prediction (I_p) using a simple autoencoder (so that similarity could be computed for the initialization process) and then feeds the rough prediction (I_p) and the input image (I_in) to a CSA facilitated encoder layer for refinement. The refinement network performs the above mentioned two steps (initialization and iterative optimization ) on the input (I_p + I_in) to output the final result (I_r).

Iterative Optimization in simple terms
pred_pixel = A + B
A = similarity(pred_pixel, adjacent_pixel) × adjacent_pixel
B = similarity(pred_pixel, most_similar_pixel) × most_similar_pixel
where pred_pixel is the pixel to be hallucinated, adjacent_pixel is the adjacent pixel and most_similar_pixel is the pixel with most similar feature. Also the similarities are normalized.

The architecture of the model utilizing the CSA layer-Source

Gated Convolution

When we use vanilla CNNs, the convolutional filters apply the same operation on all the pixels, irrespective of the fact that whether they are spatially located in the known or unknown region. This drawback of vanilla CNNs leads to blurry outputs with visual artefacts in the colour and edge domain. To handle such problems, concepts like partial convolution and gated convolution have been suggested in recent studies [4,3]. The main idea behind gated convolution is to learn a dynamic feature gating mechanism for every spatial location and image channel. The gating values are nothing but a soft mask automatically calculated from data/feature and multiplied back to feature to regulate the values of certain spatial and channel indices. For example, if the input feature is I_in and the learnable convolution weight matrix for the gating mechanism is W_g. Then the soft mask will be calculated as:

gating_values = sigmoid ( ∑ ∑ W_g · I_in )

Illustration of (left) partial convolution and (right) gated convolution.-Source

The calculated soft mask is then multiplied back to the original feature. It is important to note that multiplying the mask before or after convolution is equivalent when convolutions are stacked layer-by-layer in the CNN. Gated convolution has two significant advantages:

Firstly, it makes the hallucinative components more robust for arbitrary shapesSecondly, it enables the network to learn to select the feature not only according to the mask and background but also according to the semantic segmentation information in some channels.

SIREN — Sinusoidal Representation Networks

The task of image inpainting involves modelling fine-grained details of the image signals. But most of the used methods often fail to learn robust implicit neural representations of the image’s spatial derivatives, which may or may not be important during the generative process (depends on the difficulty of the task). To tackle this rarely considered issue, [5] proposes to leverage periodic activations for robustly modelling complex implicit neural representations. Unlike the traditional approach of using discrete representations for modelling different types of signals in images, SIREN uses the sine as a periodic activation function. As the derivative of the sine is a cosine, the derivatives of a SIREN inherit the properties of SIRENs, which enables the supervision of any derivative of SIREN with complicated signals. The authors of SIREN demonstrated the capability of SIREN on image inpainting by fitting a 5-layer MLP SIREN to an image input and enforcing a prior on the representation. The results of the method can be seen in the figure below. Therefore SIREN also holds a worthy mention among these components and it surely holds several exciting avenues for future work in many types of inverse problems.

Image Inpainting results from a SIREN model, (left) Ground Truth and (right) Inpainted Image-Source

Hallucinating after Learning on Videos

Optical Flow hallucination (a) Optical flow estimation on the input sequence. Missing regions are the white squares (b) Hallucinated flow edges in red. (c ) Edge guided flow completion -Source

Unlike image inpainting, video inpainting focuses on filling the space-time regions in a given video sequence with generated content. To synthesize this content, most traditional approaches used patch-based synthesis. But after the rise of learning-based methods, some of the most successful approaches are flow-based approaches that jointly synthesize optical flow and colour to enable high-resolution outputs. The synthesized colour is generally propagated to the missing spatial-temporal regions along the flow trajectories to ensure the temporal coherence and also alleviates the memory problems. In this section, we will discuss the flow-based approach suggested in the paper [6].

Algorithm overview of flow guided video completion-Source

The key component of flow-based video inpainting approaches is the accurate and sharp edge synthesis of the optical flow fields for the object in motion. The method proposed in [6] “Flow-edge Guided Video Completion” aims to specifically handle accurate flow completion. For achieving the same, the network’s first stage computes a forward and backward flow between the adjacent and non-adjacent frames of the sequence. Then using the computed flow, the flow of the missing pixel regions is computed. To compute the initial flow, they first use a canny edge detector to extract the edges of the known region and then use EdgeConnect and train a flow edge completion network. This stage of network is the major hallucinative component of the architecture, whose job is to hallucinate the flow edges in the missing region. The hallucinated edges of the flow maps are typically the most salient features that serve as the key input to produce piecewise-smooth flow completion. Once the hallucination of the optical flow is complete, the network follows the backward and forward flow trajectories to propagate two candidate pixels for each missing pixel. The network also obtains three non-local flow vectors from the sequence by checking three temporally distant frames. Finally, the candidate pixel’s values are fused in the gradient domain for each missing pixel using a confidence-weighted average. This type of fusion in the gradient domain ensures the removal of any visual artefact and visible colour seam.

Conclusion

In this blog, we focused on how hallucination in neural networks is utilized to perform the task of image inpainting. We discussed three major scenarios that covered the concepts of hallucinating pixels without any prior learning, after learning on images and after learning on videos. All of the discussed cases holds deep meaning and reflects the rich history of research in image/video inpainting in their own respective way. Nevertheless, we emphasized how all of these methods have a common goal of hallucinating unseen pixels and how they tackle this inverse problem of image/video inpainting in their respective way. The variety of applications where the neural hallucinations can be applied is vast and only limited by the ingenuity of its designers.

Application of Image Inpainting in editing images-Source

My blogs are a reflection of what I worked on and simply convey my understanding of these topics. My interpretation of deep learning can be different from that of yours, but my interpretation can only be as inerrant as I am.

References

[1] Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2020). Deep Image Prior. International Journal of Computer Vision, 128, 1867–1888.

[2] Liu, H., Jiang, B., Xiao, Y., & Yang, C. (2019). Coherent Semantic Attention for Image Inpainting. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 4169–4178.

[3] Yu, J., Lin, Z.L., Yang, J., Shen, X., Lu, X., & Huang, T. (2019). Free-Form Image Inpainting With Gated Convolution. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 4470–4479.

[4] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV)

[5] Sitzmann, V., Martel, J.N., Bergman, A., Lindell, D.B., & Wetzstein, G. (2020). Implicit Neural Representations with Periodic Activation Functions. ArXiv, abs/2006.09661.

[6] Gao, C., Saraf, A., Huang, J., & Kopf, J. (2020). Flow-edge Guided Video Completion. ArXiv, abs/2009.01835.

How Neural Networks Hallucinate Missing Pixels for Image Inpainting