ICCV 2019: Papers that indicate the future of computer vision (Satellites to 3D reconstruction)

If you couldn’t make it to ICCV 2019 due to visa issues, no worries. Below is a list of top papers everyone is talking about!

------

Satellite Pose Estimation with Deep Landmark Regression and Nonlinear Pose Refinement

https://www.catalyzex.com/paper/arxiv:1908.11542

TLDR: The authors propose an approach to estimate the 6DOF pose of a satellite..from a single image.

Model/Architecture used: Their approach combines machine learning and geometric optimisation, by predicting the coordinates of a set of landmarks in the input image, associating the landmarks to their corresponding 3D points on a priori reconstructed 3D model, then solving for the object pose using non-linear optimisation. Their approach is not only novel for this specific pose estimation task, which helps to further open up a relatively new domain for machine learning and computer vision, but it also demonstrates superior accuracy and won first place in the recent Kelvins Pose Estimation Challenge organised by the European Space Agency (ESA)

Model accuracy: Visual inspection (picture attached below) indicates high accuracy of their approach even with images that have very small object size.

Datasets used: Spacecraft PosE Estimation Dataset (SPEED), which consists of high fidelity grayscale images of the Tango satellite. There are 12,000 training images with ground truth 6DOF poses (position and orientation) and 2,998 testing images without ground truth.

----------------------------

FSGAN: Subject Agnostic Face Swapping and Reenactment

https://www.catalyzex.com/paper/arxiv:1908.05932

TLDR: State-of-the-art face-swapping tech from ICCV19. FSGAN is subject agnostic and can be applied to pairs of faces without requiring training

Model/Architecture used:

Overview of the proposed FSGAN approach. (a) The recurrent reenactment generator Gr and the segmentation generator Gs. Gr estimates the reenacted face Fr and its segmentation Sr, while Gs estimates the face and hair segmentation mask St of the target image It. (b) The inpainting generator Gc inpaints the missing parts of F˜r based on St to estimate the complete reenacted face Fc. (c) The blending generator Gb blends Fc and Ft, using the segmentation mask St.

Model accuracy: retains pose and expression much better than its baselines.

Datasets used:

-they use the video sequences of the IJB-C dataset [30] to train generator, Gr

-trained VGG-19 CNNs for the perceptual loss on the VGGFace2 dataset and CelebA dataset for face attribute classification (The VGGFace2 dataset contains 3.3M images depicting 9,131 identities, whereas CelebA contains 202,599 images, annotated with 40 binary attributes.)

-LFW Parts Labels set [19] with ∼3k images labeled for face and hair segmentations, removing the neck regions using facial landmarks.

-conduct all experiments on videos from FaceForensics++ (FaceForensics++ provides 1000 videos, from which they generated 1000 synthetic videos on random pairs using DeepFakes and Face2Face)

----------------------------

Predicting 3D Human Dynamics from Video

https://www.catalyzex.com/paper/arxiv:1908.04781

TLDR: Given a video of a person in action, the neural autoregressive model can easily guess the 3D future motion of the person

Model/Architecture used: Predicting Human Dynamics (PHD), a neural autoregressive framework that takes past video frames as input to predict the motion of a 3D human body model. PHD takes in a video sequence of a person and predicts the future 3D human motion. They show the predictions from two different viewpoints.

Model accuracy: Frame corresponding to the timestep when the accuracy improves the most effectively captures the “suggestive” moments in an action.

The method using autoregressive predictions in the latent space significantly outperforms the baselines.

Datasets used: Human3.6M dataset, in-the-wild Penn Action dataset

----------------------------

PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization

https://www.catalyzex.com/paper/arxiv:1905.05172

TLDR: PIFu, an end-to-end deep learning method that can reconstruct a 3D model of a person wearing clothes from a single image

Model/Architecture used: Given an input image, PIFu for surface reconstruction predicts the continuous inside/outside probability field of a clothed human, in which iso-surface can be easily extracted

Model accuracy:

- Comparison with other human digitization methods from a single image (PiFu has higher fidelity)

- Comparison with SiCloPe on texture inference. (PiFu is better; predicts textures on the surface mesh directly, removing projection artifacts)

- Comparison with learning-based multi-view methods. (PiFu outperforms other learning-based multi-view methods qualitatively and quantitatively)

Datasets used:

-They evaluate PiFu on a variety of datasets, including RenderPeople and BUFF, which

has ground truth measurements, as well as DeepFashion which contains a diverse variety of complex clothing.

-All approaches, that PiFu is compared against, are trained on the same High-Fidelity Clothed Human Dataset using three-view input images. (High-Fidelity Clothed Human Dataset: photogrammetry data of 491 high-quality textured human meshes with a wide range

of clothing, shapes, and poses, each consisting of about 100, 000 triangles from RenderPeople)

-Preliminary experiments are done on the ShapeNet dataset

----------------------------

Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud

https://www.catalyzex.com/paper/arxiv:1903.09847

TLDR: State of the art performance on both bird's eye view and 3D object detection

To handle the large amount of noise in the pseudo-LiDAR, they propose two innovations:

(1) use a 2D-3D bounding box consistency constraint, adjusting the predicted 3D bounding box to have a high overlap with its corresponding 2D proposal after projecting onto the image;

(2) use the instance mask instead of the bounding box as the representation of 2D proposals, in order to reduce the number of points not belonging to the object in the point cloud frustum

Model/Architecture used:

(a) Lift every pixel of input image to 3D coordinates given estimated depth to generate pseudoLiDAR; (b) Instance mask proposals detected for extracting point cloud frustum; (c) 3D bounding box estimated (blue) for each point cloud frustum made to be consistent with corresponding 2D proposal. Inputs and losses are in red and orange.

Model accuracy:

For 3D detection in moderate class with IoU of 0.7, we raise the accuracy by up to 15.3% AP, nearly quadrupling the performance over the prior art (from 5.7% to 21.0% by ours). We emphasize that we also achieve an improvement by up to 6.0% (from 42.3% to 48.3%) AP over the best concurrent work (its monocular variant), in moderate class with IoU of 0.5.

Datasets used:

-evaluate on the KITTI bird’s eye view and 3D object detection benchmark, containing 7481 training and 7518 testing images as well as the corresponding LiDAR point clouds, stereo images, and full camera matrix. We use the same training and validation split as. We emphasize again, during training and testing, our approach does not use any LiDAR point cloud or stereo image data.

-They first train instance segmentation network on Cityscapes dataset with 3475 training images and then fine-tune on the KITTI dataset.

----------------------------

Photo-Realistic Facial Details Synthesis from Single Image

https://www.catalyzex.com/paper/arxiv:1903.10873

TLDR: The authors present a single-image 3D face synthesis technique that can handle challenging facial expressions while recovering fine geometric details

Model/Architecture used:

Top: training stage for (a) emotion-driven proxy generation and (b) facial detail synthesis. Bottom: testing stage for an input image.

Model accuracy:

In order to verify the robustness of our algorithm, researchers tested emotion-driven proxy generation and facial detail synthesis approach on over 20,000 images

-Expression Generation: Their emotion-based expression predictor exploits global information from images and is able to more accurately capture expressions, especially for jowls and eye bags.

-Facial Detail Synthesis: Their approach also has better performance on handling the surface noise from eyebrows and beards while preserving skin details

-Quantitative evaluation: Their approach produces much lower errors on strong expressions, especially near the nose and eyebrows where shape deformations are strong.

Lastly, their output displacement map is easy to integrate with existing rendering pipelines and can produce high fidelity results,

Datasets used:

-To obtain emotion features, they reuse AffectNet dataset

-For training, they use high resolution facial images from the emotion dataset AffectNet

-To acquire training datasets(for geometry loss), they implement a small-scale facial capture system : Capture system contains 5 Cannon 760D DSLRs and 9 polarized flash lights. They capture a total of 23 images for each scan, with uniform illumination from 5 different viewpoints and 9 pairs of vertically polarized lighting images