Creating implicit 3D representations from point clouds and/or RGB images

Divam Gupta (divamg)

Tarasha Khurana (tkhurana)

Harsh Sharma (hsharma2)

The aim of the project is to take a single RGB image, along with the point cloud and create a 3D model out. Rather than explicitly creating a mesh or a 3D model , we learn a neural representation. That learned representation can be used to render the learned model at different views. For that we use various models like NeRF and vanilla CNN based rendering.

We tackle the problem of densifying sparse modalities of either RGB image or pointclouds. We make the help of various models such neural radiance fields (NeRF) for representing the underlying radiance field of a scene. We show our approaches for superresolving and discuss the accompanying challenges to these methods.

Problem statement

Given an RGB image and a point cloud we aim to create a 3D representation. The RGB image can be used to create an RGB point cloud from the Lidar point cloud. We can either use the point clouds as voxels as an explicit input to the model, or we could generate multiple views from the RGB point cloud by projecting it along different camera views.

Overview of the models

Sparse input NeRF model

Context NeRF model

Learng a latent representation vector and using a CNN to render that 3d representation.

A GAN based model to learn and render the 3D representation.

Background: Neural Radiance Fields

We primarily try to use neural radiance fields or NeRF for representing the radiance field of a scene. These encode a scene as a continuous volumetric radiance field f of color and density. Specifically, for a 3D point $x ∈ R^3$ and viewing direction unit vector $d ∈ R^3$ , f returns a differential density $σ$ and RGB color $c$ : $f(x, d) = (σ, c)$ . The volumetric radiance field can be rendered into a 2D image via first sampling 5D coordinates (location and viewing direction) along camera rays and feeding those locations into an MLP to produce a color and volume density. Given these sets of colors and densities, volume rendering techniques can be used to composite these values into an image.

The figure above represents the steps that optimizes a continuous 5D $(x; y; z; θ; Φ)$ neural radiance field representation (volume density and view-dependent color at any continuous location) of a scene from a set of input images.

Intuitively an implicit radiance field function can help in tasks such as superresolution as parts of the image that may not be visible from one view could be from another. However, the results on trying to synthesize a higher resolution image directly using the learnt radiance field results in artifacts and we try to resolve these issues using coarse-to-fine registration and self attention mechanisms.

Super-resolving RGB Images

As our first experiment, we try a NeRF based model to learn from sparse input images of multiple views. This is coarsely equivalent to training on a spase projection images from an RGB point cloud.

Coarse-to-fine Registration

Intuition: When filling missing pixels of an image, gather a global context by coarse registration and then reconstruct/synthesize finer details by fine registration.

Disable higher frequencies in the positional encoding at the start of training

Gradually introduce higher frequencies during training

Weighted positional encoding as introduced in a recent work called BARF.

Implementation:

Linearly increase the weight of the positional encoding from 0 to 1 from epoch 40 to epoch 80. The idea is that super-resolution can be thought of as hole-filling or filling in missing pixels. If an image is to be super-resolved by 2x, then this is the same as filling in every alternate row and column between the rows and columns of the given low-resolution image. We introduce missing pixels in the image in three ways:

'random': Randomly remove 50% of all the pixels in an image during training.

'patch': Consistently mask out the central 1/10th region of all the images.

'alternate': Mask out alternate rows and columns. This simulates super-resolution.

We first start by looking at the performance of NeRF when only a sparse input is given. This will be followed by our coarse-to-fine registration scheme which unfortunately does not work and hence, we are not able to appropriately superresolve images.

Behaviour of NeRF on Sparsifying RGB Input

Property	w/o coarse-to-fine (PSNR, dB)
NeRF	27.5
NeRF (50% masked)	26.9
NeRF (masked patch)	23.4
NeRF-alternate (super-resolution)	25.8

Qualitative results for the PSNR values in the above table. The quantitative and qualitative behaviour of NeRF is as expected. Sparse images in the first row are representative only. In the code, the holes are accomplished differently.

With Coarse-to-fine Registration

Property	w/o coarse-to-fine (PSNR, dB)	w/ coarse-to-fine(PSNR, dB)
NeRF	27.5	24.8
NeRF (50% masked)	26.9	23.1
NeRF (masked patch)	23.4	20.6
NeRF-alternate (super-resolution)	25.8	22.2

Going from Point Clouds to an implicit 3D representation

After running the experiment for training on sparse images where sparsity is artificially generated, we try experiments with real RGB point clouds.

We take a single RGB image from NYUv2 Depth dataset along with its corresponding RGB point cloud in camera coordinates. We generate images from new views using just this single RGB point cloud. These new views contain holes which we try to fill with the above model.

Implementation:

Random poses are generated with a 10 degree noise in rotation and 10cm noise in translation. We train the NeRF model with white background set to True, hence the outputs look different from the original image. The model is able to synthesize the scenes coarsely such that the artifacts from projecting into new view go away but the reconstruction is not fine enough.

Context NeRF

We experiment with a setting of a NeRF where we train across multiple images of different models rather than multiple views of the same model. This idea is similar to the pixelNeRF, and is supposed to make the network learn some priors from the dataset.

This didn't seem to capture the pointcloud and images' represenation well, and the resulting renderings are not at all meaningful. We use the ShapeNet dataset for this task.

The Point cloud is first mapped to a binary Voxel grid of size 32x32 and then passed to a 3D CNN to get the point cloud context vector. The input image is encoded to a context vector by passing to a 2D CNN.

Resulting rendered images from ContextNeRF. The model doesn't seem to capture the image and 3D representation of the objects.

Non-NeRF model

We also try using an encoder-decoder model which takes an input image and pose as input and outputs a new view. The model takes an image and point-cloud as input to generate a 3D representation vector. Similar to the above model, the Point cloud is first mapped to a binary Voxel grid of size 32x32 and then passed to a 3D CNN to get the point cloud representation vector. The input image is encoded to a representation vector by passing to a 2D CNN.

The concatenated output is passed to a Pose transformation MLP which learns to transform the 3D representation to the new given pose.

A rendering CNN which is a standard ConvNet with transposed convolutions is used to render the 3D input representation

A simple encoder-decoder architecture. The model takes an image and point-cloud as input to generate a 3D representation vector. Given a pose , a MLP is used to transform the 3D embedding. The transformed 3D embedding is passed through transposed convolutions to render the final image.

We can see that the model is able to learn a decent 3d representation from a single input.

The first column is the input to the model, the 2nd column is the actual image of the new pose, the 3d column is the 3D render along the new pose.

Visualizing the learned 3D representation.

For visualizing the learned 3D we render the representation from multiple views

Here all the 3D views are generated from a single input image.

GAN Based Model

Now we also experiment with combining the above model with a GAN. Here we also add an adversarial loss along with a discriminator. The discriminator takes the generated image along the new pose and the input image to discriminate.

We tried several different hyper parameters but we observe that GAN is not significantly improving any results.

References

Mildenhall, Ben, et al. "NeRF: Representing scenes as neural radiance fields for view synthesis." European Conference on Computer Vision. Springer, Cham, 2020.

Yu, Alex, et al. "pixelNeRF: Neural Radiance Fields from One or Few Images." arXiv preprint arXiv:2012.02190 (2020).

Lin, Chen-Hsuan, et al. "BARF: Bundle-Adjusting Neural Radiance Fields." arXiv preprint arXiv:2104.06405 (2021).

https://github.com/krrish94