16-726 Assignment 1

Overview

In this assignment, we are given a set of scanned images from the Prokudin-Gorskii photo collection, where each scene is captured with three monochromatic glass plate images that are filtered by red, green, and blue filters, respectively. The goal is to colorize the images by aligning the three color channels and then combining them into a single color image.

Approach

We assume that the alignment needed for the three color channels is a simple translation and perform an exhaustive search over a set of (x, y) translation to find one with smallest error. This can be done directly on smaller images. For larger images, we downsample them to form an image pyramid and perform the search on the coarsest level first, then refine the alignment based on the result from the previous level.

Metrics and Single-Scale Alignment

We tried three different metrics to measure the error between the color channels: L2 norm (sum of squared differences), cosine similarity (normalized cross-correlation), and SSIM (structural similarity index measure).

All these metrics can be directly applied to the color channels, or to their gradients. Instead of using things like scipy.ndimage.convolve, we compute the gradients in two directions direction by taking the difference between images shifted by one pixel (with np.roll). The L2 norm or NCC can then be applied to the magnitude of the gradients.

Comparing to the L2 norm and NCC, SSIM is less sensitive to luminance and contrast, but more sensitive to structural similarity, and small translation. SSIM value sits in the range of [-1, 1], where 1 means perfect similarity, 0 means no similarity, and -1 indicates inverse relations. We hence use 1 - abs(ssim) as the error to minimize.

The original SSIM is applied over sliding windows, which is too expensive for large images. Instead, we sample a set of NxN windows by randomly selecting K pixels from the top-10K pixels that have the largest gradient magnitude. We then compute the average SSIM over these windows on the gradient magnitude. We found that N=11 and K=300 works well on the given images. The L2 norm and NCC can also use the same sampling strategy, albeit unnecessary.

In the single-scale implementation, given a range of translations (x_min, x_max, y_min, y_max), we do a grid search over the range, shift the image by the translation, and compute the error over an internal region of the image. The internal region is defined by the user, and we found that using the central 80% of the image works well. This allows us to ignore the boundary artifacts both from the original image and from the shifting.

Multiscale Alignment

The user can specify the number of levels in the image pyramid, and we use sk.transform.resize to downsample the image by a factor of 2 at each level with anti-aliasing. At each level, we search around the best translation from the previous level for the new best translation. The maximum translation is specified at the finest level (level 0), and multiplied by s^i at level i, where s is a chosen factor. In our implementation, we use s=1.8, maximum translation range [-4, 4] at the finest level, and 4 levels in total. This results in a max searching range of [-24, 24] in the coarsest level (8x downsampling).

Results

Note: The following aligned images have been cropped using the auto cropping function described later. They are not auto constrated or auto color balanced, however.

Cathedral: Green channel shift: (2, 5). Red channel shift: (3, 12).

Emir: Green channel shift: (23, 48). Red channel shift: (41, 107).

Harvesters: Green channel shift: (17, 60). Red channel shift: (14, 124).

Icon: Green channel shift: (16, 41). Red channel shift: (22, 90).

Lady: Green channel shift: (10, 56). Red channel shift: (13, 120).

Self Portrait: Green channel shift: (30, 79). Red channel shift: (37, 176).

Three Generations: Green channel shift: (12, 52). Red channel shift: (9, 110).

Train: Green channel shift: (2, 42). Red channel shift: (29, 85).

Turkmen: Green channel shift: (18, 55). Red channel shift: (24, 114).

Village: Green channel shift: (9, 64). Red channel shift: (21, 137).

List of the best translations for the green and red channels for each image:

Image	Red channel shift	Green channel shift
Cathedral	(3, 12)	(2, 5)
Emir	(41, 107)	(23, 48)
Harvesters	(14, 124)	(17, 60)
Icon	(22, 90)	(16, 41)
Lady	(13, 120)	(10, 56)
Self Portrait	(37, 176)	(30, 79)
Three Generations	(9, 110)	(12, 52)
Train	(29, 85)	(2, 42)
Turkmen	(24, 114)	(18, 55)
Village	(21, 137)	(9, 64)

Additional Results

Note: These results are both auto cropped, auto contrasted, and auto white balanced.

Kush-Beggi: Green channel shift: (24, 69). Red channel shift: (39, 148).

Lugano: Green channel shift: (-17, 41). Red channel shift: (-29, 91).

Saimaa Lake: Green channel shift: (5, 44). Red channel shift: (7, 159).

Svetlitsa: Green channel shift: (19, -24). Red channel shift: (23, 35).

Bells & Whistles

PyTorch Implementation

The PyTorch version can be found in main_hw1_torch.py.

Automatic Cropping

We implement the auto cropping applying two large 1D convolution kernels that are similar to Sobel filters to the luminance of the aligned RGB image. The kernels are in the form of [-1, ..., -1, 0, 1, ..., 1], which is then normalized by the sum of the absolute values of the kernel. As applying them to the entire 2D image is slow, we apply them to the row and column average luminance.

We then take the maximum of the resulting signals and multiply it by a factor (we use 0.05 or 0.1) to get the threshold, since there can be multiple borders with large gradient values. Among all columns and rows with horizontal/vertical gradient that is beyond the threshold, we take the one that is closest to the center of the image. We then crop the image based on the found borders. The search for border is limited to the 10% or 20% margin of the image, considering the usual border sizes.

We chose a large kernel size (31) so that vertical or horizontal lines within the image will not be mistaken as borders. As most borders in the high resolution images are very wide, this method works well. Comparison between the original and cropped images can be found below:

Note the vertical borders on the door is not mistaken as the border of the image.

This approach may sometimes fail to crop colored border that are relatively smooth in luminance though. For instance:

The smoothly varying border at the bottom and the right side are not cropped.

Automatic Contrast and White Balance

We implement a simple auto contrast and auto white balance scheme. The auto contrast is implemented by normalizing each channel to the range of [0, 1]. The auto white balance is implemented with the Gray World assumption, where we scale the average of the red and blue channels to be the same as the green channel and clip the result. We apply them after the auto cropping to reduce the influence of borders. This simple method works reasonably well for most images.

Left is the original, right is the auto contrasted and white balanced image.

Better Features

We apply SSIM to the gradient magnitude of the color channels, which is more robust to difference in luminance and contrast comparing to directly applying to color channels, or using L2 norm/NCC. Comparisons:

Top left: NCC on color channels. Top right: NCC on gradient magnitude. Bottom left: SSIM on color channels. Bottom right: SSIM on gradient magnitude. SSIM behaves stable even on color channels, whereas NCC on color channels can fail severely.

Zoom-in View. Top left: NCC on color channels. Top right: NCC on gradient magnitude. Bottom left: SSIM on color channels. Bottom right: SSIM on gradient magnitude. NCC on gradient shows slight misalignment (the green edge on the hat), whereas SSIM on gradient shows no visible misalignment.

Acknowledgements

The website template was borrowed from Michaël Gharbi and Ref-NeRF.