Overview
In this assignment, we are given a set of scanned images from the Prokudin-Gorskii photo collection, where each scene is captured with three monochromatic glass plate images that are filtered by red, green, and blue filters, respectively. The goal is to colorize the images by aligning the three color channels and then combining them into a single color image.Approach
We assume that the alignment needed for the three color channels is a simple translation and perform an exhaustive search over a set of (x, y) translation to find one with smallest error. This can be done directly on smaller images. For larger images, we downsample them to form an image pyramid and perform the search on the coarsest level first, then refine the alignment based on the result from the previous level.Metrics and Single-Scale Alignment
We tried three different metrics to measure the error between the color channels: L2 norm (sum of squared differences), cosine similarity (normalized cross-correlation), and SSIM (structural similarity index measure).All these metrics can be directly applied to the color channels, or to their gradients. Instead of using things like
scipy.ndimage.convolve
, we compute the gradients in two directions direction by taking the difference
between images shifted by one pixel (with np.roll
). The L2 norm or NCC can then be applied to the magnitude of
the gradients.
Comparing to the L2 norm and NCC, SSIM is less sensitive to luminance and contrast, but more sensitive to structural similarity, and small translation. SSIM value sits in the range of
[-1, 1]
,
where 1 means perfect similarity, 0 means no similarity, and -1 indicates inverse relations. We hence use 1 - abs(ssim)
as the error to minimize.
The original SSIM is applied over sliding windows, which is too expensive for large images. Instead, we sample a set of
NxN
windows by randomly selecting K
pixels from the top-10K
pixels that have the largest gradient magnitude. We then compute the average SSIM over these windows on the gradient
magnitude. We found that N=11
and K=300
works well on the given images. The L2 norm and NCC
can also use the same sampling strategy, albeit unnecessary.
In the single-scale implementation, given a range of translations
(x_min, x_max, y_min, y_max)
, we do a
grid search over the range, shift the image by the translation, and compute the error over an internal region of the image.
The internal region is defined by the user, and we found that using the central 80%
of the image works well.
This allows us to ignore the boundary artifacts both from the original image and from the shifting.
Multiscale Alignment
The user can specify the number of levels in the image pyramid, and we usesk.transform.resize
to downsample
the image by a factor of 2
at each level with anti-aliasing. At each level, we search around the best translation
from the previous level for the new best translation. The maximum translation is specified at the
finest level (level 0), and multiplied by s^i
at level i
, where s
is a chosen factor. In our
implementation, we use s=1.8
, maximum translation range [-4, 4]
at the finest level, and 4 levels
in total. This results in a max searching range of [-24, 24]
in the coarsest level (8x downsampling).
Results
Note: The following aligned images have been cropped using the auto cropping function described later. They are not auto constrated or auto color balanced, however.

Cathedral: Green channel shift: (2, 5). Red channel shift: (3, 12).

Emir: Green channel shift: (23, 48). Red channel shift: (41, 107).

Harvesters: Green channel shift: (17, 60). Red channel shift: (14, 124).

Icon: Green channel shift: (16, 41). Red channel shift: (22, 90).

Lady: Green channel shift: (10, 56). Red channel shift: (13, 120).

Self Portrait: Green channel shift: (30, 79). Red channel shift: (37, 176).

Three Generations: Green channel shift: (12, 52). Red channel shift: (9, 110).

Train: Green channel shift: (2, 42). Red channel shift: (29, 85).

Turkmen: Green channel shift: (18, 55). Red channel shift: (24, 114).

Village: Green channel shift: (9, 64). Red channel shift: (21, 137).
List of the best translations for the green and red channels for each image:
Additional Results
Note: These results are both auto cropped, auto contrasted, and auto white balanced.Bells & Whistles
PyTorch Implementation
The PyTorch version can be found in main_hw1_torch.py
.
Automatic Cropping
We implement the auto cropping applying two large 1D convolution kernels that are similar to Sobel filters to the luminance
of the aligned RGB image. The kernels are in the form of [-1, ..., -1, 0, 1, ..., 1]
, which is then normalized
by the sum of the absolute values of the kernel. As applying them to the entire 2D image is slow, we apply them to the
row and column average luminance.
We then take the maximum of the resulting signals and multiply it by a factor (we use 0.05 or 0.1) to get the threshold, since there can be
multiple borders with large gradient values. Among all columns and rows with horizontal/vertical gradient that is beyond the threshold,
we take the one that is closest to the center of the image. We then crop the image based on the found borders. The search for border
is limited to the 10% or 20% margin of the image, considering the usual border sizes.
We chose a large kernel size (31) so that vertical or horizontal lines within the image
will not be mistaken as borders. As most borders in the high resolution images are very wide, this method works well.
Comparison between the original and cropped images can be found below:


Note the vertical borders on the door is not mistaken as the border of the image.


This approach may sometimes fail to crop colored border that are relatively smooth in luminance though. For instance:


The smoothly varying border at the bottom and the right side are not cropped.
Automatic Contrast and White Balance
We implement a simple auto contrast and auto white balance scheme. The auto contrast is implemented by normalizing each channel to the range of[0, 1]
. The auto white balance is implemented with the Gray World assumption,
where we scale the average of the red and blue channels to be the same as the green channel and clip the result.
We apply them after the auto cropping to reduce the influence of borders.
This simple method works reasonably well for most images.


Left is the original, right is the auto contrasted and white balanced image.


Left is the original, right is the auto contrasted and white balanced image.


Left is the original, right is the auto contrasted and white balanced image.
Better Features
We apply SSIM to the gradient magnitude of the color channels, which is more robust to difference in luminance and contrast comparing to directly applying to color channels, or using L2 norm/NCC. Comparisons:



Top left: NCC on color channels. Top right: NCC on gradient magnitude. Bottom left: SSIM on color channels. Bottom right: SSIM on gradient magnitude. SSIM behaves stable even on color channels, whereas NCC on color channels can fail severely.




Zoom-in View. Top left: NCC on color channels. Top right: NCC on gradient magnitude. Bottom left: SSIM on color channels. Bottom right: SSIM on gradient magnitude. NCC on gradient shows slight misalignment (the green edge on the hat), whereas SSIM on gradient shows no visible misalignment.
Acknowledgements
The website template was borrowed from Michaƫl Gharbi and Ref-NeRF.