Full article: Synthetic Images for Georeferencing Camera Images in Mobile Mapping Point-clouds

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Accurate three-dimensional mapping and digital twinning provides a powerful tool for effective maintenance of civil infrastructure and supports efficient future planning of new developments. Three-dimensional mapping can be efficiently performed with a Mobile Mapping System (MMS) that records geospatial data from platform-mounted sensors. However, it is expensive to continuously update datasets by re-capturing with MMS. This paper outlines a novel method allowing camera-only approaches for updating and change detection. It resolves key issues with inherent resolution differences between MMS laser scanner point-clouds and camera images. An intermediary is used to register two disparate datasets. This novel approach to synthetic camera images (SCIs) bridges the differences between MMS point-clouds and camera images and aid in coarse registration of camera images to an outdoor MMS point-cloud. SCI coarse registration precision is maximized by generating surfaces, interpolating intensity values, and reducing noise with a median filter. Landmark features coarsely register the camera image to the MMS point-cloud. The coarse registration is most precise when the whole scene is captured either from the same location as the SCI or further from the scene. Landmarks precisely detect scenes when changes are less than 20%, and foliage does not exceed 20% of the camera image.

RÉSUMÉ

La cartographie tridimensionnelle précise et le jumelage numérique constituent des outils puissants pour l’entretien efficace des infrastructures civiles et favorisent une planification efficiente de nouveaux développements. La cartographie tridimensionnelle peut être réalisée efficacement à l’aide d’un système de cartographie mobile (SCM) qui enregistre les données géospatiales à partir de capteurs montés sur une plate-forme. Cependant, il est coûteux de mettre à jour en permanence les jeux de données en les enregistrant avec un SCM. Cet article décrit une nouvelle méthode permettant d’utiliser uniquement des caméras pour la mise à jour et la détection des changements. Il résout les principaux problèmes liés aux différences de résolution inhérentes entre les nuages de points des scanners laser SCM et les images des caméras. Un intermédiaire est utilisé pour enregistrer deux jeux de données disparates. Cette nouvelle approche, des images de caméra synthétique (ICS), comble les différences entre les nuages de points SCM et les images de caméra et facilite l’enregistrement grossier des images de caméra sur un nuage de points SCM extérieur. La précision du repérage grossier ICS est maximisée en générant des surfaces, en interpolant les valeurs d’intensité et en réduisant le bruit à l’aide d’un filtre médian. Des fonctions de repère enregistrent grossièrement l’image de la caméra dans le nuage de points SCM. Le repérage grossier est plus précis lorsque l’ensemble de la scène est capturé soit à partir du même emplacement que les ICS, soit plus loin dans la scène. Les points de repère détectent avec précision les scènes lorsque les changements sont inférieurs à 20% et que le feuillage ne dépasse pas 20% de l’image.

Introduction

Mapping infrastructure is important for the economy and public safety (Oliver et al. Citation2018). Mapping can be efficiently performed with a mobile mapping system (MMS) that records geographical data from platform-mounted sensors. The platform is moved along a route to observe as-built infrastructure in urban corridors or capture the landform for analysis. MMS data is used to generate a 3D model for infrastructure planning, environmental monitoring, emergency response planning, and resource management. The use of accurate and up-to-date, as-built 3D models support infrastructure expansion and upgrading like roads, railways, powerlines, bridges, and tunnels. Discussion of a 3D city model provides an example.

A 3D city model with the shapes of buildings and other existing objects can be reconstructed with highly detailed spatial information. It is becoming increasingly common for industry to use detailed point clouds to support engineering design. For example, this paper describes a 3D model generated from an MMS is used to plan the expansion to the light rail transit system in the City of Calgary northeast. The model is used to identify infrastructure impacted by the future rail line and any encroachment on private property. These models, and subsequent infrastructure upgrades, are also key needs for autonomous vehicles and smart cities (Lemmens Citation2018; Oliver et al. Citation2018; Shahrour Citation2018).

Sometimes spatial mapping data and related 3D models are incomplete due to occlusions or because datasets become outdated in dynamic environments. Occlusions occur when an object obstructs another object. For example, a tree obstructing the view of a house wall as show at .

Figure 1. Causes of incomplete data: (a) occlusions and (b) construction altering infrastructure.

MMS data is often captured in dynamic environments like a city where changes frequently occur due to construction or temporary features such as traffic and pedestrians as seen at . The current method to resolve occlusions and changes is to re-observe the occluded or changed portion of the map with a full set of observation or multiple passes of the MMS. Updating map data is resource intensive and costly because it requires subsequent MMS passes since there is no straightforward updating method without recapturing the whole dataset or a subset of the data capture especially when updating 3D legacy models.

An alternate solution is to capture newer 2D camera images to fill the missing data, or map gaps, caused by occlusions or changes. Lowry et al. (Citation2016) suggests that updating or complementing maps with image sequences may fill map gaps by recognizing the changes from the image sequence, but a single image or image pair is currently unable to fill a map gap. Filling map gaps can be accomplished by registering the camera images to the MMS point-cloud. The pose of the camera image facilitates registration to the point-cloud. However, not all cameras have sensors with pose information from global navigation satellite systems (GNSS). Limited GNSS line-of-sight in urban corridors or indoor spaces makes it difficult to acquire a precise pose from sensors. In addition, georeferencing camera images to an MMS point-cloud can be used in other applications like forensic scene documentation, vehicle navigation, or construction engineering.

Registration of subsequent camera images is challenging because of unknown relative orientation parameters between the MMS and camera. This problem is described in where the camera image has unknown translation and orientation parameters represented by the red vectors. Vector $(C_{i}^{C})$ represents the translation parameters. The orientation parameters (roll $(ω),$ pitch $(ϕ),$ and yaw $(κ)$ ) make up the relative orientation rotation matrix from camera image to the MMS frame $(R_{i}^{G} R_{G}^{MMS} = R_{i}^{MMS}) .$ Therefore, it is necessary to independently determine the camera’s relative orientation to the MMS point-cloud. Due to the often-georeferenced nature of the MMS point-cloud, the relative orientation is defined by the external orientation parameters (EOPs). EOPs include position ( $X_{MMS},$ $Y_{MMS},$ $Z_{MMS}$ ) and orientation ( $ω,$ $ϕ,$ and $κ$ ), in the global frame. However, this method can be used when the point-cloud is not georeferenced then the relative orientation parameters are position and orientation in the mapping frame.

Figure 2. Pose of MMS and camera.

The camera cannot be finely registered to the point-cloud without coarse registration, or an initial estimation of the pose. In camera-to-camera terms this is known as place recognition. To the authors’ best knowledge, there is no discussion in the literature that applies place recognition between the two disparate datasets because there are no common primitives between point-clouds and camera imagery because of their spectral and spatial resolution differences.

This paper poses the research question, is it possible to coarsely register subsequent imagery to existing 3D MMS models without prior image pose information? The paper responds by describing: (i) a novel method for registering newer camera images to the MMS point-cloud either captured by a non-technical user or from crowed sourced images that do not contain pose information; (ii) a new workflow for generating an intermediary between MMS point-clouds and subsequently captured images; and (iii) describes a novel adaptation of existing camera-to-camera feature registration methods. The synthetic camera image (SCI) provides an intermediary to address the spectral and spatial resolution differences between the two datasets.

The paper is set out in three main sections: Literature Review, Methods, and Results and Discussion. The literature review describes the background on methods used for addressing the spectral and spatial resolution differences and identifies the need for a novel SCI generation method for MMS point-clouds. The method section describes the implementation of these methods for a novel synthetic image generation for MMS point-clouds to coarsely register the subsequently captured camera images using the Calgary Greenline dataset. It also describes a novel method for feature matching between camera and synthetic images for coarse registration that makes use of landmark features employed from AlexNet, a linear generic pretrained CNN, for generating invariant feature descriptors from edgeboxes. These methods are used to create the synthetic images and landmark features to minimize resolution differences with camera images and are tested for precise coarse registration.

Literature review

Mismatched spatial and spectral resolutions are challenges for registration between point-clouds and camera images. The research of Ku et al. (Citation2018) and Forkuo and King (Citation2004) examine theses spatial resolution challenges and Wendt and Heipke (Citation2006) examine the spectral resolution challenges for terrestrial laser scanners. Their research has not addressed how these challenges pertain to registration of MMS point-clouds.

Ku et al. (Citation2018) identified spatial resolution where a terrestrial laser scanner beam does not strike a surface as the points in the point-cloud are at infinity and do not appear in the model. For example, the point-cloud image in has no points in the sky, but the camera image shows clouds that can sometimes be identified as features for matching.

Figure 3. Challenges registering between camera images and point-clouds.

The second spatial resolution challenge is that sparse point distribution is insufficient for comparison to the 2D photograph because of gaps between the points (Forkuo and King Citation2004; Ku et al. Citation2018). These gaps are seen on the roof and the large wall in the where black indicates no points between the white intensity pixels. It is also an issue where the outline of the garage is occluded in the camera image when captured from another observation trajectory.

Spectral resolution defines a sensor’s ability to discern electromagnetic spectrum features. Cameras capture visible light, while MMS laser scanners measure narrow-bandwidth laser return energy (Wendt and Heipke Citation2006). Camera pixels convey RGB values, whereas MMS pixels represent NIR return energy (Gonzalez and Woods Citation2008). Camera intensity, derived from RGB, differs from MMS point-cloud returned intensity complicating camera to MMS point-cloud registration.

For fine registration between these two sensors, Forkuo and King (Citation2004) proposed generating a synthetic camera image (SCI) as an intermediate to translate primitives. Their SCI method does not appear to have been pursued but provides a potential intermediate.

Forkuo and King (Citation2004) used image processing and feature registration for addressing the spectral and spatial challenges in registering camera images to high-density terrestrial-scanner point-clouds. Their SCI is generated from a high-density point-cloud captured with a terrestrial laser scanner. Each point from the point-cloud is represented as a pixel in the SCI. The camera image is captured with the same pose as the synthetic camera and Harris corners (Harris and Stephens Citation1988) register camera images to the point-cloud because it ignores the spectral resolution challenges. Their method for registration generates an intermediary SCI with a synthetic camera at the same time and location as the camera image resulting in no scale or temporal changes. However, this method is incomplete, as it does not address variance in scale or pose or the MMS resolution challenges described in . A method is required to support coarse registration between camera images and point-clouds by providing an intermediary.

The following subsections will describe the literature for addressing the spatial and spectral resolution challenges. It is separated into three subsections: (i) Synthetic Image Generation; (ii) Camera Image Processing; and (iii) Coarse Registration Features. Synthetic Image Generation describes the literature for addressing the spatial resolution and synthetic image capture using surface generation, raytracing, and intensity interpolation. Camera Image Processing describes literature for spectral resolution processing for matching with the MMS point-cloud and downsampling for the spatial resolution. Coarse Registration Features examines the literature for feature detection, description and matching in camera-to-camera place recognition or coarse registration.

Synthetic image generation

To direct the literature review it is first necessary to introduce the novel approach to generating SCIs from large mobile mapping point-clouds that resolves the resolution differences in coarse registration of newer camera images to MMS point-clouds. The novel SCI method, described at , involves surface generation, raytracing, interpolation, and image processing. The literature relating to these methods are described in the subsections below.

Figure 4. SCI generation flowchart.

Surface generation

Surface generation handles sparsity and occlusion issues within point-clouds. A surface is generated from the point-cloud to provide a digital object between points that removes the distances between points and gives the synthetic camera something to detect. The surface also occludes objects from outside the scene or that should be hidden from the synthetic camera viewpoint. However, any surface generation method must also minimize artifacts and holes in surfaces generated from the MMS point-cloud to align with the camera images.

Poisson, Delaunay, and fast surface reconstruction (fast recon) are ubiquitous surface generation methods that were explored in this research for MMS applications to minimize artifacts (Kazhdan et al. Citation2006; Delaunay Citation1934; Marton et al. Citation2009). Poisson surface generation creates a continuous vector field from the oriented points, finds the closest scalar function gradient that matches the vector field, and extracts the isosurface as seen at (Kazhdan et al. Citation2006). The closed surface is then cropped by adjusting the scalar field based on the point density (Rumpler et al. Citation2013). Cropping the lower-density values removes enclosing surfaces and artifacts from the surface. However, it also leaves gaps in the surface where objects are occluded or the incidence angle was too great as seen at . As a result, the cropping values are chosen manually to minimize the number of artifacts while preventing the appearance of gaps as seen at .

Figure 5. Poisson surface reconstruction: (a) the initial surface reconstruction; (b) cropped surface showing gaps and some artifacts.

In contrast, Delaunay projects the points onto the x-y plane. The points are triangulated so no other point is inside the circumcircle for the triangle (Delaunay Citation1934). The mesh structure is then returned to 3D. The result is no need for manual thresholding, but this method is known for producing slivers of triangles creating jagged edges and difficulty with non-uniform point density (Hilton et al. Citation1996). Fast recon is a surface growing algorithm (Marton et al. Citation2009). A k-neighborhood is selected for each point by searching for the points nearest neighbors in a radius r. The radius adapts to the local point density by multiplying the distance of point p to its nearest neighbor ( $d_{0}$ ) with a user-specified threshold ( $μ$ ). The neighborhood is projected onto a plane tangential to p and its neighborhood. Non-visible points are pruned. Points are also connected to p and consecutive points by edges to form triangles. These triangles have a maximum angle criterion to describe the characteristics of the holes in the surface. Larger angles cause fewer holes in the model. The triangles also have a minimum angle criterion that helps to minimize slivers and artifacts in the model.

These surface generation methods are tested in the research to find the surface generation method that produces the fewest artifacts and holes in MMS point-clouds. This is done to resolve spatial differences including proper object occlusion and point distribution between the MMS point-cloud and a camera image. The chosen surface generation method is the fast recon method and is described in detail in the results section. The surface is struck by a ray cast through the synthetic camera model. This provides a pixel intensity value for the struck surface. The next section describes the synthetic camera model followed by the interpolation of the intensity value.

Ray tracing & orientation

Raytracing simulates geometric optics by tracing oriented rays through a synthetic camera into object space, which in this paper is the mobile mapping space (Formella and Gill Citation1995). Generally, raytracing involves thousands of intersecting rays from various illumination models. This paper uses it to capture the SCI by projecting the ray from the image plane, through the synthetic camera model, and into the MMS surface space with no additional light sources. Terdiman’s raytracing algorithm then uses a bounding volume tree to build a hierarchy of bounding volumes to speed up collision detection (Terdiman Citation2003; Stich et al. Citation2009). The intensity value of the intersection between the ray and the surface is then stored as the pixel value.

The ray is cast from the optical center ( $C_{j}),$ through the pixel on the image plane at point $p .$ When the ray strikes a surface at point $P$ the intensity value is then stored in the pixel. Rays are iteratively cast through each pixel on the image plane described by the synthetic camera model. The synthetic camera model contains distortion parameters including principal point offset $(c_{x}, c_{y});$ lens distortion coefficients (not shown); and shear distortion (not shown). This paper assumes an ideal camera model where all camera equation distortion values are ignored. The resulting camera matrix is, [1] $K = {[\begin{matrix} \frac{1}{p_{s}} & 0 & c_{x} \\ 0 & - \frac{1}{p_{s}} & c_{y} \\ 0 & 0 & - \frac{1}{f} \end{matrix}]}^{}$ [1] where, $f$ is the focal length, $c_{x}$ and $c_{y}$ are the principal point offsets as shown at , and $p_{s}$ is the pixel size for the synthetic camera. These parameters are chosen based on the desired camera model. This research discusses parameter choices in the coarse registration subsection of the method section and principal distance subsection of the results and discussion section.

Figure 6. Projection of a ray through the synthetic camera model into mapping space.

The camera orientation in the point-cloud frame is the rotation matrix $R_{sci}^{PC} .$ The rotation matrix is then transformed to get the orientation from the SCI frame to the point-cloud frame $(R_{sci}^{M})$ (Ellum and El-Sheimy Citation2002). [2] $R = R_{3} (ω) R_{2} (ϕ) R_{1} (κ) = R_{sci}^{M}$ [2]

The direction ( $\overset{⇀}{d}$ ) of the ray passing through a pixel is calculated as, [3] $\overset{⇀}{d} = {KR}_{sci}^{M} [\begin{matrix} j \\ i \\ 1 \end{matrix}]$ [3] where $j$ is the column location of the pixel and $i$ is the row location of the pixel. The rotation matrix is applied to the mapping frame and is described in the SCI orientation parameters subsection of the results and discussion section.

Finally, the position of the synthetic camera must be generated in the MMS point-cloud frame. This vector is represented by ${\vec{C}}_{j}$ seen at . A ray is cast from ${\vec{C}}_{j}$ in the direction of each pixel in the synthetic camera array.

Intensity interpolation

To find the spectral or intensity value of the surface where the pixel strikes, it is necessary to interpolate from the nearest MMS point-cloud points. The natural neighbor interpolation method is used to preserve edges and minimize aliasing and artifacts in the final SCI. Natural neighbors interpolation has seven steps shown at (Fisher Citation2006, 97–108; Sibson Citation1981).

Figure 7. Vornoi cells for natural neighbors (adapted from Sibson (Citation1981)).

The new point $(P)$ is inserted, represented by the blue point in . Voronoi cells are drawn around the blue point and its neighbors. The white and blue areas represent these cells. The volume is calculated for these cells. $P$ is removed and the Voronoi cells are drawn with black lines around the neighbors only. The volumes are recalculated and the difference provides the weights, represented by green circles and numbers, for the intensity values. The natural neighbor shows significantly less aliasing and is used in feature matching between the synthetic and camera images. The camera image must be processed to minimize the spectral differences between the point-cloud and the camera image.

Camera image processing

As identified earlier, the spectral information differs between synthetic camera images and RGB cameras. To minimize this difference for matching, it is necessary to process the camera RGB image by transforming the color space. It is found that the hue, saturation and intensity (HSI) model minimized the spectral differences between the two image types (Gonzalez and Woods Citation2008).

HSI models decouple intensity information from color information (hue and saturation). Hue is used to describe the pure color’s attribute. This describes if the color is red, green, blue, yellow, etc. Saturation describes the amount of white light, and intensity describes the grey level. A reminder that this intensity is different than the laser scanner intensity, which describes the energy return of the laser.

It is necessary to downsample the image to provide similar spatial resolution to the synthetic image and overcome the spatial resolution problem. This is done by using the proportion of ground sampling distance of the camera and MMS point spacing as seen in where, [4] $Downsample ratio = \frac{{GSD}_{camera}}{PointSpacin g_{MMS}} = \frac{\frac{p_{s} * Distance}{f}}{sin (α_{scan}) * Distance} = \frac{p_{s}}{f * sin (α_{scan})}$ [4]

Figure 8. MMS point spacing vs camera ground sampling distance.

MMS point distribution is influenced by many variables including distance from scanner, scanning geometry like scanning angle, and velocity of the MMS (Puttonen et al. Citation2013). is a schematic representation of point distribution for simplicity.

Coarse registration features

Traditionally, keypoint feature-based registration is used for camera-to-camera. Scale-invariant feature transform (SIFT) (Lowe Citation2004) and speed-up robust features (SURF) (Bay et al. Citation2008), are two keypoint features used to successfully identify matching points for place-recognition between camera images (Gupta and Cecil Citation2014). These keypoint methods are dependent on intensity values to define the feature for matching.

Alternatively, image to point-cloud registration uses linear features or corner features like hessian features used by Forkuo and King (Citation2004). These features are robust to spectral and spatial differences. However, these methods are not viewpoint invariant. In contrast, CNNs are robust to spectral and spatial differences and viewpoint invariant but are not generalizable and are outperformed by traditional image processing and feature-based registration (Ku et al. Citation2018). Sunderhauf et al. (Citation2015) suggests landmark features for camera-to-camera coarse registration that are pose and intensity invariant.

Landmark features are a combination of traditional image processing with feature-based registration and CNNs. Objects are extracted from an image by drawing edge boxes around long contiguous edges. These are edgeboxes, which are then converted into a descriptor by running them through layers of a generic pretrained CNN. This descriptor is then compared using the cosine distance to find matches (Sunderhauf et al. Citation2015). This research proposes to adapt this method for camera image to point-cloud coarse registration.

Feature detection: edgeboxes

Edgeboxes use edges in the image to detect objects in the scene (Dollar and Zitnick Citation2013). A sliding window is moved around the image to find object proposals based on contours. Every proposal is given a score to identify if there is an object. A box is drawn around the object to capture the whole line and is stored as a feature.

CNNs are limited by 5 factors: (i) overfitting (McGrail and Rhodes Citation2020); (ii) data insufficiency (Xu et al. Citation2021); (iii) lack of interpretability (Mi et al. Citation2020); (iv) computational complexity (Kearns Citation1990); (v) continuous training (Parisi et al. Citation2019), and (vi) catastrophic forgetting (Parisi et al. Citation2019). These limitations can be avoided by extracting output before the decision layers and avoiding the necessity of training the network. Avoiding the limitations are why traditional features are used for this research. Therefore, edgeboxes were chosen over region-based CNNs such as Fast-RCNN and Mask-RCNN. Edgeboxes’ parameterization allows for the fine-tuning for the processed camera image and the SCI to capture the same features without the need for training on both datasets. The possibility of catastrophic forgetting, computational complexity, and over fitting are mitigated by edgeboxes.

Edgeboxes identify objects from the candidate bounding boxes is measured using intersection over union (IoU) where area of the intersection between candidate and ground truth boxes is divided by their union area (Zitnick and Dollár Citation2014). Some applications, like the camera-to-camera place recognition, work with a lower IoU score because they use similar images of similar scenes. However, for this application an IoU score greater than 0.85 is necessary because of the need for complete and distinct object features from both SCI and camera images. Lower scores mean fewer object features are being found in both data sets.

There are four important criteria that impact the IoU (Zitnick and Dollár Citation2014): (i) window step size; (ii) non-maximal suppression; (iii) minimum box score; and (iv) number of boxes. Each of these criteria is explained below. Observed values are provided in the results section.

The window step size defines how far the sliding window is moved in the image. Larger step sizes detect fewer overlapping objects meaning fewer detected objects and less feature detail. This leads to the need for different step sizes for SCIs and camera images. Smaller step sizes are used in the SCI to capture more detail and the edgebox captures more objects. Since, camera images have more detail it requires a larger step size to synchronize with the synthetic image that require smaller step sizes. This limited the number of smaller objects like mailboxes and outdoor lights that are usually detected by a smaller step size in the camera image.

The non-maximal suppression (NMS) is used to remove boxes if there is high overlap between two boxes. It accepts the box with the higher score. The threshold is used to determine the amount of overlap that is acceptable.

The minimum box score removes any boxes where the score is lower than the set threshold. The score is calculated from contours captured by the candidate box. Long edges connected with shallow curves are scored higher than edges connected with sharp curves.

The number of boxes is the maximum number of objects detected in the image. If edgeboxes detect fewer objects than the maximum number of boxes, it stores the smaller number of boxes. For example, if observing a shoe and the maximum number of boxes is 10 but we only count 3 – tongue, laces, and sole – the edgebox stores 3 for the observed objects.

Once the features are detected and an edgebox is drawn around the object, the box is cropped from the image. These cropped features are then passed through a CNN to extract the feature descriptor.

Feature description: neural network layers

CNNs are multi-layered neural networks specializing in recognizing visual patterns in images and are regularly used in camera-to-camera place recognition (Chen et al. Citation2014). CNNs rely on teaching datasets to make decisions. These decisions have been shown to be less generalizable than traditional feature matching methods and require colorized point-clouds (Ku et al. Citation2018). The five limitations of CNNs described above are avoided by stopping before the dense decision layers while using the convolutional and pooling layers to generate a robust distinctive landmark feature descriptor (Neubert et al. Citation2013). It is for these reasons that the research adopts an approach that ignores the dense decision layers for feature description.

The landmark feature method employs AlexNet, a linear generic pretrained CNN, for generating invariant feature descriptors from edgeboxes (Krizhevsky et al. Citation2012). The edgeboxes are passed through the neural layers and does not require specific environment training. AlexNet consists of four convolutional layers, two pooling layers, and linear feature-filtering neurons, providing intensity and pose-invariant solutions as seen at . This network efficiently processes images by convoluting, downsampling, and filtering for unique features (Krizhevsky et al. Citation2012). The output from the third convolutional layer is flattened and used as the feature descriptor in camera-to-camera imagery because Neubert et al. (Citation2013) found it was robust to appearance changes.

Figure 9. AlexNet Convolutional Neural Network design diagram for creating landmark feature descriptor (Patil et al. Citation2023).

shows the output of the first convolutional layer. Linear features are shown in the top line and color/intensity filter is highlighted in the third row. This paper tests these layers to find the most distinct feature for registration between camera and synthetic image.

Figure 10. Example of AlexNet’s first convolutional layer output.

Feature matching: cosine distance

The output from the chosen layer is then flattened into a one-dimensional array. This flattened descriptor is then passed into the decision layers for matching. However, for generalizability the flattened descriptor is matched using the cosine distance (Sunderhauf et al. Citation2015). The cosine distance between two flattened descriptor vectors ( $\overset{⇀}{r}$ and $\overset{⇀}{s}$ ) is: [5] $cos (θ) = \frac{\overset{⇀}{r} \overset{⇀}{s}}{‖ \overset{⇀}{r} ‖ ‖ \overset{⇀}{s} ‖}$ [5] where better matches result when the cosine distance is closest to one.

The synthetic feature descriptors are stored in a database with each image name, pose of the synthetic camera, and the feature descriptors for that image. The camera image feature descriptors are queried against this synthetic image database. The pose of the matched SCI is used as an approximate position of the camera image. The largest features are matched first. The best matches are refined by matching smaller features.

Methods

This section describes the implementation of the methods described in the literature review. Development of a new coarse registration method requires a new approach to SCI generation to resolve spatial and spectral resolution differences between the MMS point-cloud and subsequently captured camera images.

In the new method, synthetic images are generated from the MMS point-cloud following the steps shown in the left column of the flowchart at . A surface is generated from the MMS point-cloud using the fast recon surface generation method. The thresholds are tested and described in the results and discussion section. The synthetic image is generated using raytracing with orientation parameters perpendicular to MMS trajectory and captured at 10 meter intervals. Rays are traced through the camera model and the pixel value is interpolated with the natural neighbor method to minimize aliasing when it strikes a surface. A median filter is applied to the SCI to remove any salt-and-pepper noise while preserving edges. A database of the landmark feature descriptors is then generated from the series of SCIs. The testing of these parameters is described in the following subsections.

Figure 11. Flowchart for coarse registering subsequent camera images to MMS legacy model.

Camera image features are created by converting the image into the HSI color model where saturation and intensity are used to represent the SCI intensity and downsampling is done to match the spatial resolution of the SCIs. Features are detected in the processed image and passed through the CNN for description. These features are then queried against the SCI feature database using size similarity and the cosine distance to find the best match. Precision is used for the analysis of the registration as it represents the true positive registration over the total number of registrations. Another criterion, recall, represents the retrieved incidences over the relevant instances. This case does not have false negatives only false positives and true positives. Therefore, recall is not descriptive of the results and is not used in the analysis.

The precision of these matches is compared with the correct manually selected matches. A correct match is counted when the coarse registration and manually selected matches are the same. The total number of matches is used to determine the precision by, [6] $precision = \frac{Correct matches}{correct matches + incorre c t matches}$ [6]

shows the novel approach for coarse registration of a subsequent camera image to an MMS point-cloud using SCIs. The spatial resolution problem is solved using SCIs by surface generation, raytracing, natural neighbor intensity interpolation, and median filtering. Surface generation fills in the spaces between point-cloud points. Raytracing projects rays from the synthetic camera to find pixels that strike the surface. Natural neighbor intensity interpolation identifies the SCI pixel intensity value, and the median filter preserves edges and reduces noise in the final SCI. Research into coarse registration of camera images to non-colorized point-clouds.

Synthetic camera image

The surface generation, raytracing, and intensity interpolation and the camera image processing were tested in an indoor controlled experiment. A camera was setup at the same location as a terrestrial laser scanner. The experiment emulated Forkuo and King (Citation2004). However, fine registration was done using SIFT and SURF instead of Harris corners. The three surface generation methods were tested and compared in the indoor environment. They were examined for large holes and artifacts created from the surface generation methods. The raytracing and intensity interpolation were tested for their ability to generate the best representation of the laser scanner intensity image while minimizing aliasing.

The camera image processing was tested by matching SIFT and SURF features between the processed camera image and the SCI. The camera image was down sampled at the calculated 50% as well as 75%, 33% and 25% for comparison in the indoor experiment. Different color model representations were also tested. Grayscale and combinations of HSI and CMY color models were examined for matching. The coarse registration method is tested on an MMS data capture in an outdoor environment.

SIFT features were used to test the novel SCI method. SIFT works well on a dense field captured from a terrestrial scanner and provide precise matches between an indoor SCI and a captured image. However, SIFT descriptors are not unique when observing the outdoor scenes. Therefore, it was necessary to find a feature and descriptor that could match outdoor scenes to outdoor scenes. Landmark features are used because they mix global and linear features resulting in a feature robust to scale and intensity variations (Sunderhauf et al. Citation2015). Landmark features are modified to coarsely register camera images to the SCI using edgeboxes, neural network layers, and traditional matching techniques.

Coarse registration

The coarse registration method is tested on a large MMS dataset following the proposed light rail transit (LRT) line in Calgary, Alberta, Canada. The MMS dataset was captured in 2016 to expand Calgary’s LRT in the city’s northeast. Synthetic camera images are generated in two areas of the LRT route to capture a variety of building types and densities. It captures high-density commercial and residential, medium-density commercial, low-density commercial and residential, and parkland. The dataset is depicted in . It covers approximately 10 km but was reduced to 14 city blocks: 8th Avenue to 20th Avenue, and Beddington Boulevard to Bergen Crescent as these contained a sufficient variety of land uses and streetscapes.

Figure 12. Map of test area.

The synthetic image generation methods are tested in a non-controlled environment where outdoor scenes are captured perpendicular to the MMS trajectory at different intervals. This includes 4 description and matching methods: (i) surface generation, (ii) raytracing, (iii) intensity interpolation, and (iv) landmark features.

Nine influencing variables were tested in the outdoor experiments: (i) number of camera edgeboxes; (ii) number of synthetic image edgeboxes; (iii) height of synthetic image; (iv) vertical synthetic camera angle; (v) synthetic image frequency; (vi) camera spatial resolution; (vii) principal distance; (viii) camera angle; and (ix) camera distance. These are broken down into two categories: camera image experiments and synthetic image experiments.

summarizes the experimental variables that are used to test registration between camera images and SCIs. Each experiment was performed while keeping the other variables constant.

Table 1. List of test parameters.

Display Table

schematically shows the synthetic image experiments. Height of the synthetic image is the vertical distance of the synthetic camera from the model surface. Vertical angle is calculated and tested the vertical pose angle of the synthetic image to minimize road capture and maximize scene information. Synthetic image frequency tests how frequently the synthetic images are captured. This is tested at 20 meters, which captures each house and lot in the model and 10 meters, which captures additional images of each house and lot. The number of SCI edgeboxes examines the step size, NMS, and maximum number of boxes.

Figure 13. Synthetic image parameters.

The camera images were captured along the MMS trajectory as shown at . These images show a variety of different scenes that include, high and low-density commercial and residential, parkland, and empty lots. They were taken at multiple locations with different cameras to apply the different variables described above.

Figure 14. Camera image capture locations.

shows variables for the camera image experiments. The camera tested variables associated with the camera image used for matching with the synthetic images. The angle tested camera images taken normal to the building and at 45°. The camera distance captured the same side of the road as the scene (minimum distance), matching the MMS and SCI capture location with the opposite side of the road (maximum distance). The spatial resolution tests the downsampling of the camera image to match the spatial resolution of the synthetic image in the scene. Number of camera edgeboxes is tested in the same way as the SCIs. Principal distance is tested by capturing images using 3 different cameras an Apple iPhoneX_R with a focal length of 4 mm, a Cannon automatic zoom Optix with a focal length of 7 mm, and a cannon EOS Rebel T2i with a focal length of 18 mm.

Figure 15. Camera image parameters.

During the outdoor experiments it was found that there was significant noise that interfered with the feature matching. A median filter was applied to reduce the noise.

Median filtering

Surface noise relating to the MMS laser scanner are observed in the SCI outdoor scenes with roads and trees. They require processing to minimize noise while preserving edges for the feature matching. shows patterns that cause issues with the feature descriptor. However, linear features are important for the feature descriptors. Therefore it is necessary to apply a median filter to preserve the edges (Gonzalez and Woods Citation2008).

Figure 16. Road noise and patterns mis-identified as linear.

The median filter provides a reduction of salt and pepper noise on buildings, roads, and other horizontal surfaces while maintaining edges. The filter decreases the amount of noise on the road. However, it does not resolve the noise on the road and decreases the precision of the feature matching. A simple solution is to tilt the synthetic camera up from horizontal to minimize the visible amount of road in the synthetic images. This has the added benefit of capturing more distinct features of taller buildings in higher-density areas.