917
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Inferring implicit 3D representations from human figures on pictorial maps

ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon
Pages 97-113 | Received 12 Oct 2022, Accepted 06 Jun 2023, Published online: 26 Jun 2023

ABSTRACT

In this work, we present an automated workflow to bring human figures, one of the most frequently appearing entities on pictorial maps, to the third dimension. Our workflow is based on training data and neural networks for single-view 3D reconstruction of real humans from photos. We first let a network consisting of fully connected layers estimate the depth coordinate of 2D pose points. The gained 3D pose points are inputted together with 2D masks of body parts into a deep implicit surface network to infer 3D signed distance fields (SDFs). By assembling all body parts, we derive 2D depth images and body part masks of the whole figure for different views, which are fed into a fully convolutional network to predict UV images. These UV images and the texture for the given perspective are inserted into a generative network to inpaint the textures for the other views. The textures are enhanced by a cartoonization network and facial details are resynthesized by an autoencoder. Finally, the generated textures are assigned to the inferred body parts in a ray marcher. We test our workflow with 12 pictorial human figures after having validated several network configurations. The created 3D models look generally promising, especially when considering the challenges of silhouette-based 3D recovery and real-time rendering of the implicit SDFs. Further improvement is needed to reduce gaps between the body parts and to add pictorial details to the textures. Overall, the constructed figures may be used for animation and storytelling in digital 3D maps.

Introduction

Technology companies – such as Meta, Microsoft or Sony – invest heavily in the creation of the metaverse these days (Gilbert, Citation2021). In the visions of these companies (e.g. Meta, Citation2021), people equipped with head-mounted displays can immerse in virtual 3D environments for work or leisure activities. The virtual space may represent our physical world, be purely imaginary or be a mixture of both. Creating digital 3D representations of topographic elements and thematic content from the real world by abstraction and generalization would be of interest from a cartographic perspective. Early works have focused on the rendering of sketched 3D buildings (Döllner & Walther, Citation2003) or the modeling of pictorial 3D mountains and sights (Naz, Citation2005). This is opposed to “mirror worlds”, for instance in Google Earth, where the real world is convincingly reflected (Park & Kim, Citation2022) and photo-realistically rendered. Cartographic 3D models and mirror worlds are closely related to “digital twins”, which are virtual representations of real-world entities for mainly simulation purposes (Park & Kim, Citation2022), for example, historical reconstructions (Herold & Hecht, Citation2018) or urban planning (Schrotter & Hürzeler, Citation2020).

Avatars are one key concept of the metaverse (Park & Kim, Citation2022). Those virtual 3D models of humans, animals or other personifications embody real humans or computer-controlled entities, who can be interacted with. In a cartographic 3D environment, avatars may give background information to a topic, tell personal stories or serve as tour guides. For example, 3D figures could illustrate daily life (e.g. a farmer on a field), act in special events (e.g. a priest at a coronation ceremony), or represent famous persons (e.g. Goethe in Weimar). Past research has examined the animation of 3D objects, such as cars and horses, on the terrain (Evangelidis et al., Citation2018) and the integration of 3D characters into cartographic virtual reality environments (Matthys et al., Citation2021).

The enrichment of cartographic digital twins with human figures would be an analogy to historic maps, where human figures have been inserted for ethnographic or humoristic purposes, amongst others (Child, Citation1956). Historic, but also contemporary pictorial maps would be valuable sources for creating 3D models of the depicted figures. Similarly to cartoons (X. Wang & Yu, Citation2020), pictorial figures on maps are usually composed of rather geometrically formed and possibly disproportionate shapes, whose low-detail textures are filled with flat colors and accentuated by sharp black edges. Nevertheless, the manual creation of 3D figures would be labor-intensive and cumbersome, and parametric models may not be detailed enough. Machine learning methods are a promising technique to reconstruct 3D models from humans and objects in photos (Fahim et al., Citation2021). In cartography, researchers rather focused on the detection of topographic elements on maps – such as buildings (Heitzler & Hurni, Citation2020) or water bodies (Wu et al., Citation2022) – or pictorial figures (Schnürer et al., Citation2022) by convolutional neural networks (CNNs) so far, however not with their transfer into the 3D space yet.

In this work, we like to close this gap by applying a series of neural networks to infer 3D representations, encoded as signed distance fields (SDFs), from 2D figures on pictorial maps. Each point in an SDF holds a value denoting the distance to the nearest boundary of an object. As a difference to previous works, we do not recover the SDFs from textures but merely from silhouettes. Compared to meshes, point clouds, or voxels, implicit representations like SDFs have some desirable properties such as infinite geometric detail or easy blending capabilities. We use sphere tracing (Hart, Citation1996) to render the figures in real-time, whereas other researchers mainly used marching cubes (Lorensen & Cline, Citation1987) to polygonize the SDFs. The sphere tracing algorithm is relatively well-established in contrast to newer methods like neural rendering (e.g. Eslami et al., Citation2018; Lassner & Zollhöfer, Citation2021). Our work has great potential for skeletal animation since we construct the figures by compositing 3D body parts according to 3D pose points. We see atlases, education, museums, tourism or games in- and outside the metaverse as primary application areas.

Related work

In recent years, many advances have been made in reconstructing 3D persons and objects from single images using machine learning methods. For example, Omran et al. (Citation2018) apply CNNs to predict parameters of the pose and shape of a 3D person model by taking advantage of segmented body parts as an intermediate representation. Saito et al. (Citation2019) developed the “PIFu” architecture, which produces a 3D occupancy field for the geometry of a person by a multilayer perceptron (MLP) and texture colors by a generative adversarial network (GAN). In “ARCH,” described by Huang et al. (Citation2020), animation capabilities of human models are considered by including a semantic deformation field, amongst others. Lin et al. (Citation2020) encode images of objects in a hypernetwork, which is a network generating weights for another network. In the architecture of Lin et al. (Citation2020), the hypernetwork predicts parameters of implicit functions for an MLP, which converts encoded 3D coordinates into SDF and RGB values and which is updated by a recurrent neural network.

In a subset of single-view 3D reconstruction networks, coarse and detailed geometry are handled separately. In the deep implicit surface network (DISN), proposed by W. Wang et al. (Citation2019), SDFs are predicted from local and global features, which are extracted from feature maps of an image encoder. Branches for coarse and fine-level geometry exist also in “PIFuHD” (Saito et al., Citation2020). The successor of “PIFu” contains two CNNs, three MLPs and a conditional GAN predicting normal maps. Li and Zhang (Citation2021) demonstrated with “D2 IM-Net” how to transfer surface details from displacement maps to coarse shapes by one image encoder and two decoder branches as well as including a Laplacian loss function.

While the above networks use 3D training data, another subgroup of object reconstruction networks, also known as off-the-shelf recognition systems, uses only the given 2D images for supervision. Liu et al. (Citation2019) elaborated a ray-based field probing technique to correct errors of predicted 3D implicit surfaces. Lunz et al. (Citation2020) added a proxy neural renderer to a GAN to render 2D images by the traditional non-differentiable rendering pipeline. In “U-CMR,” Goel et al. (Citation2020) optimized possible camera views to render meshes and textures of objects and birds. Ye et al. (Citation2021) render images from semi-implicit volumetric representations and only take approximate instance segmentation masks into account for supervision. In “pixelNeRF” (Yu et al., Citation2021), the volumetric density and color of objects are implicitly encoded along camera rays by a CNN.

A third subcategory of networks additionally outputs distinct parts or part memberships for the 3D reconstruction. In an early work, Agarwal and Triggs (Citation2006) approximated body parts by cuboids using non-linear regression with Support Vector Machines. In a more modern architecture, Niu et al. (Citation2018) extracted object parts as cuboids by sequential CNNs recovering masks and hierarchies. Paschalidou et al. (Citation2020) trained a partition network to split objects into two parts, a geometry network to find shape parameters of geometric primitives, and a structure network that links the partitions to the primitives.

Instead of reconstructing the object out of individual shapes, Varol et al. (Citation2018) relied in “BodyNet” on a voxel-based representation, which is predicted by CNNs together with 2D and 3D poses and 2D body parts. A more fine-grained part membership than a body part segmentation are UV coordinates, which link texture images to the surface of a 3D model. UV coordinates can be predicted from images and also be used for 3D reconstruction (e.g. Güler et al., Citation2017; Yao et al., Citation2019).

A last group of networks related to our research is concerned with the reconstruction of objects and figures based on silhouettes or sketches. Di and Yu (Citation2017) propose a stacked hierarchical network consisting of 3D CNNs to create objects from black-and-white silhouette images. Delanoy et al. (Citation2018) reconstruct voxelized objects from sketches by an image encoder-decoder CNN and an updater CNN. Brodt and Bessmeltsev (Citation2022) recover 3D poses from sketched characters by training a 2D pose estimation network and applying an optimization algorithm focusing on bone tangents, body part contacts and bone foreshortening. No literature could be found to reconstruct 3D persons or objects from paintings, comics/manga, or maps by machine learning methods.

In this work, we address this shortage by following a bottom-up approach, distantly related to deep local shapes (Chabra et al., Citation2020), to build pictorial human figures from individual body parts. In a top-down approach, contrariwise, it may be more challenging to identify 3D body parts after having constructed a holistic 3D model. The enclosure of 3D body parts into bounding boxes may accelerate sphere tracing computations and may lead to less storage compared to covering the full 3D space. As the variance of SDF values within the bounding boxes is lower than the variance of the entire body, a more efficient training process and more fine-grained reconstruction results can be expected. For deriving 3D body parts from their 2D silhouettes, we use the DISN architecture (W. Wang et al., Citation2019) due to its simplicity and adaptability. 3D skeletal points, whose depth coordinates are predicted by another minimalistic network (Martinez et al., Citation2017), serve as anchor points for creating the 3D body parts. Textures based on the given view are generated by an inpainting network (Grigorev et al., Citation2019) using UV coordinates predicted by a U-Net (Ronneberger et al., Citation2015). Finally, the textures are enhanced by a cartoonization network (X. Wang & Yu, Citation2020) and an autoencoder (Gondara, Citation2016). Overall, we aim at providing an easily understandable yet effective pipeline for inferring implicit 3D representations of pictorial figures.

Data

Generalized 3D body meshes of a female and male person from the SMPL-X dataset (Pavlakos et al., Citation2019) form the basis for our experiments. In the following, we process the meshes () with a Blender plugin provided for the SMPL-X dataset and automate the steps with the Blender scripting API. We assign about 3200 poses from the AGORA dataset (Patel et al., Citation2021) to the meshes, half to females and the other half to males. Additionally, we vary height and weight parameters (i.e. 1.40 m & 60 kg, 1.80 m & 75 kg, 2.20 m & 90 kg) for the posed body meshes since pictorial humans may have distorted proportions. Next, we determine 3D pose points of the mesh by retrieving the bone heads from the skeleton. In total, we extract 20 pose points (head, neck, thorax, pelvis, left/right [l/r] shoulder, l/r elbow, l/r wrist, l/r hand, l/r hip, l/r knee, l/r ankle, l/r foot) and take the midpoint of two other pose points (l/r eye).

Figure 1. Our data processing workflow.

Figure 1. Our data processing workflow.

As a further processing step, we split the 3D body mesh into sub-meshes for different body parts (i.e. head, torso, upper arms, lower arms, hands, upper legs, lower legs, feet). For this, we first iterate through the mesh vertices and derive a body part index from the maximum weight associated to each vertex group. Secondly, we iterate through the mesh triangles and assign them the same body part index as the majority of vertices of a triangle. Triangles with the same body part index are then selected and separated from the mesh. To smoothen the spikes at the boundaries, we split the two edges of a boundary triangle at their center points and assign the resulting smaller triangle to the other body part. Finally, we calculate center points for the vertices lying at any boundaries and connect them to close any arisen holes. The individual body parts are exported in OBJ format and converted to SDFs by the mesh-to-sdf library (Kleineberg et al., Citation2021).

After subdividing the body parts into watertight meshes, we create binary mask images for each body part and categorical mask images for all body parts using vertex colors and custom shaders in the rendering pipeline of Blender. Additionally, images with depth values and UV coordinates are generated using ray casting. By repositioning the orthographic camera, four views (i.e. front, left, back, right) are produced for each of the three types of 2D images.

2D body part masks, depth and UV images as well as 3D pose points and SDFs of individual body parts will serve as training and validation data for our networks. For enhancing the textures, we cartoonized the heads of about 3100 humans (X. Wang & Yu, Citation2020) from the PASCAL-Part dataset (Chen et al., Citation2014). All data items are normalized to equal sizes, but their original size is stored as metadata. As testing data, we annotated 2D skeletons and body part masks of 12 larger figures from historic and contemporary pictorial maps, which mainly originate from a pictorial map classification dataset (Schnürer et al., Citation2021). The selected test figures vary in poses, clothes, genders, skin colors, drawing styles and viewing perspectives.

Methods

3D pose estimation

We use a network proposed by Martinez et al. (Citation2017) to predict depth coordinates for a 2D skeleton. The network consists only of two blocks of fully connected and dropout layers as well as a residual connection. In the original work, humans are captured by four cameras having a perspective projection. We adapt the network by introducing an orthographic projection, where the depth coordinate of the 3D skeletons is omitted to construct the projected 2D coordinates for our training data. Since pictorial figures are mostly hand-drawn in arbitrary projections and an additional network would have to be trained to estimate the parameters of a perspective camera (e.g. focal length, distortion coefficients), we apply only the orthographic projection to our test data. Nevertheless, we report the results of a hypothetical perspective camera for our validation data. Another minor modification of the original network is the addition of five skeleton keypoints – one at each hand and foot, and one at the head. Those will be helpful for the 3D body part inference in the next step.

We train each network configuration for 100 epochs using the given hyperparameters (i.e. batch size = 64, learning rate = 0.001, dropout = 0.5, batch normalization) by Martinez et al. (Citation2017) since the authors already tested those extensively in their work. All experiments in this article are conducted on an NVIDIA GeForce GTX 1080 and our custom architectures are implemented with TensorFlow (Google, Citation2022). We set the height of our figures uniformly to 1.79 m since this is the average height of the two test subjects in the original dataset according to their body meshes. Quantitative results () show that the root mean squared errors (RMSE) of the orthographic projection are only slightly higher compared to the perspective one, whereas the percentages of correct keypoints (PCK150 mm) are slightly lower. An extreme outlier occurs, especially for the orthographic projection, when our validation data is predicted by the network trained on their data, demonstrating that re-training is necessary. After doing so, a similar accuracy to their original training and validation data is reached. The error slightly increases when adding the five pose points (i.e. l/r hand, l/r foot, midpoint between the eyes) in our skeletons. Qualitative results show that poses can be recovered well for pictorial figures (). Only in one out of the 12 test figures, an arm was positioned in front of the body instead of behind it (, lower row).

Figure 2. Estimated 3D poses (green/violet) from 2D poses (red/blue) of pictorial figures from our test data after training the network of Martinez et al. (Citation2017) with our data (i.e. 21 pose points, orthographic projection).

Figure 2. Estimated 3D poses (green/violet) from 2D poses (red/blue) of pictorial figures from our test data after training the network of Martinez et al. (Citation2017) with our data (i.e. 21 pose points, orthographic projection).

Figure 3. SDF around a hand viewed from the top, back and left, and in oblique perspective. Only positive distance values (blue = small, green = intermediate, red = large distances) are colored.

Figure 3. SDF around a hand viewed from the top, back and left, and in oblique perspective. Only positive distance values (blue = small, green = intermediate, red = large distances) are colored.

Figure 4. Inferred body part SDFs (distance < 0) from masks by DISN one-stream with concatenated pose points. Each network producing a 3D body part is trained individually. The mask of the left hand of the second figure is empty since it is hidden in the original image.

Figure 4. Inferred body part SDFs (distance < 0) from masks by DISN one-stream with concatenated pose points. Each network producing a 3D body part is trained individually. The mask of the left hand of the second figure is empty since it is hidden in the original image.

Table 1. Average root mean squared errors and percentages of correct keypoints of pose points for estimating depth coordinates of human poses using their (Martinez et al., Citation2017) and our data as well as different projections and number of pose points.

3D body part inference

We generate 3D body parts from their 2D masks by a network called DISN (W. Wang et al., Citation2019). Originally, the network predicts 3D SDFs of objects in 2D images, which are encoded in a series of convolution and down-sampling operations to a final feature map (i.e. global features). Intermediate feature maps of the encoding process are up-sampled and concatenated so that local features can be retrieved point-wise. W. Wang et al. (Citation2019) propose two variations for the decoding part: On the one hand, the encoded 3D query points, global and local features are combined and decoded by fully connected layers in DISN one-stream. On the other hand, 3D query points and global features as well as 3D query points and local features are first combined and decoded by fully connected layers in parallel, and finally added in DISN two-stream.

We extend the network by additionally concatenating 3D pose points, which have been estimated in the previous stage, to the global and local features together with the 3D query points. We use two 3D pose points as anchor points for each body part (e.g. elbow and wrist for a lower arm), except for the torso where four points are used (l/r hip, l/r shoulder). The pose points embed information about orientation and depth of the body parts. This has the advantage that an initial network proposed by W. Wang et al. (Citation2019) can be omitted, which estimates translation and rotation parameters to transform points from world space into camera space. Since we feed only 64 × 64px masks into the adapted network, we reduced its parameters (Appendix A). We sample the same amount of positive and negative SDF values (i.e. 2000 each) to get a distinct zero-iso-surface. The distance values are sampled randomly within the cubic grid to recover coarse structures and fine details of body parts, however leading to an accuracy trade-off in either granularity.

We report errors for inferring body parts with and without pose points for the one-stream and two-stream architecture for hands (). We selected a hand as an exemplary body part for our measurements () since fingers are the most difficult structure to recover (). Each configuration is trained five times for 200 epochs at a learning rate of 0.0001 and a batch size of four. Results show that errors are similar for the two architectures, and decrease with the additional pose points in both cases. To construct all body parts (), we mirror symmetric body parts (e.g. right and left foot).

Table 2. Average root mean squared errors and intersections over unions on our validation data for inferring 3D SDFs of hands from 2D masks.

Table 3. Average root mean squared errors and intersections over unions on our validation data for inferring 3D SDFs of different body parts from 2D masks using DISN two-stream with pose points.

UV coordinates prediction

We predict UV coordinates, ranged zero to one, from a depth image and body part masks of the figures by designing a network similar to U-Net (Ronneberger et al., Citation2015). The input data is derived from the outputs of the previous two steps. Each 3D body part is positioned at the midpoints of the bones of the 3D skeleton to form the full body. The size of each body part is determined by the height and width of the 2D body part mask as well as enclosed 3D skeleton points. For the latter, a multi-layer perceptron – consisting of three layers with 20, 40, 20 neurons – predicts the size from the enclosed 3D skeleton coordinates to compensate for the lack of depth information of the 2D body mask.

By assembling the inferred 3D body parts (Appendix B), we derive depth images and body part masks for four camera views (i.e. front, left, back, right). The additional front view will be helpful to generate textures for overlapping parts later on. Since the projection of the drawn figures may vary, we simply assume an orthographic projection. We feed depth images and body part masks in batches of eight and resized to 256 × 256px into our U-Net-like network (). Our network consists of 1- and 2-strided convolutions, which are used for down- and up-sampling, as well as skip connections. The network is trained for 50 epochs at a learning rate of 0.001 and for another 50 epochs at a learning rate of 0.0001 with the Adam optimizer. Since the loss converged at similar values, we report the results of one single run. It turned out that the additional body part masks, which are multiplied with the depth image, lead to a lower error compared to training with depth images only for our validation data (). Smooth UV images were produced for our test data ().

Figure 5. Network architecture for predicting UV coordinates from a depth map. ci = 1 for inputting a depth image; ci = 14 for inputting a depth image multiplied by body part masks; co = 3 for outputting the two UV coordinate channels and a body mask channel used in the loss calculation. Numbers below each layer denote the channel dimension. Figure created by Net2Vis (Bäuerle et al., Citation2021).

Figure 5. Network architecture for predicting UV coordinates from a depth map. ci = 1 for inputting a depth image; ci = 14 for inputting a depth image multiplied by body part masks; co = 3 for outputting the two UV coordinate channels and a body mask channel used in the loss calculation. Numbers below each layer denote the channel dimension. Figure created by Net2Vis (Bäuerle et al., Citation2021).

Figure 6. Predicted UV coordinates of pictorial human figures from a depth image and body part masks by our fully convolutional network. The body part masks are one-hot encoded in 14 channels for the neural network. The UV coordinates are stored in two channels and have been mapped to a squared color circle () for visualization purposes.

Figure 6. Predicted UV coordinates of pictorial human figures from a depth image and body part masks by our fully convolutional network. The body part masks are one-hot encoded in 14 channels for the neural network. The UV coordinates are stored in two channels and have been mapped to a squared color circle (Figure 8) for visualization purposes.

Table 4. Mean absolute errors (MAE) and root mean squared errors (RMSE) for predicting UV coordinates of pictorial human figures from different input data.

Texture inpainting and enhancement

We create 256 × 256px textures when viewing the figure from behind, the left and the right by a generative network (Grigorev et al., Citation2019). Due to minor mismatches of the shape between the predicted UV coordinates in the previous step and the given texture, we input the intersection of both into the network. We add a gray rectangle to the background since the network was trained on human models standing in front of a white wall, which appears grayish due to lighting and shadows. As a post-processing step, we crop the output images to the given body masks.

Since the authors did not publish the code for training their generative network, we could only use a pre-trained version and thus report qualitative results only (). We apply texture maps, whose color values were retrieved from the inpainted textures using the predicted UV images (), to the body parts to render the final images (Appendix C). In general, colors of clothes, skin and hair were mostly plausibly generated. Artifacts appear for uncommon poses (e.g. sitting, football playing) and at the shoes/feet. Coarse texture structures could be created, however, pictorial black strokes representing folds in the textures were not transferred from the source image. To take these pictorial characteristics into account, a cartoonization network (X. Wang & Yu, Citation2020) can be applied to the inpainted textures.

Figure 7. Generated textures (already cropped to the input mask) of two pictorial figures from a given image as well as source and target UV maps by the inpainting network. The inpainted image is cartoonized to mimic a more pictorial style.

Figure 7. Generated textures (already cropped to the input mask) of two pictorial figures from a given image as well as source and target UV maps by the inpainting network. The inpainted image is cartoonized to mimic a more pictorial style.

Figure 8. UV coordinates in the inpainted texture (left) and in the color wheel (right) at the position of the left eye of a pictorial figure (, upper row).

Figure 8. UV coordinates in the inpainted texture (left) and in the color wheel (right) at the position of the left eye of a pictorial figure (Figure 7, upper row).

Since some of the inpainted faces clearly originate from real humans, we train an autoencoder to map them to a more pictorial style and to possibly recover missing facial details. We establish a shallow branch (Gondara, Citation2016) for denoising the hue and saturation channel, and a deeper branch including a bottleneck for learning structures and shadings in the value channel (). Colors (i.e. hue and saturation) and shadings (i.e. value) are weighted equally in the loss function. We augment the realistic input images by blurring and oil painting, by adding noise (i.e. gaussian, salt and pepper) and by varying the hue. The target images have been converted from the input images by the above cartoonization network. The autoencoder is trained for 200 epochs at a batch size of 32 and at a learning rate of 0.001 with the Adam optimizer. During training, head images with varying looks can be obtained (). Convincingly painted results, however, are rather the exception (i.e. roughly 5% of the generated images).

Figure 9. Network architecture for enhancing textures of inpainted heads. We input HSV images together with predicted UV coordinates to account for the different orientations of the heads. This branch outputs the value of the color, whereas hue and saturation are outputted by another branch. Figure created by Net2Vis (Bäuerle et al., Citation2021).

Figure 9. Network architecture for enhancing textures of inpainted heads. We input HSV images together with predicted UV coordinates to account for the different orientations of the heads. This branch outputs the value of the color, whereas hue and saturation are outputted by another branch. Figure created by Net2Vis (Bäuerle et al., Citation2021).

Figure 10. Denoised and resynthesized textures of inpainted heads from different views (i.e. right, front, left, back).

Figure 10. Denoised and resynthesized textures of inpainted heads from different views (i.e. right, front, left, back).

Real-time rendering

The inferred figures can be rendered in real-time using the sphere tracing algorithm (Hart, Citation1996). A 256 × 256px image is rendered at 25 frames per second (FPS) and a 512 × 512px image at 11 FPS with an NVIDIA GeForce 1080 GTX even when enabling computationally intensive features (i.e. trilinear interpolation, normal calculation, texture blending). Optionally, we can enable a perspective projection by sending rays from a point location, however, differences to the orthographic projection are only marginal. Also, diffuse lighting can be added by calculating the angle between a virtual light source and the surface normals.

To integrate the figures into existing 3D map environments such as virtual globes, which are mainly based on the traditional rendering pipeline, they can be rendered with transparent background in billboards, while the sphere tracing algorithm is implemented in the fragment shader (Schnürer et al., Citation2017). Another option is to export a point cloud by returning the 3D coordinates of surface intersections in the ray marcher. The point cloud can be further turned into a triangle mesh by the ball-pivoting algorithm (Bernardini et al., Citation1999). We exemplarily illustrated the outcome of the latter two conversion steps by placing the 3D figures on the original map in a 3D modeling software () and a virtual globe toolkit ().

Figure 11. Remeshed pictorial 3D figure placed on the original map (Owen, Citation2015) in Blender (Blender Foundation, Citation2022).

Figure 11. Remeshed pictorial 3D figure placed on the original map (Owen, Citation2015) in Blender (Blender Foundation, Citation2022).

Figure 12. Remeshed pictorial 3D figure in the virtual globe CesiumJS (Cesium GS, Citation2022). The figure is placed in front of the FIFA headquarters, similar to the original map (Flynn, Citationn.d.). 2D base map, 3D buildings and trees originate from the GeoAdmin API (swisstopo, Citation2022).

Figure 12. Remeshed pictorial 3D figure in the virtual globe CesiumJS (Cesium GS, Citation2022). The figure is placed in front of the FIFA headquarters, similar to the original map (Flynn, Citationn.d.). 2D base map, 3D buildings and trees originate from the GeoAdmin API (swisstopo, Citation2022).

Use case

We see a large potential in adding the constructed figures as protagonists or secondary characters to story maps. Particularly, 3D maps convey the topography vividly and allow channeling of the depicted topic by occlusions (e.g. mountains, trees, fog). Instead of presenting multimedia content in overlays (e.g. Matt, Citation2019), we suggest placing essential figures or other objects directly and in a consistent style on the map to support the story.

In the following, we outline how a story map including characters, animals and additional 3D objects may be designed for children (). We take Charles Darwin’s (Citation1839) round-the-world journey as an example, specifically a stop in Port St. Julian, Patagonia, in January 1834. After a short introduction to this setting, the user can visit different places in any order. We provide different incentives to follow the story and to interact with characters and the environment (). The story ends after having explored all places.

Figure 13. Sketched story map for children about Charles Darwin in Patagonia (sources: Appendix D). The start location is in the middle and possible places to explore are at the corners. At each place, a puzzle needs to be solved after being provided with some contextual information (top right graphic of each scene). The reconstructed figures will be part of the story.

Figure 13. Sketched story map for children about Charles Darwin in Patagonia (sources: Appendix D). The start location is in the middle and possible places to explore are at the corners. At each place, a puzzle needs to be solved after being provided with some contextual information (top right graphic of each scene). The reconstructed figures will be part of the story.

Table 5. Application of storytelling concepts (Thöny et al., Citation2018) to our sketched story map.

Pedagogically, our proposed story map may improve map and visualization literacy (e.g. route planning, interpretation of climate diagrams), and may help to correct misconceptions (e.g. Patagonians perceived as giants). It would connect interdisciplinary fields, such as biology (e.g. penguins), history (e.g. early explorers), and ethnology (e.g. indigenous people). Although some gamification elements are included, it is intended to put the focus on the scientific aspects. Our presented pipeline of different machine learning models will help map creators in constructing and potentially animating 3D human figures based on a given 2D template. With some artistic skills, the 2D figures may be drawn by the map creator instead of reusing the works of others. The sketched story map may be experienced on desktop and tablet computers, but also with head-mounted displays. Users of the latter devices perceive the 3D environment in virtual reality, what allows introducing metaverse concepts, such as avatars. For instance, the quizzes may be solved together with other students or a teacher may give hints or explanations. Their avatars may also be reconstructed from 2D figures and the user’s movements in reality may be mirrored, possibly by additional tracking systems (e.g. camera, gesture controllers). In the future, we anticipate an increase in these kinds of virtual excursions and learning activities for geography classes since they are more affordable than visiting the location in the real world and more intriguing than reading a textbook or than watching a video.

Discussion

Our work is a continuation of another computer vision experiment for pictorial maps (Schnürer et al., Citation2022), where silhouettes of figures, their body parts and pose points have been segmented by two neural networks. Therefore, it can be assumed that these data items can be automatically extracted. Nevertheless, we annotated our test data manually to have a solid foundation for the current experiment. For our training data, we varied the sizes and weights of body meshes of real humans, which had a positive impact on reconstructing a test figure with thin long limbs and a small head. The consideration of clothes would definitely improve the quality of reconstruction, for instance, our current pipeline does not support hats. However, publicly available training datasets of clothed humans did not contain 3D body meshes or body part segmentations. Beyond, some clothes (e.g. long skirts) would behave differently than the underlying body parts, but we aimed primarily at skeletal animation as a follow-up use case.

Besides selecting and enhancing the training data, the structure of the different networks may be modified. We showed that increasing the number of pose points from 16 to 21 led to a 1 cm higher error for the 3D pose estimation network, which is still tolerable and provided a benefit to the reconstruction of hands, feet and the head. We used captured poses of mainly standing persons as training data; alternatively, bones of a skeleton may be oriented according to a range of possible joint angles (Soucie et al., Citation2011) to better handle uncommon poses. We did not consider including pose points of fingers, which would have been available in the SMPL-X training dataset, since the hands of pictorial figures are usually small, their fingers are not easily distinguishable and sometimes contain less than five fingers. Instead of estimating the 3D coordinates directly, relative rotation angles of joints or limbs may be predicted, however, this would require more complex and constrained network designs, as noted by Martinez et al. (Citation2017). Estimating more camera parameters would probably increase the prediction accuracy, yet we achieved satisfactory results by simply assuming an orthographic projection, similar to other works (e.g. Huang et al., Citation2020).

While no fine-tuning was necessary for the pose estimation network, we carefully configured DISN to infer body parts. We also examined normalizing the data (e.g. sqrt/log transform), using 3D deconvolutions (instead of query points), sampling points near the surface (instead of equally spaced grid points), predicting a top and side view (additionally to the front view), outputting a binary field (additionally to the SDF); but those did not significantly improve the reconstruction quality. Generally, silhouette-based 3D reconstruction is a more difficult task than a texture-based recovery since textures may contain depth information, thus the qualitative results are only partly comparable to those of the original network. The addition of pose points helped to recover partly or totally hidden body parts, though not visible hands are approximated by ellipsoids, which is the average shape. Other issues concern the rather realistic forms of the body parts and the gaps between them ().

Figure 14. Failure cases of inferred body parts.

Figure 14. Failure cases of inferred body parts.

Predicting the UV coordinates from the depth map was a straightforward task by using a fully convolutional network similar to U-Net (Ronneberger et al., Citation2015). Inputting depth maps only would have been already sufficient to get a smooth image for the validation data (i.e. real humans). However, on our test data (i.e. pictorial humans), where the depth map is derived from the inferred 3D body parts, stains appeared on the UV image, which could be remedied by additionally feeding the masks of the body parts into U-Net. We refrained from predicting UV coordinates or even texture colors together with the body parts because “color prediction is a non-trivial task as RGB colors are defined only on the surface while the [signed distance] field is defined over the entire 3D space” (Saito et al., Citation2019, p. 2308).

Although the inpainting network is biased toward generating textures of real humans, it produced adequate results on a coarse level for pictorial humans. For better matching the texture with the input, the potential of symmetries could be exploited (Zhou et al., Citation2021). Furthermore, texture maps probably need to be deformed (Shu et al., Citation2018) in case the body part shapes deviate much. Since texture mappings for occluded body parts may be incorrect when viewing the figure from another than the four perspectives, one may increase the number of textures to prevent these artifacts. Due to the lack of adequate training data, we cartoonized photos of humans to train our autoencoder, however, the generated textures are still quite realistic. Our autoencoder is able to recover some facial details, yet a more expressive latent space may be learned by a variational autoencoder. Overall, an end-to-end network would be desirable instead of our four consecutive networks to benefit from synergy effects.

To realize our exemplary use case, existing story map editors (e.g. ESRI, Citation2022) would need to be extended to support different storylines, game templates and textual options. The availability of a 3D model store would facilitate the reusability and the copyright management of the reconstructed figures and other objects. Positioning and viewing perspective, animation parameters and triggers, interactions with the map content and other characters, and possibly cartographic functions may be defined for the characters in the story map editor. In cognitive experiments, optimal parameters (e.g. animation speed) would need to be clarified in detail and whether the approach including figures in story maps generally offers an added value to the user.

Summary and future work

In this work, we generated implicit 3D representations of human figures appearing on pictorial maps using machine learning methods (). We showed that plausible poses and body parts can be inferred when training the networks with data of real humans. However, we see a need for improvement in refining shapes and textures, in supporting hair and clothes as well as in simplifying the workflow. Our automated workflow takes only a few minutes to complete, whereas manual sculpting and texturing of the 3D models would take several hours.

Figure 15. All pictorial human figures on maps from our test dataset and our inferred 3D models from different views.

Figure 15. All pictorial human figures on maps from our test dataset and our inferred 3D models from different views.

Next to human figures, also other pictorial entities like animals, sea monsters or ships may be transferred to the third dimension. Moreover, 3D buildings or distinct landscape features may be derived from historic maps in oblique view (e.g. Murer, Citation1576). Cartography would be rather concerned with abstract and georeferenced representations while the reconstruction of detailed representations would rather fall into the domains of other fields (e.g. archeology, anthropology, evolution biology).

The automatic 3D reconstruction of buildings and landscapes will accelerate the development of “time travel” applications, where users can see the past and future of a geographic area (e.g. Stadt Zürich, Citation2022). When additionally viewing a 3D city in virtual reality, users will get a more vivid impression of its structure (e.g. the narrowness of alleys). As envisioned in the concept of the metaverse, the virtual space may be populated with avatars, where our animation-ready 3D pictorial human figures come into operation. We would propose the term “cartoverse” for this kind of cartographic metaverse. Several challenges would need to be addressed in the future to provide a fully functional cartoverse, for instance, how to avoid motion sickness during spatial navigation or for different perspectives, how to interact with 3D objects via speech or gesture recognition, or how to present thematic information additionally to the topographic elements.

Acknowledgments

We like to thank Jost Schmid-Lanter and his colleagues from the map department of the Zurich Central Library for providing images of figures from a digital replica of the St Gallen Globe.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

Training and test datasets, models, and source code are available at: http://narrat3d.ethz.ch/

Additional information

Funding

This work was supported by a UKRI Future Leaders Fellowship [grant number G104084].

References

  • Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 44–58. https://doi.org/10.1109/TPAMI.2006.21
  • Bäuerle, A., Van Onzenoodt, C., & Ropinski, T. (2021). Net2vis–a visual grammar for automatically generating publication-tailored cnn architecture visualizations. IEEE Transactions on Visualization and Computer Graphics, 27(6), 2980–2991. https://doi.org/10.1109/TVCG.2021.3057483
  • Bernardini, F., Mittleman, J., Rushmeier, H., Silva, C., & Taubin, G. (1999). The ball-pivoting algorithm for surface reconstruction. IEEE Transactions on Visualization and Computer Graphics, 5(4), 349–359. https://doi.org/10.1109/2945.817351
  • Blender Foundation. (2022). Blender. https://www.blender.org/
  • Brodt, K., & Bessmeltsev, M. (2022). Sketch2pose: Estimating a 3D character pose from a bitmap sketch. ACM Transactions on Graphics, 41(4), 1–15. https://doi.org/10.1145/3528223.3530106
  • Cesium GS.(2022). CesiumJS. Cesium. https://cesium.com/platform/cesiumjs/
  • Chabra, R., Lenssen, J. E., Ilg, E., Schmidt, T., Straub, J., Lovegrove, S., & Newcombe, R. (2020). Deep local shapes: learning local SDF priors for detailed 3D reconstruction. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 (pp. 608–625). Springer International Publishing. https://doi.org/10.1007/978-3-030-58526-6_36
  • Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., & Yuille, A. (2014). Detect what you can: Detecting and representing objects using holistic models and body parts. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1971–1978). https://doi.org/10.1109/CVPR.2014.254
  • Child, H. (1956). Decorative maps (1st ed.). Studio Publications.
  • Darwin, C. (1839). The voyage of the beagle. Project Gutenberg EBook. https://www.gutenberg.org/files/944/944-h/944-h.htm
  • Delanoy, J., Aubry, M., Isola, P., Efros, A. A., & Bousseau, A. (2018). 3D sketching using multi-view deep volumetric prediction. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 1(1), 21:1–22. https://doi.org/10.1145/3203197
  • Di, X., & Yu, P. (2017). 3D reconstruction of simple objects from a single view silhouette image. ArXiv:1701.04752 [Cs]. http://arxiv.org/abs/1701.04752
  • Döllner, J., & Walther, M. (2003). Real-time expressive rendering of city models. Proceedings on Seventh International Conference on Information Visualization, IV (pp. 245–250). https://doi.org/10.5555/938981.939605
  • Eslami, S. M. A., Jimenez Rezende, D., Besse, F., Viola, F., Morcos, A. S., Garnelo, M., Ruderman, A., Rusu, A. A., Danihelka, I., Gregor, K., Reichert, D. P., Buesing, L., Weber, T., Vinyals, O., Rosenbaum, D., Rabinowitz, N., King, H., Hillier, C., Botvinick, M. … Hassabis, D. (2018). Neural scene representation and rendering. Science, 360(6394), 1204–1210. https://doi.org/10.1126/science.aar6170
  • ESRI. (2022). ArcGIS StoryMaps. https://storymaps.arcgis.com
  • Evangelidis, K., Papadopoulos, T., Papatheodorou, K., Mastorokostas, P., & Hilas, C. (2018). 3D geospatial visualizations: Animation and motion effects on spatial objects. Computers & Geosciences, 111, 200–212. https://doi.org/10.1016/j.cageo.2017.11.007
  • Fahim, G., Amin, K., & Zarif, S. (2021). Single-view 3D reconstruction: A survey of deep learning methods. Computers & Graphics, 94, 164–190. https://doi.org/10.1016/j.cag.2020.12.004
  • Flynn, D. (n.d.). Zurich City Map. Pinterest. https://www.pinterest.co.uk/pin/469922542351340326/
  • Gilbert, B. (2021). Zuckerberg is most worried about Apple, Google, Microsoft, Sony and others as the main competition for the “metaverse.” Business Insider. https://www.businessinsider.com/facebook-says-apple-sony-microsoft-google-are-metaverse-competition-2021-11
  • Goel, S., Kanazawa, A., & Malik, J. (2020). Shape and viewpoint without keypoints. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer vision – ECCV 2020 (pp. 88–104). Springer International Publishing. https://doi.org/10.1007/978-3-030-58555-6_6
  • Gondara, L. (2016). Medical image denoising using convolutional denoising autoencoders. IEEE 16th International Conference on Data Mining Workshops (pp. 241–246). https://doi.org/10.1109/ICDMW.2016.0041
  • Google. (2022). TensorFlow. https://www.tensorflow.org/
  • Grigorev, A., Sevastopolsky, A., Vakhitov, A., & Lempitsky, V. (2019). Coordinate-based texture inpainting for pose-guided human image generation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12127–12136). https://doi.org/10.1109/CVPR.2019.01241
  • Güler, R. A., Trigeorgis, G., Antonakos, E., Snape, P., Zafeiriou, S., & Kokkinos, I. (2017). DenseReg: Fully convolutional dense shape regression in-the-wild. IEEE Conference on Computer Vision and Pattern Recognition (pp. 2614–2623). https://doi.org/10.1109/CVPR.2017.280
  • Hart, J. C. (1996). Sphere tracing: A geometric method for the antialiased ray tracing of implicit surfaces. The Visual Computer, 12(10), 527–545. https://doi.org/10.1007/s003710050084
  • Heitzler, M., & Hurni, L. (2020). Cartographic reconstruction of building footprints from historical maps: A study on the Swiss Siegfried map. Transactions in GIS, 24(2), 442–461. https://doi.org/10.1111/tgis.12610
  • Herold, H., & Hecht, R. (2018). 3D reconstruction of urban history based on old maps. In S. Münster, K. Friedrichs, F. Niebling, & A. Seidel-Grzesińska (Eds.), Digital research and education in architectural heritage (pp. 63–79). Springer International Publishing. https://doi.org/10.1007/978-3-319-76992-9_5
  • Huang, Z., Xu, Y., Lassner, C., Li, H., & Tung, T. (2020). ARCH: Animatable reconstruction of clothed humans. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3090–3099). https://doi.org/10.1109/CVPR42600.2020.00316
  • Kleineberg, M., Sundt, P. B., & Davies, T. (2021). Mesh-to-sdf. https://github.com/marian42/mesh_to_sdf
  • Lassner, C., & Zollhöfer, M. (2021). Pulsar: Efficient sphere-based neural rendering. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1440–1449). https://doi.org/10.1109/CVPR46437.2021.00149
  • Lin, C.-H., Wang, C., & Lucey, S. (2020). SDF-SRN: Learning signed distance 3D object reconstruction from static images. Proceedings of the 34th International Conference on Neural Information Processing Systems (pp. 11453–11464). https://doi.org/10.5555/3495724.3496685
  • Liu, S., Saito, S., Chen, W., & Li, H. (2019). Learning to infer implicit surfaces without 3D supervision. Proceedings of the 33rd International Conference on Neural Information Processing Systems (pp. 8295–8306). https://doi.org/10.5555/3454287.3455032
  • Li, M., & Zhang, H. (2021). D2IM-Net: Learning detail disentangled implicit fields from single images. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10241–10250). https://doi.org/10.1109/CVPR46437.2021.01011
  • Lorensen, W. E., & Cline, H. E. (1987). Marching cubes: A high resolution 3D surface construction algorithm. ACM SIGGRAPH Computer Graphics, 21(4), 163–169. https://doi.org/10.1145/37402.37422
  • Lunz, S., Li, Y., Fitzgibbon, A., & Kushman, N. (2020). Inverse graphics GAN: Learning to generate 3D shapes from unstructured 2D data. ArXiv: 2002.12674 [Cs]. http://arxiv.org/abs/2002.12674
  • Martinez, J., Hossain, R., Romero, J., & Little, J. J. (2017). A simple yet effective baseline for 3d human pose estimation. IEEE International Conference on Computer Vision (pp. 2659–2668). https://doi.org/10.1109/ICCV.2017.288
  • Matt, A. (2019). Charles Darwin and the voyage of the HMS beagle. https://scout.wisc.edu/archives/r50646/charles_darwin_and_the_voyage_of_the_hms_beagle
  • Matthys, M., De Cock, L., Vermaut, J., Van de Weghe, N., & De Maeyer, P. (2021). An “Animated spatial time machine” in co-creation: reconstructing history using gamification integrated into 3D city modelling, 4D web and transmedia storytelling. ISPRS International Journal of Geo-Information, 10(7), Article 7. https://doi.org/10.3390/ijgi10070460460
  • Meta. (2021). The Metaverse and How We’ll Build It Together. Connect 2021. https://www.youtube.com/watch?v=Uvufun6xer8
  • Murer, J. (1576). Murerplan. https://en.wikipedia.org/wiki/Murerplan
  • Naz, A. (2005). 3D interactive pictorial maps [ Master’s thesis]. Texas A&M University. http://oaktrust.library.tamu.edu/handle/1969.1/1571
  • Niu, C., Li, J., & Xu, K. (2018). Im2struct: Recovering 3D shape structure from a single RGB image. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4521–4529). https://doi.org/10.1109/CVPR.2018.00475
  • Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., & Schiele, B. (2018). Neural body fitting: Unifying deep learning and model based human pose and shape estimation. International Conference on 3D Vision (pp. 484–494). https://doi.org/10.1109/3DV.2018.00062
  • Owen, N. (2015, October). Queensland: National Geographic’s “Traveller” Mag. https://www.behance.net/gallery/30454283/Queensland-National-Geographics-Traveller-Mag
  • Park, S.-M., & Kim, Y.-G. (2022). A metaverse: Taxonomy, components, applications, and open challenges. IEEE Access, 10, 4209–4251. https://doi.org/10.1109/ACCESS.2021.3140175
  • Paschalidou, D., Van Gool, L., & Geiger, A. (2020). Learning unsupervised hierarchical part decomposition of 3D objects from a single RGB image. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1057–1067). https://doi.org/10.1109/CVPR42600.2020.00114
  • Patel, P., Huang, C.-H. P., Tesch, J., Hoffmann, D. T., Tripathi, S., & Black, M. J. (2021). AGORA: Avatars in geography optimized for regression analysis. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13463–13473). https://doi.org/10.1109/CVPR46437.2021.01326
  • Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A., Tzionas, D., & Black, M. J. (2019). Expressive body capture: 3D hands, face, and body from a single image. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10967–10977). https://doi.org/10.1109/CVPR.2019.01123
  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, & A. F. Frangi (Eds.), Medical image computing and computer-assisted intervention – MICCAI 2015 (pp. 234–241). Springer International Publishing.
  • Saito, S., Huang, Z., Natsume, R., Morishima, S., Li, H., & Kanazawa, A. (2019). Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. 2019 IEEE/CVF International Conference on Computer Vision (pp. 2304–2314). https://doi.org/10.1109/ICCV.2019.00239
  • Saito, S., Simon, T., Saragih, J., & Joo, H. (2020). PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 81–90). https://doi.org/10.1109/CVPR42600.2020.00016
  • Schnürer, R., Eichenberger, R., Sieber, R., & Hurni, L. (2017). Animations for 3D solid charts in a virtual globe – techniques, use cases, and implementation. 28th International Cartographic Conference.
  • Schnürer, R., Öztireli, A. C., Heitzler, M., Sieber, R., & Hurni, L. (2022). Instance segmentation, body part parsing, and pose estimation of human figures in pictorial maps. International Journal of Cartography, 8(3), 291–307. https://doi.org/10.1080/23729333.2021.1949087
  • Schnürer, R., Sieber, R., Schmid Lanter, J., Öztireli, A. C., & Hurni, L. (2021). Detection of pictorial map objects with convolutional neural networks. The Cartographic Journal, 58(1), 50–68. https://doi.org/10.1080/00087041.2020.1738112
  • Schrotter, G., & Hürzeler, C. (2020). The digital twin of the city of Zurich for urban planning. PFG – Journal of Photogrammetry, Remote Sensing and Geoinformation Science, 88(1), 99–112. https://doi.org/10.1007/s41064-020-00092-2
  • Scotese, C. (2016). Plate tectonics, paleogeography, and ice ages (Modern World—540Ma). YouTube Animation. https://youtu.be/g_iEWvtKcuQ
  • Shu, Z., Sahasrabudhe, M., Alp Güler, R., Samaras, D., Paragios, N., & Kokkinos, I. (2018). Deforming autoencoders: Unsupervised disentangling of shape and appearance. In V. Ferrari, M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer vision – ECCV 2018 (pp. 664–680). Springer International Publishing.
  • Soucie, J. M., Wang, C., Forsyth, A., Funk, S., Denny, M., Roach, K. E., Boone, D., & The Hemophilia Treatment Center Network,(2011). Range of motion measurements: Reference values and a database for comparison studies. Haemophilia, 17(3), 500–507. https://doi.org/10.1111/j.1365-2516.2010.02399.x
  • Stadt Zürich. (2022). Zürich 4D. GeoAdmin API. https://api3.geo.admin.ch/
  • swisstopo. (2022). GeoAdmin API. https://api3.geo.admin.ch/
  • Thöny, M., Schnürer, R., Sieber, R., Hurni, L., & Pajarola, R. Storytelling in interactive 3D geographic visualization systems. (2018). ISPRS International Journal of Geo-Information, 7(3), 123, Article 3. https://doi.org/10.3390/ijgi7030123
  • Varol, G., Ceylan, D., Russell, B., Yang, J., Yumer, E., Laptev, I., & Schmid, C. (2018). BodyNet: Volumetric inference of 3D human body shapes. In V. Ferrari, M. Hebert, C. Sminchisescu, & Y. Weiss (Eds.), Computer vision – ECCV 2018 (pp. 20–38). Springer International Publishing. https://doi.org/10.1007/978-3-030-01234-2_2
  • Wang, W., Xu, Q., Ceylan, D., Mech, R., & Neumann, U. (2019). DISN: Deep implicit surface network for high-quality single-view 3D reconstruction. Proceedings of the 33rd International Conference on Neural Information Processing Systems (pp. 492–502). https://doi.org/10.5555/3454287.3454332
  • Wang, X., & Yu, J. (2020). Learning to cartoonize using white-box cartoon representations. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8087–8096). https://doi.org/10.1109/CVPR42600.2020.00811
  • Wu, S., Heitzler, M., & Hurni, L. (2022). Leveraging uncertainty estimation and spatial pyramid pooling for extracting hydrological features from scanned historical topographic maps. GIScience & Remote Sensing, 59(1), 200–214. https://doi.org/10.1080/15481603.2021.2023840
  • Yao, P., Fang, Z., Wu, F., Feng, Y., & Li, J. (2019). DenseBody: Directly regressing dense 3D human pose and shape from a single color image. ArXiv:1903.10153 [Cs]. http://arxiv.org/abs/1903.10153
  • Ye, Y., Tulsiani, S., & Gupta, A. (2021). Shelf-supervised mesh prediction in the wild. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8839–8848). https://doi.org/10.1109/CVPR46437.2021.00873
  • Yu, A., Ye, V., Tancik, M., & Kanazawa, A. (2021). pixelNerf: Neural radiance fields from one or few images. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4576–4585). https://doi.org/10.1109/CVPR46437.2021.00455
  • Zhou, Y., Liu, S., & Ma, Y. (2021). NeRD: Neural 3D reflection symmetry detector. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15935–15944). https://doi.org/10.1109/CVPR46437.2021.01568

Appendices Appendix A.

Adaptations to the Deep Implicit Surface Network (DISN) for 3D body part inference

Query points and additional pose points, which are inputted into the network, are encoded by three 1D convolutions with four times smaller filter sizes (i.e. 16, 64, 128). In the image encoder, we reduced the number of filters for convolutions by four (i.e. 8, 16, 32, 64, 128) and use only one convolution layer after a strided convolution layer at each level. Similarly, we lowered the size of the global features by four (i.e. 256). The filter sizes for 1D convolutions in the decoder part remained unchanged (i.e. 512, 256, 1).

Appendix B.

Derivation of depth maps and body part masks for UV coordinates prediction

We implemented a custom ray marcher in Python using the library “Numba,” which enables executing parallel operations on the GPU, for rendering the inferred 3D objects represented by SDFs. An SDF value denotes the shortest distance (d) to object surfaces, while the sign indicates whether a point lies inside (d < 0), on (d = 0) or outside (d > 0) the object. We trilinearly interpolate the eight closest grid points of the evenly spaced SDF to get smoother surfaces. The SDFs of the different body parts can be combined by the union operation (i.e. min (d1, d2)). Afterward, a depth image can be obtained by iteratively cumulating the covered distances from the virtual camera to the body surface during each ray marching step. As an enhancement, we smooth the depth map with a 5 × 5 averaging filter. A body part mask is produced by returning the index of the first hit body part along the ray.

Appendix C.

Application of textures to figures

The ray marcher (Appendix B) is extended by projecting the inpainted texture maps to the surfaces of the 3D body parts from four views (i.e. front, back, left, right). As an enhancement, we blend the obtained textures, that is, the steeper the angle of the surface normal to the texture, the more weight the color value gains from this texture. The normals are approximated by nearby SDF values at the surface points (i.e. n = [SDF(x+ε, y, z) – SDF(x-ε, y, z), SDF(x, y+ε, z) – SDF(x, y-ε, z), SDF(x, y, z+ε) – SDF(x, y, z-ε)]). For an interactive view, we pass the rendered images to the canvas of the library “matplotlib,” where mouse events can be captured to zoom and to rotate around the depicted figure.

Appendix D.

Sources of the 3D story map with pictorial figures