Full article: A Multi-View Semi-supervised learning method for knee joint cartilage segmentation combining multiple feature descriptors and image modalities

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Multi-atlas based segmentation techniques constitute an effective approach in the automatic segmentation of medical images. Existing methods usually rely on single spectral descriptors extracted from a specific imaging modality. In this paper, we propose the Multi-View Knee Cartilage Segmentation (MV-KCS) approach, for segmenting the knee joint articular cartilage from MR images. Operating under the Semi-supervised learning framework, MV-KCS leverages spectral content from multiple feature spaces by constructing sparse graphs for each view individually, and aggregating them via optimisation to obtain a common data graph. In We consider two multi-view scenarios: in the former case views correspond to multiple feature descriptors, while on the latter, the views correspond to multiple image modalities. We propose two effective labelling schemes, implementing label propagation from the atlas library to the target image. The proposed methodology is applied to the publicly available Osteoarthritis Initiative repository. We devise a comprehensive experimental design to validate different test cases, comparing single-feature vs multi-features, multi-features vs feature stacking and multi-features vs multi-modalities. Comparative results and statistical analysis reveal that the proposed MV-KCS provides enhanced performance ( $DSC = 92.56 % (FC), 89.91 % (TC)$ ), outperforming a series of patch-based approaches, six recent state-of-the-art deep supervised models and three deep semi-supervised ones, in terms of both classification and volumetric measures.

KEYWORDS:

1. Introduction

Osteoarthritis (OA) is one of the most widespread joint diseases worldwide, affecting as much as 10% of men and 18% of women over the age of 60 (Felipe and McCombie Citation2002). It is a complex condition whose onset and progression depends on a multitude of socioeconomic, genetic, biomechanical and other systemic factors (Felson Citation2000; Glyn-Jones et al. Citation2015). The leading causes of the condition are thought to be 1) bone misalignments caused by either congenital or pathogenic factors, 2) excess body weight, 3) loss of strength and muscle mass and 4) nerve damage of peripheral nerves. It primarily affects the weight-bearing joints of the human body, most commonly the knee joint, causing pain and varying degrees of disability in the patients. The pain and physical discomfort induced by the condition significantly affects the patients’ social functioning and mental health, further diminishing their quality of life. Recent rising trends in life expectancy and ageing populations, especially in the developed countries, are expected to place osteoarthritis among the leading causes of disability in the coming years. In addition to the severe negative effect in the everyday lives of the affected populations, this has the capacity to place a significant burden in national health systems. To that end, research towards a better understanding and treatment of OA is a crucial issue.

Magnetic Resonance Imaging (MRI) provides an indispensable tool in the evaluation of cartilage volume and thickness, thus allowing a robust qualitative and quantitative analysis of the cartilage morphology. However, manual delineations performed by expert radiologists and clinicians, are time-consuming and also susceptible to high inter- and intra-rater variability, highlighting the need for efficient and robust automatic methods of achieving reliable segmentation.

1.1. Related work

In the past decade, considerable research has been conducted towards fully automating the process of cartilage segmentation from MRI images. The thin structure of cartilage, coupled with the inter-slice variability of femoral and tibial cartilage shape, pose significant challenges in achieving the above goal. Most methods suggested to address these issues can be broadly distinguished into five groups, including Region-growing methods, Statistical Shape models, Graph-based methods and finally, classical machine learning & deep learning ones. A thorough review of the existing knee cartilage segmentation literature can be found in (Ebrahimkhani et al. Citation2020).

1.1.1. Region-based methods

Region growing algorithms operate on a predetermined set of seed points, expanding the initial regions by incorporating neighbouring voxels under a similarity measure. In most applications (Pakin et al. Citation2002; Öztürk and Albayrak Citation2016) they are utilised as an initial step to acquire a pre-segmentation mask, progressively refined until the desired result is obtained. Region growing methods are fairly easy to implement, but computationally expensive. Moreover, the required user input in the initial stages of segmentation hinders the widespread application of such methods in a fully automated setting.

1.1.2. Statistical shape models

Statistical Shape Models (SSM) and Active Appearance Models (AAM) enjoy a wide use in knee joint segmentation applications. Shape is a primary feature of rigid anatomical structures and can be successfully utilised as an initial guiding tool for the subsequent complete delineation of the corresponding structure. Finally, SSMs can also feature as shape regularisers at the end of a segmentation pipeline, performing post-processing corrections (Ambellan et al. Citation2019). While adequately successful in applications involving a small number of subjects, SSMs tolerate over-restrictive shape variation, often requiring a large number of landmarks to deal with larger datasets.

1.1.3. Graph-based methods

Graph-based methods treat segmentation as an energy cost function optimisation problem, where an initial global graph is split into multiple subgraphs under certain constraints, each made to correspond to an object of interest in the image (Yin et al. Citation2010). Overall, graph-based methods have strong theoretical foundations and achieve desirable results, but similar to the Region-Growing ones require a user’s input.

1.1.4. Classical machine learning methods

This family of methods treats knee joint segmentation as a supervised classification task, estimating the label of each voxel from a set of features extracted from the image. The large number of possible combinations of features and classifiers allows for great variety of applications (Folkesson et al. Citation2007; Zhang et al. Citation2013). However, these types of models typically suffer from low generalisability, since the extracted features are usually tailored to a specific dataset.

1.1.5. Deep learning-based methods

In recent years, deep learning methods, primarily involving Convolutional Neural Networks (CNNs) are steadily gaining popularity in medical image applications, in part due to the fact that CNNs are capable of learning an appropriate set of features automatically. In (Prasoon et al. Citation2013), a deep model is presented, composed of three 2D CNNS, each one corresponding to the three orthogonal views of an MRI (sagittal, planar, coronal). A 2D U-net model is reported in (Norman et al. Citation2018) for the segmentation of cartilage and meniscus, while a variant of the SegNet (Badrinarayanan et al. Citation2017) with an application to bone and cartilage segmentation is presented in (Liu et al. Citation2018). A combination of 2D and 3D CNNs, coupled with an SSM regularisation step is reported in (Ambellan et al. Citation2019), while applications featuring fully 3D networks constitute only a recent development (Zhou et al. Citation2018; Dai et al. Citation2021; Peng et al. Citation2022), due to the vast computational load incurred in such cases. Although deep learning methods achieve overall attractive segmentation results, the lack of large-scale annotated medical image repositories might lead to overfitting, with the researchers having to rely on fine tuning of pre-trained CNNs, or artificially augmenting the existing available datasets.

1.2. Semi-supervised deep learning-based methods

Semi-supervised deep-learning methods constitute a promising alternative to standard deep-learning models in the face of sparse annotated data, as is the usual case in medical image segmentation applications, while at the same time showcasing substantially faster computation times. The authors in (Zhang et al. Citation2017) propose a deep adversarial network (DAN) comprising two sub-modules: a segmentation network that provides label maps and an evaluation network to assess the quality of said segmentations. In (Yu et al. Citation2019), the uncertainty aware mechanism is utilised, employing a student model and teacher model, whereby the student model attempts to minimise the segmentation loss by utilising labels supplied by the teacher, while the teacher continuously estimates an uncertainty map to bias the student towards harnessing information from the more reliable annotated targets. Finally, in (Luo et al. Citation2021) the authors design a novel uncertainty-aware mechanism, expanding the previous work of (Tarvainen and Valpola Citation2017), by designing a dual-task deep network that simultaneously learns a pixel-wise segmentation map and a level-set representation of the target.

1.3. Multi-atlas patch-based methods

Multi-atlas patch-based methods segment a target MRI $T$ by utilizing a collected Atlas Library $A = L B (T) = {A_{i}, L_{i}}_{i = 1}^{n_{A}}$ , consisting of $n_{A}$ subject MRIs $(A_{i})$ and their corresponding labeled masks $(L_{i})$ . Voxels in the target image $T$ are automatically assigned labels by harnessing the available information from library $A$ . A necessary condition for the successful application of this process is that the atlas images in $A$ and the target image $T$ , all share a common coordinate space. This is commonly achieved by initially registering all atlases $A_{i} \in A$ to the target $T$ , usually via an affine transformation or a deformable one (Hajnal et al. Citation2001).

These methods typically proceed as follows: For each voxel $x \in T$ , a search volume $N (x)$ of size $N_{s} = (n \times n \times n)$ with $x$ at its centre is defined, and each voxel $y \in N (x)$ yields a patch $P$ of size $P_{s} = (p \times p \times p), p < n$ . Each of these patches gives rise to a corresponding patch vector $p (y) \in R^{P_{s}}$ , comprising the intensity values of all voxels in $P$ . The collection of all patch vectors extracted from the search volumes within each atlas ${A_{i}}_{i = 1}^{n_{A}}$ , forms the patch library corresponding to the target voxel $x$ , $P_{L} = {p_{A_{i}} (y), \forall y \in N (x), i = (1, \dots, n_{A})}$ , $P_{L} \in R^{P_{s} \times N_{s} n_{A}}$ .

Under the assumption that spectral and label similarity across patches are encoded by the same underlying function, the goal is to express $p_{T} (x)$ as a linear decomposition of the patches in $P_{L}$ , via coefficients $w \in R^{N_{s} n_{A}}$ reflecting that similarity, then reconstruct the label of $p_{T} (x)$ as a combination of the corresponding atlas patches, utilizing those same coefficients. Typical examples of this approach are found in (Rousseau et al. Citation2011; Zhang et al. Citation2012), where $(w)$ is computed via Non-local Means (NLM) filtering (Buades et al. Citation2011) and Sparse Coding SC optimisation (Zhang et al. Citation2015), correspondingly.

Patch-based methods overcome many of the challenges encountered by previous approaches, while simultaneously achieving state-of-the-art segmentation performance, as long as a suitably large and variable atlas library $P_{L}$ is chosen. However, the voxel-wise manner in which the segmentation process is carried out incurs large computational costs, since the full segmentation of a target image entails the exhaustive extraction of a large number of atlas patches, for each single target voxel.

1.4. Multi-view semi-supervised learning

Multi-view data are widespread among real world applications. In most cases, multiple views on the same dataset result either from a multitude of measuring methods, or from expressing a single view with distinct sets of features. Multi-view methods aim to learn a function for each view separately and then jointly optimise them in order to boost generalisation performance (Sun Citation2013).

As the availability of data, as well as the ease of their acquisition is steadily increasing across multiple fields, multi-view learning algorithms are becoming more prevalent in the computational intelligence community. Successful applications of multi-view methods have been reported for numerous machine learning tasks, such as transfer learning, dimensionality reduction, clustering, semi-supervised learning and multi-task learning. A more comprehensive list of recent advances and the new challenges arising, can be found in (Zhao et al. Citation2017).

In this study, we exclusively focus on applying multi-view learning (MVL) under the semi-supervised (SSL) framework. Adopting the manifold assumption, that the available data lie in a lower dimensional manifold embedded in a higher dimensional feature space, SSL aims to leverage the spectral information of few labelled data to predict the labels of the unlabelled ones (Belkin et al. Citation2006). Since direct access to the underlying manifold structure is impossible, a proxy is often used through the construction of a weighted spectral affinity graph $G (V, E, W)$ , where the vertices $V$ correspond to the data points and edges $E$ connect those vertices that display some similarity under a given measure, encoded in an adjacency matrix $W$ (Belkin and Niyogi Citation2005). SSL is particularly useful in applications where there is an abundance of unlabelled data, but acquisition of the corresponding labels is a costly and tedious process. Medical image segmentation falls exactly in this category, whereby the manual annotation of an MRI by experts may induce considerable workload.

The adoption of multi-view learning is a natural extension of the SSL framework, with the potential to greatly boost the latter’s performance. Given $V$ distinct views of the same dataset, Multi-view Semi-supervised learning (MV-SSL) assumes the existence of $V$ distinct manifolds, and constructs $V$ different weighted graphs ${G (V, E, W)^{(v)}}_{v = 1}^{V}$ corresponding to each one. Various combination schemes have been proposed for the fusion of those multiple graphs into a single entity (Sun Citation2011; Karasuyama and Mamitsuka Citation2013). The strength of MV-SSL stems from its ensemble-like rationale. Multiple views individually encode different spectral information of the data. Most often, some aspects of the data variability ignored by a particular view may be captured by another one and vice-versa. Hence, their MV integration boosts significantly the robustness of the overall model.

1.5. Outline of proposed method

Recently, we have proposed a multi-atlas patch-based approach for automatic knee cartilage segmentation utilising the SSL principles (Chadoulos et al. Citation2022). The MV-KCS method proposed herein integrates effectively the frameworks of MV and SSL in a unifying scheme, thus leveraging the strengths of both to enhance the classification accuracy. In that respect, MV-KCS is an extension of our previous work which can be regarded as a single-view approach. provides a succinct illustration of the algorithmic procedure. A summary of the salient features and contributions of our proposed framework are listed below [noitemsep]

Multi-view Semi-supervised learning (MV-SSL): Under the multi-atlas setting, we classify target voxels by utilising both labelled data from the atlas library ${A, L}$ and unlabeled ones from the target image $T$ . Moreover, we further embed the SSL framework within the broader scope of MVL, thus enhancing the robustness of our method via leveraging multiple distinct views, each one capturing a different aspect of the data.
Sparse Coding (SC): The multiple weighted affinity graphs corresponding to each distinct view are constructed via a Sparse Coding optimisation process, whereby each voxel is represented as a linear combination of its sparse neighbours. The incorporation of sparsity aims at reducing noise and enhancing the robustness of the proposed method.
Voxels’ labeling: Adapting to the MVL framework, we propose two multi-view graph-based labelling schemes, namely, Regional Label Propagation (MV-RegLP) and Hybrid Label Propagation (MV-HyLP) for the classification of the target voxels. The former considers the voxel descriptors at the regional level, while the latter one integrates further a finer level of description by leveraging the notion of Label Map Estimates.
Out-of-sample labeling: An iterative sampling process successively generates out-of-sample batches of target voxels $X_{o}$ through progressive densification of a 3D mesh. In addition, local data are generated by sampling the atlas library in spatially correspondent locations. The resulting labeling scheme leverages information by considering voxel similarity along two axes, namely, global – local and spectal – spatial.
Experimental setup: The proposed MV-KCS considers two cases of multiple views, namely, different feature sets (descriptors), and different MRI acquisition protocols (sequences). Furthermore, we investigate several test cases, such as single-feature vs multi-features, multi-features vs feature stacking and multi-features vs multi-modalities, with the aim to demonstrate the efficacy of the multi-view setting.

Figure 1. General flowchart of the proposed method.

The remainder of this paper is organised as follows: Section 2 focuses on the preliminary steps undertaken with regards to image preprocessing, registration and atlas selection. In Section 8.3, we present the two sets of views examined in this study, namely, the feature sets extracted from popular image descriptors and the different MRI sequences. Section 3 describes the adopted multi-view aggregation framework while Sections 4, 6 and 7 detail the proposed MV-KCS algorithm. Details relevant to the data and model hyperparameters, as well as the experimental layout are presented in Section 8, while Section 7 deals with the presentation and discussion of the experimental results. Finally, Section 8 concludes this study with an overview on the key findings and a number of possible future extensions.

2. Background

We treat the knee cartilage segmentation as a multi-class classification ( $c = 5$ ): namely ${Background : 0, FemoralBone : 1, : 2, TibialBone : 3, TibialCartilage : 4}$ . Our primary focus lies on the successful segmentation of the cartilage structure (femoral & tibial) while a secondary focus is placed on the bone structure.

2.1. Image preprocessing

Knee cartilage segmentation is a challenging task, owning mainly to the poor separation of the articular cartilage and the surrounding tissues (excluding bone), as well as the fact that the intensity profiles and texture of image regions occupied by those structures exhibit close similarities. This problem is further accentuated by the inter-subject variability inherent in magnetic resonance imaging. To ameliorate these shortcomings, we pre-process each MRI as follows:

Curvature flow filtering: A curvature-driven flow filter (Sethian Citation1999) is applied to each MRI, preserving surface boundaries between adjacent structures while simultaneously smoothing homogeneous image regions.
Inhomogeneity correction: N3 intensity nonuniformity bias field correction (Sled et al. Citation1998) is applied to each image to reduce the intra-subject variability in intensity profiles across similar structures.
Intensity standardization: All MRI histograms are standardised to a common template, according to the method described in (Nyul and Udupa Citation2000), in an effort to reduce inter-subject variability.
Non-local-means denoising: A final filtering step is performed to account for any leftover artefacts from the previous steps and to further reduce noise. We opted for non-local-means patch-based denoising (Buades et al. Citation2011), due to its robust performance and widespread use in medical image applications. As a final step, all image intensities are rescaled to $[0, 100]$ .

2.2. Registration & atlas selection

An essential component of any multi-atlas approach is the atlas library. Given a target image $T$ to be segmented, a series of pairwise transformations is performed, registering all atlases ${A_{i}}_{i = 1}^{n_{A}}$ to $T$ . To avoid the computational cost associated with the use of deformable models, here we employ an affine transformation. This registration aligns all atlases to the target image coordinate space, covering all linear deformations such as translations, rotations, shear, scale, etc. The derived transformations are also applied to the corresponding label masks of each atlas $L_{i}$ , to maintain the spatial correspondence between atlas image and mask pairs. The result of the above process is the construction of the atlas library ${A_{i}^{T}, L_{i}^{T}}_{i = 1}^{N_{A}}$ registered to $T$ .

Next, in an effort to reduce the required number of voxels to be segmented, we define a Region Of Interest (ROI), covering the entire cartilage structure and its surrounding volume. A pre-segmentation binary mask is derived via a Majority Voting filter, operating on the registered atlas masks, by discarding the bone (Femoral & Tibial) labels and regarding the corresponding cartilage ones as a single unit. Finally, a binary morphological dilation filter expands this mask, yielding the resulting cartilage ROI for the image $T$ , as illustrated in . The ROI effectively defines the sampling volume for the target image $T$ and its associated atlas library ${A_{i}, L_{i}}_{i = 1}^{N_{A}}$ , disregarding all exterior voxels. The final morphological dilation step guarantees that the ROI encompasses all of cartilage voxels across the target and atlas images.

Figure 2. Each atlas $A_{i}$ is affinely registered to the target image $T$ , and the obtained transformation is subsequently applied to the corresponding masks $L_{i}$ . A majority voting filter (MV) produces an initial pre-segmentation mask $({\tilde{L}}^{T})$ of cartilage-only labels and the Region of Interest (ROI) is finally defined as the morphological dilation of ${\tilde{L}}^{T}$ .

Given that non-linear deformations cannot be effectively handled by affine transformations, we perform a final atlas selection step by retaining in the atlas library only a subset of the registered atlases that exhibit good spatial alignment with the target image. To this end, a measure of spatial misalignment is calculated for every pair ${T, A_{i}^{T}}$ using the Mean Squared Difference $(MS D_{i}^{(R O I)})$ , where the voxels participating in the computation are exclusively those belonging in the ROI. The first $n_{A}$ atlases with the least scores are included in the library, while the rest are discarded (Algorithm 1).

Algorithm 1:

Affine Registration & Atlas Selection

Data: Target $T$ , Atlases ${A_{i}, L_{i}}_{i = 1}^{N_{A}}$ , $n_{A}$ number of atlases to select

Result: Registered selected atlases ${A_{i}^{(T)}, L_{i}^{(T)}}$ , ROI

1. for $i \leftarrow 1$ to $n_{A}$ do

2. $A_{i}^{(T)}, L_{i}^{(T)} =$ $Register (T, A_{i}$ );

3. ${\tilde{L}}^{(T)} = MV ({L_{i}^{(T)}});$

4. $ROI = BinaryMorphologicalDilation ({\tilde{L}}^{(T)});$

5. $MSE = mse (T (x), A_{i}^{(T)} (x)), \forall x \in ROI;$

6. select first $n_{A}$ atlases corresponding to $n_{A}$ smallest $MSE$ s

3. Multi-view acquisition

3.1. Patch/region description

A robust and distinctive characterisation of the voxels’ information content is essential for cartilage segmentation. In that regard, the spatial scale considered is crucial; a large spatial scale suggests that the resulting features cannot encode meaningful information, while a small one leads to noisy features. Here, we introduce two units of spatial aggregation, namely, the patch and the Regional Search Volume (SV). The patch $p_{i} = p (x_{i})$ is defined as a $5 \times 5 \times 5$ volume centered around a voxel $x_{i}$ . Each patch $p_{i}$ is characterized by a feature vector $x_{i} = h_{i} = f_{enc} (p_{i}) \in R^{q}$ , with $q$ representing the feature dimensionality and $f_{enc}$ the associated encoding function of the descriptor.Footnote¹ Correspondingly, each feature vector is associated with a label vector $y_{i} = [y_{i, 1}, \dots, y_{i, c}]^{T} \in R^{c}$ . We distinguish between hard label vectors, derived from atlas images $(y_{i}^{A})$ and soft ones, derived from the target image $(y_{i}^{T})$ .

The Regional Search Volume (SV) $R_{i}$ is defined as a $15 \times 15 \times 15$ volume, comprising $3 \times 3 \times 3$ patches, with the central patch corresponding to the one associated with voxel $x_{i}$ . $R_{i}$ may be viewed as the set of all patches contained within its spatial region, $R_{i} = {p_{i_{1}}, \dots, p_{i_{27}}}$ where $p_{i_{1}}$ corresponds to the central voxel $x_{i}$ of $R_{i}$ . The feature descriptor associated with the whole region is formed as a weighted aggregate of all the individual feature vectors of its constituent patches, $H_{i} = \sum_{j} w_{h_{j}} h_{j}^{(i)}, H_{i} \in R^{q}$ , where the weight associated to each patch is determined according to its city-block distance from the central one (). Finally, the label vectors are concatenated in a row-wise manner to produce the corresponding label matrix of $R_{i}$ , $Y_{i} = [y_{i_{1}}, \dots, y_{i_{27}}]^{T} \in R^{27 \times c}$ .

Figure 3. Schematic representation of the feature description process for a regional SV $R_{i}$ , centered around a voxel $x_{i}$ . Each SV comprises $3 \times 3 \times 3$ patches, which in turn consist of $5 \times 5 \times 5$ voxels. For all patches $p_{j}^{(i)} \in R_{i}$ , a feature vector $h_{j}^{(i)}$ is calculated according to an encoding function (HOG, LBP, etc.), here denoted as $f_{enc} (\cdot)$ , where $h_{j}^{(i)}$ serves as the individual descriptor for $p_{j}^{(i)}$ . Finally, the whole $R_{i}$ is characterized by $H_{i}$ as the weighted aggregation of all its constituent descriptors.

Figure 3. Schematic representation of the feature description process for a regional SV Ri, centered around a voxel xi. Each SV comprises 3×3×3 patches, which in turn consist of 5×5×5 voxels. For all patches pj(i)∈Ri, a feature vector hj(i) is calculated according to an encoding function (HOG, LBP, etc.), here denoted as fenc(⋅), where hj(i) serves as the individual descriptor for pj(i). Finally, the whole Ri is characterized by Hi as the weighted aggregation of all its constituent descriptors.

3.2. Feature descriptors

In this section we introduce our main multi-view case, involving the use of multiple feature descriptors to characterise a voxel and/or SV. Feature descriptors are selected so that they exhibit minimum redundancy, while simultaneously offering complementary information.

3.2.1. Histograms of Oriented Gradients (HOG)

Histograms of Oriented Gradients (HOG) (Navneet and Triggs Citation2020) are a state-of-the-art descriptor with established performance across multiple computer vision tasks, among the medical imaging ones (Sarwinda and Bustamam Citation2018). Here, we present a fully 3D variant of the HOG descriptor, based on (Kläser et al. Citation2008). Our own implementation is modified to fit the feature extraction framework presented in Section 3.1. For a Regional SV $R_{i}$ , we compute the mean gradient vector of each constituent patch $p_{j}^{(i)} \in R_{i}$ , ${\overline{g}}_{j}^{(i)} \in R^{3}$ . To avoid the unequal bin size of the traditional 2D histogram over the surface of a sphere (Rister et al. Citation2017), we project ${\overline{g}}_{j}^{(i)}$ onto the vertices of a regular polyhedron, obtaining the orientation histogram $q_{j}^{(i)} = C {\overline{g}}_{j}^{(i)}$ , with $C = [c_{1}, \dots, c_{q}]^{T} \in R^{q \times 3}$ the matrix with the polyhedron vertices’ coordinates. $q_{j}^{(i)}$ is subsequently normalized and thresholded to yield the patch feature descriptor $h_{j}^{(H O G, i)} \in R^{q}$ . The Regional SV’s descriptor $H_{i}^{(H O G)}$ is finally obtained as illustrated in . We found that $q = 42$ provides a good balance between descriptor size and performance.

3.2.2. Local Binary Patterns (LBP)

Local Binary Patterns (LBP) (Ojala et al. Citation1996) enjoy a widespread use owning to their robustness and ease of implementation. Based on (Banerjee et al. Citation2013), we propose a 3D extension on the traditional 2D LBP by utilising the concept Spherical Harmonics, thus circumventing the difficult problem of uniform sampling on the surface of a sphere and simultaneously achieving rotational invariance. The information content of a patch $p_{i} \in R_{i}$ is characterized through the encoding function $f = f_{lbp} = [(s (g_{0} - g_{c}), s (g_{1} - g_{c}), \dots, (g_{P - 1} - g_{c})] \in R^{P}$ , where $g_{c}$ the intensity of the central voxel, $g_{i}, i = 1, \dots, P - 1$ the intensities of the $P$ neighbors sampled on the vertices of a regular polyhedron at radius $R$ and $s (\cdot)$ the step function. Sampling the harmonic basis functions’ values at the locations of the polyhedron vertices we obtain $Y_{l}^{m} \in R^{P \times l^{2}}$ (degree $l$ , order $m$ ) and proceed to extract the expansion coefficients of $f$ via the following inner product $c_{l}^{m} = {f Y}_{l}^{m}$ . $f$ can now be reconstructed as a band-limited expansion through with $l$ components, via $\tilde{f_{k}} = \sum_{m = - k}^{k} c_{l}^{m} Y_{l}^{m}$ and the patch descriptor $h_{j}^{(i)}$ is finally obtained by taking the $l$ -2 norm of the expansion’s frequency components, $h_{j}^{(i)} = LB P_{P, R}^{r i 3 D} = {{\tilde{∥ f}}_{0} ∥, ∥ \tilde{f_{1}} ∥, \dots, \break ∥ {\tilde{f}}_{l - 1} ∥}$ . As for HOG, we choose a polyhedron with $P = q = 42$ vertices, while radius is set at $R = 2$ .

3.2.3. Gray-Level Co-occurrence Matrix (GLCM) & Gray-Level Run-Length Matrix (GLRLM)

Features based on the computation of Gray-Level Co-occurrence Matrices (GLCM) (Haralick et al. Citation1973) and Gray-Level Run-Length Matrices (GLRLM) (Galloway Citation1975) are among the most commonly used for the description of image texture. Given an image with $L$ levels, both methods rely on the creation of a $L \times L$ matrix: GLRLM yields a matrix whose $(i, j)$ element denotes the number of occurrences of the intensity value $i$ for runs of length $j$ at a given direction $θ$ , while in GLCM, the element $(i, j)$ stores the number of times the pair of intensity values $i, j$ is present along a given direction $θ$ and at a distance $d$ . The extension to the 3D case is possible through the inclusion of an additional direction $ϕ$ . A typical arrangement of directions and the corresponding displacement vectors for the computation of GLCM and GLRLM is found in .

Table 1. Displacement vectors for GLCM and GLRLM for volumetric data.

Display Table

In our case, GLCM and GLRLM matrices are calculated for every patch

p_{j}^{(i)} \in R_{i}

to be characterized. A set of features is then calculated from each matrix, yielding the corresponding feature vectors for the patch under consideration,

h_{j}^{(i), G L C M} \in R^{q_{GLCM}}

and

h_{j}^{(i), G L R L M} \in R^{q_{GLRLM}}

3.2.4. Hand-Crafted Geometric Features (HCGF)

Finally, inspired by (Folkesson et al. Citation2007), we consider a final set of geometric features, heavily relying on the gradients of voxels’ intensities. Denoting $I$ the intensity of a voxel and $I_{x}, I_{y}, I_{z}$ the partial derivatives of $I$ along the $- x, - y, - z$ axes, we compute the following features for every voxel with a given patch $p_{i} \in R_{i}$ : spatial coordinates $(x, y, z)$ , intensity $I$ , first order partial derivatives $I_{x}, I_{y}, I_{z}$ , eigenvalues and eigenvectors of the Hessian matrix $H = [I_{jk}]$ and the corresponding ones of the structure tensor $T = [I_{j} I_{k}], for j, k \in {x, y, x}$ . This results in a $q = 36$ dimensional feature vector for every voxel. To obtain a single feature descriptor for the whole patch, and to avoid the curse of dimensionality, we follow the same approach as in the regional descriptors, taking the weighted sum of all feature vectors, where the weights correspond to the city-block distances from the central voxel.

The above features provide a diversified description of the image content, covering many aspects of the data variability. In particular, HOG and HCGF are well suited for capturing the shape and orientation of anatomical structures, while LBP, GLCM and GLRLM are excellent at describing their texture. Each descriptor gives rise to a different view on the available data ( $V = 5$ , ).

Figure 4. Schematic representation of multi-view feature extraction for regional SV $R_{i}$ . Each one of the $j = 1, \dots, 27$ patches is supplied with a feature vector for each descriptor(view) employed. The regional feature vector for each view is then calculated according to the process outlined in . The overall description of $R_{i}$ is the collection of all ${H_{i}^{(v)}}_{v = 1}^{V}$ .

Figure 4. Schematic representation of multi-view feature extraction for regional SV Ri. Each one of the j=1,⋯,27 patches is supplied with a feature vector for each descriptor(view) employed. The regional feature vector for each view is then calculated according to the process outlined in Figure 3. The overall description of Ri is the collection of all {Hi(v)}v=1V.

3.3. MRI sequences

We also consider a secondary multi-view case, whereby multiple views are associated with different MRI acquisition protocols (sequences). The sequences used in our experiments are the Dual Echo Steady State (DESS), T2-weighted (T2) and SPoiled Gradient Recalled echo (SPGR). DESS is the most widely used acquisition protocol for musculoskeletal and joint imaging, due to the adequate delineation it offers between cartilage, bone and some of the surrounding tissue. SPGR, while not yielding a similar quality of delineation as DESS, produces images that are characterised by greater homogeneity within the same anatomical region. Finally, T2 sequences are also widely applicable in musculoskeletal imaging and can offer even better contrast between cartilage and surrounding tissue (muscle, fat, etc.) (Peterfy et al. Citation2008). showcases the same cross-slice of a single subject across the three different acquisition protocols.

Figure 5. DESS, SPGR and T2 MRI sequences for the same subject. The differences in intensity profiles is apparent, especially between sequence T2 and the rest, suggesting that features learned from these images will yield sufficiently different descriptions of the image information content.

An important preprocessing step in the multi-sequence multi-view case is the intra-subject alignment of the three images. Since each sequence follows a distinct acquisition protocol, there are differences regarding the voxel spacing, image size, direction and origin. However, since the underlying anatomy is the same, non-linear deformations are not an issue (Schneider et al. Citation2008). Regarding DESS as the reference sequence for each subject, a simple resampling and interpolation operation is sufficient to align all sequences in the same coordinate space. Next, SPGR and T2 sequences undergo the same exact preprocessing steps outlined in Section 2.1.

4. The AMUSE model

4.1. Background on Graph-based SSL

Given a set of $n$ voxels, $n^{A}$ of which are sampled from an atlas library (labelled) and the remaining $n^{T} = n - n^{A}$ are sampled from a target image (unlabelled), $V$ feature sets are extracted, each corresponding to a different view. The obtained dataset is represented in a column-wise manner as ${X^{(v)} = [x_{1}^{(v)}, \dots, x_{n}^{(v)}], \break X^{(v)} \in R^{q^{(v)} \times n}}_{v = 1}^{V}$ , with the corresponding labels being one-hot encoded into the label matrix $Y = [Y^{A}; Y^{T}] \in R^{n \times c}$ (components of $Y^{T}$ are initially set to zero).

Adopting the Gaussian Fields and Harmonic Functions approach(GFHF) (Zhu et al. Citation2003), an affinity matrix $W^{(v)} \in R^{n \times n}$ is calculated for each view, based on a pre-defined similarity measure, corresponding to a graph $G^{(v)} = (V^{V}, E^{V})$ , whose vertices correspond to data points and the edge weights encode the spectral similarity of those points within the context of the specific view. The Laplacian of each graph is derived as $L^{(v)} = D^{(v)} - W^{(v)}$ where $D^{(v)}$ is a diagonal matrix, with $D_{i i}^{(v)} = \sum_{j = 1}^{n} W_{i j}^{(v)}$ (Mohar Citation1991).

In the single-view case $(V = 1)$ , GFHF utilizes the affinity matrix $W$ and the corresponding Laplacian $L$ to propagate labels from labeled data to unlabeled ones, under the constraint that already labeled data will not be affected. The underlying principle lies on the manifold assumption,i.e., data in a high-dimensional space share the same labels if they are spectrally similar within a lower-dimensional manifold embedded in that space (for which graph $G$ serves as a proxy.) This idea can be rigorously defined in the following optimization objective

(1)

min_{F} \propto \sum_{i = 1}^{n_{A}} ∥ f_{i} - y_{i} ∥_{2}^{2} + \sum_{i, j = 1}^{n} W_{ij} ∥ f_{i} - f_{j} ∥_{2}^{2}

(1)

where $F \in R^{n \times c}$ denotes the learned labels, the first term above corresponds to the loss, and the second one acts as the manifold regulariser. Casting (1) as a trace minimisation problem, we get

(2)

\begin{matrix} min_{F} & Tr (F^{T} LF) \\ s . t & F^{A} = Y^{A} \end{matrix}

(2)

4.2. Multiple-view integration

In the multi-view setting, given multiple graphs $G^{(v)}$ and the corresponding affinity matrices $W^{(v)}$ , a straightforward way to integrate the multiple views is via a linear combination, yielding:

(3)

\begin{matrix} min_{F, α} & \sum_{v = 1}^{V} α_{v} Tr (F^{T} L^{(v)} F) \\ s . t & S = \sum_{v = 1}^{V} α_{v} W^{(V)} F^{(A)} = Y^{A}, α^{T} 1 = 1, α \geq 0 \end{matrix}

(3)

where $L_{S}$ corresponds to the Laplacian of the explicit combined affinity matrix $S = \sum_{v = 1}^{V} α_{v} W^{(v)}$ . The optimal $α^{*}$ resulting from the above problem is often too sparse to effectively take advantage of the information content across the multiple views. To this end, the equality constraint is relaxed to a quadratic penalty term, as follows:

(4)

\begin{matrix} min_{F, α, S} & Tr (F^{T} L^{(v)} F) + λ ∥ S - \sum_{v = 1}^{V} α_{v} W^{(v)} ∥_{F}^{2} \\ s . t . & F^{(A)} = Y^{A}, α^{T} 1 = 1, α \geq 0 \\ S1 = 1, S \geq 0 \end{matrix}

(4)

4.3. AMUSE optimization

Since the objective function in(4) cannot be efficiently solved for all three variables simultaneously, an alternating optimisation process is proposed, whereby two out of the three variables are fixed, while the third one is updated, in an iterative fashion. Algorithm 2 summarises the steps outlined in the following paragraphs.

4.3.1. Update F,fixing S, α

With $S, α$ fixed, the problem is equivalent to the single view case(2), where $L = L_{S}$ . Compartmentalizing the Laplacian into a $2 \times 2$ block matrix format and separating the labelled and unlabelled entries, we get the following optimisation objective:

(5)

\begin{matrix} min_{F, α} & Tr ({[\begin{matrix} F^{A} \\ F^{T} \end{matrix}]}^{T} [\begin{matrix} L_{A A} & L_{A T} \\ L_{T A} & L_{T T} \end{matrix}] [\begin{matrix} F^{A} \\ F^{T} \end{matrix}]) \\ s . t & F^{A} = Y^{A} \end{matrix}

(5)

Taking the partial derivative of the objective with respect to $F^{T}$ and setting to zero, the following solution is obtained:

(6)

F^{T} = - (L_{T T})^{- 1} L_{T A} Y^{A}

(6)

Finally, $F$ is updated as

(7)

F = [\begin{matrix} Y^{A} \\ F^{T} \end{matrix}]

(7)

4.3.2. Update α, fixing F,S

The objective of (4) for this round assumes the following formulation:

(8)

\begin{matrix} min_{S} & Tr (F^{T} L_{S} F) + λ {∥S - \sum_{v = 1}^{V} α_{v} W^{(v)}∥}_{F}^{2} \\ s . t & S1 = 1, S \geq 0 \end{matrix}

(8)

which can be further algebraically manipulated as follows:

Tr (F^{T} L_{S} F) + λ {∥S - \sum_{v = 1}^{V} α_{v} W^{(v)}∥}_{F}^{2}

= \frac{1}{2} \sum_{i, j = 1}^{n} D_{ij} s_{ij} + λ \sum_{i, j = 1}^{n} {(s_{ij} - \sum_{v = 1}^{V} α_{v} W_{i j}^{(v)})}^{2}

= λ \sum_{i, j = 1}^{n} {(s_{ij} - \sum_{v = 1}^{V} α_{v} W_{i j}^{(v)} + \frac{D_{ij}}{4 λ})}^{2} + constant

where $D_{ij} =∥ f_{i} - f_{j} ∥_{2}^{2}$ . The simplex constraint $\sum_{j = 1}^{n} s_{ij} = 1$ essentially means that the rows of $S$ can be treated independently of one another, leading to the below optimisation problem for $i = 1, \dots, n$

(9)

\begin{matrix} min_{S_{i}} & \sum_{j = 1}^{n} {(s_{ij} - \sum_{v = 1}^{V} α_{v} W_{ij} + \frac{D_{ij}}{4 λ})}^{2} \\ s . t & \sum_{j = 1}^{n} s_{ij} = 1, s_{ij} \geq 0 \end{matrix}

(9)

Denoting $T_{ij} = \sum_{v = 1}^{V} α_{v} W_{ij} - \frac{D_{ij}}{4 λ}$ for notational brevity, the optimisation problem for each row $i$ of $S$ can finally be formulated as

(10)

\begin{matrix} min_{S} & \frac{1}{2} ∥ S_{i} - T_{i} ∥_{2}^{2} \\ s . t & 1^{T} S_{i} = 1, s_{ij} \geq 0 \end{matrix}

(10)

This constitutes a projection problem and can be efficiently solved utilising an accelerated projected gradient method. The details of the solver’s implementation can be found in (Huang et al. Citation2015).

Algorithm 2:

AMUSE Optimization Algorithm

Data: affinity matrices ${W^{(v)} \in R^{n \times n}}_{v = 1}^{V}$ , label matrix $Y^{A} \in R^{n \times c}$ , regularisation parameter $λ \in R$

Result: affinity matrix of multi-view graph $S \in R^{n \times n}$ , label matrix of unlabelled data $F^{T} \in R^{T \times c}$ , linear combination parameters $α \in R^{V}$

1. Initialize $S_{0} = \frac{1}{V} \sum_{v = 1}^{V} W^{(v)}$

2. Initialize $α_{v} = \frac{1}{V} \forall v = 1, \dots V$ ;

3. while convergence criterion not met do

4. Update $F$ with Eq. (7);

5. Update $S$ with Eq. (10);

6. Update $α$ with Eq. (14);

4.3.3. Update S, fixing F,α

With $F, α$ fixed, the original problem (4) becomes

(11)

\begin{matrix} min_{α} & {∥S - \sum_{v = 1}^{V} α_{v} W^{(v)}∥}_{F}^{2} \\ s . t & α^{T} 1 = 1, α \geq 0 \end{matrix}

(11)

and can be reformulated in a more compact form as

(12)

\begin{matrix} min_{α} & ∥ vec (S) - Λ α ∥_{2}^{2} \\ s . t & α^{T} 1 = 1, α \geq 0 \end{matrix}

(12)

where $vec (\cdot)$ is the vectorisation (flattening) operator, $vec (S) \break \in R^{n^{2}}$ and $Λ \in R^{n^{2} \times V}$ is

Λ = [\begin{matrix} | & | \\ vec (W^{(1)}) & \dots & vec (W^{(V)}) \\ | & | \end{matrix}]

Observing that $Λα = vec (\sum_{v = 1}^{V} α_{v} W^{(v)})$ , the above problem can finally be cast in the form of a Quadratic Program (QP), as showcased below:

(13)

\begin{matrix} min_{α} & α^{T} Λ^{T} Λα - 2 vec {(S)}^{T} Λ α = min_{α} α^{T} P α - q^{T} α \\ s . t & α^{T} 1 = 1, α \geq 0 \end{matrix}

(13)

where $P = Λ^{T} Λ$ and $q = 2 Λ^{T} vec (S)$ . Introducing an auxiliary $β \in R^{V}$ variable and an additional equality constraint, problem (13) becomes

(14)

\begin{matrix} min_{α, β} & α^{T} P α - q^{T} β \\ s . t & α^{T} 1 = 1, α \geq 0, α = β \end{matrix}

(14)

This problem can be efficiently solved using the Alternating Direction Method of Multipliers(ADMM) (Boyd et al. Citation2010), whereby the original problem is split into optimising $α$ and $β$ independently. The exact derivations for the final update rules can be found in (Nie et al. Citation2020).

5. Multi-view graph constructions

In this section, we describe the global data sampling process and the sparse graph construction of the multiple views.

5.1. Data sampling

A spatially stratified sampling process is used to yield the two global datasets of labelled $X_{g}^{A}$ and unlabeled $X_{g}^{T}$ , respectively. The first step involves the application of KMeans clustering on the spatial coordinates of all voxels $x^{T} \in RO I^{T}$ , yielding $n_{g}^{T}$ cluster centers. After interpolating those centers to the closest grid voxels, we obtain a global target (unlabeled) dataset $X_{g}^{T} = {x_{g}^{T} (i)}_{i = 1}^{n_{g}^{T}}$ . The above step guarantees a sufficient coverage of the entire cartilage ROI.

Next, the same process is repeated, this time clustering the coordinates of voxels in $X_{g}^{T}$ into $n_{s}^{T}$ clusters $(n_{s}^{T} < n_{g}^{T})$ , thus forming the subset $X_{g_{s}}^{T} \subset X_{g}^{T}$ . Having established a common coordinate space between the target image $T$ and Atlas Library ${A_{i}^{T}, L_{i}^{T}}_{i = 1}^{n_{A}}$ through registration (Section 2.2), all atlases $A_{i}$ are sampled at spatially correspondent locations specified by the coordinates of $X_{g_{s}}^{T}$ . A global atlas (labeled) dataset $X_{g}^{A} = {x_{g}^{A_{i}} (j), j : x_{g_{s}}^{T} (j) \in X_{g_{s}}^{T}}_{i = 1}^{n_{A}}, (n_{g}^{A} = n_{s}^{T} \cdot n_{A})$ is thus formed, along with the corresponding label matrix $Y_{g}^{A} \in R^{n_{g}^{A} \times c}$ . In restricting the sampling of atlas voxels to spatial locations corresponding to already sampled target ones, we maintain the patch-based directive: that is, estimating the target voxels’ labels from those of spatially aligned ones in the atlas library.

The cardinalities $n_{g}^{T} = | X_{g}^{T} |$ and $n_{g}^{A} = | X_{g}^{A} |$ of the unlabelled and labelled datasets respectively, are properly chosen to balance between improved accuracy (larger values) and moderate computational demands (smaller values). In our experiments, we consider a fixed number of global samples $(n_{g} = n_{g}^{T} + n_{g}^{A})$ .

5.2. Views sparse graph coding

In constructing the graphs encoding data connectivity for each view, we incorporate the concept of sparse neighbourhood (Zang and Zhang Citation2012). Each voxel $x_{i}^{(v)}$ is linearly reconstructed from its respective Local Neighborhood $LN (x_{i}^{(v)})$ , which comprises the set of $k$ -nearest neighbors from the global dataset $X_{g}^{(v)} = \break [X_{g}^{T (v)}, X_{g}^{A (v)}] \in R^{q^{v} \times n_{g}}$ . Let $A_{i}^{(v)} \in R^{q^{v} \times k}, A_{i}^{(v)} \subset {X_{g}^{(v)} ∖ x_{i}^{(v)}}$ denote the subset of $k$ -nearest neighbors of $x_{i}^{(v)}$ . The entire sparse graph $G_{g}^{(v)}$ associated to view $v = 1, \dots, V$ , is constructed via a series of sparse optimizations (Zhang et al. Citation2015):

(15)

\begin{matrix} min_{w_{i}^{(v)}} & ∥ w_{i}^{(v)} ∥_{1}, \forall x_{i}^{(v)} \in X_{g}^{(v)} \\ s . t & \frac{1}{2} ∥ x_{i}^{(v)} - A_{i}^{(v)} w_{i}^{(v)} ∥_{2}^{2} \leq ε \end{matrix}

(15)

where $ε > 0$ typically set to a small value. Eq.(15) is a $l_{1}$ -minimisation problem under a quadratic constraint, which can be efficiently solved via coordinate descent. Each resulting vector $w_{i}^{(v)} \in R^{k}$ is projected to a unit $l_{1}$ -norm $(∥ w_{i} ∥_{1} = 1)$ and resized to $R^{n_{g}}$ with zero-padding. The edge weights for graph $G_{g}^{(v)}$ form the affinity matrix for view $v$ , $W^{(v)} = [w_{1}^{(v)}, \dots, w_{n_{g}}^{(v)}] \in R^{n_{g} \times n_{g}}$ . Algorithm 3 summarises the above procedure.

Algorithm 3:

Sparse Graph Construction

Data: labelled atlas voxels $X_{g}^{A (v)}$ , unlabelled target voxels $X_{g}^{T (v)}$ , number of neighbours $k$

Result: Sparse Graphs’ edge weights $W^{(v)}$

1. for $v \in {1, \dots, V}$ do

2. identify $k$ -nearest neighbours $\forall x^{(v)} \in X_{g}^{(v)}$ ;

3. for each $x_{i}^{(v)} \in X_{g}^{(v)}$ do

4. solve for $w_{i}^{(v)}$ (15);

5. $W^{(v)} = [{w_{i}^{(v)}}_{i = 1}^{n_{g}}]^{T};$

In all of the above, each voxel $x_{i}^{(v)}$ is described by its associated Regional SV descriptor corresponding to a distinct view (HOG, LPB, etc). Each weight $w_{i j}^{(v)}$ encodes the spectral similarities of the Regional SVs $R_{i}$ and $R_{j}$ for view $v$ .

6. Proposed multi-view labeling schemes

We now propose two label propagation mechanisms, the MV-RegLp and MV-HyLP methods. The first one considers voxel descriptors’ similarity at the regional level, while the second one integrates an additional spatial scale by incorporating the notion of Label Map Estimates.

6.1. Multi-view regional label propagation (MV-RegLP)

Regional Label Propagation (MV-RegLP) operates exclusively on the regional level, whereby each voxel $x_{i}$ is characterized by its associated $R_{i}$ multi-view feature descriptors ${H_{i}^{(v)} \in R^{q^{v}}}_{v = 1}^{V}$ , and assigned a label $y_{i}$ , (the label of its central voxel $x_{i}$ ).

The labelling of MV-RegLP proceeds along the following steps:

Compute the affinity matrices ${W^{(v)}}_{v = 1}^{V}$ for each view through Algorithm 3.
Solve the multi-view optimisation problem (4) via the AMUSE (Algorithm 2) to obtain the combined affinity matrix
(16) $W_{g} = \sum_{v = 1}^{V} α_{v} W^{(v)} \in R^{n_{g} \times n_{g}}$ (16)
along with the multi-view coefficients $α \in R^{V}$ .
Obtain the Laplacian of $W_{g}$ as $L = I - W_{g}$ to yield the label propagation weights
(17) ${\tilde{W}}_{g} = - L_{T T}^{- 1} L_{T A} \in R^{n_{g}^{T} \times n_{g}^{A}}$ (17)
which encode the regional transfer mechanism from labelled to unlabelled data.
Target voxels $x_{g}^{T} \in X_{g}^{T}$ assume their labels via
(18) ${\hat{Y}}_{g}^{T} = {\tilde{W}}_{g} Y_{g}^{A}$ (18)
where $Y_{g}^{A} = [y_{1}^{A} \dots y_{n_{g}^{A}}^{A}]^{T} \in R^{(n_{g}^{A} \times c)}$ and ${\hat{Y}}_{g}^{T} = [y_{1}^{T} \break \dots y_{n_{g}^{T}}^{T}]^{T} \in R^{(n_{g}^{T} \times c)}$ .

6.2. Multi-view Hybrid Label Propagation (MV-HyLP)

The feature descriptors $H_{i}^{(v)}, v = (1, \dots, V)$ associated with a SV $R_{i}$ provide a voxel’s $x_{i}$ characterization at a broad spatial scale. To leverage the information encoded within the constituent patches at a finer spatial level, for each view $v = 1, \dots, V$ we define the patch-to-patch spectral similarity for $p_{i}, p_{j}$ via the chi-squared kernel (Zhang et al. Citation2006):

(19)

s^{(v)} (i, j) = s^{(v)} (p_{i}, p_{j}) = - \sum_{k = 1}^{q} \frac{{(h_{i_{k}}^{(v)} - h_{j_{k}}^{(v)})}^{2}}{h_{i_{k}}^{(v)} + h_{j_{k}}^{(v)}}

(19)

Next, a region-to-region label transfer mechanism is developed, whereby the label estimates ${\hat{y}}_{i_{k}}^{T (v)}$ of target patches $p_{i_{k}}^{T (v)} \in R_{i}^{T (v)}$ , are computed as a weighted combination of all label vectors $y_{j_{l}}^{T (v)} \in R_{j}^{A}$ :

(20)

{\hat{y}}_{i_{k}}^{T (v)} = \frac{\sum_{l = 1}^{27} s^{(v)} (i_{1}, j_{l}) (y_{j_{l}}^{A})^{T}}{\sum_{l = 1}^{27} s^{(v)} (i_{1}, j_{l})} = {\tilde{s}}_{ij}^{(v)} (i_{1}, \cdot) Y_{j}^{A}

(20)

where ${\tilde{s}}_{ij}^{(v)} (i_{1}, \cdot) = [{\tilde{s}}^{(v)} (i_{1}, j_{1}), \dots, {\tilde{s}}^{(v)} (i_{1}, j_{27})] \in R^{27}$ is the normalized to unit sum patch similarity vector and $Y_{j}^{A}$ the label matrix associated with $R_{j}^{A}$ . Then, the Label Map Estimate (LME) ${\hat{Y}}_{i}^{T (v)} (j) \in R^{27 \times c}$ of $R_{i}^{T}$ is produced as:

(21)

{\hat{Y}}_{i}^{T (v)} (j) = {\tilde{S}}_{ij}^{(v)} Y_{j}^{A}, v = (1, \dots, V)

(21)

where ${\tilde{S}}_{ij}^{(v)} = [{\tilde{s}}_{ij}^{(v)} (i_{1}, \cdot), \dots, {\tilde{s}}_{ij}^{(v)} (i_{27}, \cdot)] \in R^{27 \times 27}$ encodes the region-to-region similarity between $R_{i}^{T}$ and $R_{j}^{A}$ , with the index $j$ identifying the Regional SV $R_{j}^{A}$ as the source of label information for $R_{i}^{T}$ . The above equation provides an effective way of encoding the patch-level label distribution across regional SVs.

The MV-HyLP method integrates label information at two spatial scales, proceeding as follows:

The first step is identical to MV-RegLP, obtaining the sparse graphs $G_{g}^{(v)}$ and solving 4 via AMUSE (Algorithm 2) to obtain the label propagation weights ${\tilde{W}}_{g}$ . This step provides the regional-level description on the data.
For each view $v = 1, \dots, V$ , a Regional SV $R_{i}^{T}$ is associated with a collection of SVs ${R_{j}^{A}, j : x_{j}^{A} \in LN (x_{i}^{T})}$ . Each $R_{j}^{A}$ produces a label map estimate ${\hat{Y}}_{i}^{T (v)} (j)$ which is computed using (21). Complementing the previous step, this one offers a finer level of description within each Regional SV, by leveraging the patch-to-patch similarity matrices ${\tilde{S}}_{ij}^{v}$
Aggregate the view-specific LMEs $Y_{i}^{T (v)}$ over the different views using the multi-view coefficients $α$ to obtain:
(22) ${\hat{Y}}_{i}^{T} (j) = \sum_{v = 1}^{V} α_{v} Y_{i}^{T (v)} (j)$ (22)
Translating the manifold assumption from points – labels to Regional SVs – LMEs, aggregate the derived label estimates for $R_{i}^{T}$ , into a final label map, by utilizing the label propagation coefficients of ${\tilde{W}}_{g}$ :
(23) ${\hat{Y}}_{i}^{T} = \sum_{j \in L N (x_{g}^{T} (i))} {\tilde{w}}_{ij} {\hat{Y}}_{i}^{T} (j)$ (23)

In view of (21) and (22) ${\hat{Y}}_{i}^{T}$ can be expressed in a more compact matrix form as:

(24a)

{\hat{Y}}_{i}^{T} = \sum_{j} Q_{ij} Y_{j}^{A}

(24a)

(24b)

Q_{ij} = {\tilde{w}}_{ij} \sum_{v = 1}^{V} α_{v} S_{i j}^{(v)}

(24b)

The terms $Q_{ij} \in R^{(27 \times 27)}$ in (24b) provide the label transfer between $R_{i}^{T}$ and the target region $R_{j}^{A}$ . The first term ${\tilde{w}}_{ij}$ represents their affinity at the regional level, while the second one is a weighting average of the region-to-region similarities across the different views.

(5) Finally, subsuming the label estimates of all target voxels in $X_{g}^{T}$ , we obtain,

(25)

[{\hat{Y}}_{g}^{T}] = Q [Y_{g}^{A}]

(25)

where $Q = [Q_{ij}] \in R^{(27 n_{g}^{T} \times 27 n_{g}^{A})}$ , and $[{\hat{Y}}_{g}^{T}] = [({\hat{Y}}_{1}^{T})^{T}, \break ldots, \({\hat{Y}}_{n_{g}^{T}}^{T})^{T}]^{T} \in R^{(27 \cdot n_{g}^{T} \times c)}$ , $[{\hat{Y}}_{g}^{A}] = [({\hat{Y}}_{1}^{A})^{T}, \dots, ({\hat{Y}}_{n_{g}^{A}}^{T})^{T}]^{T} \in \break R^{(27 \cdot n_{g}^{A} \times c)}$ are the label matrices of all Regional SVs for each voxel $x_{g}^{T} \in X_{g}^{T}$ and $x_{g}^{A} \in X_{g}^{A}$ , respectively.

7. Out-of-sample label propagation

Stage-2 of the proposed method deals with induction on the remaining unlabelled target voxels. As in stage-1, out-of-sample voxels assume their labels via sparse reconstruction from their closest neighbours in the in-sample set. The iterative generation of out-of-sample batches relies on the creation of a 3D tetrahedral mesh, which allows the sampling of voxels at locations dependent on the ones of previously labelled samples. The mesh is progressively densified for each successive iteration until the entire target ROI is classified. The out-of-sample labelling proceeds as follows:

Out-of-sample batch generation: At iteration $t = 0$ , a 3D Delaunay tetrahedral mesh is constructed, with vertices corresponding to voxels in $X_{g}$ . Beginning stage-2 $(t = 1, 2, \dots)$ , the centroids of the mesh tetrahedra are extracted and interpolated to the closest grid point to yield $X_{o}^{(v)} = {x_{o}^{(v)} (i)} \in R^{q^{v} \times n_{o}}$ , with $n_{o}$ the batch size, for views $v = 1, \dots, V$ .
Local dataset generation: For every $x_{o}^{(v)} \in X_{o}^{(v)}$ , the four parent vertices in the target image are collected to form the local dataset $X_{l}^{T (v)} = {x_{o}^{T (v)} (i), i = 1, \dots, n_{l}^{T}} \break \in R^{q^{v} \times n_{l}^{T}}$ . The respective spatially correspondent voxels of those in $X_{l}^{T}$ and $X_{o}$ (centroids & parent vertices) in the atlas library are also collected, to form a second local dataset $X_{l}^{A (v)} \in R^{q^{v} \times n_{l}^{A}}$ . The label matrices associated with these two local datasets are, accordingly, ${\hat{Y}}_{l}^{T} \in R^{(n_{l}^{T} \times c)}$ and $Y_{l}^{A} \in R^{(n_{l}^{A} \times c)}$ .
Out-of-sample voxel embedding: The available global $X_{g} = [X_{g}^{T} X_{g}^{A}]$ , ${\hat{Y}}_{g} = [({\hat{Y}}_{g}^{T})^{T}, (Y_{g}^{A})^{T}]^{T}$ and local $X_{l} = [X_{l}^{T} X_{l}^{A}]$ , ${\hat{Y}}_{l} = [({\hat{Y}}_{l}^{T})^{T}, (Y_{l}^{A})^{T}]^{T}$ datasets yield complementary information on the data: the former geared towards capturing a larger portion of the class variability across the whole ROI, while the latter contains patterns that are specific to smaller local regions. The existence of multiple views on the data lends yet another dimension on the overall description of the data, rendering it more complete. Leveraging all of the above components, we connect the out-of-sample batch $X_{o}$ to both the global and local datasets via the learned sparse graphs. Particularly, for each view $(v = 1, \dots V)$ , the sparse graphs ${G_{o g}^{(v)} (X_{o}^{(v)}, X_{g}^{(v)}), G_{o l}^{(v)} (X_{o}^{(v)}, X_{l}^{(v)})}$ are constructed according to (15), where each $x_{o}^{(v)} \in X_{o}^{(v)}$ is decomposed into a linear combination of its sparse neighbours in $X_{g}^{(v)}, X_{l}^{(v)}$ , respectively.
The resulting affinity matrices ${{\tilde{W}}_{og}^{(v)} \in R^{n_{g} \times n_{o}}, \break {\tilde{W}}_{ol} (v) \in R^{n_{l} \times n_{o}}}_{v = 1}^{V}$ , with columns restricted to unit $l_{1}$ -norm, are then aggregated via the coefficients $α$ to yield the combined global and local multi-view matrices ${\tilde{W}}_{og} = \sum_{v = 1}^{V} α_{v} {\tilde{W}}_{og}^{(v)}$ and ${\tilde{W}}_{ol} = \sum_{v = 1}^{V} α_{v} {\tilde{W}}_{ol}^{(v)}$ as illustrated in .
Figure 6. Out-of-sample label propagation: two sparse graphs ${\tilde{W}}_{og}^{(v)}, {\tilde{W}}_{ol}^{(v)}$ are constructed for each view $v$ of the out-of-sample batch $X_{o}^{(v)}$ , connecting it to the global $X_{g}^{(v)}$ and local $X_{l}^{(v)}$ datasets respectively. The multi-view coefficients $α$ are used to yield the final learned sparse graphs facilitating the labelling of voxels in $X_{o}$ .
Depending on the label transfer mechanism employed, the induction on out-of-sample voxels proceeds by extracting two sets of labels (global + local):
(26a) ${\hat{Y}}_{og} = {\tilde{W}}_{og}^{T} Y_{g}, {\hat{Y}}_{ol} = {\tilde{W}}_{ol}^{T} Y_{l} (MV - RegLP)$ (26a)
(26b) $[{\hat{Y}}_{og}] = Q_{og} [Y_{g}], [{\hat{Y}}_{ol}] = Q_{ol} [Y_{l}] (MV - HyLP)$ (26b)
with $Q_{og}, Q_{ol}$ derived according to (24b). Finally, the complete label estimation is formulated by considering the convex combination of both the globally and locally derived components:
(27a) ${\hat{Y}}_{o} = μ {\tilde{W}}_{og}^{T} Y_{g} + (1 - μ) {\tilde{W}}_{ol}^{T} Y_{l} (MV - RegLP)$ (27a)
(27b) $[{\hat{Y}}_{o}] = μ Q_{og} [Y_{g}] + (1 - μ) Q_{ol} [Y_{l}] (MV - HyLP)$ (27b)
where $μ \in [0, 1]$ adjusts the contribution of global versus local label information, respectively.
Mesh densification & termination: At the end of each iteration, the 3D mesh is densified by adding the current centroids to its list of vertices, drawing new edges between child(centroid) and parent nodes. A new round of out-of-sample label propagation begins by locating the centroids of the new refined tetrahedra. This process terminates when a certain percent of those tetrahedra ( $\approx 90 %$ ) fits entirely in a $5 \times 5 \times 5$ cell(Algorithm 4).
Majority Voting & Post-processing: The remaining unlabelled voxels dispersed through the ROI are guaranteed to be surrounded by multiple labelled ones, owning to the progressive densification of the initial mesh. Thus, a simple Majority Voting rule over the vertices of the encompassing tetrahedra is sufficient to yield accurate classification results. Finally, a binary morphological closing – opening is performed to smooth the surfaces of the segmented structures, followed by a pass through a neighbour voting filter, to rectify potential isolated misclassifications.

Algorithm 4:

Iterative sampling procedure

Data: Global dataset $X_{g}^{T}$

1. Construct 3D mesh $D T$ via triangulation of $X_{g}^{T}$ ;

2. initialise iteration count $t = 1$ ;

3. while termination condition not met do

4. compute new batch of centroids $X_{o}$ ;

5. insert $X_{o}$ in $D T$ ;

6. $t = t + 1;$

8. Experimental setup

8.1. Image data

The MRI sequences used in this study were acquired from the publicly available Osteoarthritis Initiative (OAI) repository (Peterfy et al. Citation2008). The entirety of the dataset is utilised (507 images), accounting for the whole spectrum of joint degradation with regards to the Kellgren-Lawrence (K – L) grade (Biscaldi et al. Citation2018). For the multi-view setting involving the use of multiple feature descriptors, we use the sagittal 3D Dual-Echo-Steady-State (3D-DESS) sequence with water excitation, while for the multi-view setting dealing with different MRI sequences, we consider the 3D-DESS, T2-weighted (T2) and SPoiled Gradient Recalled echo (SPGR). In all comparative experiments and test cases, the problem is cast as a 5-class classification: Background Tissue (BT): 0, Femoral Bone (FB): 1, Femoral Cartilage (FC): 2, Tibial Bone (TB): 3 and Tibial Cartilage (TC): 4.

8.2. Evaluation measures

Segmentation performance is evaluated using three well-established volumetric measures, namely, the Dice Similarity Coefficient (DSC), the Volumetric Difference (VD) and the Volume Overlap Error (VOE). Denoting $Y$ as the ground truth label map and $\hat{Y}$ the estimated one, they are defined as:

(28a)

DSC = 100 \frac{| Y \cap \hat{Y} |}{| Y | + | \hat{Y} |}

(28a)

(28b)

VOE = 100 (1 - \frac{DSC}{200 - DSC})

(28b)

(28c)

VD = 100 \frac{| \hat{Y} | - | Y |}{| Y |}

(28c)

Since the majority of voxels in a given MRI belong to either Background or Bone classes, we also include the typical classification measures Recall and Precision, to evaluate the sensitivity of our proposed methodology. All measures refer exclusively to the image content delineated by the respective cartilage ROIs.

8.3. Hyperparameter setting & sensitivity analysis

The proposed MV-RegLP and MV-HyLP, integrate multiple components, the most notable being multi-atlas, sparse representation, semi-supervised learning and finally, multi-view learning. Each component’s contribution hinges upon one or more hyperparameters that control its behaviour: the number of selected atlases $n_{A}$ (Section 2.2), the percentage $ss l_{perc} = 100 \frac{n_{g}^{A}}{n_{g}}$ of labeled data included in the learning process (Section 5.1), the number of $k$ nearest neighbors in constructing the sparse graphs (Section 5.2), the structural regularization term $λ$ in the multi-view semi-supervised objective (Section 4.2) and finally, the convex combination parameter $μ$ controlling the trade-off between global and local datasets in EquationEquation 27(27a) ${\hat{Y}}_{o} = μ {\tilde{W}}_{og}^{T} Y_{g} + (1 - μ) {\tilde{W}}_{ol}^{T} Y_{l} (MV - RegLP)$ (27a) .

Let $(A, L) = {A_{i}, L_{i}}_{i = 1}^{n}$ be the entire image data set $(n = 507)$ . The proposed MV-KCS follows the local modeling approach of multi-atlas segmentation, creating a particular model for each target image $(T, {\hat{L}}_{T}) = f (L B (T))$ , where $L B (T) = {A_{j}^{T}, L_{j}^{T}}_{j = 1}^{n_{A}}$ denotes the corresponding atlas library and $f (\cdot)$ signifies the segmentation mapping function implemented by MV-KCS. To this end, we devise a suitable 5-fold cross validation scheme for the hyperparameter selection and the evaluation of segmentation performance, respectively.

The optimal hyperparameter selection proceeds as follows:

Each testing fold $(A_{t}^{(r)}, L_{t}^{(r)}) = (A_{t, i}, L_{t, i})_{i = 1}^{m}$ , $r = 1, \dots 5$ , $m = n / 5$ , is used as a source for selecting target images, $T_{i}^{(r)} = A_{t, i}^{(r)}, L_{T_{i}}^{(r)} = L_{t, i}^{(r)}$ . The target labels $L_{T_{i}}^{(r)}$ are considered as unknown, and hence, they are not used in the model’s construction.
For each $A_{t, i}^{(r)}, i = 1, \dots, m$ , we select the nearest neighbor image $A_{ν, i}^{(r)} \in {A ∖ A_{t}^{(r)}}$ which satisfies the condition $A_{ν, i}^{(r)} = min MSE (A_{t, i}^{(r)}, A_{ν, i}^{(r)})$ (Section 2.2). A collection of images $(A_{ν}^{(r)}, L_{ν}^{(t)}) = {A_{ν, i}^{(r)}, L_{v, i}^{(r)}}_{i = 1}^{m}$ forms the validation data set with known labels, used to validate the models in $(A_{t}^{(r)}, L_{t}^{(t)})$
For each $A_{ν, i}^{(r)} \in A_{ν}^{(r)}, i = 1, \dots, m$ , we select its corresponding library $L B (A_{ν, i}^{(r)}) = {A_{t r, j}^{(r)}, L_{t r, j}^{(r)}}_{j = 1}^{n_{A}}$ , drawn from the training part $A_{t r, j}^{(r)} \in {A ∖ (A_{t, i}^{(r)} \cup A_{ν, i}^{(r)})}$ .
Next, we apply MV-KCS to build the models $(A_{ν, i}^{(r)}, {\hat{L}}_{ν, i}^{(t)}) = f (L B (A_{ν, i}^{(r)})), i = 1, \dots, m$ and $r = 1, \dots, 5$ for different combinations in the parameter grid space. The optimal hyperparameter assumes the value providing the minimum overall error on the validation parts $\sum_{i, r} D S C (L_{ν, i}^{(r)}, {\hat{L}}_{ν, i}^{(t)})$

Having determined the optimal parameters, the segmentation performance is evaluated using the traditional $5$ -fold cross validation. Concretely, for each $T_{i}^{(r)} \in A_{t}^{(r)}, i = 1, \dots, m$ , we select the corresponding libraries $L B (T_{i}^{(r)}) = {A_{t r, j}^{(r)}, L_{t r, j}^{(r)}}_{j = 1}^{n_{A}}$ , where the atlases are now drawn from the training parts $A_{t r, i}^{(r)} \in {A ∖ A_{t}^{(r)}}$ . Then, models are alternately built via MV-KCS for different testing folds $(T_{i}^{(r)}, {\hat{L}}_{T_{i}}^{(r)}) = f (L B (T_{i}^{(r)})), \break i = 1, \dots, m, r = 1, \dots, 5$ and the results are averaged across all images in the data set. The parameter validation and performance evaluation are conducted under the MV-HyLP approach with the default multi-view case (multiple descriptors).

8.4. Single-view vs multi-view test cases

We consider a series of experimental cases, organised in three groups, to validate the effect of fusing multiple views on the performance of our models. [noitemsep]

Single-view vs Multi-view: This first group consists of a series of 4 test cases, as outlined in . Our primary concern in performing this test is two-fold: First, compare the trivial single descriptor case against the cases where multiple features are integrated via AMUSE (Section 3). Secondly, monitor the performance improvements obtained by the various feature subsets, each encoding a different aspect of the image content. The inclusion of features is designed with the aim to successively capture information pertaining to shape (HOG), shape and texture (HOG+LBP) and finally shape, texture and geometry (HOG+LBP+GLCM+GLRLM+HCGF). The descriptors GLCM and GLRLM are included together due to their similarity in the computation.
Stacked vs Fused Features: This experiment compares the fusion of HOG,LBP and HCGF as carried out by MV-KCS, against the single-view case where those three features are stacked together. This test case is important to determine whether the additional computational cost pertaining to the optimal MV learning is worth paying, or a simple feature stacking suffices. We chose this particular subset of descriptors to evaluate this idea, since they encode substantially different information content (shape, texture, geometry).
Multi-feature vs Multi-modal views: In this case we evaluate our two multi-view settings, elaborated in Sections 3.2,3.3. Specifically, we aim to assess how the setting involving multiple feature descriptors compares against the one associated with multiple imaging sequences. The competing models make use of all the available features and modalities respectively. For the multi-feature model, the DESS sequence is set as the default, while the HOG descriptor is utilised in the multi-sequence scenario.

Table 2. Test cases examined for the first group of experiments.

Download CSV Display Table

All cases utilise the MV-HyLP labelling method. The hyperparameters are set to the values determined by sensitivity analysis.

8.5. Competing methods

The efficacy of our proposed framework is evaluated against two supervised patch-based methods, six supervised and three semi-supervised deep learning models, respectively.

8.5.1. Patch-based methods

The Patch-based Sparse Coding (PB_SC) (Zhang et al. Citation2012) is a popular approach in medical image segmentation. For consistency reasons, $N_{s}$ and $P_{s}$ are taken as $15 \times 15 \times 15$ and $5 \times 5 \times 5$ , respectively. In addition, we implement a variant termed PB_SC^(stacked), whereby each patch is characterized by a stacked feature vector of HOG,LBP,HCGF descriptors. The rationale for the choice is similar to the one previously outlined in Section 8.4. Aside from the feature description of each patch, the construction of the patch library $P_{L}$ is identical to the original paper.

A more recent variant on the patch-based family of methods is the Patch-based Joint-Label-Fusion (PB-JLF (Nikolopoulos et al. Citation2020)). Here, the authors opt to perform a more computational intensive registration process, by incorporating an additional deformable transformation after the initial affine step, achieving better correspondence between the respective image structures. In contrast to the two standard patch-based methods described in the above paragraph, all images are mapped to a common template space, offering a substantial boost in speed over alternative methods that operate on each target image space independently. The segmentation map for each target image is obtained via a joint label fusion mechanism, where each atlas is assigned a weight according to a voting process. The final results are obtained after an inverse transformation to the original target image space. The method can be extended in a straight-forward manner to incorporate additional image modalities. In the following experiments, the multi-modal setting will be the default case considered for comparison.

8.5.2. Deep learning-based methods

Since deep learning is a prevalent approach, increasingly gaining popularity in recent years, we opt to compare our method against eight deep networks, developed to specifically tackle medical image segmentation problems. The following paragraphs contain a brief summary of our own implementations of each model under consideration.

8.5.2.1. Triplanar CNN (Prasoon et al. Citation2013)

This method consists of a triad of CNNs, each operating on a 2D plane ( $- xy, - xz, - yz$ ) corresponding to the sagittal, coronal and axial views, respectively. For each voxel to be classified, a 2D patch centered around it is extracted from each of the three planes and is fed to the respective network. Each of the three branches performs an independent 2D segmentation task, with the final layer combining the individual outputs in order to produce the final 3D segmentation map.

8.5.2.2. SegNet (Badrinarayanan et al. Citation2017)

This is a convolutional Encoder-Decoder whose final layer performs classification on a pixel-wise manner. The topology of the Encoder part of the network is borrowed from VGG16 (Simonyan and Zisserman Citation2015), a state-of-the-art network developed by Google Deep Mind. To effectively deal with the issue of having a massive network but relatively few data for learning, we resort to transfer learning (Tajbakhsh et al. Citation2016) and initialise the Encoder weights to the values of the corresponding ones of VGG16. Those parameters are kept fixed; thus, the only trainable part of the network is the Decoder and the final softmax layer. SegNet receives the available imaging modalities (DESS,T2,SPGR) as three input channels, effectively allowing us to leverage its architecture for multi-view learning, as each channel accepts a slice of one of the three sequences.

8.5.2.3. Multi-Modal (MM) CNN-FC

This deep learning model is based on the architecture proposed in (Guo et al. Citation2019), for the segmentation of multi-modal medical images. The authors suggest a simple arrangement of a series of 5 convolutional layers, followed by 3 fully connected ones, and a final softmax layer to perform multi-class classification. They distinguish between three types of fusion architectures, based on the specific location in the network where the fusion takes place. Here, we implement a Type-II fusion network, where initially the three available modalities are fed as different input channels. Feature fusion takes place at the first layer of the fully connected part of the network, meaning that features are learnt from each modality separately by their respective convolutional parts.

8.5.2.4. DenseVoxNet (Yu et al. Citation2017)

This is a convolutional network proposed for cardiovascular MRI segmentation. It consists of a downsampling and an upsampling part, with skip connections from each layer to all its subsequent ones, thus improving the information flow across the entire network and achieving more robust supervision overall. In our experiments, DenseVoxNet was trained using the same initialisation scheme for the parameters’ values as the one described in the original paper.

8.5.2.5. VoxResNet (Chen et al. Citation2016)

This is a deep residual network consisting of a series of stacked residual modules (VoxRes modules), with each one performing a series of Batch Normalization and Convolution operations, and containing a skip connection for the module’s input to its output (skip connection). By concatenating the different imaging modalities and feeding them as input to the network, the complementary information they provide is jointly fused in an implicit way.

8.5.2.6. KCB-Net (Peng et al. Citation2022)

This is a recently proposed network performing cartilage and bone segmentation. An initial representative slice selection step, in all major orientations (coronal, axial, sagittal), is used to single out the most representative ones on which expert annotations will be performed. Three base learners (2D modules) are then trained on the annotated slices and pseudo labels are assigned to the remaining ones, and a 3D module is subsequently trained. Following that, a variant of the DenseVoxNet is used as a meta-learner in order to obtain the full segmentation. Finally, a post-processing step incorporating a fine-tuning network (Xie et al. Citation2022) is employed to enhance the surface delineation between adjacent cartilage and bone surfaces.

8.5.2.7. CAN3D

This is an extension of the Context Aggregation Network proposed in (Dai et al. Citation2021). Utilizing a series of progressively dilated convolution operators, it can efficiently aggregate multi-scale information by extracting dense features and progressively expanding its receptive field, allowing for the final voxelwise classification to occur in full resolution. In order to deal with highly imbalanced class data (as in our case), CAN3D employs a hybrid loss function by combining the Dice Similarity Coefficient (DSC) with the Dice Squared Focal Loss (DSF), a variant of DSC with MSC built inside.

8.5.2.8. 3D CNN + SSM

The work presented in (Ambellan et al. Citation2019) features a segmentation framework that combines 2D and 3D CNNs in conjunction with a Statistical Shape Model (SSM). The overall workflow consists of two branches, where the first one is tasked with the segmentation of the femoral and tibial bone structures, facilitated through a cascade of CNN and SSM steps. The resulting masks are then passed on to the second branch, where another 3D CNN finalises the segmentation by producing the respective maps for the femoral and tibial cartilage structures. All of the above steps are performed separately for both structures.

In all the above cases, the image dataset was split into three disjoint subsets, used for training, validation and testing respectively $(60 % - 20 % - 20 %)$ . All the images were pre-processed as described in Section 2.1. The training parameters (learning rate, optimisation algorithm, etc.) and validation schemes of the above models were chosen based on the respective original papers.

8.5.2.9. DAN

(Zhang et al. Citation2017) The Deep Adversarial Network (DAN) exploits information from both labelled and unlabelled images, by utilising two sub-modules: (1) a segmentation network (SN) tasked with providing annotated label maps and (2) an evaluation network (EN) that continuously ranks the performance achieved by SN. The training process consists of an iterative procedure whereby SN learns to successively output more accurate segmentation maps, guided by the grading its performance on unlabeled images by EN. Essentially, the EN module attempts to”teach” the student SN what an appropriate segmentation map looks like, thus rendering the overall model capable of utilising unlabelled data to obtain enhanced segmentation quality.

8.5.2.10. UA-MT

The Uncertainty-Aware Self-Ensembling model presented in (Yu et al. Citation2019) is another variant in the teacher-student overall setting. To deal with the possibility of unreliable inputs from the teacher, the uncertainty-aware mean teacher (UA-MT) framework is proposed, where the student model is gradually guided to learn from reliable targets generated by the teacher, whilst the teacher estimates the uncertainty of each generated prediction. This process is exploited as a means of filtering out unreliable outputs from the student, which in turn encourages the teacher to supply higher-quality targets. Regarding the implementation details, the core for both teacher and student is designed around the topology of th V-net (Milletari et al. Citation2016), by removing the residual connections in each separate convolutional block, and adapting the rest of the model as a Bayesian network, in order to render it capable of estimating the uncertainty.

8.5.2.11. SS-DTC

The authors in (Luo et al. Citation2021) present a dual-task-consistency semi-supervised framework that can jointly generate a voxel-wise segmentation map and a geometry-aware level-set representation of the target. A differentiable task transform layer is employed to convert said representation into an approximated segmentation map where subsequently, a combined loss function is utilised in order to facilitate the minimisation of discrepancy between the predicted segmentation map and the probability map converted from the level set function. The aforementioned mechanism allows the model to harness unlabelled data in order to boost the overall segmentation performance.

For the three semi-supervised methods described in the paragraphs above, we follow the same setting as in our own proposed methods, regarding the percentage of labelled vs unlabelled MRIs used. The optimal parameters for each model were chosen according to the validation scheme proposed in paragraph 8.3.

9. Experimental results & discussion

9.1. Parameter sensitivity analysis

In this section, we validate the performance effect of the five critical hyperparameters $(n_{A}, ss l_{perc}, k, λ, μ)$ . While varying one parameter, the rest remain fixed at their default values (Section 8.3).

9.1.1. Number of selected atlases n_A

The number of selected atlases $n_{A}$ controls the size of the atlas library $A$ which contributes labeled data at both stages of label propagation. During the transductive stage (stage-1), all atlases are sampled in a stratified manner at locations specified by the global target dataset, $X_{g}^{T}$ , with the number of sampled voxels per atlas being inversely proportional to the number of available atlases, $n_{g}^{A} = ss l_{perc} \cdot n_{g}^{T} / n_{A}$ (Section 5.1). At the inductive stage (stage-2) however, this relationship changes as the iterative sampling process via the 3D mesh (Section 6) yields a number of voxels directly proportional to $n_{A}$ .

showcases the effect of $n_{A} \in {5, 10, 20}$ in terms $DSC$ score for the two cartilage classes (FC, TC). In both cases $n_{A} = 10$ yields the best results, while $n_{A} = 5$ and $n_{A} = 20$ induce a slight drop in performance for the FC class. The same effects are observed for the TC class, albeit more pronounced. The concept of bias-variance trade-off can be useful in interpreting this behavior: a small number of atlases implies that more voxels are sampled from each atlas and that labeling errors emanating from potential misalignments between the target and an atlas image, may not be efficiently countered by the remaining atlases. On the other hand, a large $n_{A}$ suggests that smaller number of voxels is sampled from each atlas, greatly increasing the variance and noise in $X_{g}^{A}$ . It seems that the optimal strategy is to strike a balance between the two options; choosing enough atlases to cover as much label information content as possible, but not too many as to render the samples noisy.

Figure 7. Effect of number of atlases on $DSC$ score.

9.1.2. Percentage of labelled voxels ssl_perc

The parameter $ss l_{perc}$ effectively controls the amount of supervision available to our method. For a fixed number of atlases, a greater $ss l_{perc}$ means that more voxels will be drawn from each atlas to form the initial labeled global dataset. We test the effect of this parameter by evaluating models with $ss l_{perc} \in {10 %, 20 %, 30 %}$ ().

Figure 8. Effect of amount of supervision on $DSC$ score.

As can be seen, the segmentation performance for both FC and TC is directly proportional to the amount of supervision allowed. As expected, increasing $ss l_{perc}$ yields better results, since more class-specific content is available to the learning algorithm. However, the accompanying computational cost rises considerably, suggesting that a more moderate amount of supervision should be considered.

9.1.3. Number of nearest neighbours $k$

The number $k$ of nearest neighbors is a significant parameter controlling the construction of the sparse graphs in both stages of the label propagation. shows the variation of $DSC$ score for neighborhood sizes $k \in {10, 25, 50}$ .

Figure 9. Effect of number of sparse neighbors on $DSC$ score.

It seems that $k$ yields better results when assuming values at the lower range, with the best results obtained for $k = 10$ , while $k = 50$ performs markedly worse. This result can be also explained by considering the bias-variance trade-off. A large neighbourhood (high variance) allows each voxel to be connected with neighbours sharing similar spectral presentation, which might though belong to a different class (i.e. Femoral & Tibial cartilage). Limiting $k$ forces graph construction to draw edges between only the most similar of voxels. Contrary to the selection of $n_{A}$ , here the choice of incurring a high bias (small $k$ ) seems to work best. A plausible explanation lies on the multi-view learning mechanism; each view of a specific neighbourhood may be characterised by a high bias, but their combination seems to have a regularisation effect nullifying that bias.

9.1.4. Structural regularisation parameter $λ$

Parameter $λ$ controls the amount of structural regularization in AMUSE (4). In the extreme case where $λ = 0$ , the problem degenerates to the single-view case, while for $λ \to \infty$ , the objective is trivialized to computing the combined multi-view graph as a linear combination of the single-view ones. , highlights its effect on the $DSC$ score for both cartilage classes, when assuming values in the logarithmic scale $λ \in {0.01, 1, 100}$ .

Figure 10. Effect of structural regularization parameter on $DSC$ score.

It is evident that increasing the value of $λ$ yields better segmentation accuracy for both FC and TC, up to a certain point, while performance drops markedly for extremely small values. Interestingly, the segmentation accuracy on FC seems to diminish for the larger values of $λ$ . Since the increase of $λ$ strengthens the contribution of the combined graph learning, it seems that the multi-view component of MV-KCS is essential to the overall performance. Care must be taken however, since past a certain point the objective (4) is dominated by the second term and the solution is trivialised.

9.1.5. Convex combination parameter $μ$

Finally, we validate the effect of the convex combination parameter $μ$ (27), which controls the relative contribution of local and global datasets $X_{l}, X_{g}$ respectively (stage-2).

As can be seen from , the performance follows a somewhat symmetrical pattern, achieving the best results for both classes in the $[0.4, 0.6]$ range. This suggests that a balance between local and global information is the optimal strategy. Setting $μ$ to either $μ = 1$ or $μ = 0$ , entails that the label transfer mechanism relies exclusively on either global or local information, respectively. In the former case, any potential errors in the initial labeling of the voxels in $X_{g}^{T}$ will keep propagating to the out-of-sample batches, without the regularizing effect of $X_{l}^{A}$ . Further, in the latter case, the rich label content of $X_{g}^{A}$ is completely discarded.

Figure 11. Effect of convex combination parameter on $DSC$ score.

9.2. Test cases – results

provides a thorough evaluation of our proposed labelling method MV-HyLP under all testing cases outlined previously. The reported results refer to both Cartilage classes (FC, TC), while the performance quality is assessed in terms of Recall and Precision, as well as the volumetric indices DSC, VD, and VOE.

Table 3. Performance comparison (mean and std. deviations) of all test cases for the proposed method of MV-HyLP. In case #1, GLM denotes the combination of GLCM, GLLM and All, the inclusion of all feature descriptors. Stacked and fused in case #2 refer to the list of feature descriptors (HOG, LBP, HCGF) being concatenated and fused through AMUSE, respectively. Finally, in case #3, Multi-feature corresponds exactly to the last row in case #1.

Display Table

9.2.1. Case #1: single-view vs multi-view

This case highlights the effects in performance when incorporating progressively additional feature descriptors in MV-KCS. The first model corresponds to the trivial single-view case, utilising solely the HOG descriptor. HOGs are very powerful in encoding information content pertaining to shape and orientation. Since the anatomy of the cartilage has a very specific geometry, being locally approximated by a thin 3D plate, HOG manages to yield a satisfactory segmentation performance $(DSC = (88.56 %, 84.03 %))$ . Although distinctive, HOG is not the best choice in capturing texture information, a salient feature in biomedical images.

This limitation is confronted by incorporating LBP descriptors as an additional view. LBPs are excellent in capturing texture information, making them an ideal synergist to their HOG counterparts. This point is succintly reflected in the improved segmentation score achieved $(DSC = (89.67 %, 85.79 %))$ . Under the multi-view setting, each voxel now participates in two single-view sparse graphs, with its spectral neighborhood in each case reflecting a difference notion of similarity. Voxels of different classes exhibiting similar HOG descriptors may assume substantially different characterizations under the LBP view, and vice versa. In such a case, the affinity matrix $S = α_{(HOG)} W^{(HOG)} \break + α_{(LBP)} W^{(LBP)}$ will not be incentivized to maintain each one in the neighborhood of the other, and a potential source of erroneous label propagation can be effectively eliminated. On the other hand, voxels with high affinity across all views, are more likely to share the same labels. This assumption is reflected in the construction of the combined graph, reinforcing propagation of label content from one to the other.

The above discussion remains intact when adding two, and finally three more descriptors (views). Incorporating together the GLCM and GLRLM features, we observe a slight increase in performance $(DSC = (90.37 %, 86.54 %))$ for both classes of interest. Since GLCM and GLRLM mainly capture texture, as do LBP, and are variants of a similar underlying encoding, the additional views are not distinctive enough to provide a remarkable improvement. On the other hand, the inclusion of HCGF yields a more noticeable effect $(DSC = (91.76 %, 88.65 %))$ . This effect can be attributed to the structure of the HCGF descriptor, which incorporates multiple sources of low-order statistics (location, intensity, first & second order derivatives etc.) in its encoding. Thus, it is allowed to offer a substantially different voxel characterisation relative to the preceding descriptors. showcases the gradual improvements in segmentation accuracy with the progressive incorporation of additional feature descriptors, for a specific subject.

Figure 12. Segmentation results for femoral (FC) and tibial (TC) cartilage for the four instances of case #1 (left to right: Ground Truth, HOG, HOG & LBP, HOG & LBP & GLCM, all features). The first part of the figure illustrates a case of successful application of MV-KCS, while the second part presents in instance characterized by poor performance. In both cases, however, the positive effect of incorporating additional views is noticeable.

The results seem to indicate that multi-view learning is a powerful tool, significantly boosting the performance. Nevertheless, care must be taken so that the selected views are complementary to each other.

Finally, to establish that the reported results are statistically different, the Friedman statistical test (Friedman Citation1937) is performed. The test yields a $χ^{2}$ value of 312.6 with 3 degrees of freedom, which corresponds to a $p$ -value $6.23 \cdot 10^{- 09}$ rejecting the null hypothesis of statistical equivalence. Following that, a post-hoc Nemenyi test (Nemenyi Citation1963) is carried out, to evaluate each pair individually. The results in $(α = 0.05)$ , suggest that statistically, the inclusion of additional views further boosts segmentation performance.

Table 4. Pairwise comparisons using the Nemenyi test for the cartilage $DSC$ results of the test cases in . Reported are the corresponding $p$ -values. The binary indices in the subscript of each method’s name MV-HyLP_xxxxx indicate the inclusion or exclusion of the respective feature descriptor HOG, LBP, GLCM, GLRLM, HCGF..

Display Table

9.2.2. Case #2: feature stacking vs multi-view graph learning

The purpose of this experiment is to evaluate the two different apporaches to combine feature descriptors: the multi-view learning implemented by in MV-KCS, as contrasted to the simple stacking of multiple features. To keep the dimensionality of the stacked feature vector at a reasonable scale, the experiments consider the least redundant feature subset, (HOG, LBP, HCGF). The results are reported in the second part of . It is clear that combining the views in a more structured manner yields superior performance $(DSC = (90.67 %, 87.12 %))$ , for both FC and TC classes. This is due to the fact that simple stacking of feature vectors most likely introduces considerable noise, while also distorting the similarity between data. This last notice hinges heavily on the size of the feature vectors: in a high dimensional space, there is a great number of ways to define similarity between voxels, and only a very small subset of those correspond to label similarity. Therefore, most likely a learning algorithm identifies the ’wrong’ type of similarity, adversely affecting performance $(DSC = (89.65 %, 85.43 %))$ . By combining multiple views via MV-KCS, through structural regularization on a combined graph, we avoid the dimensionality issue and allow for a more natural integration of the available information content supplied by each view. The superiority of the multi-view setting against feature stacking is statistically verified by performing the Wilcoxon signed rank (Wilcoxon Citation1946) test. The obtained $p$ -value $(p = 4.8 \cdot 10^{- 03})$ confirms this conclusion.

9.2.3. Case #3: multiple features vs multiple modalities

This final test case serves to contrast the two different multi-view settings against each other. Specifically, we examine the merit of integrating multiple feature descriptors compared to utilising multiple imaging sequences (results in third part of ). The Multi-feature case exhibits the best performance $(DSC = (92.56 %, 89.91 %))$ , exhibiting a noticeable overhead in comparison to the Multi-modal $(DSC = (89.93 %, 87.31 %))$ . Generating multiple data views via distinctive encodings seems to be the optimal strategy. On the contrary, multiple modalities, using a single spectral descriptor yield redundant representations, thus limiting their ability to provide an effective information fusion.

9.3. Comparative results

The comparative results in highlight that both our methods MV-RegLP and MV-HyLP report superior results against the competing methodologies, across all examined measures. This effect is observed in both classes, with FC showcasing a slightly augmented performance boost compared to TC.

Table 5. Summary of segmentation performance measures (means and std. deviations) on the two cartilage classes of our proposed method MV-RegLP and its modification MV-HyLP, compared to Patch-Based Sparse-Coding (PB_SC), Patch-Based Sparse-Coding with stacked features (PB_SC^(stacked)), Patch-Based joint label fusion (PB-JLF), Triplanar CNN, SegNet, MM-CNN-FC, VoxResNet, DenseVoxNet, KCB-Net, CAN3D, 3D-CNN + SSM, DAN, UA-MT and SS-DTC. The hyperparameter values for MV-RegLP and MV-HyLP are those obtained after parameter sensitivity analysis in previous section 8.3.

Display Table

The baseline PBSC method achieves the lowest overall score across all competitors $(DSC = (82.23 %, 78.85 %))$ . However, its variant that incorporates the stacked feature descriptors offers a noticeable improvement in accuracy $(DSC = (84.35 %, \break 81.91 %))$ . Both methods build an extensive patch library $P_{L}$ for each target voxel by exhaustively extracting patches from its neighborhood. However, feature vectors comprising the patch library assume substantially different characterizations in each case: PB_SC forms the feature vector by simply stacking the voxels’ intensities, while PB_SC^(stacked), leverages three powerful descriptors HOG, LBP, HCFG (Section 3.2.3). The incorporation of information relevant to texture, shape and geometry via the above stacked features demonstrably benefits PB_SC^(stacked) allowing it to more efficiently distinguish between similarly appearing classes (Femoral & Tibial Cartilage, Cartilages & Background). Finally, the PB-JLF method seems to obtain the best overall performance among the three patch-based methods considered here. The incorporation of of an extra step of deformable registration on top of the affine one, proves to be beneficial with respect to the accuracy of the resulting segmentation maps.

The eight deep learning models exhibit relatively comparable results, with all of them achieving superior outcomes in both femoral and tibial cartilage segmentation with respect to the two baseline patch-based methods. Overall, the 3D-CNN + SSM demonstrates better performance across most measures $(DSC = (90.18 %, 88.74 %))$ , showcasing the efficacy of combining a fully 3D convolutional network with a sophisticated post-processing regularization step, in order to finalize the segmentation maps. KCB-Net comes second, demonstrating a state-of-the-art segmentation accuracy $(DSC = (88.92 %, 87.96 %))$ owning to its refined training strategy incorporating both unsupervised and subsequent supervised learning steps. CAN3D reports a slightly lower performance than KCB-Net w.r.t $DSC$ scores $(DSC = (88.32 %, 86.68 %))$ , showcasing the effectiveness of the Context Aggregation Network framework (CAN). SegNet $(DSC = (87.22 %, 85.71 %))$ ranks just below CAN3D in both femoral & tibial cartilage. DenseVoxNet $(DSC = (87.12 %, \break 85.05 %))$ and VoxResNet $(DSC = (85.81 %, 84.52 %))$ , via the utilization of similar strategies employing skip connections, achieve comparable results in cartilage components, respectively, albeit lower than SegNet, CAN3D and especially KCB-Net. MM-CNN-FC $(DSC = (85.09 %, 84.12 %))$ achieves the second lowest performance with respect to the previous deep models, albeit still superior to the patch-based ones. Finally, the Triplanar CNN method showcases the lowest overall performance with regards to the deep learning methods $(DSC = (84.98 %, 83.19 %))$ , demonstrating the fact that decoupling the spatial cohesion of the input image patches by splitting them in three separate planes is a sub-optimal choice.

The next three methods models, utilising the semi-supervised setting in some capacity, demonstrate superior results both with regards to the patch-based methods, as well as compared to the standard deep learning ones. Specifically, DAN $(DSC = (88.65 %, 87.61 %))$ reports superior performance to all previous methods with the exception of KCB-Net, while UA-MT $(DSC = 89.61 %, 88.13 %)$ and SS-DTC $(DSC = 91.07 %, \break 89.42 %)$ achieve better overall segmentation results against all previous reported methods. The performance of these approaches serves to strengthen the argument for semi-supervised case, demonstrating the effectiveness of incorporating additional unlabelled data into the learning process.

Regarding our proposed labelling methods, MV-RegLP $(DSC = (90.23 %, 88.82 %))$ outperforms both patch-based methods by a significant margin, while exhibiting slightly better, or similar results to those achieved by the deep learning-based ones (supervised or semi-supervised). The combined synergy of multi-view integration, semi-supervised learning, representation through sparse neighborhoods and aggregation of global and local information, proves to be an effective tool. MV-HyLP pushes the performance even further $(DSC = (92.56 %, 89.91 %))$ , by introducing the consideration of an additional spatial scale in the label transfer mechanism, through the concept of LMEs, demonstrably outperforming the rest of the competing methods.

To statistically validate the observed differences across the competing methods, the Friedman statistical test is initially performed. The reported $χ^{2}$ value with 5 degrees of freedom is $441.81$ , corresponding to a $p$ -value of less than $2.2 \cdot 10^{- 16}$ , rejecting the null hypothesis of statistical equivalence among the competitors. This allows for the execution of Nemenyi’s pairwise post-hoc test, to identify the statistically superior method. The reported $p$ -values in suggest that the performance discrepancy in PB_SC and PB_SC^(stacked) is statistically significant, highlighting the positive impact of combining multiple descriptors instead of relying on raw voxels’ intensities alone, even in the trivial case of stacking the together. Regarding the eight supervised deep learning methods employed, all are proven to be statistically superior to the patch-based ones. More notably, all five methods evaluated in this study, that in some capacity fall within the semi-supervised learning paradigm, are demonstrated to achieve superior performance with regards to their fully supervised counterparts, with the sole exception being the 3D-CNN + SSM method. Concretely, the two methods employing the teacher-student framework (DAN, UA-MT) achieve statistically superior results compared to the standard patch-based and deep learning supervised approaches, showcasing its effectiveness in this particular application. Our own MV-RegLP manages to achieve better overall results, with SS-DTC and 3D-CNN + SSM, in turn, superseding it. Finally, our proposed MV-HyLP is shown to compare favourably against all previous methods reported.

Table 6. Pairwise comparisons using the Nemenyi test for the cartilage $DSC$ results of the competing methodologies of . Reported are the corresponding $p$ -values.

Display Table

10. Conclusions & future work

The proposed MV-KCS successfully incorporates multi-view and semi-supervised learning in the multi-atlas patch-based segmentation framework. Leveraging multiple views on the data to learn a combined sparse graph, it is able to achieve state-of-the-art segmentation performance by integrating information along two distinct axes, spectral – spatial and global – local. The proposed labelling mechanisms within MV-KCS, namely, MV-RegLP and MV-HyLP, were proven to be statistically superior to other existing patch-based and deep-learning based approaches, with MV-HyLP yielding the best results ( $DSC = 92.56 % (FC)$ , $DSC = 89.91 % \break (TC)$ ). Furthermore, a comparison with other state-of-the-art methods in the knee cartilage segmentation literature, firmly places MV-KCS among the top performing methodologies. The implementation for both MV-RegLP and MV-HyLP can be found in .Footnote²

While demonstrating satisfactory results, the proposed MV-RegLP and MV-HyLP are not without their limitations. In order for these methods to perform effectively, an atlas library has to be constructed for each target image to be segmented, significantly increasing the computational load and execution time, in case such libraries are not already available. Finally, determination of voxel similarity, which facilitates the construction of sparse graphs, in the original feature space may not be adequate in capturing the full information content of these voxels, possibly leading to sub-optimal representations. A potential remedy for this issue may lie in the use of deep models, that will be better suited to capture the varying degrees of similarity between the data.

Our future work will primarily proceed along three research directions: 1) Evaluation of alternative multi-view semi-supervised learning schemes in knee cartilage segmentation 2) Application of the suggested methods in different medical imaging domains (e.g. brain segmentation). 3) The employment of deeper models within the graph-learning framework (i.e. Graph Neural Networks).

Disclosure statement

No potential conflict of interest was reported by the author(s)

Data availability statement

Due to the nature of the research, due to [ethical/legal/commercial] supporting data is not available.

Additional information

Notes on contributors

Christos G. Chadoulos

Christos G. Chadoulos received the B.S. and M.S. degrees in electrical and computer engineering from the Aristotle University of Thessaloniki, Thessaloniki, Greece, in 2017. He is currently pursuing the Ph.D. degree at the Depart- ment of Electrical and Computer Engineering, Aristotle University of Thessaloniki. His research interests include computer vision, image analysis, and machine learning.

Dimitrios E. Tsaopoulos

Dimitrios E. Tsaopoulos graduated with a BSc from the University of Thessaly, Greece, in the field of Sports Science in 2003. He received his PhD from the Manchester Metropolitan University in 2007, focusing on in vivo human knee joint mechanics. He is currently a Research Director (Grade A) on the field of Biomechanics in CERTH/IBO. Dimitris aspires to the advancement of research and technologies in areas of human and animal-movement biology, aiming at diagnosis, assessment and treatment of locomotor dysfunction. He has participated in more than 20 National (UK and GR) and European research projects.

Serafeim P. Moustakidis

Serafeim P. Moustakidis has wide experience in computational intelligence, machine learning and data processing with more than 12 years of research experience in various fields. His research focuses in developing novel algorithms for solving important existing and/or emerging problems. His main scientific interests cover various application fields such as Deep Learning, Big Data, Biomechanics, Bio-economy, Health, Remote Sensing, Energy Optimization, Non- Destructive Testing (NDT) and machine learning-empowered imaging. He has been involved in the technical implementation, scientific or overall management of 25 R&D projects of a total budget of 35 million Euros. He has worked for several research organisations across Europe.

John B. Theocharis

John B. Theocharis (M’90) received the degree in elec- trical engineering and the Ph.D. degree from the Aristotle University of Thessaloniki, Thessaloniki, Greece, in 1980 and 1985, respectively. He is currently a Professor in the Department of Electrical and Computer Engineering, Aris- totle University of Thessaloniki. His research activities in- clude fuzzy systems, neural networks, evolutionary algo- rithms, pattern recognition and image analysis. He has published numerous papers in several application areas such as neuro-fuzzy modeling, power demand and wind speed prediction, land cover classification and segmenta- tion from remotely sensed images. Recently his research is focused on addressing challenges in medical imaging using machine learning and deep learning techniques.

Notes

1. Here, we make the concession of using

x_{i}

interchangeably to refer both to a voxel location, as well as its description.

2. https://gitlab.com/christos_chadoulos/semi-supervised-multi-view-multi-atlas-mri-segmentationMV-RegLP & MV-HyLP implementations

References

Ambellan F, Tack A, Ehlke M, Zachow S. 2019. Automated segmentation of knee bone and cartilage combining statistical shape knowledge and convolutional neural networks: data from the osteoarthritis initiative. Med Image Anal. 52:109–23. doi: 10.1016/j.media.2018.11.009.
PubMed Web of Science ®Google Scholar
Badrinarayanan V, Kendall A, Cipolla R. 2017. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. Ieee T Pattern Anal. 39(12):2481–2495. arXiv:1511.00561. doi: 10.1109/TPAMI.2016.2644615.
PubMed Web of Science ®Google Scholar
Banerjee J, Moelker A, Niessen WJ, Van Walsum T. 2013. 3D LBP-based rotationally invariant region description Lecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics) 7728 LNCS (PART 1). 26–37. doi: 10.1007/978-3-642-37410-4_3.
Google Scholar
Belkin M, Niyogi P. 2005. Towards a theoretical foundation for Laplacian-based manifold methods Lecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics) 3559 LNAI. 486–500. doi: 10.1007/11503415_33.
Google Scholar
Belkin M, Niyogi P, Sindhwani V. 2006. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res. 7(2006):2399–2434.
Google Scholar
Biscaldi E, Barra F, Evangelisti G, Ferrero S. 2018. Radiological assessment, endometrial cancer: risk factors. Management And Prognosis. 3:113–144. doi: 10.2307/3578513.
Google Scholar
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. 2010. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn. 3(1):1–122. doi: 10.1561/2200000016.
Google Scholar
Buades A, Coll B, Morel J-M. 2011. Non-local means denoising. Image Processing On Line. 1:208–212. doi: 10.5201/ipol.2011.bcm_nlm.
Google Scholar
Chadoulos C, Moustakidis S, Tsaopoulos D, Theocharis J, Multi-atlas segmentation of knee cartilage by propagating labels via semi-supervised learning, ACM International Conference Proceeding Series (2022) 76–82 doi:10.1145/3524086.3524098.
Google Scholar
Chen H, Dou Q, Yu L, Heng P-A, VoxResNet: Deep Voxelwise Residual Networks for Volumetric Brain Segmentation (2016) 1–9 arXiv: 1608.05895. http://arxiv.org/abs/1608.05895
Google Scholar
Dai W, Woo B, Liu S, Marques M, Tang F, Crozier S, Engstrom C, Chandra S, Can3d: Fast 3D knee mri segmentation via compact context aggregation, Proceedings - International Symposium on Biomedical Imaging 2021-April (2021) 1505–1508. arXiv:arXiv:2109.05443v2, doi:10.1109/ISBI48211.2021.9433784.
Google Scholar
Ebrahimkhani S, Jaward MH, Cicuttini FM, Dharmaratne A, Wang Y, de Herrera AG. 2020. A review on segmentation of knee articular cartilage: from conventional methods towards deep learning. Artif Intell Med. 106(January 2019):101851. doi: 10.1016/j.artmed.2020.101851.
PubMedGoogle Scholar
Felipe J, McCombie J. 2002. Burden of major musculoskeletal conditions. ERD Working Paper Series. 81(19):1–27.
Google Scholar
Felson DT. 2000. NIH Conference Osteoarthritis: new insights. Ann Intern Med. 133(8):635–639. http://www.annals.org/content/133/8/635.short.
PubMed Web of Science ®Google Scholar
Folkesson J, Dam EB, Olsen OF, Pettersen PC, Christiansen C. 2007. Segmenting articular cartilage automatically using a voxel classification approach. IEEE Trans Med Imaging. 26(1):106–115. doi: 10.1109/TMI.2006.886808.
PubMed Web of Science ®Google Scholar
Friedman M. 1937. The use of ranks to avoid the Assumption of Normality Implicit in the analysis of variance. J Am Stat Assoc. 32(200):675–701. doi: 10.1080/01621459.1937.10503522.
Google Scholar
Galloway MM. 1975. Texture analysis using gray level run lengths. Comput Graph Image Process. 4(2):172–179. doi: 10.1016/s0146-664x(75)80008-6.
Google Scholar
Glyn-Jones S, Palmer AJ, Agricola R, Price AJ, Vincent TL, Weinans H, Carr AJ. 2015. Osteoarthritis. Lancet. 386(9991):376–387. doi: 10.1016/S0140-6736(14)60802-3.
PubMed Web of Science ®Google Scholar
Guo Z, Li X, Huang H, Guo N, Li Q. 2019. Deep learning-based image segmentation on multimodal medical imaging. IEEE Trans Radiat Plasma Med Sci. 3(2):162–169. doi: 10.1109/TRPMS.2018.2890359.
PubMed Web of Science ®Google Scholar
Hajnal JV, Hill DL, Hawkes DJ. 2001. Medical image registration. Medical Image Registration. 46:1–383. doi: 10.1051/epn:2000401.
Google Scholar
Haralick R, Shanmugam K, Dinstein I. 1973. Textural features for image classification. IEEE Trans Syst Man Cybern. SMC-3(6):610–621. doi: 10.1109/TSMC.1973.4309314.
Web of Science ®Google Scholar
Huang J, Nie F, Huang H. 2015. A new simplex sparse learning model to measure data similarity for clustering, IJCAI International Joint Conference on Artificial Intelligence 2015-Janua (Ijcai); July 25-31; Buenos Aires, Argentina. p. 3569–3575.
Google Scholar
Karasuyama M, Mamitsuka H. 2013. Multiple graph label propagation by sparse integration. IEEE Trans Neural Netw Learn Syst. 24(12):1999–2012. doi: 10.1109/TNNLS.2013.2271327.
PubMed Web of Science ®Google Scholar
Kläser A, Marszałek M, Schmid C, A spatio-temporal descriptor based on 3D-gradients, BMVC 2008 - Proceedings of the British Machine Vision Conference 2008 (2008). doi:10.5244/C.22.99.
Google Scholar
Liu F, Zhou Z, Jang H, Samsonov A, Zhao G, Kijowski R. 2018. Deep convolutional neural network and 3D deformable approach for tissue segmentation in musculoskeletal magnetic resonance imaging. Magn Reson Med. 79(4):2379–2391. doi: 10.1002/mrm.26841.
PubMed Web of Science ®Google Scholar
Luo X, Chen J, Song T, Wang G, Semi-supervised Medical image segmentation through dual-task consistency, 35th AAAI Conference on Artificial Intelligence, AAAI 2021 10A (2021) 8801–8809. arXiv:2009.04448, doi:10.1609/aaai.v35i10.17066.
Google Scholar
Milletari F, Navab N, Ahmadi SA, V-Net: fully convolutional neural networks for volumetric medical image segmentation, Proceedings - 2016 4th International Conference on 3D Vision, 3DV 2016 (2016) 565–571 arXiv:1606.04797, doi:10.1109/3DV.2016.79.
Google Scholar
Mohar B. 1991. The Laplacian Spectrum of Graphs. Graph Theory, Combinatorics and Applications. Wiley; p. 871–898.
Google Scholar
Navneet D, Triggs B. 2020. Histogram of oriented gradients for human detection. IEEE Trans Ind Informat. 16(7):4714–4725. doi: 10.1109/TII.2019.2950094.
Google Scholar
Nemenyi P, Distribution-free Multiple Comparisons, Ph.d. thesis, Princeton University (1963).
Google Scholar
Nie F, Tian L, Wang R, Li X. 2020. Multiview semi-supervised learning model for image classification. IEEE Trans Knowl Data Eng. 32(12):2389–2400. doi: 10.1109/TKDE.2019.2920985.
Web of Science ®Google Scholar
Nikolopoulos FP, Zacharaki EI, Stanev D, Moustakas K. 2020. Personalized knee geometry modeling based on multi-atlas segmentation and mesh refinement. IEEE Access. 8:56766–56781. doi: 10.1109/ACCESS.2020.2982061.
Web of Science ®Google Scholar
Norman B, Pedoia V, Majumdar S. 2018. Use of 2D U-net convolutional neural networks for automated cartilage and meniscus segmentation of knee MR imaging data to determine relaxometry and morphometry. Radiology. 288(1):177–185. doi: 10.1148/radiol.2018172322.
PubMed Web of Science ®Google Scholar
Nyul LG, Udupa JK. 2000. Standardizing the MR image intensity scales: making MR intensities have tissue specific meaning, medical imaging 2000. Image Display And Visualization. 3976:496–504.
Google Scholar
Ojala T, Pietikäinen M, Harwood D. 1996. A comparative study of texture measures with classification based on feature distributions. Pattern Recogn. 29(1):51–59. doi: 10.1016/0031-3203(95)00067-4.
Web of Science ®Google Scholar
Öztürk CN, Albayrak S. 2016. Automatic segmentation of cartilage in high-field magnetic resonance images of the knee joint with an improved voxel-classification-driven region-growing algorithm using vicinity-correlated subsampling. Comput Biol Med. 72:90–107. doi: 10.1016/j.compbiomed.2016.03.011.
PubMed Web of Science ®Google Scholar
Pakin SK, Tamez-Pena JG, Totterman S, Parker KJ. 2002. Segmentation, surface extraction, and thickness computation of articular cartilage, medical imaging 2002. Image Processing. 4684:155. doi: 10.1117/12.467113.
Google Scholar
Peng Y, Zheng H, Liang P, Zhang L, Zaman F, Wu X, Sonka M, Chen DZ. 2022. KCB-Net: a 3D knee cartilage and bone segmentation network via sparse annotation. Med Image Anal. 82(August):102574. doi: 10.1016/j.media.2022.102574.
PubMedGoogle Scholar
Peterfy CG, Schneider E, Nevitt M. 2008. The osteoarthritis initiative: report on the design rationale for the magnetic resonance imaging protocol for the knee. Osteoarthritis Cartilage. 16(12):1433–1441. doi: 10.1016/j.joca.2008.06.016.
PubMed Web of Science ®Google Scholar
Prasoon A, Petersen K, Igel C, Lauze F, Dam E, Nielsen M. 2013. Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network. Lecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics) 8150 LNCS (PART 2). 246–253. doi: 10.1007/978-3-642-40763-5_31.
Google Scholar
Rister B, Horowitz MA, Rubin DL. 2017. Volumetric image registration from invariant keypoints. IEEE Trans Image Process. 26(10):4900–4910. doi: 10.1109/TIP.2017.2722689.
PubMed Web of Science ®Google Scholar
Rousseau F, Habas PA, Studholme C. 2011. A supervised patch-based approach for human brain labeling. IEEE Trans Med Imaging. 30(10):1852–1862. doi: 10.1109/TMI.2011.2156806.
PubMed Web of Science ®Google Scholar
Sarwinda D, Bustamam A, 3D-HOG features-based classification using MRI images to early diagnosis of Alzheimer’s disease, Proceedings - 17th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2018 (2018) 457–462 doi:10.1109/ICIS.2018.8466524.
Google Scholar
Schneider E, NessAiver M, White D, Purdy D, Martin L, Fanella L, Davis D, Vignone M, Wu G, Gullapalli R. 2008. The osteoarthritis initiative (OAI) magnetic resonance imaging quality assurance methods and results. Osteoarthritis Cartilage. 16(9):994–1004. doi: 10.1016/j.joca.2008.02.010.
PubMed Web of Science ®Google Scholar
Sethian J. 1999. Advancing interfaces: level set and fast marching methods, in: level set methods and fast marching methods. 2nd. Cambridge Press. Ch. 16p. 12. http://www.math.berkeley.edu/sethian/2006/Papers/sethian.iciamproceedings.1999.pdf.
Google Scholar
Simonyan K, Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition, 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings; May 7 - 9, 2015; San Diego, CA, USA. p. 1–14.
Google Scholar
Sled JG, Zijdenbos AP, Evans AC. 1998. A nonparametric method for automatic correction of intensity nonuniformity in mri data. IEEE Trans Med Imaging. 17(1):87–97. doi: 10.1109/42.668698.
PubMed Web of Science ®Google Scholar
Sun S. 2011. Multi-view Laplacian support vector machines Lecture notes In Computer Science (Including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 7121 LNAI (PART 2). 209–222. doi: 10.1007/978-3-642-25856-5_16.
Google Scholar
Sun S. 2013. A survey of multi-view machine learning. Neural Comput Applic. 23(7–8):2031–2038. doi: 10.1007/s00521-013-1362-6.
Web of Science ®Google Scholar
Tajbakhsh N, Shin JY, Gurudu SR, Hurst RT, Kendall CB, Gotway MB, Liang J. 2016. Convolutional neural networks for medical image analysis: full training or fine tuning? IEEE Trans Med Imaging. 35(5):1299–1312. arXiv:1706.00712. doi:10.1109/TMI.2016.2535302.
PubMed Web of Science ®Google Scholar
Tarvainen A, Valpola H. 2017. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. Advances In Neural Information Processing Systems 2017-Decem. 30:1196–1205.
Google Scholar
Wilcoxon F. 1946. Individual comparisons of grouped data by ranking methods. J Econ Entomol. 39(6):269. doi: 10.1093/jee/39.2.269.
PubMedGoogle Scholar
Xie H, Pan Z, Zhou L, Zaman FA, Chen DZ, Jonas JB, Xu W, Wang YX, Wu X. 2022. Globally optimal OCT surface segmentation using a constrained IPM optimization. Opt Express. 30(2):2453. doi: 10.1364/oe.444369.
PubMed Web of Science ®Google Scholar
Yin Y, Zhang X, Williams R, Wu X, Anderson DD, Sonka M. 2010. LOGISMOS-layered optimal graph image segmentation of multiple objects and surfaces: cartilage segmentation in the knee joint. IEEE Trans Med Imaging. 29(12):2023–2037. doi: 10.1109/TMI.2010.2058861.
PubMed Web of Science ®Google Scholar
Yu L, Cheng JZ, Dou Q, Yang X, Chen H, Qin J, Heng PA. 2017. Automatic 3D cardiovascular MR segmentation with densely-connected volumetric convnets Lecture notes in Computer Science (Including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 10434 LNCS. 287–295. arXiv:1708.00573. doi:10.1007/978-3-319-66185-8_33.
Google Scholar
Yu L, Wang S, Li X, Fu CW, Heng PA. 2019. Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation lecture notes in Computer Science (Including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11765 LNCS. 605–613. arXiv:1907.07034. doi:10.1007/978-3-030-32245-8_67.
Google Scholar
Zang F, Zhang JS. 2012. Label propagation through sparse neighborhood and its applications. Neurocomputing. 97:267–277. doi: 10.1016/j.neucom.2012.03.017.
Web of Science ®Google Scholar
Zhang D, Guo Q, Wu G, Shen D. 2012. Sparse patch-based label fusion for multi-atlas segmentation lecture notes in Computer Science (Including Subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 7509 LNCS (Mv). 94–102. doi: 10.1007/978-3-642-33530-3_8.
Google Scholar
Zhang X, Hu L, Zhang L. 2013. An efficient multiple kernel computation method for regression analysis of economic data. Neurocomputing. 118:58–64. doi: 10.1016/j.neucom.2013.02.013.
Web of Science ®Google Scholar
Zhang J, Marszałek M, Lazebnik S, Schmid C, Local features and kernels for classification of texture and object categories: a comprehensive study, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2006 (2006) 1–28. doi:10.1109/CVPRW.2006.121.
Google Scholar
Zhang Z, Xu Y, Yang J, Li X, Zhang D. 2015. A survey of sparse representation: algorithms and applications. IEEE Access. 3:490–530. arXiv:1602.07017. doi: 10.1109/ACCESS.2015.2430359.
Web of Science ®Google Scholar
Zhang Y, Yang L, Chen J, Fredericksen M, Hughes DP, Chen DZ. 2017. Deep adversarial networks for biomedical image segmentation utilizing unannotated images lecture notes in Computer Science (Including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 10435 LNCS. 408–416. doi: 10.1007/978-3-319-66179-7_47.
Google Scholar
Zhao J, Xie X, Xu X, Sun S. 2017. Multi-view learning overview: recent progress and new challenges. Inf Fusion. 38:43–54. doi: 10.1016/j.inffus.2017.02.007.
Web of Science ®Google Scholar
Zhou Z, Zhao G, Kijowski R, Liu F. 2018. Deep convolutional neural network for segmentation of knee joint anatomy. Magn Reson Med. 80(6):2759–2770. doi: 10.1002/mrm.27229.
PubMed Web of Science ®Google Scholar
Zhu X, Lafferty J, Cs L, Edu CMU. 2003. Semi-Supervised learning using Gaussian fields and harmonic functions, AISTATS 2005 - Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics; August 21-24 2003; Washington, DC, USA.
Google Scholar

A Multi-View Semi-supervised learning method for knee joint cartilage segmentation combining multiple feature descriptors and image modalities

ABSTRACT

1. Introduction

1.1. Related work

1.1.1. Region-based methods

1.1.2. Statistical shape models

1.1.3. Graph-based methods

1.1.4. Classical machine learning methods

1.1.5. Deep learning-based methods

1.2. Semi-supervised deep learning-based methods

1.3. Multi-atlas patch-based methods

1.4. Multi-view semi-supervised learning

1.5. Outline of proposed method

2. Background

2.1. Image preprocessing

2.2. Registration & atlas selection

3. Multi-view acquisition

3.1. Patch/region description

3.2. Feature descriptors

3.2.1. Histograms of Oriented Gradients (HOG)

3.2.2. Local Binary Patterns (LBP)

3.2.3. Gray-Level Co-occurrence Matrix (GLCM) & Gray-Level Run-Length Matrix (GLRLM)

Table 1. Displacement vectors for GLCM and GLRLM for volumetric data.

3.2.4. Hand-Crafted Geometric Features (HCGF)

3.3. MRI sequences

4. The AMUSE model

4.1. Background on Graph-based SSL

4.2. Multiple-view integration

4.3. AMUSE optimization

4.3.1. Update F,fixing S, α

4.3.2. Update α, fixing F,S

4.3.3. Update S, fixing F,α

5. Multi-view graph constructions

5.1. Data sampling

5.2. Views sparse graph coding

6. Proposed multi-view labeling schemes

6.1. Multi-view regional label propagation (MV-RegLP)

6.2. Multi-view Hybrid Label Propagation (MV-HyLP)

7. Out-of-sample label propagation

8. Experimental setup

8.1. Image data

8.2. Evaluation measures

8.3. Hyperparameter setting & sensitivity analysis

8.4. Single-view vs multi-view test cases

Table 2. Test cases examined for the first group of experiments.

8.5. Competing methods

8.5.1. Patch-based methods

8.5.2. Deep learning-based methods

8.5.2.1. Triplanar CNN (Prasoon et al. Citation2013)

8.5.2.2. SegNet (Badrinarayanan et al. Citation2017)

8.5.2.3. Multi-Modal (MM) CNN-FC

8.5.2.4. DenseVoxNet (Yu et al. Citation2017)

8.5.2.5. VoxResNet (Chen et al. Citation2016)

8.5.2.6. KCB-Net (Peng et al. Citation2022)

8.5.2.7. CAN3D

8.5.2.8. 3D CNN + SSM

8.5.2.9. DAN

8.5.2.10. UA-MT

8.5.2.11. SS-DTC

9. Experimental results & discussion

9.1. Parameter sensitivity analysis

9.1.1. Number of selected atlases nA

9.1.2. Percentage of labelled voxels sslperc

9.1.3. Number of nearest neighbours k

9.1.4. Structural regularisation parameter λ

9.1.5. Convex combination parameter μ

9.2. Test cases – results

9.2.1. Case #1: single-view vs multi-view

9.2.2. Case #2: feature stacking vs multi-view graph learning

9.2.3. Case #3: multiple features vs multiple modalities

9.3. Comparative results

Table 6. Pairwise comparisons using the Nemenyi test for the cartilage DSC results of the competing methodologies of Table 5. Reported are the corresponding p-values.

10. Conclusions & future work

Disclosure statement

Data availability statement

Additional information

Notes on contributors

Christos G. Chadoulos

Dimitrios E. Tsaopoulos

Serafeim P. Moustakidis

9.1.1. Number of selected atlases n_A

9.1.2. Percentage of labelled voxels ssl_perc

9.1.3. Number of nearest neighbours $k$

9.1.4. Structural regularisation parameter $λ$

9.1.5. Convex combination parameter $μ$

Table 6. Pairwise comparisons using the Nemenyi test for the cartilage $DSC$ results of the competing methodologies of . Reported are the corresponding $p$ -values.