Search in:

Applied Artificial Intelligence

An International Journal

Volume 38, 2024 - Issue 1

Submit an article Journal homepage

Open access

445

Views

CrossRef citations to date

Altmetric

Listen

Research Article

STSG: A Short Text Semantic Graph Model for Similarity Computing Based on Dependency Parsing and Pre-trained Language Models

Hai Liaoa School of Computer Engineering, Chengdu Technological University, Chengdu, China;b School of Software Engineering, Chengdu University of Information Technology, Chengdu, China;c School of Software, Sichuan Vocational College of Information Technology, Guangyuan, China

https://orcid.org/0000-0002-2862-7863 View further author information

Yan Lianga School of Computer Engineering, Chengdu Technological University, Chengdu, ChinaCorrespondence[email protected]

https://orcid.org/0000-0002-1617-457X View further author information

Song Chena School of Computer Engineering, Chengdu Technological University, Chengdu, China

https://orcid.org/0000-0002-8037-3004 View further author information

Lingyun Xiangd Faculty of Informatics, Eötvös Loránd University, Budapest, Hungary

https://orcid.org/0009-0002-5211-3712 View further author information

Zhimin Change The Research and Development Department, HAN Networks Corporation Limited, Beijing, ChinaView further author information

Yun Xiaoc School of Software, Sichuan Vocational College of Information Technology, Guangyuan, ChinaView further author information

Article: 2321552 | Received 13 Jun 2023, Accepted 07 Feb 2024, Published online: 04 Mar 2024

Cite this article
https://doi.org/10.1080/08839514.2024.2321552
CrossMark

In this article

ABSTRACT
Introduction
Related Works
The Proposed Method
Experiments and Results
Conclusion and Future Works
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Short text semantic similarity is a crucial research area in nature language processing, which is used to predict the similarity between two sentences. Due to the sparsity features of short texts, words are isolated in the sentence and the correlations of words are ignored, it is very difficult to calculate the global semantic information. Based on this, short text semantic graph (STSG) model based on dependency parsing and pre-trained language models is proposed in this paper. It utilizes the syntactic information to obtain word dependency relationships and incorporate it into pre-trained language models to enhance the global semantic information of sentences. So it can address the semantic sparsity more effectively. A text semantic graph layer based on the graph attention network (GAT) is also realized, which regards word vectors as node features and word dependency as edge features. The attention mechanism of GAT can identify the importance of different word correlations and solve the word dependency modeling effectively. On the challenging short text semantic benchmark dataset MRPC, the STSG model achieves an F1-score of .946, which is further improved 2.16% over previous SOTA approaches. At the time of writing, STSG has achieved a new SOTA performance on the MRPC dataset.

Introduction

Short text semantic similarity (STS) computation is a fundamental problem in natural language processing, which aims to predict the similarity between two sentences. STS computation has numerous applications, especially in the information retrieval. STS can calculate the relevance of a user’s question to the content, addressing the issue of information overload and improving search strategies and results (Chandrasekaran and Mago Citation2021; Mengting et al. Citation2021). For example, paraphrase recognition is used to determine the category scores of two sentences in text classification, and STS is used to assess the similarity between sentences in question and answer. However, it is important to note that the recognition tasks for the short text are different from long texts such as news articles and magazines. In addition, the content of short texts is often sparse (Mengting et al. Citation2021), it is challenging to accurately compute the semantic similarity between two sentences.

There are three primary methods for computing short text semantic similarity. Deep learning-based methods (Jonas and Aditya Citation2016; Weidong et al. Citation2022; Zhiguo, Haitao, and Abraham Citation2016) can generate vector representations of sentences and calculate their similarity using neural networks. However, these methods require a large amount of computing power and may struggle to accurately capture the semantics of a sentence. BERT (Devlin et al. Citation2018), T5 (Raffel et al. Citation2020), DeBERTav3 (Pengcheng, Jianfeng, and Weizhu Citation2023) and other approaches based on the pre-trained language model use a large-scale unlabeled corpus on upstream tasks, and then fine-tune the downstream task to achieve good results in text semantic similarity computing. SimCSE (Tianyu, Xingcheng, and Danqi Citation2021), DiffCSE (Yung-Sung et al. Citation2022), and other (Lingling et al. Citation2024) comparative learning-based approach distinguish between similar and dissimilar representations of data points. Although existing methods have achieved good results in computing text semantic similarity, there are two challenges for computational research on the semantic similarity of short texts:

Sparsity of Short Texts. Short texts are typically less than 200 words and contain sparse information, making it difficult to extract effective feature words. In addition, short texts often have unpredictable language variations and lack contextual information, such as emoticons, acronyms, and internet slang, which can result in a significant amount of extraneous information. For example, the sentence “I like the larger screen size of the iPhone 6 compared to its predecessor” does not mention that the iPhone 6 Plus, a similar product, also has a larger screen than its predecessor. Short texts may lack related words, making it difficult for models to reason about global semantic information. This can affect the accuracy of text similarity.
Word Relevance Modelling. One of the main challenges facing STS models is how to capture semantics and underlying context (Zhe et al. Citation2019; Ghafour, Jamshid Bagherzadeh, and Mohammad-Reza Citation2022). Deep embedding models have been studied for STS tasks with promising results (Jianguo et al. Citation2019). However, existing models usually assume that words are independent of each other and that all words and attributes are equally important for a sentence. This assumption is not always true. The semantic capacity of short texts is limited, and it is difficult to discriminate the importance of different relationships if the correlations between words are ignored.

Motivation: Using neural models in STS is not only simple and efficient, but also it can integrate a wider range of informative knowledge. Syntactic information and dependency trees have been used in numerous studies (Yuanhe et al. Citation2021). Dependency trees can establish long-distance connections between key words, facilitating the extraction of relationships between sentence pairs, which help the system to extract relations between sentence pairs more effectively. Graph neural networks have been used recently to combine grammatical knowledge with pre-trained language models, and it achieves good outcomes (Hironori et al. Citation2022; Liang, Chengsheng, and Yuan Citation2019; Yangfan et al. Citation2021). However, these methods tend to focus solely on the feature learning of local nodes in the graph neural network, while disregarding the feature learning of neighboring nodes. Graph Attention Networks (GAT) can learn distinct weights for neighboring nodes by using the attention mechanism, addressing the limitations of traditional graph neural networks like GCN (Velikovi et al. Citation2017).

To address the above challenges, we propose the short text semantic graph (STSG) model, which based on dependency parsing and pre-trained language models. The STSG model, which makes use of syntactic information, extracts the word dependency relationships and incorporates it into pre-trained language models to enhance the global semantic information of sentences. The STSG model can addresses the semantic sparsity more effectively. We also propose a text semantic graph layer based on the graph attention network (GAT) (Velikovi et al. Citation2017), which regards word vectors as node features and dependency parsing relationships of words as edge features. The attention mechanism of GAT enhances its ability to capture remote dependencies, which can identify the importance of different word correlations and solve the word dependency modeling effectively. Four layers are proposed in the STSG model. 1) The dependency parsing layer is used to extract the dependency parsing relations of a short text and obtain multiple relation triples. 2) The sentence encoding layer is used to encode the semantic information and extract word vectors of the short text by the DeBERTav3 model (Pengcheng, Jianfeng, and Weizhu Citation2023). 3) The text semantic graph layer is used to learn graph attention features of word vectors and syntactic dependencies. 4) The similarity computing layer is used to build a fine-tuning model to calculate the similarity for different sentences by the fully connected network.

The main contributions of this paper are summarized as follows:

A short text semantic graph (STSG) model by using syntactic information is proposed. It extracts the word dependency relationships and incorporates it into pre-trained language models to enhance the global semantic information of sentences.
A GAT-based text semantic graph layer is proposed to use word vectors as node features and words dependency as edge features. It improves word relevance modeling by identifying the importance of different words.
Extensive evaluation and analysis shows that our framework not only achieves new SOTA performance on MRPC, but also has strong generalization capabilities. It is also more effective on low-resource datasets.

There are still some limitations in the STSG model. The longer texts more than 512 tokens cannot be supported in the STSG model. The semantic similarity of English text can be calculated in the model, but the cross-lingual semantic similarity, such as Chinese, cannot be calculated. For the limited computational resources, we are unable to conduct experiments on recently pre-trained large language models, such as GPT4 (Achiam et al. Citation2023)

The remainder of this paper is organized as follows. The related work is reviewed in Section 2. The proposed model STSG are described in Section 3. The experimental settings are presented in Section 4. Finally, the work is concluded and prospects are given in Section 5.

Related Works

In this section, we first describe the related research on short text semantic similarity in Section 2.1, followed by related work on dependency parsing in Section 2.2. Finally, we present applications of short text semantic similarity in Section 2.3.

Short Text Semantic Similarity

For semantic similarity in short texts, deep learning-based approaches are an important approach. Zhiguo, Haitao, and Abraham (Citation2016) focus on the importance of non-similar parts between two sentences, and use a two-channel convolutional neural network (CNN) to decompose similar and non-similar components, which improves the global semantic representation ability of sentences. Paul, Maarten, and Mihai (Citation2016) use the bidirectional LSTM model to obtain information from two directions of input text, which allows them to capture the bidirectional semantic information of sentences. Weidong et al. (Citation2022) propose Re-LSTM, a weighted word-embedded long- and short-term memory network that reduces model parameters and computation. Methods based on pre-trained language models are the other main approach to semantic similarity of short texts. BERT (Devlin et al. Citation2018) uses an existing unlabeled corpus to pre-train the Transformer model to obtain word embedding vectors, and then fine-tunes the model to complete the sentence similarity computation. Joshi et al. (Citation2020) propose a pre-trained model (called SpanBERT) based on word segmentation, which adds a mask to the random adjacent word. Kevin et al. (Citation2020) propose an efficient pre-training method called replaced token detection (RTD). During the pre-training process, the generator can replace the words in the sentence, and then the discriminator determines which words in the sentence have been replaced. T5 (Raffel et al. Citation2020) uses the unified pre-trained language models to learn multiple different NLP tasks. DeBERTa (He et al. Citation2021) and DeBERTav3 (Pengcheng, Jianfeng, and Weizhu Citation2023) use disentangled attention to improve BERT and further improve model performance. However, these methods lack a semantically enhanced representation of the text and do not take into account the dependencies between different words. Comparative learning-based approaches for short text semantic similarity have been extensively researched. Sentence-BERT (Nils and Iryna Citation2019) improves BERT by using siamese and triplet networks to update parameters and generate semantic embedding vectors. DiffCSE (Yung-Sung et al. Citation2022) presents an unsupervised contrast learning framework for sentence embeddings that is sensitive to the differences between original and edited sentences, enhancing their representations. SimCSE (Tianyu, Xingcheng, and Danqi Citation2021) proposes a straightforward contrast learning framework that predicts input sentences in the contrast target, using only standard dropout as noise. The article effectively promotes the use of state-of-the-art sentence embedding. Lingling et al. (Citation2024) propose similarity and relative similarity strategies to identify false-negative samples in potential contrast learning, which can further improve performance significantly. Previous research methods are described in .

Table 1. Previous research methods.

Download CSV Display Table

Dependency Parsing

Recently, some methods have combined grammatical information in pre-trained language models and achieved good results in NLP tasks. Zhe et al. (Citation2019) proposed the ACV tree, which combines word embedding and syntactic information. They also developed the ACVT kernel to calculate sentence similarity. Minh Hieu and Philip (Citation2020) introduced syntax into the attention and considered partial speech embedding, dependency-based embedding, and contextualization embedding to improve the performance of the extractors. Yu et al. (Citation2020) propose a continuous learning framework (called ERNIE) for deeply integrated knowledge to continuously learn vocabulary, syntactic, and semantic knowledge by introducing more tasks. Yuanhe et al. (Citation2021) propose a dependent-driven relational traction method (called a-GCN), which applies graph convolutional network attention to different context words in the syntax dependency tree to distinguish the importance of different word dependencies. Guimin et al. (Citation2021) propose a type-aware mapping memory (called TaMM) for relation extraction, which can encode sentence dependencies and dependency types. Binyuan et al. (Citation2022) propose an S²SQL model for the text-to-SQL task, which jointly encodes the text syntax and the database schema. However, existing grammar-based methods do not consider feature learning of neighbor nodes, and cannot handle dependencies between words. Based on this, the proposed STSG uses the dependency parsing method to achieve an accurate understanding of the semantic relationships between different sentences. We utilize a dependency parsing layer to extract multiple relationship triples, and use the GAT to learn the attention score between nodes by triples.

Application of Short Text Semantic Similarity

Short text semantic similarity has various applications, including text classification, text clustering, sentiment analysis, information retrieval, social networks, academic plagiarism detection, and specific domains (Mengting et al. Citation2021; Song and Hai Citation2022). Successful approaches in the biomedical domain include ontology-based and neural network-based metrics. For example, Dazhi et al. (Citation2023) utilized textual semantic similarity to identify cancer driver genes. They introduced GASN, a similarity network-driven gamma distribution test for gene identification, which combines machine learning and distributional statistical methods. Yan et al. (Citation2024) proposed a multiple similarity deep graph network to deal with user clustering in human-computer interaction using textual semantic similarity. Wenjuan et al. (Citation2024) also used textual semantic similarity to detect personality traits and proposed a PS-GCN model that integrates psychological knowledge and affective semantic features through graph convolutional networks. This model efficiently exploits textual similarity to extract affective features, resulting in a significant improvement in the classification accuracy of the personality detection task. Zhaorui et al. (Citation2023) employed textual semantic similarity to address the task of visual language comprehension. They proposed a new metric model, Semantic Similarity Distance (SSD), to mitigate semantic inconsistencies and bridge the gap between text and image.

The Proposed Method

This section presents a detailed explanation of STSG. Firstly, a method for solving short text sparsity is introduced by using the dependency parsing layer and sentence coding layer. Next, the semantic graph layer is introduced to distinguish the relevance of different words, which improves word relevance modeling. Finally, the similarity computing layer is used for similarity computation, as shown in .

Figure 1. Workflow of the proposed STSG.

Problem Definition

Definition 1.

(Short text semantic similarity) Given the two short sentences $S_{1} = (p_{1}, p_{2}, \dots, p_{m})$ and $S_{2} = (q_{1}, q_{2}, \dots, q_{n})$ , where $p$ and $q$ are the words in the sentences $S_{1}$ and $S_{2}$ , respectively. In additional, $m$ and $n$ represent the length of the sentences $S_{1}$ and $S_{2}$ , respectively. A classifier was trained to accurately predict the similarity between $S_{1}$ and $S_{2}$ . Lin (Citation1998) derived a text similarity theorem based on information theory, as shown in EquationEquation 1(1) $Similarity (S_{1}, S_{2}) = \frac{l o g P (c ommon (S_{1}, S_{2}))}{\log P (description (S_{1}, S_{2}))}$ (1) :

(1)

Similarity (S_{1}, S_{2}) = \frac{l o g P (c ommon (S_{1}, S_{2}))}{\log P (description (S_{1}, S_{2}))}

(1)

where $P (\cdot)$ represents probability, $common (S_{1}, S_{2})$ is the common information of $S_{1}$ and $S_{2}$ , and $description (S_{1}, S_{2})$ is the descriptions of all information in $S_{1}$ and $S_{2}$ . The range of the similarity is between 0 and 1. The $\log (\cdot)$ simplifies the calculation of probabilities.

Dependency Parsing Layer

The syntax dependency between each word in sentence $S_{1}$ and sentence $S_{2}$ can be extracted by the dependency parsing layer, which comprises the dependency parsing module and the dependency representing module. Dependency parsing technology is used to obtain syntax components and analyze the relationships between them. The attention of different dependencies between words in short text is calculated using dependency parsing features. In order to obtain the similarity of sentences $S_{1}$ and $S_{2}$ , the model jointly calculates the semantic representation between two sentences. In this paper, we constructed our dependency parsing module using the DDParser toolkit (Shuai et al. Citation2020) to generate dependency analysis results. DDParser is capable of processing text in both Chinese and English languages (Shuai et al. Citation2020).

Given a sentence $S = \{W_{1}, W_{2}, \dots, W_{n}\}$ , where $W_{n}$ is the $n$ -th word in S. According to the result of dependency parsing, we obtain the syntax dependency graph $G = (V, E)$ of $S$ , where $V$ represents some nodes corresponding to each word in $S$ , and $E$ represents directed edges, indicating the dependency relationship between words. Each $E$ includes a label to indicate the specific dependency. Dependency syntax allows for the mapping of the core word to its dependency syntax tree, as shown in .

Figure 2. Dependency parsing layer.

Dependency parsing concentrates on the word and its dependencies, rather than its components. To make the model’s ability to learn dependency representation, we convert the syntax dependency graph G into a relation triplet R, as shown in EquationEquation 2(2) $R = \{(W_{1}, W_{3}, r_{1}), (W_{4}, W_{7}, r_{5}) (W_{6}, W_{8}, r_{3}), \dots, (W_{n}, W_{m}, r_{i})\}$ (2) :

(2)

R = \{(W_{1}, W_{3}, r_{1}), (W_{4}, W_{7}, r_{5}) (W_{6}, W_{8}, r_{3}), \dots, (W_{n}, W_{m}, r_{i})\}

(2)

where $r_{i}$ represents the $i$ -th type of dependency between words.

The relation triplet $R$ is then converted into an adjacency matrix $A = {(a_{ij})}_{n \times n}$ , where $n$ represents the matrix size. In order to calculate the relational attention between words by using the text semantic graph layer, it is necessary to remove the noisy data in the dependencies. Therefore, it is necessary to ignore the specific direction of dependencies between words in the adjacency matrix A and focus only on their relations. A value in the adjacency matrix $A$ represents whether there is a dependency relation between the $i$ -th word and the $j$ -th word. The adjacency matrix $A$ is defined as EquationEquation 3(3) $a_{ij} = \{\begin{matrix} 1, is adependency relation \\ 0, is not adependency relation \end{matrix}$ (3) .

(3)

a_{ij} = \{\begin{matrix} 1, is adependency relation \\ 0, is not adependency relation \end{matrix}

(3)

Finally, the adjacency matrix $A$ is vectorized to obtain the representation $D$ of dependency syntax, which constitutes the edge features of the text semantic graph. The process of the dependency parsing layer is shown in Algorithm 1.

Table

Display Table

Algorithm 1

Algorithm 1 describes the computing process of dependency parsing layer. First, in step 1, the output variable $D$ is declared, and in step 2, the syntax dependency graph $G$ of the sentence $S$ is obtained. In steps 3–5, according to the syntax dependency graph $G$ , we can obtain a relation triplet $R$ . Simultaneously, in step 6–7, we convert the relation triplet $R$ into the adjacency matrix $A = {(a_{ij})}_{n \times n}$ , and initialize $a_{ij}$ to 0 (step 8). In steps 9–12, we judge whether two words contain dependency relations, and if they do, we set $a_{ij}$ to 1. Finally, we characterize the adjacency matrix into vectors (step 13).

Sentence Coding Layer

The sentence coding layer can encode a sentence into vectors. Following DeBERTav3 (Pengcheng, Jianfeng, and Weizhu Citation2023), we use a multi-layer bidirectional Transformer (Ashish et al. Citation2017; Mary and Marcus Citation2022) as the model backbone to encode the semantic information of short texts, as shown in .

Figure 3. Sentence coding layer.

First, given two sentences $S_{1} = \{W_{1}^{1}, W_{1}^{2}, W_{1}^{3}, W_{1}^{4}, \dots, W_{1}^{n}\}$ and $S_{2} = \{W_{2}^{1}, W_{2}^{2}, W_{2}^{3}, W_{2}^{4}, \dots, W_{2}^{n}\}$ , where $W_{i}^{j}$ represents the $j$ -th word in $S_{i}$ . To learn the deep interaction features between sentences $S_{1}$ and $S_{2}$ , we use tags $[CLS]$ and $[SEP]$ to concatenate the above sentences for joint representation. The tagged sequence $X_{i}$ is shown in EquationEquation 4(4) $X_{i} = [CLS] W_{1}^{1}, W_{1}^{2}, W_{1}^{3}, W_{1}^{4}, \dots, W_{1}^{n} [SEP] W_{2}^{1}, W_{2}^{2}, W_{2}^{3}, W_{2}^{4}, \dots, W_{2}^{n} [SEP]$ (4) .

(4)

X_{i} = [CLS] W_{1}^{1}, W_{1}^{2}, W_{1}^{3}, W_{1}^{4}, \dots, W_{1}^{n} [SEP] W_{2}^{1}, W_{2}^{2}, W_{2}^{3}, W_{2}^{4}, \dots, W_{2}^{n} [SEP]

(4)

Then we encode the sequence $X_{i}$ . Specifically, the sentence coding layer takes a sequence $X_{i}$ as the input and applies $N$ transformer layers to generate a contextual representation $H^{n}$ , as shown in EquationEquation 5(5) $H^{n} = Transforme r_{n} (H^{n - 1})$ (5) :

(5)

H^{n} = Transforme r_{n} (H^{n - 1})

(5)

where $n \in [1, N]$ represents the $n$ -th layer, $H^{0}$ is the embedding of the sequence $X_{i}$ , and each transformer layer contains an identical transformer block. The transformer block consists of disentangled attention (He et al. Citation2021) and a feed-forward neural network (FFN). Formally, the output of the transformer block is calculated as shown in EquationEquation 6(6) $H^{0} = \{H_{i}, P_{i | j}\} \times {\{H_{j}, P_{j | i}\}}^{T} = H_{i} H_{j}^{T} + H_{i} P_{j | i}^{T} + P_{i | j} H_{j}^{T} + P_{i | j} P_{j | i}^{T}$ (6) ~ Eq (8):

(6)

H^{0} = \{H_{i}, P_{i | j}\} \times {\{H_{j}, P_{j | i}\}}^{T} = H_{i} H_{j}^{T} + H_{i} P_{j | i}^{T} + P_{i | j} H_{j}^{T} + P_{i | j} P_{j | i}^{T}

(6)

where $H_{i}$ represents the $i$ -th word embedding of the sequence $X_{i}$ , and $P_{j | i}$ represents the relative position embedding of the sequence $X_{i}$ . $T$ represents the transpose operation.

A decoupling matrix of content embedding and relative position embedding are used to calculate the weight of attention between words. The projection matrix is calculated as follows:

(7)

Q_{c} = H^{0} W_{q, c}, K_{c} = H^{0} W_{k, c}, V_{c} = H^{0} W_{v, c}, Q_{r} = P W_{q, r}, K_{r} = P W_{k, r}

(7)

where $Q_{c}$ , $K_{c}$ and $V_{c}$ represents the content vectors generated by projection matrices, $W_{q, c}$ ， $W_{k, c}$ ， $W_{v, c}$ $\in R^{d \times d}$ respectively, and $P \in R^{2 K \times d}$ represents the vector of the relative position embedding in a shared layer. $Q_{r}$ and $K_{r}$ represent the relative position embedding generated by projection matrices $W_{q, r}$ ， $W_{k, r} \in R^{d \times d}$ , respectively, as shown in EquationEq (8)(8) $z_{i} = Q_{i}^{c} K_{j}^{c T} + Q_{i}^{c} {K_{r, δ (i, j)}}^{T} + K_{j}^{c} {Q_{r, δ (i, j)}}^{T}$ (8) :

(8)

z_{i} = Q_{i}^{c} K_{j}^{c T} + Q_{i}^{c} {K_{r, δ (i, j)}}^{T} + K_{j}^{c} {Q_{r, δ (i, j)}}^{T}

(8)

where $z_{i}$ is the $i$ -th element in the attention matrix $z$ , $Q_{i}^{c}$ represents the $i$ -th row in $Q_{c}$ , $K_{j}^{c}$ represents the $j$ -th row in $K_{c}$ , $K_{r, δ (i, j)}$ represents the relative distance $δ (i, j)$ of the $r$ -th row in $K_{r}$ , and $Q_{r, δ (i, j)}$ represents the relative distance $δ (i, j)$ of the $r$ -th row in $Q_{r}$ . We calculate the disentangled attention $z_{i}$ , as shown in EquationEq. (9)(9) $z_{i} = Softmax (\frac{z_{i}}{\sqrt{3 d}}) \times V_{c}$ (9) :

(9)

z_{i} = Softmax (\frac{z_{i}}{\sqrt{3 d}}) \times V_{c}

(9)

where $Softmax (\cdot)$ represents an activation function, and we feed $z_{i}$ to the FFN and normalize $z_{i}$ to get the normalized output $r_{i}$ , as shown in EquationEq (10)(10) $r_{i} = LN (FFN (z_{i}) + z_{i})$ (10) :

(10)

r_{i} = LN (FFN (z_{i}) + z_{i})

(10)

where $LN (\cdot)$ represents a $LayerNorm$ function (Lei Jimmy et al. Citation2016), and $FFN (\cdot)$ represents a feed forward neural network.

Finally, we use Transformer Decoders (Ashish et al. Citation2017) to obtain intermediate embedding, and we input the intermediate embedding into the pooling layer. We take the vector $S^{'}$ of the pooling layer at the $CLS$ as the word vector of $X_{i}$ , defined as $S^{'} \in R^{d \times d_{q}}$ . We regard $S^{'}$ as a node feature of the semantic graph. Algorithm procedure of the sentence coding layer is shown in Algorithm 2.

Table

Display Table

Algorithm 2

outlines the process for calculating the sentence coding layer. In Step 1, the output variable $S^{'}$ is declared. Then, in Step 2, the encoder layers are traversed. The content embedding and relative position embedding of the text are obtained in Step 3, and five projection matrices are generated in Step 4. The disentangled attention is then calculated in Step 5, which is normalized in Step 6. The attention is inputted to the FFN in Step 7, and the output vector of each encoder layer is accumulated in Steps 8–9. Finally, the output vector of the encoder layer is fed into the decoder to obtain the word vector $S^{'}$ in Step 10.

Semantics Graph Layer

We construct a semantic graph layer based on the graph attention network (Velikovi et al. Citation2017), which can effectively fuse the word vector of the text with the dependency representation. Specially, we build a heterogeneous text graph that contains word nodes and dependency relations. The word vector representation of the text is the node feature of the graph, and the dependency representation of the text is the edge feature of the graph. The semantic graph layer can learn the relation weights between word nodes using the graph attention mechanism, as shown in .

Figure 4. Semantics graph layer.

When constructing an edge, we do not take into account the dependency type and dependency direction. If two words in a sentence have dependency relations, we construct an edge between the corresponding nodes. And we represent the F dimensional semantic feature of each node in the graph attention network as $h = \{{\vec{h}}_{1}, {\vec{h}}_{2}, \dots, {\vec{h}}_{N}\}, {\vec{h}}_{i} \in R^{F}$ , all node semantic features of the output are expressed as $h^{'} = \{{\vec{h}}_{1}^{'}, {\vec{h}}_{2}^{'}, \dots, {\vec{h}}_{N}^{'}\}, {\vec{h}}_{i}^{'} \in R^{F^{'}}$ . Let the vector features of word nodes $u_{i}, u_{j}$ be $h_{i}$ and $h_{j}$ , respectively. For a word node $u_{i}$ we calculate the attention coefficient $e_{ij}$ of all neighboring word nodes $u_{j}$ and $u_{i}$ , as shown in Eq. (11):

(11)

e_{ij} = a (W {\vec{h}}_{i}, W {\vec{h}}_{j}), j \in N_{i}

(11)

where $a$ represents a shared attention weight, $W \in R^{F^{'} \times F}$ represents a learnable weighting matrix, which convert these features of word nodes into high-dimensional features.

In order to assign the attention weights between different word nodes, we normalize the attention coefficient $e_{ij}$ between word node $u_{i}$ and all neighboring word nodes $u_{j}$ , and obtain an attention score $α_{ij}$ , as shown in Eq. (12):

(12)

α_{ij} = softma x_{j} (e_{ij}) = \frac{\exp (e_{ij})}{\sum_{k \in N_{i}} \exp (e_{ik})}

(12)

where we use a $LeakyReLU (\cdot)$ activation function to optimize Eq. (12), as shown in Eq. (13).

(13)

α_{ij} = \frac{e x p (L eakyReLU ({\vec{a}}^{T} [W {\vec{h}}_{i} W {\vec{h}}_{j}]))}{\sum_{k \in N_{i}} \exp (LeakyReLU ({\vec{a}}^{T} [W {\vec{h}}_{i} W {\vec{h}}_{k}]))}

(13)

And we perform a weighted summation of the attention features of all word nodes, and obtain the semantic feature representation ${\vec{h}}_{i}^{'}$ of word nodes, as shown in Eq. (14):

(14)

{\vec{h}}_{i}^{'} = σ (\sum_{j \in N_{i}} α_{ij} W {\vec{h}}_{j})

(14)

where $σ (\cdot)$ represents an activation function, and $h_{i}$ represents the feature representation of the $i$ -th word node in the attention layer.

In order to make the attention more accurate, we use multi-head attention mechanism to assign $K$ groups of independent attention weight matrices to Eq. (14), as shown in Eq. (15):

(15)

{\vec{h}}_{i}^{'} = σ (\frac{1}{K} \sum_{k = 1}^{K} \sum_{j \in N_{i}} α_{i j}^{k} W^{k} {\vec{h}}_{j})

(15)

Finally, to improve the semantic representation of text, we concatenate the obtained text sentence representation $S^{'}$ with the text semantic graph representation ${\vec{h}}_{i}^{'}$ , and obtain the final text semantic representation, as shown in Eq. (16):

(16)

S {^{'}}_{O} = Concat ({\vec{h}}_{i}^{'}, S^{'})

(16)

where $Concat (\cdot)$ represents the concatenated operation. Algorithm procedure of the semantics graph layer is shown in Algorithm 3.

Table

Display Table

Algorithm 3

Algorithm 3 describes the process for calculating the semantics graph layer. Firstly, the output variable $S {^{'}}_{O}$ (Step 1), and we regard the word vector $S$ is considered as the node feature of the graph (Step 2). Next, we take the adjacency matrix $A$ is then taken as the edge feature of the graph (Step 3), and we feed node and edge features are fed into GAT to obtain the graph attention vector (Step 4). Finally, the word vectors and graph attention vectors are concatenated to obtain the semantic graph vector $S {^{'}}_{O}$ (Step 5).

Similarity Computing Layer

The similarity computing layer is created to build a text classifier that calculates the semantic similarity between two sentences. A fully connected neural network is used to classify the text vector and obtain the semantic similarity. In Section 3.4, the final text semantic representation $S {^{'}}_{O}$ is obtained. A fully connected neural network is created for binary classification using $S {^{'}}_{O}$ as input. The network’s classification output (0–1) is obtained using the cross-entropy loss function. The network is then trained with a pre-trained language model DeBERTv3 (as described in Section 3.3) using a fine-tuned training pattern to obtain the final output, as illustrated in . The formula for the similarity calculation is shown in Eq. (17):

(17)

y = f (W) \times S {^{'}}_{O} + b

(17)

where $f (\cdot)$ is the activation function, $W$ represents the weight matrix, $S {^{'}}_{O}$ represents the semantic vector, and $b$ is the learnable parameters. And we use the cross-entropy loss function $L$ to measure the prediction quality, as shown in Eq. (18):

(18)

L = \frac{1}{N} \sum_{i = 1}^{N} L_{i} = - [y_{i} \cdot log (p_{i}) + (1 - y_{i}) \cdot log (1 - p_{i})]

(18)

where $y_{i}$ represents the label of sample $i$ , and $p_{i}$ represents the probability that sample $i$ is predicted to be positive value.

Experiments and Results

In this section, we first describe the details of the experiments and datasets in Sections 4.1 and 4.2, then introduce the baselines in Section 4.3 and introduce the metrics in Section 4.4, and show the results of the main experiment in Section 4.5. In Section 4.6 we show results on the low-resource setting. Finally, we perform an ablation testing to prove the effectiveness of each component in Section 4.7.

Experimental Details

The proposed STSG was implemented using the PyTorch 1.7 framework on a Windows server with an Intel(R) Core-(TM) i7-10700F CPU @ 2.90 GHz, 32 G memory, and an NVIDIA GeForce RTX 3060 GPU. DeBERTV3 (Pengcheng, Jianfeng, and Weizhu Citation2023) suggests using these hyper-parameters to fine-tune DeBERTaV3 in downstream tasks. To ensure a fair comparison with our baseline model DeBERTV3, we have chosen the hyper-parameters of STSG to be consistent with it. The original DeBERTV3 article has demonstrated the validity of these parameters, which are described in .

Table 2. Parameters of STSG.

Download CSV Display Table

Datasets

To assess the performance of STSG in short text similarity, we utilized the MRPC dataset (Alex et al. Citation2019), which is the most commonly used authoritative evaluation benchmark for such tasks (Devlin et al. Citation2018; Pengcheng, Jianfeng, and Weizhu Citation2023; Raffel et al. Citation2020; Yung-Sung et al. Citation2022). Additionally, we employed the challenging low-resource BIOSSES dataset (Gizem, Hakime, and Arzucan Citation2017) to validate the feasibility of STSG in low-resource text semantic similarity tasks, as shown in :

MRPC (The Microsoft Research Paraphrase Corpus) is a dataset consisting of 5800 sentence pairs extracted from online news. The category distribution is unbalanced, with 68% of the pairs being positive samples, which poses a significant challenge to the model’s computation.
BIOSSES is a dataset comprising 100 sentence pairs that were selected from medical abstracts in the biomedical field. The dataset uses score intervals of 0-5 to judge the similarity between the pairs. Due to its low-resource nature and extremely sparse samples, the dataset presents a challenge for models to learn from small samples.

Table 3. Summary of datasets.

Download CSV Display Table

Baselines

We used the following 10 models as the experimental baseline.

BERT (Devlin et al. Citation2018). This model uses an existing unlabeled corpus to pre-train the Transformer model to obtain word embedding vectors, and then fine-tunes the model to complete the sentence similarity computing.
T5 (Raffel et al. Citation2020). This model uses the unified pre-trained language models to learn multiple different NLP tasks.
ELECTRA (Kevin et al. Citation2020). This model uses a discriminative pre-trained text encoder to calculate the similarity of two sentences.
ERNIE2.0 (Yu et al. Citation2020). This model is a continuous learning framework for deeply integrated knowledge to continuously learn vocabulary, syntactic, and semantic knowledge by introducing more tasks.
SpanBERT (Joshi et al. Citation2020). This model is a pre-trained model based on word segmentation, which adds a mask to the random adjacent word to calculate text similarity.
DistilBERT (Sanh et al. Citation2019). This model is a pre-trained universal language representation model to calculate text similarity.
ESM-2 (Zemin et al. 2022). ESM-2 calculates the sequence text data using large-scale pre-trained models.
XLNet (Yang et al. Citation2019). XLNet is a generalized autoregressive pre-trained model to calculate text similarity.
DeBERTa (He et al. Citation2021). DeBERTa uses the disentangled attention to improve BERT and further improve the performance of the model.
DeBERTav3 (Pengcheng, Jianfeng, and Weizhu Citation2023). This model is an efficient and simple pre-trained model to calculate text similarity.

Metrics

(1) Precision. Precision describes how many of the number of similar sentence pairs (classification result of 1) predicted by the model are true:

(19)

Precision = \frac{TP}{TP + FP}

(19)

(2) Recall. Recall describes the percentage of all similar sentence pairs (classification result 1) correctly predicted by the STSG model:

(20)

Recall = \frac{TP}{TP + FN}

(20)

(3) Accuracy. Accuracy describes the percentage of all true outcome values of the STSG model:

(21)

Accuracy = \frac{TP + TN}{TP + FP + TN + FN}

(21)

(4) F1-score. The F1-score is the mean value of the Precision and Recall. The value range of F1-score is [0, 1]:

(22)

F 1 Score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(22)

where $TP$ is the number of similar sentence pairs correctly predicted by the model, $TN$ is the number of non-similar sentence pairs predicted by the model, $FP$ is the number of similar sentence pairs incorrectly predicted by the model, and $FN$ is the number of non-similar sentence pairs not predicted by the model.

(5) Pearson. Pearson metric reflects the difference between the predicted result of the model and the correct result, where 0 represents no correlation and 1 represents a complete correlation. The corresponding formula is defined as Eq. (23):

(23)

Pearson = \frac{n (\sum_{i = 1}^{n} x_{i} y_{i}) - \sum_{i = 1}^{n} x_{i} \sum_{i = 1}^{n} y_{i}}{\sqrt{n (\sum_{i = 1}^{n} x_{i}^{2}) - {(\sum_{i = 1}^{n} x_{i})}^{2}} \sqrt{n (\sum_{i = 1}^{n} y_{i}^{2}) - {(\sum_{i = 1}^{n} y_{i})}^{2}}}

(23)

(6) Spearman. The Spearman metric measures the rank correlation coefficient between the predicted outcome of the model and the correct outcome. The value ranges from −1 to 1, where −1 indicates no correlation and 1 indicates perfect correlation. The corresponding formula is defined as Eq. (24):

(24)

Spearman = 1 - \frac{6 \sum_{i = 1}^{n} d_{i}^{2}}{n (n^{2} - 1)}

(24)

(7) Kendall. Kendall’s correlation is a rank correlation coefficient used to assess the correlation between two random variables based on the rank of the data objects. The corresponding formula is defined as Eq. (25):

(25)

Kendall = \frac{c - d}{\sqrt{(c + d + t_{x}) (c + d + t_{y})}}

(25)

where $c$ and $d$ represent the number of concordant and divergent pairs, and $t_{x}$ and $t_{y}$ denote the number of tied rankings in $x_{i}$ and $y_{i}$ , respectively.

(8) MAE. Mean Absolute Error (MAE) can be calculated as the mean absolute error between the true value and the predicted value, as shown in Eq. (26):

(26)

MAE = \frac{1}{n} \sum_{i - 1}^{n} |x_{i} - y_{i}|

(26)

(9) MSE. Mean Squared Error (MSE) can calculate the mean-squared error between the true and predicted values, as shown in Eq. (27):

(27)

MSE = \frac{1}{n} \sum_{i - 1}^{n} {(x_{i} - y_{i})}^{2}

(27)

where $x_{i}$ represents the true result, $y_{i}$ represents the predicted result, $d_{i}$ is the level difference between $x_{i}$ and $y_{i}$ , and $n$ is the number of samples.

Results of the Main Experiment

(1) Results on MRPC dataset

shows the accuracy of STSG compared to advanced methods based on the MRPC dev dataset. STSG achieves the highest performance among all the methods, with an F1-score of .946. The proposed STSG has a recall of .913, outperforming other methods. The pre-trained language model based achieves similarity computation with F1-scores ranging from .900 to .926. Examples of such models include BERT, T5, ELECTRA, DeBERTa, DeBERTav3 and SpanBERT. Comparative learning methods, such as Sentence-BERT, DiffCSE, and SimCSE, have achieved F1-scores ranging from .760 to .777. The MRPC dataset has an unbalanced category distribution, with 68% of sentence pairs being positive samples. As a result, contrastive learning-based methods typically yield low F1-scores. However, it has been shown that using pre-trained language models such as DeBERTav3 and DeBERTa alone can improve results. It is worth noting that pre-trained language models may be difficult for learning semantic of sentences, resulting in a large discrepancy between F1-scores and Recall scores. In contrast, STSG based on a combination of syntactic and pre-trained models is better at capturing the semantic of sentences. As a result, the similarity calculation is better, and the difference between the F1-score and the Recall and Accuracy metrics is smaller.

Table 4. Evaluation on MRPC dataset.

Download CSV Display Table

(2) Evaluation on BIOSSES dataset

shows the accuracy of STSG in comparison to other methods, using the BIOSSES test dataset. STSG achieves the best values for MAE and MSE errors, with a Pearson value of .915 and a Spearman’s value of .835. This primary cause of these results is the limited number of training samples in the BIOSSES dataset, which comprises only 64 pairs of sentences and more noisy data. Methods that rely on pre-trained language models can be challenging to apply to low-resource tasks. For example, models like DeBERTav3 and BioBERT may not yield high Pearson values. The STSG enhances the global semantic information of sentences by utilizing syntactic dependencies. Compared to DeBERTav3, Pearson values have improved by 8.67%. These results demonstrate that combining syntactic dependency and pre-trained language models can be effective in textual semantic similarity tasks with small datasets.

Table 5. Evaluation on BIOSSES dataset.

Download CSV Display Table

The MAE and MSE metrics are used to assess the deviation between the predicted values of the model and the true values of the samples. The smaller the MAE and MSE values, the better the predictive performance of the model. shows that some methods have high Pearson values but large MAE and MSE values due to poor processing of sentence semantics, such as ERNIE2.0 and XLNet. The proposed STSG has a MAE of .518 and a MSE of .376, which is 40.60% lower than T5’s MAE.

Results on the Low-Resource Setting

shows the experiments conducted on various training ratios of the MRPC dev dataset. We divided the training set of MRPC into new datasets with different ratios to train the model (called training ratios), and used the validation set of 100% size to verify the performance of the model to test the generalization ability of STSG on the classification problem.

Table 6. Low-resource settings of MRPC datasets.

Download CSV Display Table

And we use different training ratios on the MRPC dataset to show the Receiver Operating Characteristic Curve (ROC) curves. describes the ROC curves of the proposed STSG and several baselines on the MRPC dataset. Specifically, shows that the Area Under Curve (AUC) of STSG on a training ratio of 100% is .913. Similarly, shows that the AUC value of STSG on a training ratio of 50% is .869. Moreover, shows that the AUC value of STSG on a training ratio of 20% is .772. Finally, shows that the AUC of STSG on a training ratio of 10% is .693. The pre-trained language model approach performs poorly on the low-resource dataset in terms of AUC values due to the imbalance of positive and negative samples in MRPC. It struggles to accurately identify negative examples in sentences. The results demonstrate that STSG achieves a high True Positive Ratio for positive samples at different training ratios, while maintaining a low the False Positive Ratio for negative samples.

Figure 5. ROC curve on MRPC dataset.

show that the F1-score of the proposed STSG on a training ratio of 1% is .812 and the accuracy is .684. This represents an increase in F1-score of 14.04% and accuracy of 12.50% compared to BERT. On a training ratio of 50%, the F1-score of STSG is .909 and the accuracy is .873.

Table 7. Evaluation on MRPC dataset by different training ratios.

Download CSV Display Table

The experimental results in show that the proposed of STSG outperforms other baselines with different ratios on the MRPC dataset. The existing methods ignore the semantic correlation between words, resulting in unstable results during evaluation under different training ratios. In contrast, the proposed STSG learns dependencies between words and incorporates the feature learning of domain nodes, leading to stable prediction performance.

shows that the Pearson of STSG is .895 for a training ratio of 50%, .509 for a training ratio of 20%, and .578 for a training ratio of 10%. These results show that STSG can achieve accurate predictions on small datasets with different proportions of BIOSSES, and its generalization ability exceeds that of similar baselines. STSG takes into account the feature learning of domain nodes, which enables the model to better recognize the context in short texts.

Figure 6. Pearson on BIOSSES dataset.

shows the experiments conducted on different training ratios of the BIOSSES test dataset. We split the training set of BIOSSES into new datasets with different ratios to train the model, and use the validation set of 100% to verify the performance of the model to test the generalization ability of the proposed STSG.

Table 8. Low-resource settings of BIOSSES datasets.

Download CSV Display Table

presents the experimental results for different training scales of the BIOSSES test dataset. The BIOSSES training set was split into new datasets with varying scaling sizes to train the model. The model’s MAE and MSE were validated using 100% and 50% sized validation sets. With a 50% training scaling, STSG achieved an MAE of .605 and an MSE of .555. This represents an 11.03% decrease in MAE and a 36.21% decrease in MSE compared to T5. Compared to other methods, STSG exhibits smaller MAE and MSE values at different training ratios. Methods based on pre-trained language models tend to yield large error values under low-resource experiments because these models are not fine-tuned for downstream tasks and are difficult to adapt to small sample datasets. STSG utilizes syntactic information and textual semantic graphs to improve the model’s generalization ability, resulting in strong predictive performance even on small datasets.

Figure 7. Errors on BIOSSES dataset.

Ablation Testing

and show the ablation testing results of STSG on MRPC and BIOSSES, respectively. The STSG model that does not use a dependency parsing layer is represented as -w/o DP. shows that the F1-score of -w/o DP on the MRPC dataset is .926, which is a decrease of 2.11% compared to STSG. Similarly, shows that the Pearson of -w/o DP is .842, which is a decrease of 7.98% compared to STSG. These results show that the proposed STSG takes into account the syntactic structure features between sentence pairs, which makes the model understand the true intention of the sentence expression, thus improving the accuracy of the similarity computation task.

Table 9. Ablation testing on MRPC dataset.

Download CSV Display Table

Table 10. Ablation testing on BIOSSES dataset.

Download CSV Display Table

Conclusion and Future Works

Short text semantic similarity computation is a fundamental problem in natural language processing, which aims to predict the similarity between two sentences. To tackle the shortcomings of current short text semantic similarity methods, we introduce STSG, which is based on dependency parsing and pre-trained language models. The model utilizes syntactic information and incorporates it into pre-trained language models, thereby augmenting the overall semantic information of sentences to address the issue of semantic sparsity. We propose a textual semantic graph layer that utilizes word vectors and dependency parsing relations as features of Graph Attention Networks (Velikovi et al. Citation2017) to enhance the word relevance modeling problem. However, the STSG model has limitations, such as its inability to process two sentences with more than 512 tokens. Additionally, the tool can only calculate semantic similarity for English text. Moreover, due to limited computational resources, we are unable to experiment with the latest large models such as GPT4 (Achiam et al. Citation2023). Our future plans involve addressing the current limitations of STSG and proposing a new method for measuring semantic similarity of text. This will be accomplished by combining textual semantic dependency analysis with large-scale language models.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the Open Project of Sichuan Provincial Key Laboratory of Philosophy and Social Science for Language Intelligence in Special Education under Grant No. YYZN-2023-4 and the Ph.D. Fund of Chengdu Technological University under Grant No. 2020RC002.

References

Achiam, J., S. Adler, S. Agarwal, L. Ahmad, I. Akkaya. 2023 Dec 19. Gpt-4 technical report. arXiv Preprint arXiv 2303:08774.
Google Scholar
Alex, W., S. Amanpreet, M. Julian, H. Felix, L. Omer, and B. Samuel. 2019 Feb 22. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv Preprint arXiv 1804:07461.
Google Scholar
Ashish, V., S. Noam, P. Niki, U. Jakob, J. Llion. 2017. Attention is all you need. Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Long Beach.
Google Scholar
Binyuan, H., G. Ruiying, W. Lihan, Q. Bowen, L. Bowen. 2022. S2SQL: Injecting syntax to question-schema interaction graph encoder for text-to-SQL parsers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 1254–29. Dublin.
Google Scholar
Chandrasekaran, D., and V. Mago. 2021. Evolution of semantic similarity—A survey. ACM Computing Surveys 54 (2):1–37. doi:10.1145/3440755.
Web of Science ®Google Scholar
Dazhi, J., W. Runguo, H. Zhihui, L. Senlin, L. Cheng, and Y. Lin. 2023. GASN: Gamma distribution test for driver genes identification based on similarity networks. Connection Science 35 (1):1–19. doi:10.1080/09540091.2023.2167937.
Web of Science ®Google Scholar
Devlin, J., M. W. Chang, K. Lee, and K. Toutanova. 2018 Oct 11. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv Preprint arXiv 1810:04805.
Google Scholar
Ghafour, A., M. Jamshid Bagherzadeh, and F. Mohammad-Reza. 2022. Learning bilingual word embedding mappings with similar words in related languages using GAN. Applied Artificial Intelligence 36 (1). doi:10.1080/08839514.2021.2019885.
Web of Science ®Google Scholar
Gizem, S., Ö. Hakime, and Ö. Arzucan. 2017. BIOSSES: A semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 33 (14):I49–58. doi:10.1093/bioinformatics/btx238.
Web of Science ®Google Scholar
Guimin, C., T. Yuanhe, S. Yan, and W. Xiang. 2021. Relation extraction with type-aware map memories of word dependencies. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2501–12. Bangkok.
Google Scholar
He, P., X. Liu, J. Gao, and W. Chen. 2021 Oct 6. DeBERTa: Decoding-enhanced BERT with disentangled attention. arXiv Preprint arXiv 2006:03654.
Google Scholar
Hironori, T., S. Junya, F. Sulfayanti, and K. Akihiro. 2022. Anomaly detection using siamese network with attention mechanism for few-shot learning. Applied Artificial Intelligence 36 (1). doi:10.1080/08839514.2022.2094885.
Web of Science ®Google Scholar
Ilya, L., and H. Frank. 2019 Jan 4. Decoupled weight decay regularization. arXiv Preprint axXiv 1711:05101.
Google Scholar
Jianguo, C., L. Kenli, B. Kashif, Z. Xu, L. Keqin. 2019. A Bi-layered parallel training architecture for large-scale convolutional neural networks. IEEE Transactions on Parallel and Distributed Systems 30(5):965–76. doi:10.1109/TPDS.2018.2877359.
Web of Science ®Google Scholar
Jonas, M., and T. Aditya. 2016. Siamese recurrent architectures for learning sentence similarity. Proceedings of the AAAI conference on artificial intelligence, 2786–92. Arizona.
Google Scholar
Joshi, M., D. Chen, Y. Liu, S. Weld Daniel, L. Zettlemoyer, Zettlemoyer, L. and Levy, O. 2020. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8:64–77. doi:10.1162/tacl_a_00300.
Google Scholar
Kevin, C., L. Minh-Thang, V. Quoc, and M. Christopher. 2020 May 23. ELECTRA: Pre-training text encoders as discriminators rather than generators. arXiv Preprint arXiv 2003:10555.
Google Scholar
Lei Jimmy, B., K. Ryan, E. Geoffrey, H. B. Jimmy Lei. 2016 Jul 21. Layer normalization. arXiv Preprint arXiv 1607:06450.
Google Scholar
Liang, Y., M. Chengsheng, and L. Yuan. 2019. Graph convolutional networks for text classification. Proceedings of the AAAI conference on artificial intelligence, 7370–77. Hawaii.
Google Scholar
Lin, D. 1998. An information-theoretic definition of similarity. Proceedings of the International Conference on Machine Learning, 296–304. Madison.
Google Scholar
Lingling, X., X. Haoran, W. Fu Lee, T. Xiaohui, W. Weiming, and Q. Li. 2024. Contrastive sentence representation learning with adaptive false negative cancellation. Information Fusion 102:102065–102065. doi:10.1016/j.inffus.2023.102065.
Web of Science ®Google Scholar
Mary, P., and H. Marcus. 2022 Jul 19. Formal algorithms for transformers. arXiv Preprint arXiv 2207:09238.
Google Scholar
Mengting, H., Z. Xuan, Y. Xin, J. Jiahao, Y. Wei, and C. Gao. 2021. A survey on the techniques, applications, and performance of short text semantic similarity. Concurrency and Computation: Practice and Experience 33 (5). doi:10.1002/cpe.5971.
Web of Science ®Google Scholar
Minh Hieu, P., and O. Philip. 2020. Modelling context and syntactical features for aspect-based sentiment analysis. Proceedings of the 58th annual meeting of the association for computational linguistics, 3211–20. Washington.
Google Scholar
Nils, R., and G. Iryna. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language, 3980–90. Hong Kong.
Google Scholar
Paul, N., V. Maarten, and R. Mihai. 2016. Learning text similarity with siamese recurrent networks. Proceedings of the 1st Workshop on Representation Learning for NLP, 148–57. Berlin.
Google Scholar
Pengcheng, H., G. Jianfeng, and C. Weizhu. 2023. DeBERTav3: Improving DeBERTa using ELECTRA-Style pre-training with gradient-disentangled embedding sharing. The Eleventh International Conference on Learning Representations, Kigali.
Google Scholar
Raffel, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, Matena, M., Zhou, Y., Li, W. and Liu, P.J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21 (1):5485–551.
Google Scholar
Sanh, V., L. Debut, J. Chaumond, and T. Wolf. 2019 Mar 1. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv Preprint arXiv 1910:01108.
Google Scholar
Shuai, Z., W. Lijie, S. Ke, and X. Xinyan. 2020 Sep 3. A practical Chinese dependency parser based on a large-scale dataset. arXiv Preprint arXiv 2009:00901.
Google Scholar
Song, C., and L. Hai. 2022. BERT-Log: Anomaly detection for system logs based on pre-trained language model. Applied Artificial Intelligence 36 (1). doi:10.1080/08839514.2022.2145642.
Web of Science ®Google Scholar
Tianyu, G., Y. Xingcheng, and C. Danqi. 2021. SimCSE - simple contrastive learning of sentence embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 6894–910. Punta Cana.
Google Scholar
Velikovi, P., G. Cucurull, A. Casanova, A. Romero, P. Liò,and Bengio, Y. 2017 Feb 4. Graph attention networks. arXiv Preprint axXiv 1710:10903.
Google Scholar
Weidong, Z., L. Xiaotong, J. Jun, and X. Rongchang. 2022. Re-LSTM: A long short-term memory network text similarity algorithm based on weighted word embedding. Connection Science 34 (1):2652–70. doi:10.1080/09540091.2022.2140122.
Web of Science ®Google Scholar
Wenjuan, L., S. Zhengyan, W. Subo, Z. Shunxiang, Z. Guangli, and L. Chen. 2024. PS-GCN: Psycholinguistic graph and sentiment semantic fused graph convolutional networks for personality detection. Connection Science 36 (1). doi:10.1080/09540091.2023.2295820.
Web of Science ®Google Scholar
Yan, K., P. Bin, K. Yongqi, Y. Yun, C. Jianguo, and X. Xie. 2024. Two-stage perceptual quality oriented rate control algorithm for HEVC. ACM Transactions on Multimedia Computing, Communications and Applications 20 (5):1–20. doi:10.1145/3636510.
Google Scholar
Yang, Z., Z. Dai, Y. Yang, G. Jaime, Salakhutdinov, R.R. and Le, Q.V. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Proceedings of the 33rd International Conference on Neural Information Processing Systems, 5753–63. Vancouver.
Google Scholar
Yangfan, L., C. Cen, D. Mingxing, Z. Zeng, and L. Kenli. 2021. Attention-aware encoder–decoder neural networks for heterogeneous graphs of things. IEEE Transactions on Industrial Informatics 17 (4):2890–98. doi:10.1109/TII.2020.3025592.
Web of Science ®Google Scholar
Yifan, P., Y. Shankai, and L. Zhiyong. 2019. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. Proceedings of the 18th BioNLP Workshop and Shared Task, 58–65. Florence.
Google Scholar
Yuanhe, T., C. Guimin, S. Yan, and W. Xiang. 2021. Dependency-driven relation extraction with attentive graph convolutional networks. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 4458–71. Bangkok.
Google Scholar
Yung-Sung, C., D. Rumen, L. Hongyin, Z. Yang, C. Shiyu, Soljačić, M., Li, S.W., Yih, W.T., Kim, Y. and Glass, J., et al. 2022. DiffCSE: Difference-based contrastive learning for sentence embeddings. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4207–18. Seattle.
Google Scholar
Yu, S., W. Shuohuan, L. Yukun, F. Shikun, T. Hao,Wu, H. and Wang, H. 2020. Ernie 2.0: A continual pre-training framework for language understanding. Proceedings of the AAAI conference on artificial intelligence, 8968–75. New York.
Google Scholar
Zeming, L. A. Halil, R. Roshan, H. Brian, Z. Zhongkai, L. Wenting, S. Nikita, V. Robert, K. Ori, S. Yaniv, et al. 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379 (6637):1123–30.
Web of Science ®Google Scholar
Zhaorui, T., Y. Xi, Y. Zihan, W. Qiufeng, Y. Yuyao, A. Nguyen, and K. Huang. 2023. Semantic similarity distance: Towards better text-image consistency metric in text-to-image generation. Pattern Recognit 144:109883–109883. doi:10.1016/j.patcog.2023.109883.
Web of Science ®Google Scholar
Zhe, Q., W. Zhi-Jie, L. Yuquan, Y. Bin, L. Kenli, and J. Yin. 2019. An efficient framework for sentence similarity modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (4):853–65. doi:10.1109/TASLP.2019.2899494.
Web of Science ®Google Scholar
Zhiguo, W., M. Haitao, and I. Abraham. 2016 Feb 23. Sentence similarity learning by lexical decomposition and composition. arXiv Preprint axXiv 1602:07019.
Google Scholar

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

STSG: A Short Text Semantic Graph Model for Similarity Computing Based on Dependency Parsing and Pre-trained Language Models

ABSTRACT

Introduction

Related Works

Short Text Semantic Similarity

Table 1. Previous research methods.

Dependency Parsing

Application of Short Text Semantic Similarity

The Proposed Method

Problem Definition

Dependency Parsing Layer

Sentence Coding Layer

Semantics Graph Layer

Similarity Computing Layer

Experiments and Results

Experimental Details

Table 2. Parameters of STSG.

Datasets

Table 3. Summary of datasets.

Baselines

Metrics

Results of the Main Experiment

Table 4. Evaluation on MRPC dataset.

Table 5. Evaluation on BIOSSES dataset.

Results on the Low-Resource Setting

Table 6. Low-resource settings of MRPC datasets.

Table 7. Evaluation on MRPC dataset by different training ratios.

Table 8. Low-resource settings of BIOSSES datasets.

Ablation Testing

Table 9. Ablation testing on MRPC dataset.

Table 10. Ablation testing on BIOSSES dataset.

Conclusion and Future Works

Disclosure Statement

References

Information for

Open access

Opportunities

Help and information

STSG: A Short Text Semantic Graph Model for Similarity Computing Based on Dependency Parsing and Pre-trained Language Models

ABSTRACT

Introduction

Related Works

Short Text Semantic Similarity

Table 1. Previous research methods.

Dependency Parsing

Application of Short Text Semantic Similarity

The Proposed Method

Problem Definition

Dependency Parsing Layer

Sentence Coding Layer

Semantics Graph Layer

Similarity Computing Layer

Experiments and Results

Experimental Details

Table 2. Parameters of STSG.

Datasets

Table 3. Summary of datasets.

Baselines

Metrics

Results of the Main Experiment

Table 4. Evaluation on MRPC dataset.

Table 5. Evaluation on BIOSSES dataset.

Results on the Low-Resource Setting

Table 6. Low-resource settings of MRPC datasets.

Table 7. Evaluation on MRPC dataset by different training ratios.

Table 8. Low-resource settings of BIOSSES datasets.

Ablation Testing

Table 9. Ablation testing on MRPC dataset.

Table 10. Ablation testing on BIOSSES dataset.

Conclusion and Future Works

Disclosure Statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date