445
Views
0
CrossRef citations to date
0
Altmetric
Research Article

STSG: A Short Text Semantic Graph Model for Similarity Computing Based on Dependency Parsing and Pre-trained Language Models

ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, &
Article: 2321552 | Received 13 Jun 2023, Accepted 07 Feb 2024, Published online: 04 Mar 2024

ABSTRACT

Short text semantic similarity is a crucial research area in nature language processing, which is used to predict the similarity between two sentences. Due to the sparsity features of short texts, words are isolated in the sentence and the correlations of words are ignored, it is very difficult to calculate the global semantic information. Based on this, short text semantic graph (STSG) model based on dependency parsing and pre-trained language models is proposed in this paper. It utilizes the syntactic information to obtain word dependency relationships and incorporate it into pre-trained language models to enhance the global semantic information of sentences. So it can address the semantic sparsity more effectively. A text semantic graph layer based on the graph attention network (GAT) is also realized, which regards word vectors as node features and word dependency as edge features. The attention mechanism of GAT can identify the importance of different word correlations and solve the word dependency modeling effectively. On the challenging short text semantic benchmark dataset MRPC, the STSG model achieves an F1-score of .946, which is further improved 2.16% over previous SOTA approaches. At the time of writing, STSG has achieved a new SOTA performance on the MRPC dataset.

Introduction

Short text semantic similarity (STS) computation is a fundamental problem in natural language processing, which aims to predict the similarity between two sentences. STS computation has numerous applications, especially in the information retrieval. STS can calculate the relevance of a user’s question to the content, addressing the issue of information overload and improving search strategies and results (Chandrasekaran and Mago Citation2021; Mengting et al. Citation2021). For example, paraphrase recognition is used to determine the category scores of two sentences in text classification, and STS is used to assess the similarity between sentences in question and answer. However, it is important to note that the recognition tasks for the short text are different from long texts such as news articles and magazines. In addition, the content of short texts is often sparse (Mengting et al. Citation2021), it is challenging to accurately compute the semantic similarity between two sentences.

There are three primary methods for computing short text semantic similarity. Deep learning-based methods (Jonas and Aditya Citation2016; Weidong et al. Citation2022; Zhiguo, Haitao, and Abraham Citation2016) can generate vector representations of sentences and calculate their similarity using neural networks. However, these methods require a large amount of computing power and may struggle to accurately capture the semantics of a sentence. BERT (Devlin et al. Citation2018), T5 (Raffel et al. Citation2020), DeBERTav3 (Pengcheng, Jianfeng, and Weizhu Citation2023) and other approaches based on the pre-trained language model use a large-scale unlabeled corpus on upstream tasks, and then fine-tune the downstream task to achieve good results in text semantic similarity computing. SimCSE (Tianyu, Xingcheng, and Danqi Citation2021), DiffCSE (Yung-Sung et al. Citation2022), and other (Lingling et al. Citation2024) comparative learning-based approach distinguish between similar and dissimilar representations of data points. Although existing methods have achieved good results in computing text semantic similarity, there are two challenges for computational research on the semantic similarity of short texts:

  1. Sparsity of Short Texts. Short texts are typically less than 200 words and contain sparse information, making it difficult to extract effective feature words. In addition, short texts often have unpredictable language variations and lack contextual information, such as emoticons, acronyms, and internet slang, which can result in a significant amount of extraneous information. For example, the sentence “I like the larger screen size of the iPhone 6 compared to its predecessor” does not mention that the iPhone 6 Plus, a similar product, also has a larger screen than its predecessor. Short texts may lack related words, making it difficult for models to reason about global semantic information. This can affect the accuracy of text similarity.

  2. Word Relevance Modelling. One of the main challenges facing STS models is how to capture semantics and underlying context (Zhe et al. Citation2019; Ghafour, Jamshid Bagherzadeh, and Mohammad-Reza Citation2022). Deep embedding models have been studied for STS tasks with promising results (Jianguo et al. Citation2019). However, existing models usually assume that words are independent of each other and that all words and attributes are equally important for a sentence. This assumption is not always true. The semantic capacity of short texts is limited, and it is difficult to discriminate the importance of different relationships if the correlations between words are ignored.

Motivation: Using neural models in STS is not only simple and efficient, but also it can integrate a wider range of informative knowledge. Syntactic information and dependency trees have been used in numerous studies (Yuanhe et al. Citation2021). Dependency trees can establish long-distance connections between key words, facilitating the extraction of relationships between sentence pairs, which help the system to extract relations between sentence pairs more effectively. Graph neural networks have been used recently to combine grammatical knowledge with pre-trained language models, and it achieves good outcomes (Hironori et al. Citation2022; Liang, Chengsheng, and Yuan Citation2019; Yangfan et al. Citation2021). However, these methods tend to focus solely on the feature learning of local nodes in the graph neural network, while disregarding the feature learning of neighboring nodes. Graph Attention Networks (GAT) can learn distinct weights for neighboring nodes by using the attention mechanism, addressing the limitations of traditional graph neural networks like GCN (Velikovi et al. Citation2017).

To address the above challenges, we propose the short text semantic graph (STSG) model, which based on dependency parsing and pre-trained language models. The STSG model, which makes use of syntactic information, extracts the word dependency relationships and incorporates it into pre-trained language models to enhance the global semantic information of sentences. The STSG model can addresses the semantic sparsity more effectively. We also propose a text semantic graph layer based on the graph attention network (GAT) (Velikovi et al. Citation2017), which regards word vectors as node features and dependency parsing relationships of words as edge features. The attention mechanism of GAT enhances its ability to capture remote dependencies, which can identify the importance of different word correlations and solve the word dependency modeling effectively. Four layers are proposed in the STSG model. 1) The dependency parsing layer is used to extract the dependency parsing relations of a short text and obtain multiple relation triples. 2) The sentence encoding layer is used to encode the semantic information and extract word vectors of the short text by the DeBERTav3 model (Pengcheng, Jianfeng, and Weizhu Citation2023). 3) The text semantic graph layer is used to learn graph attention features of word vectors and syntactic dependencies. 4) The similarity computing layer is used to build a fine-tuning model to calculate the similarity for different sentences by the fully connected network.

The main contributions of this paper are summarized as follows:

  1. A short text semantic graph (STSG) model by using syntactic information is proposed. It extracts the word dependency relationships and incorporates it into pre-trained language models to enhance the global semantic information of sentences.

  2. A GAT-based text semantic graph layer is proposed to use word vectors as node features and words dependency as edge features. It improves word relevance modeling by identifying the importance of different words.

  3. Extensive evaluation and analysis shows that our framework not only achieves new SOTA performance on MRPC, but also has strong generalization capabilities. It is also more effective on low-resource datasets.

There are still some limitations in the STSG model. The longer texts more than 512 tokens cannot be supported in the STSG model. The semantic similarity of English text can be calculated in the model, but the cross-lingual semantic similarity, such as Chinese, cannot be calculated. For the limited computational resources, we are unable to conduct experiments on recently pre-trained large language models, such as GPT4 (Achiam et al. Citation2023)

The remainder of this paper is organized as follows. The related work is reviewed in Section 2. The proposed model STSG are described in Section 3. The experimental settings are presented in Section 4. Finally, the work is concluded and prospects are given in Section 5.

Related Works

In this section, we first describe the related research on short text semantic similarity in Section 2.1, followed by related work on dependency parsing in Section 2.2. Finally, we present applications of short text semantic similarity in Section 2.3.

Short Text Semantic Similarity

For semantic similarity in short texts, deep learning-based approaches are an important approach. Zhiguo, Haitao, and Abraham (Citation2016) focus on the importance of non-similar parts between two sentences, and use a two-channel convolutional neural network (CNN) to decompose similar and non-similar components, which improves the global semantic representation ability of sentences. Paul, Maarten, and Mihai (Citation2016) use the bidirectional LSTM model to obtain information from two directions of input text, which allows them to capture the bidirectional semantic information of sentences. Weidong et al. (Citation2022) propose Re-LSTM, a weighted word-embedded long- and short-term memory network that reduces model parameters and computation. Methods based on pre-trained language models are the other main approach to semantic similarity of short texts. BERT (Devlin et al. Citation2018) uses an existing unlabeled corpus to pre-train the Transformer model to obtain word embedding vectors, and then fine-tunes the model to complete the sentence similarity computation. Joshi et al. (Citation2020) propose a pre-trained model (called SpanBERT) based on word segmentation, which adds a mask to the random adjacent word. Kevin et al. (Citation2020) propose an efficient pre-training method called replaced token detection (RTD). During the pre-training process, the generator can replace the words in the sentence, and then the discriminator determines which words in the sentence have been replaced. T5 (Raffel et al. Citation2020) uses the unified pre-trained language models to learn multiple different NLP tasks. DeBERTa (He et al. Citation2021) and DeBERTav3 (Pengcheng, Jianfeng, and Weizhu Citation2023) use disentangled attention to improve BERT and further improve model performance. However, these methods lack a semantically enhanced representation of the text and do not take into account the dependencies between different words. Comparative learning-based approaches for short text semantic similarity have been extensively researched. Sentence-BERT (Nils and Iryna Citation2019) improves BERT by using siamese and triplet networks to update parameters and generate semantic embedding vectors. DiffCSE (Yung-Sung et al. Citation2022) presents an unsupervised contrast learning framework for sentence embeddings that is sensitive to the differences between original and edited sentences, enhancing their representations. SimCSE (Tianyu, Xingcheng, and Danqi Citation2021) proposes a straightforward contrast learning framework that predicts input sentences in the contrast target, using only standard dropout as noise. The article effectively promotes the use of state-of-the-art sentence embedding. Lingling et al. (Citation2024) propose similarity and relative similarity strategies to identify false-negative samples in potential contrast learning, which can further improve performance significantly. Previous research methods are described in .

Table 1. Previous research methods.

Dependency Parsing

Recently, some methods have combined grammatical information in pre-trained language models and achieved good results in NLP tasks. Zhe et al. (Citation2019) proposed the ACV tree, which combines word embedding and syntactic information. They also developed the ACVT kernel to calculate sentence similarity. Minh Hieu and Philip (Citation2020) introduced syntax into the attention and considered partial speech embedding, dependency-based embedding, and contextualization embedding to improve the performance of the extractors. Yu et al. (Citation2020) propose a continuous learning framework (called ERNIE) for deeply integrated knowledge to continuously learn vocabulary, syntactic, and semantic knowledge by introducing more tasks. Yuanhe et al. (Citation2021) propose a dependent-driven relational traction method (called a-GCN), which applies graph convolutional network attention to different context words in the syntax dependency tree to distinguish the importance of different word dependencies. Guimin et al. (Citation2021) propose a type-aware mapping memory (called TaMM) for relation extraction, which can encode sentence dependencies and dependency types. Binyuan et al. (Citation2022) propose an S2SQL model for the text-to-SQL task, which jointly encodes the text syntax and the database schema. However, existing grammar-based methods do not consider feature learning of neighbor nodes, and cannot handle dependencies between words. Based on this, the proposed STSG uses the dependency parsing method to achieve an accurate understanding of the semantic relationships between different sentences. We utilize a dependency parsing layer to extract multiple relationship triples, and use the GAT to learn the attention score between nodes by triples.

Application of Short Text Semantic Similarity

Short text semantic similarity has various applications, including text classification, text clustering, sentiment analysis, information retrieval, social networks, academic plagiarism detection, and specific domains (Mengting et al. Citation2021; Song and Hai Citation2022). Successful approaches in the biomedical domain include ontology-based and neural network-based metrics. For example, Dazhi et al. (Citation2023) utilized textual semantic similarity to identify cancer driver genes. They introduced GASN, a similarity network-driven gamma distribution test for gene identification, which combines machine learning and distributional statistical methods. Yan et al. (Citation2024) proposed a multiple similarity deep graph network to deal with user clustering in human-computer interaction using textual semantic similarity. Wenjuan et al. (Citation2024) also used textual semantic similarity to detect personality traits and proposed a PS-GCN model that integrates psychological knowledge and affective semantic features through graph convolutional networks. This model efficiently exploits textual similarity to extract affective features, resulting in a significant improvement in the classification accuracy of the personality detection task. Zhaorui et al. (Citation2023) employed textual semantic similarity to address the task of visual language comprehension. They proposed a new metric model, Semantic Similarity Distance (SSD), to mitigate semantic inconsistencies and bridge the gap between text and image.

The Proposed Method

This section presents a detailed explanation of STSG. Firstly, a method for solving short text sparsity is introduced by using the dependency parsing layer and sentence coding layer. Next, the semantic graph layer is introduced to distinguish the relevance of different words, which improves word relevance modeling. Finally, the similarity computing layer is used for similarity computation, as shown in .

Figure 1. Workflow of the proposed STSG.

Figure 1. Workflow of the proposed STSG.

Problem Definition

Definition 1.

(Short text semantic similarity) Given the two short sentences S1=p1,p2,,pmandS2=q1,q2,,qn, where p and q are the words in the sentences S1and S2, respectively. In additional, m and n represent the length of the sentences S1 and S2, respectively. A classifier was trained to accurately predict the similarity between S1 and S2. Lin (Citation1998) derived a text similarity theorem based on information theory, as shown in EquationEquation 1:

(1) SimilarityS1,S2=logPcommonS1,S2logPdescriptionS1,S2(1)

where P represents probability, commonS1,S2 is the common information of S1and S2, and descriptionS1,S2 is the descriptions of all information in S1and S2. The range of the similarity is between 0 and 1. The logsimplifies the calculation of probabilities.

Dependency Parsing Layer

The syntax dependency between each word in sentence S1and sentence S2 can be extracted by the dependency parsing layer, which comprises the dependency parsing module and the dependency representing module. Dependency parsing technology is used to obtain syntax components and analyze the relationships between them. The attention of different dependencies between words in short text is calculated using dependency parsing features. In order to obtain the similarity of sentences S1and S2, the model jointly calculates the semantic representation between two sentences. In this paper, we constructed our dependency parsing module using the DDParser toolkit (Shuai et al. Citation2020) to generate dependency analysis results. DDParser is capable of processing text in both Chinese and English languages (Shuai et al. Citation2020).

Given a sentence S=W1,W2,,Wn, where Wn is the n-th word in S. According to the result of dependency parsing, we obtain the syntax dependency graph G=V,Eof S, where V represents some nodes corresponding to each word in S, and E represents directed edges, indicating the dependency relationship between words. Each E includes a label to indicate the specific dependency. Dependency syntax allows for the mapping of the core word to its dependency syntax tree, as shown in .

Figure 2. Dependency parsing layer.

Figure 2. Dependency parsing layer.

Dependency parsing concentrates on the word and its dependencies, rather than its components. To make the model’s ability to learn dependency representation, we convert the syntax dependency graph G into a relation triplet R, as shown in EquationEquation 2:

(2) R=W1,W3,r1,W4,W7,r5W6,W8,r3,,Wn,Wm,ri(2)

where ri represents the i-th type of dependency between words.

The relation triplet R is then converted into an adjacency matrix A=aijn×n, where n represents the matrix size. In order to calculate the relational attention between words by using the text semantic graph layer, it is necessary to remove the noisy data in the dependencies. Therefore, it is necessary to ignore the specific direction of dependencies between words in the adjacency matrix A and focus only on their relations. A value in the adjacency matrix A represents whether there is a dependency relation between the i-th word and the j-th word. The adjacency matrix A is defined as EquationEquation 3.

(3) aij=1,isadependencyrelation0,isnotadependencyrelation(3)

Finally, the adjacency matrix A is vectorized to obtain the representation Dof dependency syntax, which constitutes the edge features of the text semantic graph. The process of the dependency parsing layer is shown in Algorithm 1.

Algorithm 1

Algorithm 1 describes the computing process of dependency parsing layer. First, in step 1, the output variable D is declared, and in step 2, the syntax dependency graph G of the sentence S is obtained. In steps 3–5, according to the syntax dependency graph G, we can obtain a relation triplet R. Simultaneously, in step 6–7, we convert the relation triplet R into the adjacency matrix A=aijn×n, and initialize aij to 0 (step 8). In steps 9–12, we judge whether two words contain dependency relations, and if they do, we set aij to 1. Finally, we characterize the adjacency matrix into vectors (step 13).

Sentence Coding Layer

The sentence coding layer can encode a sentence into vectors. Following DeBERTav3 (Pengcheng, Jianfeng, and Weizhu Citation2023), we use a multi-layer bidirectional Transformer (Ashish et al. Citation2017; Mary and Marcus Citation2022) as the model backbone to encode the semantic information of short texts, as shown in .

Figure 3. Sentence coding layer.

Figure 3. Sentence coding layer.

First, given two sentences S1=W11,W12,W13,W14,,W1n and S2=W21,W22,W23,W24,,W2n, where Wijrepresents the j-th word in Si. To learn the deep interaction features between sentences S1 and S2, we use tags CLS and SEP to concatenate the above sentences for joint representation. The tagged sequence Xi is shown in EquationEquation 4.

(4) Xi=CLSW11,W12,W13,W14,,W1nSEPW21,W22,W23,W24,,W2nSEP(4)

Then we encode the sequence Xi. Specifically, the sentence coding layer takes a sequence Xi as the input and applies N transformer layers to generate a contextual representation Hn, as shown in EquationEquation 5:

(5) Hn=TransformernHn1(5)

where n1,N represents the n-th layer, H0 is the embedding of the sequence Xi, and each transformer layer contains an identical transformer block. The transformer block consists of disentangled attention (He et al. Citation2021) and a feed-forward neural network (FFN). Formally, the output of the transformer block is calculated as shown in EquationEquation 6 ~ Eq (8):

(6) H0=Hi,Pi|j×Hj,Pj|iT=HiHjT+HiPj|iT+Pi|jHjT+Pi|jPj|iT(6)

where Hi represents the i-th word embedding of the sequence Xi, and Pj|i represents the relative position embedding of the sequence Xi. Trepresents the transpose operation.

A decoupling matrix of content embedding and relative position embedding are used to calculate the weight of attention between words. The projection matrix is calculated as follows:

(7) Qc=H0Wq,c,Kc=H0Wk,c,Vc=H0Wv,c,Qr=PWq,r,Kr=PWk,r(7)

where Qc, Kc and Vcrepresents the content vectors generated by projection matrices, Wq,cWk,cWv,c Rd×d respectively, and PR2K×d represents the vector of the relative position embedding in a shared layer. Qr and Kr represent the relative position embedding generated by projection matrices Wq,rWk,rRd×d, respectively, as shown in EquationEq (8):

(8) zi=QicKjcT+QicKr,δi,jT+KjcQr,δi,jT(8)

where zi is thei-th element in the attention matrix z, Qic represents the i-th row in Qc, Kjc represents the j-th row in Kc, Kr,δi,j represents the relative distance δi,j of the r-th row in Kr, and Qr,δi,j represents the relative distance δi,j of the r-th row in Qr. We calculate the disentangled attention zi, as shown in EquationEq. (9):

(9) zi=Softmaxzi3d×Vc(9)

where Softmax represents an activation function, and we feed zi to the FFN and normalize zi to get the normalized outputri, as shown in EquationEq (10):

(10) ri=LNFFNzi+zi(10)

where LN represents a LayerNorm function (Lei Jimmy et al. Citation2016), and FFN represents a feed forward neural network.

Finally, we use Transformer Decoders (Ashish et al. Citation2017) to obtain intermediate embedding, and we input the intermediate embedding into the pooling layer. We take the vector S  of the pooling layer at the CLS as the word vector of Xi, defined as S Rd×dq. We regard S  as a node feature of the semantic graph. Algorithm procedure of the sentence coding layer is shown in Algorithm 2.

Algorithm 2

outlines the process for calculating the sentence coding layer. In Step 1, the output variable S  is declared. Then, in Step 2, the encoder layers are traversed. The content embedding and relative position embedding of the text are obtained in Step 3, and five projection matrices are generated in Step 4. The disentangled attention is then calculated in Step 5, which is normalized in Step 6. The attention is inputted to the FFN in Step 7, and the output vector of each encoder layer is accumulated in Steps 8–9. Finally, the output vector of the encoder layer is fed into the decoder to obtain the word vector S  in Step 10.

Semantics Graph Layer

We construct a semantic graph layer based on the graph attention network (Velikovi et al. Citation2017), which can effectively fuse the word vector of the text with the dependency representation. Specially, we build a heterogeneous text graph that contains word nodes and dependency relations. The word vector representation of the text is the node feature of the graph, and the dependency representation of the text is the edge feature of the graph. The semantic graph layer can learn the relation weights between word nodes using the graph attention mechanism, as shown in .

Figure 4. Semantics graph layer.

Figure 4. Semantics graph layer.

When constructing an edge, we do not take into account the dependency type and dependency direction. If two words in a sentence have dependency relations, we construct an edge between the corresponding nodes. And we represent the F dimensional semantic feature of each node in the graph attention network as h=h1,h2,,hN,hiRF, all node semantic features of the output are expressed ash =h1,h2,,hN,hiRF . Let the vector features of word nodes ui,uj be hi and hj, respectively. For a word node ui we calculate the attention coefficient eij of all neighboring word nodes uj and ui, as shown in Eq. (11):

(11) eij=aWhi,Whj,jNi(11)

where a represents a shared attention weight, WRF ×F represents a learnable weighting matrix, which convert these features of word nodes into high-dimensional features.

In order to assign the attention weights between different word nodes, we normalize the attention coefficient eij between word node ui and all neighboring word nodes uj, and obtain an attention score αij, as shown in Eq. (12):

(12) αij=softmaxjeij=expeijkNiexpeik(12)

where we use a LeakyReLU activation function to optimize Eq. (12), as shown in Eq. (13).

(13) αij=expLeakyReLUaTWhiWhjkNiexpLeakyReLUaTWhiWhk(13)

And we perform a weighted summation of the attention features of all word nodes, and obtain the semantic feature representation hi of word nodes, as shown in Eq. (14):

(14) hi=σjNiαijWhj(14)

where σ represents an activation function, and hi represents the feature representation of the i-th word node in the attention layer.

In order to make the attention more accurate, we use multi-head attention mechanism to assign K groups of independent attention weight matrices to Eq. (14), as shown in Eq. (15):

(15) hi=σ1Kk=1KjNiαijkWkhj(15)

Finally, to improve the semantic representation of text, we concatenate the obtained text sentence representationS with the text semantic graph representation hi, and obtain the final text semantic representation, as shown in Eq. (16):

(16) S O=Concathi,S (16)

where Concat represents the concatenated operation. Algorithm procedure of the semantics graph layer is shown in Algorithm 3.

Algorithm 3

Algorithm 3 describes the process for calculating the semantics graph layer. Firstly, the output variable S O (Step 1), and we regard the word vector S is considered as the node feature of the graph (Step 2). Next, we take the adjacency matrix A is then taken as the edge feature of the graph (Step 3), and we feed node and edge features are fed into GAT to obtain the graph attention vector (Step 4). Finally, the word vectors and graph attention vectors are concatenated to obtain the semantic graph vector S O (Step 5).

Similarity Computing Layer

The similarity computing layer is created to build a text classifier that calculates the semantic similarity between two sentences. A fully connected neural network is used to classify the text vector and obtain the semantic similarity. In Section 3.4, the final text semantic representation S O is obtained. A fully connected neural network is created for binary classification using S O as input. The network’s classification output (0–1) is obtained using the cross-entropy loss function. The network is then trained with a pre-trained language model DeBERTv3 (as described in Section 3.3) using a fine-tuned training pattern to obtain the final output, as illustrated in . The formula for the similarity calculation is shown in Eq. (17):

(17) y=fW×S O+b(17)

where f is the activation function, W represents the weight matrix, S O represents the semantic vector, and b is the learnable parameters. And we use the cross-entropy loss function L to measure the prediction quality, as shown in Eq. (18):

(18) L=1Ni=1NLi=yilogpi+1yilog1pi(18)

where yi represents the label of sample i, and pirepresents the probability that sample i is predicted to be positive value.

Experiments and Results

In this section, we first describe the details of the experiments and datasets in Sections 4.1 and 4.2, then introduce the baselines in Section 4.3 and introduce the metrics in Section 4.4, and show the results of the main experiment in Section 4.5. In Section 4.6 we show results on the low-resource setting. Finally, we perform an ablation testing to prove the effectiveness of each component in Section 4.7.

Experimental Details

The proposed STSG was implemented using the PyTorch 1.7 framework on a Windows server with an Intel(R) Core-(TM) i7-10700F CPU @ 2.90 GHz, 32 G memory, and an NVIDIA GeForce RTX 3060 GPU. DeBERTV3 (Pengcheng, Jianfeng, and Weizhu Citation2023) suggests using these hyper-parameters to fine-tune DeBERTaV3 in downstream tasks. To ensure a fair comparison with our baseline model DeBERTV3, we have chosen the hyper-parameters of STSG to be consistent with it. The original DeBERTV3 article has demonstrated the validity of these parameters, which are described in .

Table 2. Parameters of STSG.

Datasets

To assess the performance of STSG in short text similarity, we utilized the MRPC dataset (Alex et al. Citation2019), which is the most commonly used authoritative evaluation benchmark for such tasks (Devlin et al. Citation2018; Pengcheng, Jianfeng, and Weizhu Citation2023; Raffel et al. Citation2020; Yung-Sung et al. Citation2022). Additionally, we employed the challenging low-resource BIOSSES dataset (Gizem, Hakime, and Arzucan Citation2017) to validate the feasibility of STSG in low-resource text semantic similarity tasks, as shown in :

  1. MRPC (The Microsoft Research Paraphrase Corpus) is a dataset consisting of 5800 sentence pairs extracted from online news. The category distribution is unbalanced, with 68% of the pairs being positive samples, which poses a significant challenge to the model’s computation.

  2. BIOSSES is a dataset comprising 100 sentence pairs that were selected from medical abstracts in the biomedical field. The dataset uses score intervals of 0-5 to judge the similarity between the pairs. Due to its low-resource nature and extremely sparse samples, the dataset presents a challenge for models to learn from small samples.

Table 3. Summary of datasets.

Baselines

We used the following 10 models as the experimental baseline.

  1. BERT (Devlin et al. Citation2018). This model uses an existing unlabeled corpus to pre-train the Transformer model to obtain word embedding vectors, and then fine-tunes the model to complete the sentence similarity computing.

  2. T5 (Raffel et al. Citation2020). This model uses the unified pre-trained language models to learn multiple different NLP tasks.

  3. ELECTRA (Kevin et al. Citation2020). This model uses a discriminative pre-trained text encoder to calculate the similarity of two sentences.

  4. ERNIE2.0 (Yu et al. Citation2020). This model is a continuous learning framework for deeply integrated knowledge to continuously learn vocabulary, syntactic, and semantic knowledge by introducing more tasks.

  5. SpanBERT (Joshi et al. Citation2020). This model is a pre-trained model based on word segmentation, which adds a mask to the random adjacent word to calculate text similarity.

  6. DistilBERT (Sanh et al. Citation2019). This model is a pre-trained universal language representation model to calculate text similarity.

  7. ESM-2 (Zemin et al. 2022). ESM-2 calculates the sequence text data using large-scale pre-trained models.

  8. XLNet (Yang et al. Citation2019). XLNet is a generalized autoregressive pre-trained model to calculate text similarity.

  9. DeBERTa (He et al. Citation2021). DeBERTa uses the disentangled attention to improve BERT and further improve the performance of the model.

  10. DeBERTav3 (Pengcheng, Jianfeng, and Weizhu Citation2023). This model is an efficient and simple pre-trained model to calculate text similarity.

Metrics

(1) Precision. Precision describes how many of the number of similar sentence pairs (classification result of 1) predicted by the model are true:

(19) Precision=TPTP+FP(19)

(2) Recall. Recall describes the percentage of all similar sentence pairs (classification result 1) correctly predicted by the STSG model:

(20) Recall=TPTP+FN(20)

(3) Accuracy. Accuracy describes the percentage of all true outcome values of the STSG model:

(21) Accuracy=TP+TNTP+FP+TN+FN(21)

(4) F1-score. The F1-score is the mean value of the Precision and Recall. The value range of F1-score is [0, 1]:

(22) F1Score=2×Precision×RecallPrecision+Recall(22)

where TP is the number of similar sentence pairs correctly predicted by the model, TN is the number of non-similar sentence pairs predicted by the model, FP is the number of similar sentence pairs incorrectly predicted by the model, and FN is the number of non-similar sentence pairs not predicted by the model.

(5) Pearson. Pearson metric reflects the difference between the predicted result of the model and the correct result, where 0 represents no correlation and 1 represents a complete correlation. The corresponding formula is defined as Eq. (23):

(23) Pearson=ni=1nxiyii=1nxii=1nyini=1nxi2i=1nxi2ni=1nyi2i=1nyi2(23)

(6) Spearman. The Spearman metric measures the rank correlation coefficient between the predicted outcome of the model and the correct outcome. The value ranges from −1 to 1, where −1 indicates no correlation and 1 indicates perfect correlation. The corresponding formula is defined as Eq. (24):

(24) Spearman=16i=1ndi2nn21(24)

(7) Kendall. Kendall’s correlation is a rank correlation coefficient used to assess the correlation between two random variables based on the rank of the data objects. The corresponding formula is defined as Eq. (25):

(25) Kendall=cdc+d+txc+d+ty(25)

where c and d represent the number of concordant and divergent pairs, and tx and ty denote the number of tied rankings in xi and yi, respectively.

(8) MAE. Mean Absolute Error (MAE) can be calculated as the mean absolute error between the true value and the predicted value, as shown in Eq. (26):

(26) MAE=1ni1nxiyi(26)

(9) MSE. Mean Squared Error (MSE) can calculate the mean-squared error between the true and predicted values, as shown in Eq. (27):

(27) MSE=1ni1nxiyi2(27)

where xi represents the true result, yi represents the predicted result, di is the level difference between xi and yi, and n is the number of samples.

Results of the Main Experiment

(1) Results on MRPC dataset

shows the accuracy of STSG compared to advanced methods based on the MRPC dev dataset. STSG achieves the highest performance among all the methods, with an F1-score of .946. The proposed STSG has a recall of .913, outperforming other methods. The pre-trained language model based achieves similarity computation with F1-scores ranging from .900 to .926. Examples of such models include BERT, T5, ELECTRA, DeBERTa, DeBERTav3 and SpanBERT. Comparative learning methods, such as Sentence-BERT, DiffCSE, and SimCSE, have achieved F1-scores ranging from .760 to .777. The MRPC dataset has an unbalanced category distribution, with 68% of sentence pairs being positive samples. As a result, contrastive learning-based methods typically yield low F1-scores. However, it has been shown that using pre-trained language models such as DeBERTav3 and DeBERTa alone can improve results. It is worth noting that pre-trained language models may be difficult for learning semantic of sentences, resulting in a large discrepancy between F1-scores and Recall scores. In contrast, STSG based on a combination of syntactic and pre-trained models is better at capturing the semantic of sentences. As a result, the similarity calculation is better, and the difference between the F1-score and the Recall and Accuracy metrics is smaller.

Table 4. Evaluation on MRPC dataset.

(2) Evaluation on BIOSSES dataset

shows the accuracy of STSG in comparison to other methods, using the BIOSSES test dataset. STSG achieves the best values for MAE and MSE errors, with a Pearson value of .915 and a Spearman’s value of .835. This primary cause of these results is the limited number of training samples in the BIOSSES dataset, which comprises only 64 pairs of sentences and more noisy data. Methods that rely on pre-trained language models can be challenging to apply to low-resource tasks. For example, models like DeBERTav3 and BioBERT may not yield high Pearson values. The STSG enhances the global semantic information of sentences by utilizing syntactic dependencies. Compared to DeBERTav3, Pearson values have improved by 8.67%. These results demonstrate that combining syntactic dependency and pre-trained language models can be effective in textual semantic similarity tasks with small datasets.

Table 5. Evaluation on BIOSSES dataset.

The MAE and MSE metrics are used to assess the deviation between the predicted values of the model and the true values of the samples. The smaller the MAE and MSE values, the better the predictive performance of the model. shows that some methods have high Pearson values but large MAE and MSE values due to poor processing of sentence semantics, such as ERNIE2.0 and XLNet. The proposed STSG has a MAE of .518 and a MSE of .376, which is 40.60% lower than T5’s MAE.

Results on the Low-Resource Setting

shows the experiments conducted on various training ratios of the MRPC dev dataset. We divided the training set of MRPC into new datasets with different ratios to train the model (called training ratios), and used the validation set of 100% size to verify the performance of the model to test the generalization ability of STSG on the classification problem.

Table 6. Low-resource settings of MRPC datasets.

And we use different training ratios on the MRPC dataset to show the Receiver Operating Characteristic Curve (ROC) curves. describes the ROC curves of the proposed STSG and several baselines on the MRPC dataset. Specifically, shows that the Area Under Curve (AUC) of STSG on a training ratio of 100% is .913. Similarly, shows that the AUC value of STSG on a training ratio of 50% is .869. Moreover, shows that the AUC value of STSG on a training ratio of 20% is .772. Finally, shows that the AUC of STSG on a training ratio of 10% is .693. The pre-trained language model approach performs poorly on the low-resource dataset in terms of AUC values due to the imbalance of positive and negative samples in MRPC. It struggles to accurately identify negative examples in sentences. The results demonstrate that STSG achieves a high True Positive Ratio for positive samples at different training ratios, while maintaining a low the False Positive Ratio for negative samples.

Figure 5. ROC curve on MRPC dataset.

Figure 5. ROC curve on MRPC dataset.

show that the F1-score of the proposed STSG on a training ratio of 1% is .812 and the accuracy is .684. This represents an increase in F1-score of 14.04% and accuracy of 12.50% compared to BERT. On a training ratio of 50%, the F1-score of STSG is .909 and the accuracy is .873.

Table 7. Evaluation on MRPC dataset by different training ratios.

The experimental results in show that the proposed of STSG outperforms other baselines with different ratios on the MRPC dataset. The existing methods ignore the semantic correlation between words, resulting in unstable results during evaluation under different training ratios. In contrast, the proposed STSG learns dependencies between words and incorporates the feature learning of domain nodes, leading to stable prediction performance.

shows that the Pearson of STSG is .895 for a training ratio of 50%, .509 for a training ratio of 20%, and .578 for a training ratio of 10%. These results show that STSG can achieve accurate predictions on small datasets with different proportions of BIOSSES, and its generalization ability exceeds that of similar baselines. STSG takes into account the feature learning of domain nodes, which enables the model to better recognize the context in short texts.

Figure 6. Pearson on BIOSSES dataset.

Figure 6. Pearson on BIOSSES dataset.

shows the experiments conducted on different training ratios of the BIOSSES test dataset. We split the training set of BIOSSES into new datasets with different ratios to train the model, and use the validation set of 100% to verify the performance of the model to test the generalization ability of the proposed STSG.

Table 8. Low-resource settings of BIOSSES datasets.

presents the experimental results for different training scales of the BIOSSES test dataset. The BIOSSES training set was split into new datasets with varying scaling sizes to train the model. The model’s MAE and MSE were validated using 100% and 50% sized validation sets. With a 50% training scaling, STSG achieved an MAE of .605 and an MSE of .555. This represents an 11.03% decrease in MAE and a 36.21% decrease in MSE compared to T5. Compared to other methods, STSG exhibits smaller MAE and MSE values at different training ratios. Methods based on pre-trained language models tend to yield large error values under low-resource experiments because these models are not fine-tuned for downstream tasks and are difficult to adapt to small sample datasets. STSG utilizes syntactic information and textual semantic graphs to improve the model’s generalization ability, resulting in strong predictive performance even on small datasets.

Figure 7. Errors on BIOSSES dataset.

Figure 7. Errors on BIOSSES dataset.

Ablation Testing

and show the ablation testing results of STSG on MRPC and BIOSSES, respectively. The STSG model that does not use a dependency parsing layer is represented as -w/o DP. shows that the F1-score of -w/o DP on the MRPC dataset is .926, which is a decrease of 2.11% compared to STSG. Similarly, shows that the Pearson of -w/o DP is .842, which is a decrease of 7.98% compared to STSG. These results show that the proposed STSG takes into account the syntactic structure features between sentence pairs, which makes the model understand the true intention of the sentence expression, thus improving the accuracy of the similarity computation task.

Table 9. Ablation testing on MRPC dataset.

Table 10. Ablation testing on BIOSSES dataset.

Conclusion and Future Works

Short text semantic similarity computation is a fundamental problem in natural language processing, which aims to predict the similarity between two sentences. To tackle the shortcomings of current short text semantic similarity methods, we introduce STSG, which is based on dependency parsing and pre-trained language models. The model utilizes syntactic information and incorporates it into pre-trained language models, thereby augmenting the overall semantic information of sentences to address the issue of semantic sparsity. We propose a textual semantic graph layer that utilizes word vectors and dependency parsing relations as features of Graph Attention Networks (Velikovi et al. Citation2017) to enhance the word relevance modeling problem. However, the STSG model has limitations, such as its inability to process two sentences with more than 512 tokens. Additionally, the tool can only calculate semantic similarity for English text. Moreover, due to limited computational resources, we are unable to experiment with the latest large models such as GPT4 (Achiam et al. Citation2023). Our future plans involve addressing the current limitations of STSG and proposing a new method for measuring semantic similarity of text. This will be accomplished by combining textual semantic dependency analysis with large-scale language models.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the Open Project of Sichuan Provincial Key Laboratory of Philosophy and Social Science for Language Intelligence in Special Education under Grant No. YYZN-2023-4 and the Ph.D. Fund of Chengdu Technological University under Grant No. 2020RC002.

References

  • Achiam, J., S. Adler, S. Agarwal, L. Ahmad, I. Akkaya. 2023 Dec 19. Gpt-4 technical report. arXiv Preprint arXiv 2303:08774.
  • Alex, W., S. Amanpreet, M. Julian, H. Felix, L. Omer, and B. Samuel. 2019 Feb 22. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv Preprint arXiv 1804:07461.
  • Ashish, V., S. Noam, P. Niki, U. Jakob, J. Llion. 2017. Attention is all you need. Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Long Beach.
  • Binyuan, H., G. Ruiying, W. Lihan, Q. Bowen, L. Bowen. 2022. S2SQL: Injecting syntax to question-schema interaction graph encoder for text-to-SQL parsers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 1254–29. Dublin.
  • Chandrasekaran, D., and V. Mago. 2021. Evolution of semantic similarity—A survey. ACM Computing Surveys 54 (2):1–37. doi:10.1145/3440755.
  • Dazhi, J., W. Runguo, H. Zhihui, L. Senlin, L. Cheng, and Y. Lin. 2023. GASN: Gamma distribution test for driver genes identification based on similarity networks. Connection Science 35 (1):1–19. doi:10.1080/09540091.2023.2167937.
  • Devlin, J., M. W. Chang, K. Lee, and K. Toutanova. 2018 Oct 11. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv Preprint arXiv 1810:04805.
  • Ghafour, A., M. Jamshid Bagherzadeh, and F. Mohammad-Reza. 2022. Learning bilingual word embedding mappings with similar words in related languages using GAN. Applied Artificial Intelligence 36 (1). doi:10.1080/08839514.2021.2019885.
  • Gizem, S., Ö. Hakime, and Ö. Arzucan. 2017. BIOSSES: A semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 33 (14):I49–58. doi:10.1093/bioinformatics/btx238.
  • Guimin, C., T. Yuanhe, S. Yan, and W. Xiang. 2021. Relation extraction with type-aware map memories of word dependencies. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2501–12. Bangkok.
  • He, P., X. Liu, J. Gao, and W. Chen. 2021 Oct 6. DeBERTa: Decoding-enhanced BERT with disentangled attention. arXiv Preprint arXiv 2006:03654.
  • Hironori, T., S. Junya, F. Sulfayanti, and K. Akihiro. 2022. Anomaly detection using siamese network with attention mechanism for few-shot learning. Applied Artificial Intelligence 36 (1). doi:10.1080/08839514.2022.2094885.
  • Ilya, L., and H. Frank. 2019 Jan 4. Decoupled weight decay regularization. arXiv Preprint axXiv 1711:05101.
  • Jianguo, C., L. Kenli, B. Kashif, Z. Xu, L. Keqin. 2019. A Bi-layered parallel training architecture for large-scale convolutional neural networks. IEEE Transactions on Parallel and Distributed Systems 30(5):965–76. doi:10.1109/TPDS.2018.2877359.
  • Jonas, M., and T. Aditya. 2016. Siamese recurrent architectures for learning sentence similarity. Proceedings of the AAAI conference on artificial intelligence, 2786–92. Arizona.
  • Joshi, M., D. Chen, Y. Liu, S. Weld Daniel, L. Zettlemoyer, Zettlemoyer, L. and Levy, O. 2020. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8:64–77. doi:10.1162/tacl_a_00300.
  • Kevin, C., L. Minh-Thang, V. Quoc, and M. Christopher. 2020 May 23. ELECTRA: Pre-training text encoders as discriminators rather than generators. arXiv Preprint arXiv 2003:10555.
  • Lei Jimmy, B., K. Ryan, E. Geoffrey, H. B. Jimmy Lei. 2016 Jul 21. Layer normalization. arXiv Preprint arXiv 1607:06450.
  • Liang, Y., M. Chengsheng, and L. Yuan. 2019. Graph convolutional networks for text classification. Proceedings of the AAAI conference on artificial intelligence, 7370–77. Hawaii.
  • Lin, D. 1998. An information-theoretic definition of similarity. Proceedings of the International Conference on Machine Learning, 296–304. Madison.
  • Lingling, X., X. Haoran, W. Fu Lee, T. Xiaohui, W. Weiming, and Q. Li. 2024. Contrastive sentence representation learning with adaptive false negative cancellation. Information Fusion 102:102065–102065. doi:10.1016/j.inffus.2023.102065.
  • Mary, P., and H. Marcus. 2022 Jul 19. Formal algorithms for transformers. arXiv Preprint arXiv 2207:09238.
  • Mengting, H., Z. Xuan, Y. Xin, J. Jiahao, Y. Wei, and C. Gao. 2021. A survey on the techniques, applications, and performance of short text semantic similarity. Concurrency and Computation: Practice and Experience 33 (5). doi:10.1002/cpe.5971.
  • Minh Hieu, P., and O. Philip. 2020. Modelling context and syntactical features for aspect-based sentiment analysis. Proceedings of the 58th annual meeting of the association for computational linguistics, 3211–20. Washington.
  • Nils, R., and G. Iryna. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language, 3980–90. Hong Kong.
  • Paul, N., V. Maarten, and R. Mihai. 2016. Learning text similarity with siamese recurrent networks. Proceedings of the 1st Workshop on Representation Learning for NLP, 148–57. Berlin.
  • Pengcheng, H., G. Jianfeng, and C. Weizhu. 2023. DeBERTav3: Improving DeBERTa using ELECTRA-Style pre-training with gradient-disentangled embedding sharing. The Eleventh International Conference on Learning Representations, Kigali.
  • Raffel, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, Matena, M., Zhou, Y., Li, W. and Liu, P.J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21 (1):5485–551.
  • Sanh, V., L. Debut, J. Chaumond, and T. Wolf. 2019 Mar 1. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv Preprint arXiv 1910:01108.
  • Shuai, Z., W. Lijie, S. Ke, and X. Xinyan. 2020 Sep 3. A practical Chinese dependency parser based on a large-scale dataset. arXiv Preprint arXiv 2009:00901.
  • Song, C., and L. Hai. 2022. BERT-Log: Anomaly detection for system logs based on pre-trained language model. Applied Artificial Intelligence 36 (1). doi:10.1080/08839514.2022.2145642.
  • Tianyu, G., Y. Xingcheng, and C. Danqi. 2021. SimCSE - simple contrastive learning of sentence embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 6894–910. Punta Cana.
  • Velikovi, P., G. Cucurull, A. Casanova, A. Romero, P. Liò,and Bengio, Y. 2017 Feb 4. Graph attention networks. arXiv Preprint axXiv 1710:10903.
  • Weidong, Z., L. Xiaotong, J. Jun, and X. Rongchang. 2022. Re-LSTM: A long short-term memory network text similarity algorithm based on weighted word embedding. Connection Science 34 (1):2652–70. doi:10.1080/09540091.2022.2140122.
  • Wenjuan, L., S. Zhengyan, W. Subo, Z. Shunxiang, Z. Guangli, and L. Chen. 2024. PS-GCN: Psycholinguistic graph and sentiment semantic fused graph convolutional networks for personality detection. Connection Science 36 (1). doi:10.1080/09540091.2023.2295820.
  • Yan, K., P. Bin, K. Yongqi, Y. Yun, C. Jianguo, and X. Xie. 2024. Two-stage perceptual quality oriented rate control algorithm for HEVC. ACM Transactions on Multimedia Computing, Communications and Applications 20 (5):1–20. doi:10.1145/3636510.
  • Yang, Z., Z. Dai, Y. Yang, G. Jaime, Salakhutdinov, R.R. and Le, Q.V. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Proceedings of the 33rd International Conference on Neural Information Processing Systems, 5753–63. Vancouver.
  • Yangfan, L., C. Cen, D. Mingxing, Z. Zeng, and L. Kenli. 2021. Attention-aware encoder–decoder neural networks for heterogeneous graphs of things. IEEE Transactions on Industrial Informatics 17 (4):2890–98. doi:10.1109/TII.2020.3025592.
  • Yifan, P., Y. Shankai, and L. Zhiyong. 2019. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. Proceedings of the 18th BioNLP Workshop and Shared Task, 58–65. Florence.
  • Yuanhe, T., C. Guimin, S. Yan, and W. Xiang. 2021. Dependency-driven relation extraction with attentive graph convolutional networks. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 4458–71. Bangkok.
  • Yung-Sung, C., D. Rumen, L. Hongyin, Z. Yang, C. Shiyu, Soljačić, M., Li, S.W., Yih, W.T., Kim, Y. and Glass, J., et al. 2022. DiffCSE: Difference-based contrastive learning for sentence embeddings. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4207–18. Seattle.
  • Yu, S., W. Shuohuan, L. Yukun, F. Shikun, T. Hao,Wu, H. and Wang, H. 2020. Ernie 2.0: A continual pre-training framework for language understanding. Proceedings of the AAAI conference on artificial intelligence, 8968–75. New York.
  • Zeming, L. A. Halil, R. Roshan, H. Brian, Z. Zhongkai, L. Wenting, S. Nikita, V. Robert, K. Ori, S. Yaniv, et al. 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379 (6637):1123–30.
  • Zhaorui, T., Y. Xi, Y. Zihan, W. Qiufeng, Y. Yuyao, A. Nguyen, and K. Huang. 2023. Semantic similarity distance: Towards better text-image consistency metric in text-to-image generation. Pattern Recognit 144:109883–109883. doi:10.1016/j.patcog.2023.109883.
  • Zhe, Q., W. Zhi-Jie, L. Yuquan, Y. Bin, L. Kenli, and J. Yin. 2019. An efficient framework for sentence similarity modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (4):853–65. doi:10.1109/TASLP.2019.2899494.
  • Zhiguo, W., M. Haitao, and I. Abraham. 2016 Feb 23. Sentence similarity learning by lexical decomposition and composition. arXiv Preprint axXiv 1602:07019.