186
Views
0
CrossRef citations to date
0
Altmetric
Computer Science

Exploring the health care system’s representation in the media through hierarchical topic modeling

ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon show all
Article: 2324614 | Received 10 May 2023, Accepted 24 Feb 2024, Published online: 13 Mar 2024

Abstract

As a large social structure, the health care system is often reflected in media publications. This creates a significant impact on society’s attitude towards the system and the state in general. In order to predict and correct state policies, media actions, and identify media shortcomings, it is necessary to analyze the image portrayed by the media and the public’s attitude towards it. In this article, we present the results of a multidirectional analysis of a corpus of media publications related to health care. We propose a method for analyzing the information image of health care formed by the mass media based on a topic model of a text corpus. The method evaluates reader interest in various healthcare topics, the dynamics of changes in publication sentiment, and the main information trends. The article presents the results of analyzing a corpus of mass media publications in Kazakhstan from January 2020 to January 2023.

1. Introduction

Relatively low efficiency and productivity within the health care sector provoke an increase in fiscal restrictions, which in turn contribute to an increase in social tension and a decline in economic growth, especially during the periods of pandemics (Baldwin & Weder di Mauro, Citation2020). At the same time, the health care system is not only the medical system; it is the part of the social and economic structures of the society. The prevalence of economic restrictions leads to an increase in problems caused by high societal expectations and increased demand for health care services (Atun, Citation2015). Moreover, insufficient or ineffective representation of health care authorities in the information space contributes to the spread of rumors and misinformation (Tasnim et al., Citation2020), affects the mental health of the population (Gao et al., Citation2020; Giri & Maurya, Citation2021; Aslam et al., Citation2020; Hamidein et al., Citation2020). In cases when readers do not have knowledge and experience regarding what is happening, they are especially dependent on the information provided by the mass media (Miller, Citation1998). Ultimately, this situation leads to such a phenomenon as infodemic, when rumors and false information spread faster than viruses (Pichai, Citation2023).

Obviously, it is necessary to assess and monitor the extent to which information topics related to health care are interesting for the society, which of these topics demonstrate rapid growth, how the state policy in the field of health care is reflected in mass media publications, etc. In other words, it is important to carry out a systematic multidirectional analysis of the information image of the health care system formed by the media. There are quite a few natural language processing tools for solving this task (Barakhnin et al., Citation2018; Sadovskaya et al., Citation2021).

One of such tools is topic modeling. Topic modeling is a method based on the statistical characteristics of the collections of documents, which is used in tasks of automatic abstracting, information extraction, information retrieval and classification (Mashechkin et al., Citation2013). The method is based on a clearly expressed pattern, reflecting the fact that the corpus of documents forms groups in which the frequency of occurrence of words or combinations of words is different. In this research topic modeling is used to identify the texts related to health care, the topics, which are mostly interesting for readers, information trends that describe health care, the sentiment of publications, and the coverage of public health policy. The directions of analysis of the information image of health care in the study can be illustrated by the Figure (see ).

Figure 1. Multi-vector analysis of the information image of health care.

Figure 1. Multi-vector analysis of the information image of health care.

Reader interest reflects the preferences of readers for specific topics or themes over a certain period of time. Information trends reflect changes in the media’s coverage of the most important news over time. Illumination of public policy shows how fully the media take into account the government documents aimed at regulating the health care system. Sentiment illustrates media attitude towards health care related topics. Taken together, the above vectors provide several complementary perspectives on the work of the mass media describing the health care system.

The main contribution of the study is as follows.

First, we have proposed a method that allows us performing a multidirectional analysis of the information image of health care formed by the mass media using a topic model for the corpus of texts.

Second, the reader interest, the dynamics of changes in the sentiment of publications and the main information trends are evaluated on the basis of the proposed method.

Third, the method has been tested and the results of its full-scale application on the basis of the corpus of the mass media publications in Kazakhstan are presented.

The study consists of the following sections:

In the first section the authors consider the literature sources that address the relevant topic of the study.

The second section describes the data collection system, the corpus of texts and the hierarchical topic model, with the help of which the media analysis was performed.

The third section describes and discusses the obtained results.

The conclusion summarizes the results, analyzes the limitations of the method and possible future research.

2. Related works

The influence of the media plays an important role in shaping health policy (Baba et al., Citation2017). For example, an integrated systematic review (Shen et al., Citation2021) provides an assessment of the impact of planned media interventions, including social networks, on the health policy-making process. The media have a positive impact when their tools are used to improve the accountability of government, to identify priorities, to initiate discussions and to raise awareness (Bou-Karroum et al., Citation2017). In times of social upheaval, the role of traditional media increases and those people who are usually indifferent to information start using again the traditional sources of news (Casero-Ripolles, Citation2020). At the same time, the media use special techniques and manipulation mechanisms to form a certain opinion and attitude (Bushman & Whitaker, Citation2012; Stacks et al., Citation2015). An infodemic occurred during a global health crisis, leading to the spread of false information and recommendations that were exchanged by media outlets without proper verification and review by specialists (Mheidly & Fares, Citation2020). News headlines became mostly negative (51% of negative and only 30% of positive ones) (Aslam et al., Citation2020). The number of publications with a negative sentiment regarding the health care sys-tem has sharply increased (Yakunin et al., Citation2021). As a result, such information background led to the growth of negative emotions (Hamidein et al., Citation2020). Due to the great influence of the media, it is necessary to assess the negative impact of media systems and to encourage its positive impact (Bushman & Whitaker, Citation2012). Properly adjusted public health and media policies can help to increase the public confidence in both traditional media (Tandoc, Citation2018) and the health care system as a whole. Media monitoring tools, which are part of reputation management (Bauer & Suerdem, Citation2016; Barysevich, Citation2019) are used to evaluate the media. However, these systems are inherently limited in evaluation criteria (usually sentiment assessment) and methods (usually keyword search) (Mukhamediev et al., Citation2020).

From the point of view of monitoring the publications devoted to health care, the literature provides examples of solving the individual problems, such as the sentiment of publications (Aslam et al., Citation2020), detection of misinformation and rumors (Song et al., Citation2021), attitudes towards the government social policy in the field of public health (Jo & Chang, Citation2020; Mohamed Ridhwan & Hargreaves, Citation2021), and analysis of information trends (Yakunin et al., Citation2022).

Many works have used sentiment analysis as a tool to study the reactions of people during a pandemic through their social media posts (Dubey, Citation2020), to identify differences between people’s reactions in different countries (Li et al., Citation2020), and to assess the dynamics of people’s sentiment (Zhou et al., Citation2021; Yin et al., Citation2020). Various techniques are used to extract relevant posts (tweets), including those based on machine learning. For example, recurrent neural network was used in (Nemes & Kiss, Citation2020) to extract posts related to the pandemic. It seems useful to compile the disparate solutions in the form of a system that allows assessing the publication activity from several perspectives.

However, a sufficiently accurate assessment of sentiment or policy requires a substantial corpus of texts marked up by experts. In the absence of a marked corpus of texts, it is productive to use cluster methods, a variation of which is topic modeling. Topic modeling is widely used in the tasks of automatic referencing, information extraction, information retrieval and classification (Vayansky & Kumar, Citation2020; Yakunin et al., Citation2021) and identification (Kirill et al., Citation2020). The basis of modern topic models is the statistical model of natural language. Probabilistic topic models describe documents by a discrete distribution on a set of topics, and topics by a discrete distribution on a set of terms (Vorontsov & Potapenko, Citation2012). As a result, the topic model determines which topics each document belongs to and which words form each topic. Groups of terms and phrases formed in the process of topic modeling, in particular, allow solving the problems of synonymy and polysemy of terms (Parhomenko et al., Citation2017). Topic modeling using various algorithms is widely used to analyze health-related topics (Erzurumlu & Pachamanova, Citation2020; Rajendra Prasad et al., Citation2019) and especially COVID-19 (Yin et al., Citation2022).

In the present study the topic modeling methods are used to solve the problem of analyzing the social impact of the health care system. Specifically, the following tasks were set and solved:

  • Ranking of topic groups of media texts (topics) devoted to health care according to the degree of reader interest;

  • Identification of leading information trends in the health care domain and assessment of the dynamics of their change;

  • Evaluation of government policy in the field of health care according to the media coverage;

  • Assessment of the dynamics of changes in the sentiment of publications on the topic of health care.

The models and the obtained results are discussed below.

3. Method

The methodological scheme of the study includes (see ) the formation of a corpus of documents, the formation of a topic model of the corpus and the identification of documents related to health care. Then the following problems are solved:

Figure 2. Rating of topics by attractiveness to readers.

Figure 2. Rating of topics by attractiveness to readers.
  • Ranking topics according to the degree of reader interest;

  • Analysis of the dynamics of health care information trends;

  • Evaluation of health care policy;

  • Assessing the sentiment of mass media publications on health issues.

3.1. Formation of the corpus of media documents

In order to solve the stated tasks, a system of data collection from open sources of information was developed (Yakunin et al., Citation2021; Barakhnin et al., Citation2019). Using the aforementioned system, a corpus of media documents was formed (Yakunin et al., Citation2021; Yakunin et al., Citation2021). The collection of publications began in February 2020 and continues to the present time. At the time of writing the corpus consists of 7 367 372 documents related to the leading news publications of Kazakhstan and some Russian sources. The corpus includes articles from more than 20 major Kazakhstani news re-sources (Tengri News, Forbes.kz, Vlast, Komsomolskaya Pravda, and some major social networks). (see ).

Figure 3. A corpus of media documents. The percentage of content and number of articles in the corpus in thousands are given for each source.

Figure 3. A corpus of media documents. The percentage of content and number of articles in the corpus in thousands are given for each source.

The resulting corpus is used in the subsequent stages of analysis. The core of the analysis process is topic modeling.

3.2. Formation of the topic model of the corpus of media texts and identification of the documents related to health care

As indicated above, one of the most effective methods for analyzing the unmarked corpus texts is topic modeling. Most often researchers use the so-called latent Dirichlet allocation (LDA) (Jelodar et al., Citation2018) to form a topic model. In this study we used Additive Regularization on Topic Models (ARTM) (Vorontsov et al., Citation2015), which is an extension of LDA, the difference of which is the use of configurable regularizers that allow fine-tuning the desired result of the model: including reducing/increasing the propensity of the model to include a word and/or document in several topics simultaneously, changing the model propensity to have more/less non-zero weights in the resulting matrix, etc.

LDA can be expressed by the following equation (Blei et al., Citation2003): (1) (w,d)=tTp(w|t,d)p(t|d)=tTp(w|t)p(t|d)=tTφwtθtd(1) where this sum encapsulates the aggregation of conditional distributions across all topics within the set T, where p(w|t) is the conditional distribution of words in topics, and p(t|d) is the conditional distribution of topics in documents. These ratios are valid under the assumption that there is no need to maintain the order of documents in the corpus and the order of words in the documents. The LDA method assumes that the components φwt and θtd are generated by a continuous multidimensional distribution of Dirichlet probabilities. The goal of the algorithm is to find the parameters φwt and θtd by maximizing the likelihood function with the corresponding regularization: (2) dD wDndwln tTφwtθtd+R(φ,θ)max(2) where ndw is the number of occurrences of word w in document; d φwt is the distribution of word w in topic t, θtd is the distribution of topic t over documents d.

The additive regularization, which recovers the original word distribution over documents D by maximizing the logarithm of likelihood, is combined with the weighted sum of regularizers (3), which is based on a variety of criteria: (3) R(φ,θ)=i=1τiRi(φ,θ)(3) i=1τiRi(φ,θ) is a weighted linear combination of regularizers with non-negative weights τi. ARTM provides a set of regularizers implemented on the basis of Kullback-Leilbler divergence, in this case demonstrating the entropy differences between the distributions of the initial matrix p‘(w|d) and the model p‘(w|d), Dirichlet with hyperparameters β0βt and α0αt (identical to the implementation of the model of latent placement LDA, in which hyperparameters can be only positive): (4) R(φ,θ)=β0tTwWβwtlnφwt+α0 dDwWαtdlnθtdmax(4)

Therefore, we can identify the background topics by determining the vocabulary of the language, or calculate the overall vocabulary in the section of each document.

Decreasing regularizer, inverse smoothing regularizer are determined as follows: (5) (φ,θ)=β0tTwWβwtlnφwtα0dDwWαtdlnθtdmax(5) aimed at identifying significant topic words, so-called lexical cores, in addition to thematic topics in each document, zeroing out small probabilities.

De-correcting regularizer makes topics more ‘different’. Selection of topics allows the model to discard small, non-informative, duplicative and dependent topics. (6) (φ,θ)=0.5*τtTsTtcov(φtφs)max, cov(φtφs)=wWφwt(6)

This regularizer does not depend on the matrix θ. The determination of variances in the discrete distributions is carried out through φwt=p(w|t), in which the measure is the covariance of the current distribution of words in topics φt compared to calculated distributions φs,where sT/t.

The following perplexity metrics (Krasnov & Sen, Citation2019) were used to assess the quality of topic models: (7) PP(p)=2H(p)=2xp(x)log2p(x)=xp(x)p(x)(7) where H(p) is the information entropy of the distribution, and x is an iterator over samples (documents). The value of the perplexity metric does not have a minimum value; therefore, it is usually used to compare different models on the same dataset or to detect the ‘elbow effect’ to determine the optimal number of topics. The lower the perplexity, the better the texts from the topic group match each other.

Contrast is calculated as follows: (8) Сon=1|Wt|wwtp(t|w)(8) where wt is the core of the topic, i.e. words from the topic whose relative weight is greater than or equal to the specified threshold. This metric allows estimating how the topics differ from each other. The higher its value is, the more contrasting the topic under the study looks against the background of the others.

Purity is determined as follows: (9) Pur=wwtp(w|t)(9) where wt is also the core of the topic. Purity shows how many ‘superfluous’ words are used in a topic group. The higher the Purity value is, the fewer the common words are used in the documents related to the topic group.

The formation of a topic model describing health care was proceeded as follows. Using the BigARTM library (Vorontsov et al., Citation2015), a topic model consisting of 200 topic groups (topics) was formed, from which the experts selected 12 topic groups related to medicine; the experts used the mentioned Perplexity, Contrast, and Purity metrics. The topic modeling was again performed on these 12 topic groups, forming 150 next-level topic groups. 47 most relevant topics were selected from the obtained 150 topics, using the threshold for the membership matrix above 0.05. The latter were used to form the final model of 100 topic groups on a sub-corpus of 100 481 documents. Each topic group contains texts with membership greater than 0.1. The final topic model with the most relevant words for each cluster is shown in (Yakunin et al., Citation2022). In this figure we have shown some of the first and last clusters separated by a red line. The full graphical version of this model is presented in Appendix A.

Figure 4. The topic groups of texts ranked according to the degree of weight in the sub-corpus of texts devoted to health care.

Figure 4. The topic groups of texts ranked according to the degree of weight in the sub-corpus of texts devoted to health care.

The constructed topic model was used to solve the above listed problems. The results are discussed below.

4. Results and discussion

4.1. Classification of topic groups according to the degree of attractiveness for readers

In order to determine the reader interest in the topic group of publications, the number of news views was determined, which was then normalized in the range from 0 to 1. For each of the 100 final topic groups (topics), the average value of this normalized indicator (topic attractiveness TA) was calculated. For each topic, the main words, volume (number of documents included in it), weight (calculated according to (Mukhamediev et al., Citation2020)), TA and headings of the articles most relevant to the topic are determined.

The results show that TA ranges from 0.13 to 0.69. The lowest TA indicator was recorded for the topic described by the words: ‘kazinform, interesting, mia, mia_kazinform, correspondent_mia, correspondent_mia_kazinform, transfer_correspondent_mia, trans-fer, kazinform_link, transfer_mia_kazinform’. Relevant article titles: ‘Kazakh athletes are not sent on business trips to the countries with coronavirus’, ‘Deputy Prime Minister of Uzbekistan dies’, ‘First case of coronavirus detected in Uzbekistan’. Obviously, during the period under consideration, the respondents were practically not interested in sports events and foreign news in the context of health care. At the same time, high TA was rec-orded in the topic group described by the words ‘region, Kazakhstan_region, Kazakhstan, Almaty, city, Atyrau, Turkestan, Shymkent, Zhambyl, Atyrau_region…’. The title of the most relevant article: ‘The number of COVID-19 cases exceed 165 thousand in Kazakhstan’. This indicator reflects the fact that during the pandemic, respondents with increased attention read the news about the spread of the epidemic in the regions of Kazakhstan and the country as a whole. Topic clusters, ranked by degree of TA, are shown in . In this figure we have shown some of the first and last clusters separated by a red line. The full version of the figure is presented in Appendix A.

Figure 5. Topic groups of media articles on health care, ranked by topic attractiveness.

Figure 5. Topic groups of media articles on health care, ranked by topic attractiveness.

4.2. Health care information trends

Considering that the coronavirus pandemic had an active impact on the information field in the period in question (2020–2022), significant topic groups were compared with the objective indicators of the pandemic (Dong et al., Citation2020), such as:

  • The total number of new coronavirus tests performed daily;

  • Percentage of positive test results as average over 7 days (inverse of tests per case);

  • New cases smoothed 7-day value of new confirmed cases of coronavirus;

  • 7-day smoothed rate of new coronavirus related deaths;

  • 7-day average number of tests performed per confirmed case of coronavirus, which is the inverse of the positivity rate;

  • Real-time assessment of the productive multiplication rate (R) of the coronavirus.

Stringency index (Stringency Index of Governmental Response: this indicator combines 9 response indicators such as school and workplace closures, travel bans, etc.; the value is scaled from 0 to 100 (100 is the most stringent response)).

The research (Yakunin et al., Citation2021) assessed the pairwise correlation between the size of the topic group and the COVID-19 epidemiological indicators listed above. At the same time, the topic group that best correlates with the dynamics of changes in the pandemic indicator was selected to assess the dynamics.

It was found that in 2020 the topic of the remote learning for schoolchildren was actively raised only with the beginning of the academic year (in September), and there is no direct correlation with the smoothed number of new cases. However, in 2021 – 2022 there is a strong correlation with the smoothed number of new cases. It can be inferred that the public anticipated the easing of quarantine restrictions and a return to in-person education; however, a deteriorating epidemiological landscape heightened interest in these developments. The topic of fakes in the field of health care continues to be relevant, and unlike the topic of vaccination, medicines and the general epidemiological situation, publication activity does not fade. In early 2022, the link between publication output and epidemiological indicators weakened relative to the first half of 2021. The association with search queries regarding the coverage of the economic crisis, remote work, microcredit, and similar topics has intensified. The increase in correlation with the Stringency index is especially noticeable, especially in connection with the dramatic changes (removal) of quarantine restrictions. At the same time, the correlation with issues related to health care, vaccinations, etc. decreased. This may indicate that the population is more concerned about the pragmatic issues associated with the quarantine restrictions than about health issues. Correlations with relative indicators, such as reproduction rate and single case tests, increased. These indicators reflect the epidemiological situation more objectively than the absolute indicators. This is an important indicator that the media began to reflect the epidemiological situation in the country more objectively in the end of 2021 compared to the initial period of the pandemic.

By the end of 2022, there were additional changes in the size of the mentioned topics ().

Figure 6. Dynamics of the publication activity on the topic ‘Incidence, School, Child, Growth, Epidemiological’.

Figure 6. Dynamics of the publication activity on the topic ‘Incidence, School, Child, Growth, Epidemiological’.

Figure 7. Dynamics of the publication activity on the topic ‘Vaccination, Vaccines, COVID”.

Figure 7. Dynamics of the publication activity on the topic ‘Vaccination, Vaccines, COVID”.

Figure 8. Dynamics of the publication activity on the topic ‘Remedy, Medicinal, Drug, Medicine, Medicinal_Remedy’.

Figure 8. Dynamics of the publication activity on the topic ‘Remedy, Medicinal, Drug, Medicine, Medicinal_Remedy’.

Figure 9. Dynamics of the publication activity on the topic ‘Fake, False Information, Disinformation’.

Figure 9. Dynamics of the publication activity on the topic ‘Fake, False Information, Disinformation’.

Figure 10. Dynamics of the publication activity on the topic ‘Crisis, Lending, Debt, Microcredit’.

Figure 10. Dynamics of the publication activity on the topic ‘Crisis, Lending, Debt, Microcredit’.

Figure 11. Dynamics of the publication activity on the topic ‘Case, Register, Coronavirus, Infec-tion, Coronavirus_Infection’.

Figure 11. Dynamics of the publication activity on the topic ‘Case, Register, Coronavirus, Infec-tion, Coronavirus_Infection’.

The graph in shows that the interest in pandemic virus COVID-19 is steadily declining. So, if we compare January 2021 and January 2023, interest in this topic has dropped by about an order of magnitude. The value on the ordinate axis is the ratio of the sum of weights of documents in this topic to the sum of all weights of all topics for a given period. That is, the value can be interpreted as the share of the given topic in the information flow. It can be seen that at the peak of interest at the beginning of 2020, one topic related to pandemicCOVID-19 occupied about 10% of all information in the mass media. If we take into account all pandemic COVID-19-related topics, then the total share of media information related to the pandemic reached a peak of 15%. At the beginning of 2023, this figure fell to 1%. The one percent figure is significant, but quite comparable to other topics. For example, the topic of artificial intelligence, estimated in this way, is between 1 and 5 percent.

On the whole, the graphs above show that the volume of publications in the considered topic groups falls approximately from the middle of 2022. The exceptions are the topic groups described by the groups of words ‘crisis, lending, debt, microcredit’, ‘fake, false information, disinformation’.

4.3. Assessing coverage of health policy

The issue of assessing the media coverage of health care policy has become especially relevant over the past three years also in relation to the pandemic. The media has the potential to influence the health policy. For example, an integrated systematic review (Shen et al., Citation2021) presents an assessment of the impact of planned media interventions, including social networks, on the health policy-making process.

Using the above-described corpus of media texts and the Health Code (On Public Health and Healthcare System, Citationn.d), an inter-corpus topic imbalance analysis was conducted to understand the level of public involvement in policy making and the most interesting health topics covered by the media. To assess the measure of influence of the information, it is proposed to use information on the distribution of documents by corpora within each topic, that the document belongs to a given topic ().

Figure 12. Imbalance across the corpus of news and health code.

Figure 12. Imbalance across the corpus of news and health code.

The figure shows the imbalance in the news corpus in dark blue, and the imbalance in the health code in light blue; the result of about 0.5 shows the topics of health policy that are particularly covered in the media. These include:

  • Fund, medical, tenge, service, social;

  • Fact, court, violation, article, case;

  • Disease, Senior, Reaction, Vaccination;

  • Driver, movement, fetus, week, term;

  • Body, relative, deceased, TV channel, quarantine;

  • Testing, ENT (single national testing), national, education, passing;

  • Cigarette, product, tobacco;

  • Development, develop, scientific, technology, production.

These are precisely the topic groups of media texts where the discussion of the actions of the government and its strategic documents on health care is most fully developed.

4.4. Assessment of the sentiment of media publications on health issues

The topic model made it possible to exclude the personal bias from the analysis process, which increased the usefulness of the model in the task of assessing the reflection of the epidemiological situation in the media. At the same time, it was found that this approach has two main limitations: taking into account the dynamic weight of only one topic and the impossibility of taking into account the opinion of experts. To eliminate these limitations, the MMA (Mass Media Assessment) method described in (Mukhamediev et al., Citation2020) was used. This method based on topic modeling and expert labelling of topics by sentiment. It allowed the creation of classification models with low volume high-level manual labeling. Instead labelling each document, expert labels topics by sentiment. In this case we are talking about labeling the topic groups by mood in the range from −1 to +1. The result for each document was obtained by a summation of expert labeling results weighted by document related to each topic. The mass media score is obtained by summing up the scores of documents related to the media source and subsequent normalization. In some cases, this approach made it possible to achieve an ROC AUC of 0.93, which is comparable to modern deep learning classifiers (Yakunin et al., Citation2021). Using the above mentioned method, there was formed the distribution of the media for the queries, which are ‘maximum negative’ and ‘maximum positive’. The results of topic modeling and sentiment analysis according to the criterion of ‘maxi-mum negative’ and ‘maximum positive’ are shown in and .

Figure 13. Imbalance across the corpus of news and health code.

Figure 13. Imbalance across the corpus of news and health code.

Figure 14. Negative and positive media on the topic of health care, January 2023. The articles of a positive sentiment are shown in green, negative in red, and articles of a neutral sentiment are in yellow in the total volume of media articles.

Figure 14. Negative and positive media on the topic of health care, January 2023. The articles of a positive sentiment are shown in green, negative in red, and articles of a neutral sentiment are in yellow in the total volume of media articles.

It should be noted that the definition of sentiment differed from the generally accepted one. We defined the sentiment not as the author’s opinion on any issue, but as the general positive or negative of the described event for society.

5. Conclusion

Topic modeling allows evaluation of large groups of textual documents. The authors of this research solve a set of problems that evaluate the representation of health care in the media, applying topic modeling and the method described in (Mukhamediev et al., Citation2020). The method of evaluating such information image may be different. After selecting the documents related to health care, we chose 4 assessment methods.

First, we evaluated the topic groups of documents according to the degree of reader interest, which made it possible to identify the most relevant topics in the time period under consideration. In our case, a fairly predictable result was obtained: topics related to the spread of coronavirus significantly exceeded sports medicine topics in terms of interest during the pandemic.

Second, we analyzed the main information trends available in articles on health care and assessed the correlation of trends with objective indicators of the coronavirus pandemic. This result can be used with certain reservations to assess the objectivity of the media: the higher the degree of such correlation is, the more objective way how the media de-scribe information trends during the pandemic.

Third, by comparing state government and media documents, we assessed the extent to which the state policy in the field of health care is reflected in the period under review and identified those topic groups where it is most fully represented. These groups include documents of legal, economic and educational nature.

Fourth, we assessed the sentiment of health publications in aggregate and by individual mainstream media. We identified clear trends of the increase in the number of negative sentiment documents during the periods of increased incidence of COVID-19, as well as an increase in positive sentiment articles over time, which is associated both with the publication policy and the gradual improvement in the quality of treatment.

The applied method, based on topic modeling, proved to be a fairly effective mechanism for assessing the publication activity of the media on the topic of health care. It al-lowed:

  • Ranking health related topics according to the degree of reader interest;

  • Identifying health care information trends;

  • Identifying those topic groups in which the state policy in the field of health care is discussed;

  • Assessing the dynamics of changes in the publication activity of the mass media on health issues.

The considered four-sided ‘portrait’ can be used as an indicator of infodemic. The sharp increase in negative articles and the increase in the number of publications in the topic ‘fake, false information, disinformation’ can be considered as evidence of an infodemic that can be quantified.

6. Limitations of the study

Despite the merits of the applied method, it also has significant limitations associated with the use of a statistical language model:

  1. The method does not allow to for the identification of complex semantic contexts, since the model operates with statistical patterns, and not with semantic patterns that can be obtained using the vector representation of words and texts;

  2. In this regard, the model allows evaluating only general patterns, but is not able to identify subtle differences, such as: sarcasm, irony, humor, etc.

7. Future research

We assume, it will be useful to apply the embedded topic model (Dieng et al., Citation2020) in a future study to solve the designated problems. In addition, it seems important to explore the ‘portrait’ of health care in a relatively calm period of time free of pandemics.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

Additional information

Funding

This research was funded by the Science Committee of the Ministry of Education and Science of the Republic of Kazakhstan, Grant No. AP09259587, ‘Developing of methods and algorithms of intelligent GIS for multi-criteria analysis of healthcare data’ and Grant No. BR21881908, Complex of urban ecological support (CUES). This investigation develops results of the project ‘Development of methods of healthcare system risk and reliability evaluation under coronavirus outbreak’ which has been supported by the Slovak Research and Development Agency under Grant no. PP COVID-20-0013. The work was partially supported by the Integrated Infrastructure Operational Program for the project: Systemic Public Research Infrastructure - Biobank for Cancer and Rare diseases, ITMS: 313011AFG5, co-financed by the European Regional Development Fund.

References

  • Aslam, F., Awan, T. M., Syed, J. H., Kashif, A., & Parveen, M. (2020). Sentiments and emotions evoked by news headlines of coronavirus disease (COVID-19) outbreak. Humanities and Social Sciences Communications, 7(1), 1–12. https://doi.org/10.1057/s41599-020-0523-3
  • Atun, R. (2015). Transitioning health systems for multimorbidity. Lancet (London, England), 386(9995), 721–722. https://doi.org/10.1016/S0140-6736(14)62254-6
  • Baba, C., M Cherecheş, R., & Mosteanu, O. (2017). The mass media influence on the impact of health policy. Transylvanian Review of Administrative Sciences, 3(19), 15–20.
  • Baldwin, R., & Weder di Mauro, B. (2020). Economics in the time of COVID-19. CEPR Press VoxEU.org eBook.
  • Barakhnin, V. B., Duisenbayeva, A. N., Kozhemyakina, O. Y., Yergaliyev, Y. N., & Muhamedyev, R. I. (2018). The automatic processing of the texts in natural language: Some bibliometric indicators of the current state of this research area. Journal of Physics: Conference Series, 1117(1), 012001.
  • Barakhnin, V., Kozhemyakina, O., Mukhamedyev, R., Borzilova, Y., & Yakunin, K. (2019). The design of the structure of the software system for processing text document corpus. Business Informatics, 13(4), 60–72. https://doi.org/10.17323/2587-814X.2019.1.60.72
  • Barysevich, A. (2019). 20 of the best social media monitoring tools to consider. Social Media Today. https://www.socialmediatoday.com/news/20-of-the-best-social-media-monitoring-tools-to-consider/545036/
  • Bauer, M. W., & Suerdem, A. (2016). Developing science culture indicators through text mining and online media monitoring. In OECD Blue Sky Forum on Science and Innovation Indicators; LSE Research (Eds.), Proceedings of the conference held in Ghent (pp. 19–21).
  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
  • Bou-Karroum, L., El-Jardali, F., Hemadi, N., Faraj, Y., Ojha, U., Shahrour, M., Darzi, A., Ali, M., Doumit, C., Langlois, E. V., Melki, J., AbouHaidar, G. H., & Akl, E. A. (2017). Using media to impact health policy-making: An integrative systematic review. Implementation Science, 12(1), 52–66. https://doi.org/10.1186/s13012-017-0581-0
  • Bushman, B. J., & Whitaker, J. (2012). Media influence on behavior. In Encyclopedia of Human Behavior (2nd ed., 571–575).
  • Bushman, B. J., & Whitaker, J. L. (2012). Media influence on behavior. In V. S. Ramachandran (Ed.), Encyclopedia of human behavior (2nd ed., pp. 571–575). Elsevier Inc.
  • Casero-Ripolles, A. (2020). Impact of COVID-19 on the media system: Communicative and democratic consequences of news consumption during the outbreak. El Profesional De La Información, 29(2). e-ISSN: 1699–2407. https://doi.org/10.3145/epi.2020.mar.23
  • Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2020). Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8, 439–453. https://doi.org/10.1162/tacl_a_00306
  • Dong, E., Du, H., & Gardner, L. (2020). An interactive web-based dashboard to track COVID-19 in real time. The Lancet. Infectious Diseases, 20(5), 533–534. https://doi.org/10.1016/S1473-3099(20)30120-1
  • Dubey, A. D. (2020). Twitter sentiment analysis during covid19 outbreak. SSRN Electronic Journal, https://doi.org/10.2139/ssrn.3572023
  • Erzurumlu, S. S., & Pachamanova, D. (2020). Topic modeling and technology forecasting for assessing the commercial viability of healthcare innovations. Technological Forecasting and Social Change, 156, 120041. https://doi.org/10.1016/j.techfore.2020.120041
  • Gao, J., Zheng, P., Jia, Y., Chen, H., Mao, Y., Chen, S., Wang, Y., Fu, H., & Dai, J. (2020). Mental health problems and social media exposure during COVID-19 outbreak. PloS One, 15(4), e0231924. https://doi.org/10.1371/journal.pone.0231924
  • Giri, S. P., & Maurya, A. K. (2021). A neglected reality of mass media during COVID-19: Effect of pandemic news on individuals’ positive and negative emotion and psychological resilience. Personality and Individual Differences, 180, 110962. https://doi.org/10.1016/j.paid.2021.110962
  • Hamidein, Z., Hatami, J., & Rezapour, T. (2020). How people emotionally respond to the news on COVID-19: An online survey. Basic and Clinical Neuroscience, 11(2), 171–178. https://doi.org/10.32598/bcn.11.covid19.809.2
  • Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2018). Latent Dirichlet Allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 77(24), 1–43. https://doi.org/10.1007/s11042-018-6611-0
  • Jo, W., & Chang, D. (2020). Political consequences of COVID-19 and media framing in South Korea. Frontiers in Public Health, 8, 572584. https://doi.org/10.3389/fpubh.2020.572584
  • Kirill, Y., Mihail, I. G., Sanzhar, M., Rustam, M., Olga, F., & Ravil, M. (2020). Propaganda identification using topic modelling. Procedia Computer Science, 178, 205–212. https://doi.org/10.1016/j.procs.2020.02.106
  • Krasnov, F., & Sen, A. (2019). The number of topics optimization: Clustering approach. Machine Learning and Knowledge Extraction, 1(1), 416–426. https://doi.org/10.3390/make1030031
  • Li, X., Zhou, M., Wu, J., Yuan, A., Wu, F., & Li, J. (2020). Analyzing COVID-19 on online social media: Trends, sentiments and emotions. ArXiv, abs/2005.14464.
  • Mashechkin, I., Petrovsky, M., & Tsarev, D. (2013). Methods for calculating the relevance of text fragments based on topic models in the problem of automatic annotation. Computational Methods and Programming, 14(1), 91–102.
  • Mheidly, N., & Fares, J. (2020). Leveraging media and health communication strategies to overcome the COVID-19 infodemic. Journal of Public Health Policy, 41(4), 410–420. https://doi.org/10.1057/s41271-020-00247-w
  • Miller, D. (1998). Promotional strategies and media power. In A. Briggs & P. Cobley (Eds.), The media: An introduction (pp. 65–80). Longman.
  • Mohamed Ridhwan, K., & Hargreaves, C. A. (2021). Leveraging twitter data to understand public sentiment for the COVID‐19 outbreak in Singapore. International Journal of Information Management Data Insights, 1(2), 100021. https://doi.org/10.1016/j.ijimdi.2021.100021
  • Mukhamediev, R. I., Yakunin, K., Mussabayev, R., Buldybayev, T., Kuchin, Y., Murzakhmetov, S., & Yelis, M. (2020). Classification of negative information on socially significant topics in mass media. Symmetry, 12(12), 1945. https://doi.org/10.3390/sym12121945
  • Nemes, L, & Kiss, A. (2020). Social media sentiment analysis based on Covid-19. Journal of Information and Telecommunication, 5(1), 1–15. https://doi.org/10.1080/24751839.2020.1790793
  • On Public Health and Healthcare System. (n.d). Retrieved February 27, 2023, from https://www.adilet.zan.kz/eng/docs/K2000000360
  • Parhomenko, P. A., Grigorev, A. A., & Astrakhantsev, N. A. (2017). A survey and an experimental comparison of methods for text clustering: Application to scientific articles. Proceedings of the Institute for System Programming of the RAS, 29(2), 161–200. https://doi.org/10.15514/ISPRAS-2017-29(2)-11
  • Pichai, S. (2023, February 24). COVID-19: How we’re continuing to help. Google Blog. https://blog.google/inside-google/company-announcements/covid-19-how-were-continuing-to-help/
  • Rajendra Prasad, K., Mohammed, M., & Noorullah, R. M. (2019). Visual topic models for healthcare data clustering. Evolutionary Intelligence, 14(2), 545–562. https://doi.org/10.1007/s12065-019-00300-y
  • Sadovskaya, L., Mukhamediev, R., Kosyakov, D., & Guskov, A. (2021). Natural language text processing: a review of publications. Artificial Intelligence and Decision Making, 10, 2552–2577.
  • Shen, B., Guan, T., Ma, J., Yang, L., & Liu, Y. (2021). Social network research hotspots and trends in public health: A bibliometric and visual analysis. Public Health in Practice (Oxford, England), 2, 100155. https://doi.org/10.1016/j.puhip.2021.100155
  • Song, X., Petrak, J., Jiang, Y., Singh, I., Maynard, D., & Bontcheva, K. (2021). Classification aware neural topic model for COVID-19 disinformation categorisation. PloS One, 16(2), e0245986. https://doi.org/10.1371/journal.pone.0247086
  • Stacks, D. W., Cathy Li, Z., & Spaulding, C. (2015). Media effects. International Encyclopedia of the Social & Behavioral Sciences, 29–34.
  • Tandoc, E. C. (2018). Tell me who your sources are. Journalism Practice, 13(2), 178–190. https://doi.org/10.1080/17512786.2017.1423237
  • Tasnim, S., Hossain, M. M., & Mazumder, H. (2020). Impact of rumors and misinformation on COVID-19 in social media. Journal of Preventive Medicine and Public Health = Yebang Uihakhoe Chi, 53(3), 171–174. https://doi.org/10.3961/jpmph.20.094
  • Vayansky, I., & Kumar, S. A. P. (2020). A review of Topic modeling methods. Information Systems, 94, 101582. https://doi.org/10.1016/j.is.2019.101582
  • Vorontsov, K. V., & Potapenko, A. A. (2012). Regularization, robustness and sparsity of probabilistic topic models. Computer Research and Modeling, 4(4), 693–706. (Russian) https://www.elibrary.ru/item.asp?id=17786186; https://doi.org/10.20537/2076-7633-2012-4-4-693-706
  • Vorontsov, K., Frei, O., Apishev, M., Romov, P., & Dudarenko, M. (2015). BigARTM: Open Source Library for regularized Multimodal topic modeling of large collections. Communications in Computer and Information Science, 535, 370–381. https://doi.org/10.1007/978-3-319-26123-6_32
  • Yakunin, K., Kalimoldayev, M., Mukhamediev, R. I., Mussabayev, R., Barakhnin, V., Kuchin, Y., Murzakhmetov, S., Buldybayev, T., Ospanova, U., Yelis, M., Zhumabayev, A., Gopejenko, V., Meirambekkyzy, Z., & Abdurazakov, A. (2021). KazNews-Dataset: Single country overall digital mass media publication corpus. Data, 6(3), 31. https://doi.org/10.3390/data6030031
  • Yakunin, K., Mukhamediev, R. I., Yelis, M., Kuchin, Y., Symagulov, A., Levashenko, V., Zaitseva, E., Aubakirov, M., Yunicheva, N., Muhamedijeva, E., Gopejenko, V., & Popova, Y. (2022). Analysis of the correlation between mass-media publication activity and COVID-19 epidemiological situation in early 2022. Information, 13(9), 434. https://doi.org/10.3390/info13030434
  • Yakunin, K., Mukhamediev, R. I., Zaitseva, E., Levashenko, V., Yelis, M., Symagulov, A., Kuchin, Y., Muhamedijeva, E., Aubakirov, M., & Gopejenko, V. (2021). Mass media as a mirror of the COVID-19 pandemic. Computation, 9(12), 140. https://doi.org/10.3390/computation9120140
  • Yakunin, K., Mukhamediev, R., Kuchin, Y., Musabayev, R., Buldybayev, T., & Murzakhmetov, S. (2021). Classification of negative publication in mass media using topic modeling. Journal of Physics: Conference Series, 1727(1), 012019. https://doi.org/10.1088/1742-6596/1727/1/012019
  • Yin, H., Song, X., Yang, S, & Li, J. (2022). Sentiment analysis and topic modeling for COVID-19 vaccine discussions. World Wide Web, 25(3), 1067–1083. https://doi.org/10.1007/s11280-022-01029-y
  • Yin, H., Yang, S, & Li, J. (2020). Detecting topic and sentiment dynamics due to Covid-19 pandemic using social media. Advanced Data Mining and Applications, 25(3), 610–623. https://doi.org/10.1007/978-3-030-65390-3_46
  • Zhou, J., Zogan, H., Yang, S., Jameel, S., Xu, G., & Chen, F. (2021). Detecting community depression dynamics due to Covid-19 pandemic in Australia. IEEE Transactions on Computational Social Systems, 8(4), 982–991. https://doi.org/10.1109/tcss.2020.3047604

Appendix A

The topic groups of texts ranked according to the degree of weight in the sub-corpus of texts devoted to health care represented in the .

. The final topic model with the most relevant words for each cluster.

. Topic groups of media articles on health care, ranked by topic attractiveness.