548
Views
0
CrossRef citations to date
0
Altmetric
Research Article

What do you think caused your ALS? An analysis of the CDC national amyotrophic lateral sclerosis patient registry qualitative risk factor data using artificial intelligence and qualitative methodology

, , , , &
Received 30 Jan 2024, Accepted 21 Apr 2024, Published online: 08 May 2024

Abstract

Objective: Amyotrophic lateral sclerosis (ALS) is an incurable, progressive neurodegenerative disease with a significant health burden and poorly understood etiology. This analysis assessed the narrative responses from 3,061 participants in the Centers for Disease Control and Prevention’s National ALS Registry who answered the question, “What do you think caused your ALS?” Methods: Data analysis used qualitative methods and artificial intelligence (AI) using natural language processing (NLP), specifically, Bidirectional Encoder Representations from Transformers (BERT) to explore responses regarding participants’ perceptions of the cause of their disease. Results: Both qualitative and AI analysis methods revealed several, often aligned themes, which pointed to perceived causes including genetic, environmental, and military exposures. However, the qualitative analysis revealed detailed themes and subthemes, providing a more comprehensive understanding of participants’ perceptions. Although there were areas of alignment between AI and qualitative analysis, AI’s broader categories did not capture the nuances discovered using the more traditional, qualitative approach. The qualitative analysis also revealed that the potential causes of ALS were described within narratives that sometimes indicate self-blame and other maladaptive coping mechanisms. Conclusions: This analysis highlights the diverse range of factors that individuals with ALS consider as perceived causes for their disease. Understanding these perceptions can help clinicians to better support people living with ALS (PLWALS). The analysis highlights the benefits of using traditional qualitative methods to supplement or improve upon AI-based approaches. This rapidly evolving area of data science has the potential to remove barriers to accessing the rich narratives of people with lived experience.

Introduction

Amyotrophic lateral sclerosis (ALS) is a fatal, incurable disease that causes progressive degeneration of motor neurons. In the United States in 2018 up to 29,824 persons (9.1 per 100,000 population) were estimated to be living with ALS, with an incidence rate of 1.6 per 100,000 population (Citation1, Citation2). Most people living with ALS (PLWALS) receive a diagnosis 10–16 months after initial symptom onset and survive for 2–5 years after diagnosis (Citation3). ALS and other neurological and psychiatric disorders are attributed to environmental factors and pose a significant health burden to the U.S. population (Citation4–11).

The federal Agency for Toxic Substances and Disease Registry (ATSDR), Centers for Disease Control and Prevention (CDC), established the National ALS Registry (Registry) to better describe the epidemiological trends of ALS in the United States, identify and examine risks and potential causes, and determine the disease’s public health burden (Citation12). The Registry collects data from existing national databases and PLWALS who sign up to participate through a voluntary online portal. PLWALS who register gain access to participate in research, receive updates on clinical trials and epidemiological studies, and can donate specimens to the National ALS Biorepository at no cost.

After PLWALS complete the online registry enrollment, they can voluntarily complete up to18 surveys related to demographics and possible risk factors for ALS, from which researchers can request deidentified data to analyze the data for their own research. Participants are also asked two open-ended questions, “What do you think caused your ALS?” and “What do you think causes ALS in general?” as one of the surveys. In this analysis, the researchers used traditional qualitative methods and natural language processing (NLP) technology to analyze narrative responses to the first question.

Methods

The Registry’s methods are previously described (Citation13). Informed consent was obtained under a protocol approved by CDC’s Institutional Review Board (IRB # RB number is: 5768.0; expires 10/18/24).

Using the Registry’s online portal, participants completed concise surveys on various ALS risk factors and experiences. As of January 2022, participants were offered 18 survey modules within the web portal (Citation14). These questionnaires were developed and validated by Stanford University’s ALS Consortium of Epidemiologic Studies (ACES) (Citation15, Citation16). The surveys are designed to allow participants to answer the questions independent of a healthcare professional. Participation in the Registry’s online portal is entirely voluntary. Some participants sign up for the portal but never fill out any of the surveys, others complete all 18 surveys.

The dataset analyzed in this investigation represents data from 2014 until December 31, 2021 (the date the data were exported from the Registry for analysis). A total of 3,061 Registry respondents completed the survey and were included in the dataset. Participant demographic characteristics were also included in the analysis.

Qualitative analysis

Data were analyzed in an iterative process that drew upon grounded theory (Citation17). Grounded theory uses a “line by line” coding approach, with each line of the participant narrative captured with initial codes that evolve into focused code. To expedite this approach, a dictionary-based named entity recognition (NER) analysis was performed, coding a “1” for every response that contained certain words or phrases. Although each line was ultimately coded by hand during the qualitative analysis, the NER analysis allowed for quick filtering and sorting of concepts within the same thematic cluster. This was especially useful to quickly classify participant responses consisting only of the phrase “don’t know” and “no idea,” and to flag responses that might be related to genetics or military experiences.

After initial data analysis and theme development, a modified member checking exercise was performed. Member checking is a technique commonly used in qualitative research where researchers return a summary or interpretation of findings to participants and participants are asked to confirm the researcher’s interpretations. Because the principal investigator does not know the identity of the participants in this de-identified dataset, PLWALS and caregivers outside of the analysis were informally consulted on themes relevant to their specific areas of interest, including veterans with ALS and people from familial ALS communities. Participants’ own responses overwhelmingly agreed with the analysis of responses. Coding of qualitative data and analyses of participant characteristics was performed in SPSS version 27.

Artificial intelligence analysis

Artificial intelligence (AI)/NLP) technology was also used to perform topic modeling on the open-ended survey responses. The AI analysis was conducted in April 2022 to generate the top 10 topics based on participants responses. Data cleaning, standardization, and topic modeling were performed using open-source libraries in the Python programming language (Python Software Foundation, https://www.python.org/). For data cleaning and standardization, “stop words” were removed, e.g. a, and, the, which are common words that can be safely filtered out without altering the meaning of the text. The Python Genism stop word library and lemmatized each word via NLTK’s WordNet Lemmatizer were used for stop word removal (Citation18, Citation19).

The cleaned and tokenized text responses were analyzed with Bidirectional Encoder Representations from Transformers (BERT), a machine learning approach for NLP. This was accomplished using the BERTopic library, leveraging the BERT-based all-MiniLM-L6-v2 sentence-transformers model (Citation20, Citation21). Unigrams, bigrams, and trigrams were all included in the topic modeling process. This analysis resulted in a list of the 10 most prominent topical clusters, based on the vocabulary and syntax of the free-text responses ().

Table 2. Topics identified by artificial intelligence for responses to question, ‘What do you think caused your ALS?’

Each of the ten clusters were then labeled based on the predominant themes and keywords used in those clusters. However, the AI model did not assign a specific label; it simply identified the cluster of responses. The themes were assigned manually based on the keywords and topical themes of the identified cluster. The themes remained broad to eliminate overlapping topics. If similar topics were still in the top ten, they remained as separate topics.

To minimize bias, the qualitative analyst (performed by DB) was blinded to the results of AI analysis (performed by JR and EK) until after the qualitative analysis was complete. A subgroup analysis of the items that were “ungrouped” or not assigned to one of the top ten clusters was performed by DB, resulting in all “ungrouped” AI clusters mapping to one or more qualitative themes or subthemes ().

Table 3. Comparison of top artificial intelligence (AI) topics and qualitative analysis themes, with sample responses from persons with ALS.

Results

Demographic characteristics

Participant characteristics recorded on closed-ended question responses included diagnosis year, age at diagnosis, gender, marital status, race, census region, military history, smoking history, alcohol history, family history of ALS, and family history of neurological diseases (). The distribution of diagnosis years ranged from before 2011 to 2021. Most participants were diagnosed between 2013 and 2019, with the highest proportion occurring in 2014 (457 participants; 14.9%). Most of the participants were male (1,790; 58.5%), married (2,418; 79.2%), and White (2,940; 96.0%). The most common census region was Region 3, the southern United States (1,118 participants, 37.0%). Most participants were aged 60–69 years (1,175; 38.4%) or 50–59 years (873; 28.5%) at diagnosis.

Table 1. National amyotrophic lateral sclerosis (ALS) registry participant characteristics (N = 3,061).

The Registry also collects data on participants’ occupational and lifestyle characteristics. In this analysis, 583 participants (19.1%) had a military history, 1,311 participants (42.8%) had ever smoked, and 2,467 participants (80.6%) had ever consumed alcohol. Most participants did not have a family history of ALS (2,862; 93.5%) or a family history of neurological diseases (2,484; 81.5%) ().

AI generated cluster analysis

When considering clustering of responses, using AI classification described above, most participant responses did not fit into a specific AI cluster (60%) and were classified as “ungrouped” (). Among the identifiable topics, the most common response was classified as “unknown” (16.3%). The next most common topic of responses were related to family history of other neurological diseases (4.5%) and heredity (4.4%) ().

AI and manual thematic analysis of perceived cause of ALS

compares the major themes emerging from the AI and traditional qualitative analysis, regarding perceived cause of ALS, with a few representative participant quotations. The major difference was that the manual thematic analysis grouped responses into themes and subthemes, which the AI/NLP analysis did not. This allowed for all the “ungrouped” responses identified by AI/NLP to be labeled in the manual thematic analysis process.

The AI cluster of responses around perceived cause aligned with the qualitative themes and subthemes in several areas with varying degrees of precision (). For example, AI identified a cluster of responses called “Chemical exposure,” which corresponds to the “Environment/Exposure to chemicals/Pesticides” theme in the qualitative analysis. The AI model identified a “Diet/Exercise” cluster as a potential perceived “cause” of ALS, which corresponds to subthemes within the “Lifestyle” theme in the qualitative analysis. The subthemes of drugs/alcohol and exercise/sports/athletics/heavy physical labor in the qualitative analysis are also aligned with the AI clusters. The AI approach identified a cluster of responses as “Family history of other neurological disease,” which had a nearly identical corresponding theme in the qualitative analysis.

The AI model also identified a cluster of “Head Trauma” responses as a potential perceived “cause” of ALS, whereas the corresponding qualitative analysis theme was “Accident/Injury” with “Head injury” as a subtheme. The AI cluster and qualitative analysis both identified “Genetic(s)” as a potential perceived “cause” of ALS; the qualitative analysis included subthemes to differentiate general versus specific genetic etiological concepts, which were not specifically identified by the AI clustering method. AI modeling and qualitative analysis both identified “Military” as a perceived “cause” of ALS. The subthemes of medications and immunizations, occupational exposures (including burn pit and nuclear radiation exposure), and Agent Orange exposure identified in the qualitative analysis are more specific than the AI clustering method’s broader category of “Military.”

Discussion

In this analysis, we analyzed the free text responses of PLWALS to gain a comprehensive understanding of their life experiences and uncover potential factors that might have contributed to the development of the disease. The responses provide valuable insights into the participants’ beliefs about the cause of their ALS and highlight the importance of considering participants’ perspectives in ALS research.

One of the key findings from the patient responses was the diversity of opinions regarding the perceived potential causes of ALS. This variation in beliefs underscores the complex and heterogeneous nature of ALS. The narrative responses analyzed in this analysis provide unprecedented detail on specific incidents regarding previously identified exposures, such as military service, environmental exposures, and occupational exposures. These details might provide important, hypothesis-generating data that could not be gleaned from a multiple-choice survey question.

The responses from PLWALS revealed the emotional and psychological impact of living with a poorly understood and incurable condition. Many participants expressed sentiments related to frustration, confusion, helplessness, and maladaptive coping, including self-blame. One PLWALS stated, “I am also very frustrated that after at least thirty years, there is still not treatment or cure for this horrible disease.” A caregiver, responding on behalf of a PLWALS stated, “He was a brilliant and smart man, and now with his dementia he can do nothing. the atrophy in his hands and arms is markedly disfiguring and they are useless to him. It is a very horrible and frustrating disease.” Self-blame is most frequently found in the “Lifestyle factors” theme, such as diet, exercise, illicit drugs, tobacco use, and alcohol. Maladaptive coping is known to be associated with poorer patient-reported outcomes in other diseases (Citation22). The concerning sentiments shared by Registry respondents might contribute to higher levels of depression and anxiety, which are related to poor health-related quality of life among people with ALS (Citation23).

The themes developed and sentiments discovered in this analysis present opportunities to translate results into improved support for PLWALS and their families. Clinicians and therapists can use this information to create tailored, empathetic approaches to working with PLWALS. Understanding the importance of these lifestyle factors in patients’ minds can enable clinicians and therapists to develop supportive interventions that promote better coping strategies, providing evidence-based information that helps to alleviate feelings of guilt and reinforce the complex and multifactorial nature of the disease. By acknowledging and addressing patients’ beliefs and fears about the causes of their ALS, healthcare professionals can strengthen the patient-provider relationship and contribute to improved mental well-being and overall quality of life for those living with this devastating disease.

Limitations

This analysis has several limitations. Because participation is voluntary, some registered participants do not complete the survey on which this analysis is based. Participants with internet access are presumably more likely to participate; this might skew toward a younger sample. The portion of participants who were younger at diagnosis (ages 40–49 years) is overrepresented in this sample (10.6%) as compared to national prevalence in the Registry of 8.3% (Citation23). The oldest age group, 80 years and older, is underrepresented in this analysis (2.4%) compared with data in the National ALS Registry (8%). At 4.0%, non-White race is underrepresented in this sample compared with 11.9% in the Registry as a whole (Citation23). Potential reasons for these discrepancies include barriers to accessing the technology needed for self-registration; lack of awareness of the Registry which could be due to lower utilization of ALS specialty clinics; and reduced participation by residents of the Western United States, a region with a substantial non-White population (Citation2).

Many free-text responses were labeled as “ungrouped” in the AI methodology modeling process, indicating that the model was not able to assign those data points exclusively to any of the identified topics. This could happen for several reasons. The generic sentence-transformers model was not fine-tuned, and as such, it might have had difficulty generalizing to this specific use case and assigning the ALS-related responses to a specific label during the AI methodology modeling process. This could be due to irrelevant language in the free-text responses that confuses the generically trained model. Novel tokens or meaningless (“noisy”) data might result in specific posts being labeled as “ungrouped” despite the presence of otherwise categorizable information. For example, the BERT model might have difficulty assigning the comment “Had shingles twice. Also spent 20+ years around cat litter.” to a specific topic. This analytic challenge is due to several factors, including ambiguity, limited context, or ill-formed sentence structure. Ambiguity plays a role in ungrouping because the comment contains two distinct statements that might not be directly related to each other. The BERT model will probably have difficulty determining a common context for the two statements, which are disparate and seemingly unrelated to ALS. As a result, the model will struggle to assign this post to a specific topic cluster. The limited context of this response also might pose a challenge. The comment is very concise and contains no mention of a relationship to ALS. Transformers-based models such as BERT rely on language context and surrounding information to generate embeddings for a given group of tokens. Because of that, this post might not contain enough textual context to accurately label the response within a specific topic. In contrast, during qualitative analysis, in the example above, “Had shingles twice” was assigned to the theme “Personal Medical History” and “Also spent 20+ years around cat litter” was assigned to the theme, “Environment/exposure to chemicals and pesticides.”

Tokenization and sentence structure also might play a role in topic ambiguity. Using a language model requires breaking statements into tokens (sub-words, words, or multi-word chunks) and analyzing their relationships. A comment’s structure, with two separate statements syntactically joined by “also,” might affect the model’s ability to understand the intended meaning. If the tokenization and sentence structure do not align well with the patterns that the BERT-based model has learned, it might not group the comment correctly.

Topic modeling performance might be improved by fine-tuning the underlying model on a specific dataset or task, such as a curated dataset of comments related to ALS. This might help BERT, or another transformers-based model, to better understand the context and associations in the data and improve its ability to group comments such as the one mentioned. Topic modeling also might be improved by using the embeddings of large language models (LLMs), which have been released since this study and have seen significant gains in generalizing to unseen text. Although the fine-tuning process was outside the scope of this analysis, and many of the high-performing LLMs were not available at the time of this analysis, they represent promising avenues for future research. For this analysis, given the large number of ungrouped responses, it was important to assess those data points separately to determine whether they represent valuable information that was missed by the model or by the data pre-processing steps.

Conclusion

This is an analysis of a large qualitative dataset with AI and traditional qualitative approaches. Although the qualitative approach resulted in a more comprehensive theme and subtheme development, AI provides a reinforcing check for the traditional qualitative analysis. AI also identifies specific topics that might be important but not given the same prioritization in traditional thematic analysis. By combining AI and traditional qualitative analytical techniques, researchers leave no stone unturned in the quest for the most accurate and actionable characterization of the data.

The narrative responses housed in the National ALS Registry represent, to our knowledge, the largest collection of lived experience data for people with ALS in the world and provide insight into the participants’ theories about the perceived causes of their ALS. Engaging with people with ALS on the perceived causes of their disease offers numerous benefits to researchers and the ALS patient community. These benefits include gaining a deeper understanding of the complex factors contributing to ALS onset and progression, fostering a more empathetic and supportive clinical and research environment, and leveraging patient insights to drive research. It is important for scientists to tap into the opportunities presented by this rich dataset. This analysis also demonstrates that it is possible to successfully analyze large narrative datasets with open-source technology such as Python. This analysis highlights the value of having a partnership between AI and human analysis. Insights gleaned from our experimentation with different approaches to the analysis of these large, unstructured data will be helpful to scientists in all disease areas. Future studies should continue to prioritize patient engagement and incorporate their perspectives into the research process to advance our understanding of ALS and improve patient care.

Disclaimer

The findings and conclusions in this report are those of the author(s) and do not necessarily represent the official position of the Centers for Disease Control and Prevention/Agency for Toxic Substances and Disease Registry.

Acknowledgements

We would like to acknowledge the participants who shared their valuable insights that made this analysis possible; the I AM ALS Veteran’s Affairs Team and Ms. Connie Becker for modified member checking. We would also like to thank Diane Walters for her copyediting assistance.

Data availability statement

The data are available via CDC - Amyotrophic Lateral Sclerosis: Research Application Form.

Declaration of interest statement

The authors report there are no competing interests to declare.

References

  • Mehta P, Raymond J, Punjani R, Larson T, Han M, Bove F, et al. Incidence of amyotrophic lateral sclerosis in the United States, 2014–2016. Amyotroph Lateral Scler Frontotemporal Degener. 2022;23:378–82.
  • Mehta P, Raymond J, Zhang Y, Punjani R, Han M, Larson T, et al. Prevalence of amyotrophic lateral sclerosis in the United States, 2018. Amyotroph Lateral Scler Frontotemporal Degener. 2023;24:1–7.
  • Richards D, Morren JA, Pioro EP. Time to diagnosis and factors affecting diagnostic delay in amyotrophic lateral sclerosis. J Neurol Sci. 2020;417:117054.
  • Andrew AS, Bradley WG, Peipert D, Butt T, Amoako K, Pioro EP, et al. Risk factors for amyotrophic lateral sclerosis: a regional United States case-control study. Muscle Nerve. 2021;63:52–9.
  • Goutman SA, Boss J, Patterson A, Mukherjee B, Batterman S, Feldman EL. High plasma concentrations of organic pollutants negatively impact survival in amyotrophic lateral sclerosis. J Neurol Neurosurg Psychiatry. 2019;90:907–12.
  • Cox PA, Kostrzewa RM, Guillemin GJ. BMAA and neurodegenerative illness. Neurotox Res. 2018;33:178–83.
  • Bozzoni V, Pansarasa O, Diamanti L, Nosari G, Cereda C, Ceroni M. Amyotrophic lateral sclerosis and environmental factors. Funct Neurol. 2016;31:7–19.
  • Weisskopf MG, Cudkowicz ME, Johnson N. Military service and amyotrophic lateral sclerosis in a population-based cohort. Epidemiology. 2015;26:831–8.
  • Capozzella A, Sacco C, Chighine A, Loreti B, Scala B, Casale T, et al. Work related etiology of amyotrophic lateral sclerosis (ALS): a meta-analysis. Ann Ig 2014;26:456–72.
  • Yu Y, Su FC, Callaghan BC, Goutman SA, Batterman SA, Feldman EL. Environmental risk factors and amyotrophic lateral sclerosis (ALS): A case-control study of ALS in Michigan. PLoS One. 2014;9:e101186.
  • Cox PA, Richer R, Metcalf JS, Banack SA, Codd GA, Bradley WG. Cyanobacteria and BMAA exposure from desert dust: a possible link to sporadic ALS among Gulf War veterans. Amyotroph Lateral Scler. 2009;10 Suppl 2:109–17.
  • US Public Health Service. ALS registry act. Washington, DC: 110th Congress; 2008.
  • Antao VC, Horton DK. The national amyotrophic lateral sclerosis (ALS) registry. J Environ Health. 2012;75:28–30.
  • Cedarbaum JM, Stambler N, Malta E, Fuller C, Hilt D, Thurmond B, BDNF ALS Study Group (Phase III), et al. The ALSFRS-R: a revised ALS functional rating scale that incorporates assessments of respiratory function. BDNF ALS study group (phase III). J Neurol Sci. 1999;169:13–21.
  • Horton DK, Mehta P, Antao VC. Quantifying a nonnotifiable disease in the United States: the national amyotrophic lateral sclerosis registry model. JAMA 2014;312:1097–8.
  • US Department of Health and Human Services. HHS regional offices. Washington, DC: US Department of Health and Human Services; 2017. https://www.hhs.gov/about/agencies/iea/regional-offices/index.html
  • Charmaz K. Constructing grounded theory: a practical guide through qualitative analysis. Thousand Oaks (CA): Sage Publications; 2006.
  • Gensim. parsing.preprocessing—Functions to preprocess raw text. 2022. [accessed 2023 May 26]. https://radimrehurek.com/gensim/parsing/preprocessing.html.
  • NLKT Project. Documentation: Source code for nltk.stem.wordnet. 2023. [accessed 2023 May 27]. https://www.nltk.org/_modules/nltk/stem/wordnet.html.
  • Grootendorst M. BERTopic. Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint. arXiv:2203.05794. 2022. https://maartengr.github.io/BERTopic/index.html#citation.
  • Riemers N. Pretrained models. SBERT.net. 2022. https://www.sbert.net/docs/pretrained_models.html.
  • Chao CY, Lemieux C, Restellini S, Afif W, Bitton A, Lakatos PL, et al. Maladaptive coping, low self-efficacy and disease activity are associated with poorer patient-reported outcomes in inflammatory bowel disease. Saudi J Gastroenterol. 2019;25:159–66.
  • van Groenestijn AC, Kruitwagen-van Reenen ET, Visser-Meily JM, van den Berg LH, Schröder CD. Associations between psychological factors and HRQoL and global quality of life in patients with ALS: a systematic review. Health Qual Life Outcomes. 2016;14:107.