104
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Sourcing public policy: organisation publishing in Wikipedia

&
Received 23 Apr 2023, Accepted 12 Apr 2024, Published online: 20 May 2024

ABSTRACT

Organisations across multiple sectors are prolific publishers in a range of genres including research reports, policy briefs, fact sheets, datasets and much more. Sometimes referred to as grey literature, these publications play a critical role in the circulation of research and ideas on public policy and public interest issues, yet they are often overlooked as part of the research publishing system, including on Wikipedia and Wikimedia. One of the cornerstones of Wikipedia is its reliance on citations from reliable sources, however little is known about the way in which organisation-produced and disseminated publications are understood and used as sources on Wikimedia platforms. This article reviews the literature and analyses available data on the sources used on English Wikipedia and what it can tell us about the extent to which policy reports from organisations are cited and the issues that arise with evaluating their reliability. We then provide a case study of a project underway by the Analysis & Policy Observatory (APO), a digital library of policy reports based in Australia, which aims to improve the presence of high-quality policy and research material and coverage of policy issues on Wikipedia.

Introduction

Digital technologies and the internet have radically reduced the cost and complexity of production, dissemination, discovery and access to research publications resulting in disruptions to traditional publication models and new opportunities for both formal and informal publishing practices to expand. Academic journals and books from commercial and scholarly publishers, as well as news media, have been transformed by the digital media economy, and these same technologies have also enabled a massive increase in the number and diversity of organisations engaged in directly producing, publishing and disseminating research and information in the public sphere (Lawrence, Citation2018; Wellstead & Howlett, Citation2022; Williams & Lewis, Citation2021). Governments, civil society organisations and research institutes have embraced the self-publishing capabilities provided by digital publishing technologies such as desktop publishing software, PDFs, websites, email newsletters, social media and other tools that provide the means for low-cost in-house production, publication and distribution of their own content.

Media economies, institutions, artefacts and practices structure the way knowledge circulates in society and are structured in turn by social, economic and technical systems, institutions and practices (Lievrouw, Citation2013; Lievrouw & Loader, Citation2020). Recognising organisation publishing as a media practice, economy and artefact provides a way of foregrounding the material affordances of various formal and informal publishing practices, and the formats, genres, production and distribution systems in use, particularly for public policy and practice. Public policy is a complex, dynamic, multisector and multicentric environment that relies on a diverse evidence ecosystem (Cairney et al., Citation2019; Davies et al., Citation2019) and online organisation-based publishing has become increasingly important for public policy and public interest issues (Lawrence, Citation2022). Organisations, from the International Panel on Climate Change (IPCC) to local interest and advocacy groups, publish material to inform and influence their constituency and wider public debate. Organisation publishing (also known as grey literature) is highly diverse, involving a wide range of genres beyond academic journals and books including research and technical reports, conference papers, discussion papers, working papers, preprints, evaluations, briefings, reviews, case studies, factsheets, statistics, datasets and more. While some material is still produced in print, most publications are published online and free to access and download via organisation websites. This provides a high degree of flexibility, allowing for organisations to produce content in ways that are more timely, targeted and accessible than academic journals and books (Lawrence, Citation2018). Organisation publications also provide diverse perspectives including from community groups, government agencies, interest groups, commercial companies, professional associations and think tanks which may not be available through formal scholarly or commercial publishing.

Organisation-based publishing operates on a spectrum from informal to formal media production which means it is highly variable in terms of content quality, bibliographic standards and professional publishing practices. On one end of the spectrum are organisations that conduct rigorous research with extensive review processes and publish using professional production standards while at the other end of the spectrum are reports of variable quality and poor production or bibliographic standards. Many reports and papers from organisations lack Digital Object Identifiers (DOIs) or adequate metadata and bibliographic information and their diverse, disaggregated and dispersed nature makes evaluation and tracking of these sources extremely challenging (Sedgwick & Ross, Citation2020). As a result, they are often overlooked as part of scholarly communication and the research publishing system, and poorly managed within library, publishing and information management systems. Lack of long-term management by either publishers or collecting services has led to much content disappearing from online access and large-scale link rot.

While there have been some studies on the role of organisation publishing or grey literature as a source of information in various areas such as evidence-based policy, systematic reviews, health and environmental studies (see for example Lawrence, Citation2022; Macdonald et al., Citation2016; Mahood et al., Citation2014), its role as a reference source in Wikipedia has received little attention. In considering the role of organisation publications as a source for Wikipedia, in this article we explore: What types of research publications and sources are used and valued in Wikipedia? What guidance does Wikipedia provide on verifying and citing material published by organisations? What can be done to support the verification and citation of policy reports and papers from organisations on Wikipedia? We consider the current guidelines for reliable sources on Wikipedia, and to what extent they provide advice and guidance on citing reports and papers from organisations. We then examine the literature on Wikipedia references and what we know about the kinds of sources being used. Finally, we provide a case study looking at the challenges in citing material published by organisations as reliable sources for Wikipedia and work being done to overcome these through a Wikimedia Foundation (WMF) Alliance Fund project, The missing link: Incorporating policy reports into the free knowledge ecosystem. Undertaken by the Analysis & Policy Observatory (APO), the project aimed to integrate quality research and policy reports and content from Australia and New Zealand into Wikidata and English Wikipedia. We then discuss the implications of this work and provide recommendations for improving both the way in which Wikipedia cites reports and also how Wikimedia projects may provide an example and guidelines for the wider knowledge ecosystem in using organisation publications.

Wikipedia and the challenge of diverse research publishing practices

Wikipedia (WP) and the wider Wikimedia (WM) ecosystem of projects are part of, and beneficiaries of, the transformation from print to digital publishing and the related move towards open scholarship and “social knowledge creation” (Arbuckle et al., Citation2022). Indeed, WP is a prime example of the way in which digital technologies have enabled the production and distribution of knowledge in radically new ways. Despite this and the diversity of sources and genres involved in research communication discussed earlier, there continues to be a very narrow view of what constitutes research publishing and large parts of this knowledge ecosystem such as organisation publishing, or grey literature, are regularly overlooked, including on platforms such as WP. As digital publishing continues to evolve we must pay close attention to the impacts of infrastructure for open knowledge and the political economy of the way knowledge is produced and distributed (Arbuckle et al., Citation2022). Given the importance of WP in the open knowledge ecosystem and the flow of information across the internet it is essential that we understand and improve the way sources are used and understood on WP and other WM platforms.

One of the cornerstones of WP is its reliance on citations from reliable sources, however little is known about the nature and extent of citations of organisation research and policy publications on WP and other WM platforms such as Wikidata. These platforms are not only significant in themselves, with WP being in the top ten most visited websites in the world for the last decade, but they also play a critical role in the wider information flows across search engines, information boxes and more recently large language models (Ford & Graham, Citation2016). As Crompton et al. (Citation2020) note, contributions to WP and Wikidata

not only help the Wikimedia suite of projects (Wikidata’s data populates infoboxes across the WP in all languages), but also shapes projects that draw on Wikidata data, from small ones, like Linked Familiarity, to large ones like the Google Knowledge Graph, which shapes search results, and through those results, what people can know by using Google Search.

At the same time WP and other WM projects also have to continually monitor and defend the site from mis - and disinformation, vandalism, pranks, bias, omissions, inaccuracies and other dangers inherent in such an open project. These issues have become more prominent as our life online has expanded. This leaves the community and WP editors facing considerable pressure in terms of verifying and using organisation publications—an issue replicated across the wider scholarly communication system. A decade ago Naomi Oreskes warned us of the dangers presented by the “merchants of doubt” (Oreskes & Conway, Citation2010) and this is particularly the case for many issues requiring a public policy intervention, such as environmental policy (Oppenheimer et al., Citation2019). There is no doubt that vested interests use publishing and communication systems to advocate and influence and therefore organisation-based publishing requires a high level of critical review and evaluation. At the same time, as we have discussed, a huge amount of extremely valuable material is produced by organisations, we therefore cannot afford to simply ignore organisation publishing as it is an essential source of knowledge on many public interest issues. It is therefore critical that WP provides clear guidance on how to use the diverse materials being published online including by organisations (Bruckman, Citation2022). As the Wikimedia Foundation white paper on knowledge integrity notes:

Technology platforms across the web are looking at Wikipedia as the neutral arbiter of information, but as Wikimedia aspires to extend its scope and scale, the possibility that parties with special interests will manipulate content, or bias to go undetected, becomes material. (Zia et al., Citation2019)

As an encyclopaedia, WP aims to summarise and describe established ideas and facts about the world which can be verified through external sources. WM projects have benefitted from the increasingly open and online access to a wide variety of sources, yet they also struggle to come to terms with what counts as a reliable source in this dynamic and changing space. WM editors now find themselves caught between guidelines that emphasise using traditional sources which may be overly narrow and not representative of the diverse genres, formats, viewpoints, locations and communities they seek to represent, and the very real dangers of poor-quality research, vested interests and mis - and disinformation that have also proliferated online. Given the importance of WM as a public knowledge platform, this is a situation that must be addressed across many domains of knowledge and particularly for the public interest and public policy issues for which an encyclopaedia is essential. As Baltz warns:

For almost as long as Wikipedia has existed, critics have argued that these biases shape its pages’ contents, limiting and slanting coverage that is now viewed nearly 10 billion times each month. Groups that are underrepresented in academia tend to be missing at an even higher rate on Wikipedia. And there is growing evidence that Wikipedia articles have tangible effects, including the power to influence the contents of scientific papers. Wikipedia does not just passively reflect biases. It amplifies and reinforces them. (Baltz, Citation2021)

Wikipedia’s reliable sources policies and editing choices also flow through to other platforms such as Wikidata. Wikidata is a sister project to WP which contains structured data about entities. It consists of items that have a label, a description and any number of properties and values, which are linked in statements that closely resemble a semantic triple (Cantallops et al., Citation2019). Although initially intended to provide structured data for facts and info boxes on WP, Wikidata also includes a huge number of bibliographic entities. While Wikidata has expanded rapidly over the last decade, Crompton (Citation2020) points out that this growth is uneven:

Wikidata’s coverage is uneven and biased in favour of entities that are of interest to editors or entity types for which there are automatic ingestion tools. Wikidata … is also biased in favour of the type of content that is already in English Wikipedia, which itself is skewed towards the typical or traditional interests … 

Wikipedia content policies and guidelines

A key part of Wikimedia’s defence system against misinformation is its policies and guidelines which guide the type of content that can be included and the selection of sources and reference to support that content. While these protect WP from unreliable information, they also create a culture that preferences certain types of sources over others.

Although WP likes to say there are no rules, there are numerous policies and guidelines developed by the community to describe best practices, clarify principles and resolve conflicts. Policies “have wide acceptance among editors and describe standards all users should normally follow” (WP, Citation2023f: Policies and Guidelines). The three core content policies are:

  1. Neutral point of view (WP, Citation2023d: Neutral point of view)—All Wikipedia articles and other encyclopedic content must be written from a neutral point of view, representing significant views fairly, proportionately and without bias.

  2. Verifiability (WP, Citation2023k: Verifiability)—Material challenged or likely to be challenged, and all quotations, must be attributed to a reliable, published source. In Wikipedia, verifiability means that people reading and editing the encyclopedia can check that information comes from a reliable source.

  3. No original research (WP, Citation2023e: No original research)—Wikipedia does not publish original thought: all material in Wikipedia must be attributable to a reliable, published source. Articles may not contain any new analysis or synthesis of published material that serves to advance a position not clearly advanced by the sources. (WP: Core content policies)

A guideline “is a set of best practices that are supported by the consensus of Wikipedia editors. Editors should attempt to follow guidelines, though they are best treated with common sense” (WP, Citation2023f: Policies and Guidelines). Finding one’s way around WP’s guidelines is a challenge in itself. Within the cite sources page the key instruction is to cite “reliable sources” which goes to further instructions. There are also specialist guideline pages for reliable sources in science and medicine, but no specific information for humanities or social sciences research. According to most of these guidelines, reliability may be judged on the specific work itself (the article, book), the creator of the work (the writer, journalist), or the publisher of the work. Yet overall, WP’s general guidelines for the most reliable sources are grounded in traditional notions of print culture, based around academic and scholarly publishing and mainstream news media:

  • peer-reviewed journals

  • books published by university presses

  • university-level textbooks

  • magazines, journals and books published by respected publishing houses

  • mainstream newspapers.

The key recommendation is that, “When available, academic and peer-reviewed publications, scholarly monographs, and textbooks are usually the most reliable sources” (WP, Citation2023g: Reliable sources).

The guidelines do include a range of formats such as printed or digital text, audio, video and multimedia materials that have been distributed, broadcast or archived, and the definition of published is suitably broad and includes “any source that was made available to the public in some form”. Yet while there is much sensible advice on these pages, the general guidelines on reliable sources and citations have little to say about reports and papers produced by organisations. There are various warnings against the dangers of using self-published material but no explanation of whether this might apply to reputable institutions such as the World Health Organisation or any other organisations, nor any mention of grey literature in the main guidelines. The advice in the self-publishing section is to try the reliable sources noticeboard (WP, Citation2023h: Reliable Sources/Noticeboard) if you need to find out what the WP community thinks of a source. The noticeboard provides a space for discussion and debate of specific sources and genres while the WP, Citation2023i: Reliable sources/Perennial sources page provides a summary list of titles where consensus on their reliability has been reached. At the time of writing there were only 514 sources listed with most of these being news media sources and only a handful of organisations so this list is of limited use given there are thousands of organisations publishing policy research. Many other sources are discussed on the noticeboard but it is not easy to navigate. We discuss engagement with the reliable sources noticeboard further in the case study below.

Within the numerous pages of citation policies and guidelines, it is only in the guidelines on reliable sources for medicine and science that we get information on the importance of publications from expert bodies and organisations. The WP: Identifying reliable sources (medicine) guidelines state that:

Statements and information from reputable major medical and scientific bodies may be valuable encyclopedic sources. These bodies include the U.S. National Academies (including the National Academy of Medicine and the National Academy of Sciences), the British National Health Service, the U.S. National Institutes of Health and Centers for Disease Control and Prevention and the World Health Organisation. The reliability of these sources ranges from formal scientific reports, which can be the equal of the best reviews published in medical journals, through public guides and service announcements, which have the advantage of being freely readable, but are generally less authoritative than the underlying medical literature. (WP, Citation2023b: Identifying reliable sources (medicine))

Similarly, the WP: Identifying reliable sources (science) guidelines state:

Ideal sources for these articles include comprehensive reviews in independent, reliable published sources, such as reputable scientific journals, statements and reports from reputable expert bodies, widely recognized standard textbooks and handbooks written by experts in a field, expert-curated databases and reference material, or high-quality non-specialist publications. (WP, Citation2023c: Identifying reliable sources (science), emphasis added)

The WP: Identifying reliable sources (science), guidelines also include explicit discussion of “white and grey literature”:

Government agencies and non-governmental organisations often produce reports that are internally vetted and reviewed. When using such a report as a source, consider the purpose of the organisation, its reputation in the desired context, and the reception of the specific report. (WP: Identifying reliable sources (science))

These are clear and strong statements of the value—and the caution required—in using organisation publications, however they are unlikely to be read by most contributors outside of medicine and science fields. Beyond policies and guidelines, the use of reports is further discouraged by the lack of tools available. The lack of attention to reports is reflected in the Citoid citation tool and template which only displays four template options: books, journals, news and web pages. This makes citing reports {{Cite report}} and other formats much harder to do accurately using existing tools and templates, especially for new editors.

Despite the increasing importance of digital publishing by organisations, particularly for public interest issues, general WP policies and guidelines provide scant advice on its role as a reliable source (WP, Citation2023a). In the next section, we review the literature to understand what we know about actual citation practices on WP.

Related works on Wikipedia sources

Despite the importance of verifiability and reliable sources in WP policies, there are significant challenges in accessing Wikipedia citation data at scale. Arroyo-Machado et al., (Citation2022b) list 15 key publications since 2007, mostly within computer sciences and science with only one focussed on the humanities. A large factor inhibiting studies of WP references is their inaccessibility for large-scale data analyses (Singh et al., Citation2021). There is still no standard format for references on WP and no central database of referenced sources. Generally, researchers have to extract references via multiple templates from a data dump or an API and then try to classify them. There are various methods and tools which have been developed for this purpose and Arroyo-Machado et al. (Citation2022b) provide a summary table of reference data sources by format, update frequency, data quantity, type, and challenges which includes: Wikimedia Dumps, MediaWiki and Wikimedia APIs, Wiki Replicas, Event Streams, Analytics dumps, WikiStats, Dbpedia, XTools, Repositories and Altmetric aggregators. There are also resources on how to access Wikipedia data available through the WM research community (WMF, Citation2024)

One of the main approaches in classifying WP citations has been to analyse identifiers such as DOIs or ISBNs. This was the approach used by WMF researchers in 2018 in a study that aimed to stimulate research into sources on Wikipedia (WMF, Citation2018). While this provides interesting results in terms of data on the formal publications that are being cited, it misses a great deal as many older journal articles, most research reports and papers and news articles do not have identifiers. Singh et al. (Citation2021) found in their large-scale analysis of 29 million Wikipedia citations that only 7 per cent of Wikipedia pages cite a journal article with a DOI and 13 per cent cite an item with an ISBN. The rest (80 per cent) were described as web links. Singh et al. have made their dataset available in Zenodo and this data has been further analysed to map science (Yang & Colavizza, Citation2022a) and humanities academic sources (Torres-Salinas et al., Citation2019). There has also been a study of news media sources from the same data (Yang & Colavizza, Citation2022b), which were estimated to be 30 per cent of citations.

Lewoniewskia and colleagues (Citation2020) have also conducted detailed analysis on the most used references by source across all language Wikipedias. As expected, they found that a significant portion of references are to academic publications and news media, however they also identified a large number of official government datasets (such as census data) and major organisations (such as the World Health Organisation, the United Nations and UNESCO) as popular sources (Lewoniewski, Citation2022). Access to this data is available via the BestRef website (Lewoniewski, Citation2024) which provides data for each language WP based on various calculation methods.

Beyond large-scale analyses other important research has focused on more contextual and qualitative approaches, sometimes combined with large scale data, involving specific pages, topics, or a random sample of articles (see for example Avieson, Citation2019; Citation2022; Dehdarirad et al., Citation2018; Ford, Citation2022; Ford et al., Citation2013; Luyt, Citation2021; Luyt & Tan, Citation2010). These kinds of case studies and targeted approaches provide additional insights and more nuanced stories about the vast Wikipedia citation data. A 2013 study by Ford et al., which analysed a random sample of 500 Wikipedia articles, found that 45 per cent of articles cited publications from organisations and 18 per cent cited data (Ford et al., Citation2013). These types of content-analysis methods are a valuable complement to the more large-scale quantitative approaches as they provide a more detailed picture of the nature of the relationship between topics and citations. Yet few recent studies have looked at the nature or extent of organisation-based publications as sources, whether via large-scale quantitative data analysis or via smaller-scale contextual analysis or other approaches, we have therefore conducted a brief analysis of existing data to gain some initial insights.

Analysis of existing data on Wikipedia references and organisation sources

Based on the BestRef data of 2020 we conducted a synthesis of the top 200 sources (a limit set by the website), based on the F-model, that is, how many references contain the analyzed domain (). Commercial and public media (32 per cent) and entertainment (27 per cent) sources were the most common, representing over half (59 per cent) of the top 200 domains. This was followed by government (13 per cent), libraries/databases (8.5 per cent), NGOs (8 per cent), and academic sites (6 per cent). Search engines and social media were a minor presence in the mix as well as a handful of sites that had closed.

Table 1. Top 200 sources on English Wikipedia classified by type.

Examples of sources cited include:

  • Commercial and public media: NY Times, BBC, Guardian, newspapers.com, Google news, LA Times, Washington Post, UK Telegraph, India Times

  • Entertainment: Youtube, Billboard.com, IMDB, Allmusic, Baseball-reference

  • Government: census.gov, NASA, National Oceanic and Atmospheric Administration (NOAA.gov), National Institute of Health (NIH.gov), Integrated Taxonomic Information System (ITIS.gov), Statistics Poland, the European Union (Europa.eu)

  • Libraries and collections: Google books, Internet Archive, National Library of Australia (NLA), Hathi Trust, Library of Congress, Research gate, National Library of New Zealand

  • Non-government organisations: International Union for the Conservation of Nature (IUCN), Global Biodiversity Information Facility (GBIF.org), British History Association, Historic England, UN, UNESCO.

This brief analysis shows that government, NGO, universities and commercial sources are a significant part of WPs sources, in addition to academic journals and books and we might expect that for public policy related topics the distribution would be even more significant. Lewoniewski (Citation2022) has found that different topics featured different types of sources. However to really understand the sources being used requires a much closer analysis and evaluation of specific references and the topics to which they are being applied. Based on their work developing a knowledge graph of WP informatics, Arroyo-Machado et al. (Citation2022a) have set up an interactive website called Wikinformetrics which has a dataset with a scatter plot and ranked tables based on a range of indicators including page views, age, length, number of edits of the article, references, published references and others. The top 1,000 Wikipedia articles for each indicator are included in the plot, making a total of 7374 unique articles in the dataset.

As a way of analysing this data we chose climate as a search term, one of the most important public interest issues of our time crossing both scientific and public policy spheres. A search on Wikinformatics rankings for Wikipedia articles with climate in the title returned six articles on climate policy and environmental science and for all of them “published references”, by which they mean those with an identifier such as a DOI or ISBN, were only a small minority of the references cited compared to URL references and total references. Even the page on Scientific consensus on climate change only had 22 “published references” compared to 147 in total. Without the capacity to drill down further to a detailed analysis of the sources actually cited, which could include books and journal articles without identifiers, as well as reports and websites, it is difficult to know what is really being cited from this dataset but clearly a range of sources are being used beyond these traditional formats. Given the importance of climate change as a public interest topic, and the role of WP in the knowledge ecosystem, we are missing a great deal of critical information on WPs verifiability and the sources actually being cited on critical policy topics.

Case study: missing link project

The Analysis & Policy Observatory (APO) plays a central role in Australia and New Zealand’s free knowledge ecosystem by sourcing material published by organisations and creating metadata to make it discoverable in an open access digital repository and across internet and library search engines. APO also plays a crucial role in disseminating this public policy and research material to more than 15,000 newsletter subscribers working across government, not-for-profits, academia, as well as the general public; and working with government and researchers to create an evidence base on an issue and building engagement with it.

The objective of the WMF funded project, The missing link: Incorporating policy reports into the free knowledge ecosystem, was to improve the presence of valuable policy and research material and coverage of important policy issues in WM. The project involved identifying the most reputable and prolific publishing organisations as sources in the APO database, feeding in APO report metadata into Wikidata, and promoting awareness and use of these sources.

The project aimed to address two key issues in particular.

  1. Current public policy sources: As discussed above, WP’s primary reference sources are either digital news media or commercial or scholarly published literature, such as books and academic journals. While news media contain the latest and most up-to-date information they often do not present in-depth research or policy on a particular issue; and while books and journals fill this gap, the time taken to publish leads to a loss of currency and they tend not to be open access.

  2. Underrepresented perspectives: smaller organisations such as those that are First Nations-led, from the Pacific Islands, and other underrepresented communities and perspectives are more likely to publish their own material rather than through commercial means.

The project aim was to only upload reports from publishing organisations that would be considered reliable to support editors to comply with WP policy. The biggest challenge in the project was to define a reliable source, due to the diverse range of guidance on reliability in Wikimedia. As previously demonstrated in this paper, there are many policies and guidelines about WP sources and many different perspectives—at times contradicting other guidance. Guidelines consulted include:

  • WP: Reliable sources

  • WP: Reliable Sources NoticeboardPerennial Sources

  • WP: No original research (Reliable sources)

  • WP: Tiers of Reliability

Two different reliability scales were also discovered: the Reliability scale on the Perennial sources noticeboard (WP: Perennial Sources) and the Tiers of reliability scale (WP, Citation2023j: Tiers of Reliability). The scale in is focussed on rating a specific publisher while the Tiers of Reliability scale is focussed on the type of publication but also includes the publisher (at a high level). There are aspects of the two scales that contradict and other aspects that reference the other. However, neither refer to reports published by organisations or grey literature. The Tiers of Reliability scale refers to “expert” and “non-expert” self-published material with very little guidance on how to distinguish the two.

Figure 1. Reliability scale on the reliable sources noticeboard. (Source: WP, Citation2023i: Perennial Sources).

Figure 1. Reliability scale on the reliable sources noticeboard. (Source: WP, Citation2023i: Perennial Sources).

Wp: tiers of reliability scale (March 2023)

Tier 1: most reliable

  • Peer-reviewed publications: peer-reviewed articles in academic journals; literature systematic and other review articles, peer-reviewed conference papers

  • Academic books: books published by university presses and specialist encyclopaedias

Tier 2: more reliable

  • Non-peer-reviewed academic publications: articles in academic journals: book reviews, conference papers

  • Mass-market books

  • Highly-reputable international journalism

Tier 3: reliable

  • Tertiary sources: general reference works e.g. encyclopaedias and dictionaries

  • Other generally reliable news sources: national and international journalism that is less than top-rated, regional and local news, trade publications

  • Any source listed as green at WP:RSP would be at least in this tier

  • Expert self-published

Tier 4: limited use

  • Non-expert self-published: official websites, brochures and other promotional materials, opinion/editorial, vanity press, predatory publishing, churnalism, government propaganda, theses

  • Questionable sources

  • Primary sources

Throughout the process of determining how a “reliable source” could be defined, it is essential to distinguish what aspect of the publication is the subject of the assessment, that is:

  • the work itself,

  • the author,

  • he type of publication, or

  • the publisher: which can fall into two categories of commercial publisher or non-commercial publisher which is often the organisation.

While this project was borne out of an identified need to increase the use of a type of publication—loosely defined as “policy reports”—to help WP editors more easily select reliable reports for use—we primarily focussed on determining the reliability of the publisher, specifically, non-commercial publishing organisations.

Criteria for inclusion of publishing organisations and reports

Drawing on the variety of policies and guidelines on reliable sources and conversations on various Wikimedia forums, we developed two criteria for selecting reports to be uploaded into Wikidata for citation on WP.

Criteria 1: The publishing organisations need to be considered Australia and New Zealand's most notable and eminent government and research organisations. Advice from the WM community was to exclude advocacy organisations. However, this may exclude some reports authored by academic researchers as advocacy organisations often engage universities to produce research reports that they publish.

Criteria 2: For each publishing organisation, APO conducted a brief review of all catalogued reports and documents to ensure any policy reports were published by organisations that have a remit for or governance over the policy area, and that research reports were based on research and not opinion.

It must be noted that it is intended that when using these reports as sources WP editors apply a “no consensus” rating (from the Reliable Sources Noticeboard reliability scale) which requires in-text attribution.

Outcomes

APO identified 64 publishing organisations from Australia and New Zealand from which to select the reports. However, this was narrowed down to 36 publishers due to the large quantity of reports from each publisher catalogued on APO (an average of 200 reports per publisher, which far exceeded our goal of uploading 1,000 reports).

The process of vetting the reports took longer than expected not due to ensuring they met Criteria 2, but instead many records contained broken links (APO’s process for preventing and managing broken links has improved over the last few years, and so this issue was more common for older records). In the end a total of 3,097 reports were uploaded into Wikidata.

As the purpose of this project was to improve the coverage of policy issues on English WP, a key element was to promote the awareness and use of the reports. While report metadata on Wikidata can improve the quality of citations, it does not provide a searchable database in which to find appropriate sources. To address this APO created a collection of the reports on its website to enable prospective editors to discover and locate reliable sources. The APO team also added references to existing pages and content, new sections to existing pages with the relevant sources, and created new pages—all on Australian public interest issues.

Discussion: Improving public policy sources in Wikimedia

As noted above, there are significant issues in conducting research on WP sources due to the lack of data and detailed analysis. A major issue for transparency, verifiability and the study of sources on WP that must be addressed is the lack of a dedicated citation database.

While Wikidata includes a large number of bibliographic records—to the point that they have come to dominate the database and even slow down performance—it is certainly not a comprehensive or dynamically updated set of scholarly publications and therefore difficult to use for citation analysis or metrics across a field or discipline without careful curation and data input first. While the value of this aim is debatable, the issue is more complex in the case of reports which are generally poorly curated, managed and cited. There is certainly a need for a bibliographic database or aggregator for research reports and papers not being collected and curated in commercial or scholarly platforms and it may be a useful role for Wikidata or a spin-off bibliographic database. However, we would argue that the main imperative is to have a database for WP citations. One option currently being discussed in the WM community is to export the bibliographic content from Wikidata into a separate linked database and then focus on harvesting and adding items referenced in WP projects. This may also be the precursor to a more ambitious proposal to develop a WP citation database.

In 2020 Liam Wyatt put forward a proposal for a Shared Citations database designed to centralise the hosting and metadata-management of individual references used in any WM project (WMF, Citation2023). As the proposal states, “Our references are high in maintenance, technical complexity, and duplication of effort. This results in knowledge gaps and biases that are difficult to quantify and address”. The idea is similar to the role of WikiCommons as a shared database of images that can be used on any Wikimedia project without duplication. The Shared Citations proposal aimed to reduce the time and effort required for editors to add references, reduce duplication, standardise referencing and increase the understanding of what is being cited on Wikimedia projects. It suggested buidling a dedicated database that could be linked to Wikidata and other databases via various facets including author, publisher, main subject, organisation, location and journal which would not only make the citation of sources easier, but also enable in-depth analysis and visualisation of sources. “This would enable significant research (both internal and academic) to be undertaken for the first time” (WMF, Citation2023). This would be particularly useful for classifying and monitoring the huge number of websites and organisations which currently feature as sources yet remain largely unexamined. However, at this point, there has been no further progress on the Shared citations proposal, perhaps because of the scale and cost involved of that particular model. The problem remains and whatever the solution, it is essential that the citation management issue is recognised as a priority for WPs commitment to verifiability.

One area that could be developed to support management of diverse formats is a coherent taxonomy of publication genres and formats within Wikidata. Genres for public policy already on Wikidata include:

  • Report (Q10870555)

  • Evaluation reports (Q109466918)

  • Research report (Q59387148)

  • Technical report (Q3099732)

  • White paper (Q223729)

  • Working paper (Q1228945)

  • Literature review (Q2412849)

  • Systematic review (Q1504425)

  • Public policy (Q546113)

  • Strategy (Q185451)

Others may well be required and Wikidata could play a key role in helping to identify and develop schemas for the types of genres currently being produced and distributed in the public arena, something that is currently lacking.

Another area for Wikidata to assist is to expand the number of organisations and their related entities to assist with verifiability and evaluation. This would help to ensure this information flows into other systems such as the recent OpenAlex initiative (Priem et al., Citation2022)—a new, fully-open scientific knowledge graph (SKG), launched to replace the discontinued Microsoft Academic Graph (MAG).

Improving the infrastructure for the sourcing and analysis of organisation publishing can only result in limited improvements in the presence of public policy material on WP, without fundamental changes to WP’s guidelines on reliable sources. We strongly recommend that the Wikipedia guidelines for medicine and science on report literature be added to the overarching guidelines, as they are essential for many other public interest topics covered on Wikipedia and could help to inform the wider research communication system. And as we have covered in detail above, we need to go beyond large-scale citation analysis to drill down and find exactly which genres, publishers, organisations and other sources are being used as references for key public policy issues on WP.

Conclusion

Wikipedia must keep up with the changing dynamics of digital publishing and the way diverse actors are producing online content. There is no doubt that this is a challenge and a key step involves having a realistic and up-to-date perspective on the way in which knowledge is produced and circulated in various contexts. Public policy issues are one context with a complex ecosystem of stakeholders across multiple sectors. It is essential that we harness the scale and collaboration systems of WP, Wikidata and other projects to improve understanding of organisation publishing and reports as potential sources of evidence. Wikimedia projects need to have clear guidelines on the use of reports published by organisations, which would benefit not only the WP editing community but the wider knowledge ecosystem. Organisation publications are often dispersed and disaggregated, with ad hoc standards and metadata and poor long-term management, making them difficult to find, catalogue, reuse and evaluate. In partnering with digital libraries such as the Analysis & Policy Observatory, Wikimedia—particularly Wikipedia and Wikidata—can play a bigger role to improve the way reports and papers are used, cited and managed and improve understanding of research publishing diversity. This will contribute to the mission of knowledge equity as part of creating a global public knowledge commons that is equitable, credible, sustainable, effective and efficient, and ensure Wikimedia is able to meet the demands of the next generation of knowledge production.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References