Publication Cover
Internet Histories
Digital Technology, Culture and Society
Volume 7, 2023 - Issue 4
314
Views
0
CrossRef citations to date
0
Altmetric
Research Articles

Continuity and discontinuity in web archives: a multi-level reconstruction of the firsttuesday community through persistences, continuity spaces and web cernes

Pages 354-385 | Received 03 Apr 2023, Accepted 06 Aug 2023, Published online: 08 Sep 2023
 

Abstract

Web archives are not direct traces of the web, they are direct traces of crawlers. By design, the structure of web archives limits our capacity to explore the memory of the Web. These structural issues induce temporal discontinuities such as inconsistency, redundancy and blindness. In this paper, we address the question of re-injecting continuity within large corpora of web archives. We introduce the notions of persistences (series of time-stable snapshots of archived web pages) and continuity spaces (networks of time-consistent persistences). We demonstrate how – on the basis of a quality score – persistences can be used to select subsets of web archives within which in-depth historical analysis can be conducted at scale. We next propose to make use of a new visualization approach called web cernes to reconstruct the multi-level temporal evolution of an archived community of web sites. We finally apply our framework to study the history of the firsttuesday movement: a constellation of entrepreneurial web sites that acted in the interest of the economical growth of the web in the early 2000s.

Acknowledgements

I would first like to thank Y. Le Jean for giving me the opportunity to organise a research residency at her home in June 2021 and thus begin my work on the First Tuesday archives. I also would like to thank V. Schafer and F. Pailler for their friendly criticism of the web cernes method and for their participation as historians in helping me to formulate the research questions of Section 7.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 Launched in 1996, the Internet Archive library keeps the traces of over 747 billion of web pages, see https://archive.org/.

2 See the fable of” Funes the Memorious” in Borges’ anthology Ficciones.

3 See (Masanes, Citation2006) for a wide review of web archiving techniques.

4 An archived known change is detected if the content of p has evolved between two consecutive crawls.

5 The number 1000 resources is the rough size used by Brügger et al. (Citation2020) to mark the boundary between quantitative and qualitative research in digital web history.

6 By using dead, we refer to web sites that no longer exist in the living web according to the terminology introduced by Lobbé (Citation2018).

10 Those local places were either standalone sites – quasi-siblings of the original firsttuesday.com – or simple discussion groups hosted by larger platforms (yahoo groups, e-groups, etc.)

11 An un-archived page is a web page cited in D1 and whose hostname matches the prefix firsttuesday but that is not part of the Internet Archive database.

12 The corpus D1, D2 and D3 can be downloaded at https://doi.org/10.7910/DVN/EAFOHY

13 From our point of view, we believe that any web archive explorer must be able to manipulate simple lines of code as these archives cannot be dissociated from their technical nature. That’s why we haven’t built any turnkey software, but have chosen python scripts that can be modified if necessary, but are nonetheless very easy to use.

14 Déndron is downloadable at https://gitlab.iscpif.fr/qlobbe/dendron. Licence: AGPL + CECILL v3

15 We however note that it might be interesting to focus on sequences of impermanences to study moments of high activity in an archived web page.

16 Future developments will be dedicated to the review of more precise comparison measures (Wills & Meyer, Citation2020) able to with all the complete topographic properties of graphs (nodes and links).

17 34 sites have been archived by Internet Archive, 3 have not.

18 We here use the force atlas algorithm (Bastian et al., Citation2009).

19 The sites with extensions north.com, .es, .br, .be, scotland.es, frankfurt.com, cincinnati.com and krakow.pl are considered as poorly connected.

20 For some fragments, we have not been able to extract a place from the raw text. This is mainly due to an incomplete preservation of the pages.

21 Network downloadable at https://doi.org/10.7910/DVN/EAFOHY

22 The corpus D4 is reduced to a set of 544 documents containing raw textual descriptions.

23 The corpus, the list of terms and the phylomemy can be downloaded at https://doi.org/10.7910/DVN/EAFOHY

24 Collective shapes are shapes resulting from entities in dynamic interactions within a given environment (Bourgine & Lesne, Citation2015)

25 The corpus, the vocabulary can be download at https://doi.org/10.7910/DVN/EAFOHY

Additional information

Notes on contributors

Quentin Lobbé

Quentin Lobbé is a researcher at the Complex Systems Institute of Paris Île de France (ISCPIF – CNRS). His research lies at the intersection of Digital Humanities, Computational Social Science and Complex Systems. His scientific contributions focus on the reconstruction of social dynamics based on the analysis of digital traces. Quentin Lobbé is particularly interested in the analysis of knowledge dynamics and in the exploration of web archives corpora.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.