166
Views
0
CrossRef citations to date
0
Altmetric
Discussion

Attention is all low-resource languages need

Received 24 Mar 2024, Accepted 25 Mar 2024, Published online: 11 Apr 2024

The question of what it might mean “to take Indigenous languages seriously” has never been more pressing. It is true that minority, Indigenous or otherwise low-resource languages have been almost “invisible” (Cronin Citation1998, 158) in translation studies (TS) since its inception, and that these languages are still considered to occupy the periphery of the discipline. But is this not the natural statue of things? We would not expect smaller languages to enjoy the same level of representation as more mainstream global languages, but we nevertheless would still expect some level of representation, some level of attention, and this is an issue which will only become more critical in a future dominated by AI tools that multiply linguistic inequalities.

Lemieux and Roy are right in calling for a rethinking of the narratives of extinction that so often accompany discussions of Indigenous and minority language and culture (cf. “seeing in Indigenous culture a ‘vanishing race’” [Citation2024, 194]). This can lead to a self-fulfilling prophecy: the more we say something is endangered, the greater the urge to gatekeep knowledge and preserve its “purity” (Henitiuk and Mahieu [Citation2024] talk of the “self-proclaimed gatekeepers jealously guarding their authority”). A more egalitarian approach is needed. This is not to say we should be blind to the dangers of language extinction, merely that all victories should be celebrated, no matter how major or minor. I commend Henitiuk and Mahieu, the instigators of this timely debate, on the 2021 publication of their trilingual edition of “Uumajursiutik unaatuinnamut”, complete with the story in Inuktitut syllabics. Allowing readers to experience the story as it was originally written alongside (more accessible) renditions in major languages is an important step forward for language representation and a major victory for Inuktitut. Critical editions such as this provide necessary context for readers to delve into an Indigenous culture.

How far can readers realistically be expected to delve, however? In a globalized society, languages are a commodity, and parents would presumably prefer their children to learn languages that maximize their future opportunities. In TS, too, there is a general emphasis on professional contexts of translation, and the translation market for endangered languages is, by definition, severely restricted. Under-representation of low-resource languages in TS is just a reflection, then, of the same problem: the “free market economy” of language. Is it the place of TS scholars specifically to redress global imbalances such as these? Perhaps not, and it would be naïve to suggest that TS scholars should race to the margins and forsake the mainstream. After all, the time investment required in learning a low-resource language is prohibitive for most.

The question is, in all such cases, one of relative power (O’Reilly Citation2001, 9). Is there a way to empower smaller languages? The advancement of AI tools, the dawn of big data, and the emergence of large language models has meant for us in TS an overwhelming number of new tools and technologies. We can hardly keep up with the pace of change. But these new inventions are, generally speaking, the sole preserve of the languages that possess the critical mass of data to be processed, the linguistic fodder to be churned into the digital training mills. Is there a way to make sure that low-resource languages do not become unwitting casualties of technological progress?

While large language models are growing increasingly unwieldy, creaking under the weight of all their billions of parameters, a “small” language model still has under 100 million parameters. How do we get to even a million parameters with languages used by only a small number of people? This is clearly unrealistic. The problems faced by the languages at the narrow end of the power spectrum begin before we even get to the point of sifting through the data, however. The data must first be encoded, and even at this level, disparities between language representations at the token level ensure that inequality is “baked in”. Petrov et al. (Citation2024, 4) show how one of the Shan words for “you”, မႂ်း, is tokenized by ChatGPT and GPT-4 as a composite of nine tokens to represent one consonant and three diacritics (because the diacritics are represented as separate Unicode code points). In contrast, the English “you” has three Unicode characters but is only a single token. This leads to a limit on the amount of data that the models could theoretically process in Shan. Tokenizer parity has direct implications for pricing structures, too: if large language model application program interfaces charge for access per token, then low-resource languages become more expensive to process requests for.

Researchers have also identified “concerning patterns” (Nguyen and Anderson Citation2023) in how generative AI translates low resource languages, as the models are prone to throw out inconsistent or even incorrect translations, as well as making up words in minority languages entirely. GPT-4 tells me that “hello” in Naxi (a minority language from southwest China) is “mba”, apparently “pronounced similar to ‘mbah’”, to which I can really only say “bah!”, for the simple reason that this is nonsense; in this case the AI is hallucinating an imagined word. The increased use of the language models as alternatives to traditional search engines – and, at some point in the future perhaps, trained models replacing CAT tools altogether – will only serve to make low-resource languages even more invisible, an issue compounded by those languages that possess unusual scripts.

Getting even a single word in a low-resource script into print has always been an arduous process. Even major scripts present typographical challenges: published versions of Ezra Pound’s Pisan Cantos omit over fifty sets of Chinese characters that the poet had directed his publishers to include. Lower resource languages have it even harder. When Pound wanted to include the minority Naxi script in his long poem, only two of these highly “pictographic” characters actually made it into print, a number of others remaining only in his early manuscripts. The existence of these two characters (at the end of Canto CXII) is itself a significant victory for this minority writing, but the fact of the matter is that the other Naxi graphs Pound wrote have been forgotten, exiled from the Cantos as we know them because, firstly, scholars are in general not equipped to deal with the originals, unlike, say, his Chinese or Greek (which could, at least to some extent, be restored in a critical edition, see Bush and Ten Eyck [Citation2013]), and secondly, because we can’t even write the characters easily on a computer.

Consider the appearance in one of Pound’s notebooks of the Naxi sacred mountain, the oak growing on the mountaintop and the setting sun to the side ().

Figure 1. “Quercus on Mt Sumeru” Naxi writing from Pound’s Notebook No.1 (Citation2010 [1958], left), and in standardized Naxi graphs (right).

Figure 1. “Quercus on Mt Sumeru” Naxi writing from Pound’s Notebook No.1 (Citation2010 [1958], left), and in standardized Naxi graphs (right).

This is the sacred “Mount Sumeru” in visual form, directly presented to the reader alongside the English lines “Quercus cleistocarpa / on Mt Sumeru”. In the final published form of the Cantos, however, the composite Naxi graph is absent, only the English line “Quercus on Mt Sumeru” remains. Before we discuss the issue of tokenization, there is a more fundamental problem here: would it even be possible to encode such a compound graph, the tree sprouting from the side of a mountaintop? There is, as yet, no Unicode encoding for Naxi, and this adds to the layers of unfairness and imbalance such scripts face in the digital marketplace – we can only draw it, we can’t write it on our computers, and the writing therefore is easily dismissed as something fundamentally unserious, nothing more than a pretty picture.

There is still hope for low-resource languages, however. As Tanzer et al. (Citation2023) have shown, models can still be trained from small amounts of data. It is unrealistic to expect scholars of Indigenous/low-resource languages to collect datasets of the scale that current machine learning methods demand. We must instead rely on more qualitative data, such as a carefully produced bilingual translation, perhaps. TS scholars and practitioners are uniquely positioned. As scholars we can help usher in the publication of articles and books that explore the relationships between languages on different ends of the power spectrum, and, as translators, we can produce bilingual (or multilingual) editions of important texts. This is translation as a “way-in” to language, allowing more people the opportunity to access it. For those of us who work on low-resource languages, it is our job simply to get that data out there – onto the page, digital or in print, in a responsible fashion; simply put, to give these languages some attention. The landmark paper introducing the transformer architecture upon which modern LLMs are based was entitled “Attention is All You Need” (Vaswani et al. Citation2017), after the much-heralded attention mechanism that drives these models. For low-resource languages, a little bit of attention can go a long way too.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Duncan Poupard

Duncan Poupard is Associate Professor in the Department of Translation at the Chinese University of Hong Kong. His research is primarily concerned with the written heritage of the Naxi minority of southwest China. He has worked with museums and libraries around the world on revitalizing the unique Naxi manuscript traditions, and his most recent book is a critical edition of a central origin myth, A Pictographic Naxi Origin Myth from Southwest China (Leiden University Press, 2023).

References

  • Bush, Ronald, and David Ten Eyck. 2013. “A Critical Edition of Ezra Pound’s Pisan Cantos: Problems and Solutions.” Textual Cultures 8 (2): 121–141. https://doi.org/10.14434/tc.v8i2.13278. https://www.jstor.org/stable/26500700.
  • Cronin, Michael. 1998. “The Cracked Looking Glass of Servants. Translation and Minority Languages in a Global Age.” The Translator 4 (2): 145–162. https://doi.org/10.1080/13556509.1998.10799017.
  • Henitiuk, Valerie, and Marc-Antoine Mahieu. 2024. “Tangled Lines: What Might it Mean to Take Indigenous Languages Seriously?” Translation Studies 17 (1): 169–180. https://doi.org/10.1080/14781700.2023.2270551.
  • Lemieux, René, and William Roy. 2024. “On the Necessity to Celebrate Indigenous Translation as Performance.” Translation Studies 17 (1): 190–194. https://doi.org/10.1080/14781700.2023.2271471.
  • Nguyen, Sydney, and Carolyn Jane Anderson. 2023. “Do All Minority Languages Look the Same to GPT-3? Linguistic (Mis)information in a Large Language Model.” Proceedings of the Society for Computation in Linguistics 6 (44). https://doi.org/10.7275/xdf4-mh72.
  • O’Reilly, Camille C. 2001. “Introduction: Minority Languages, Ethnicity and the State in the European Union.” In Language, Ethnicity and the State Volume 1: Minority Languages in the European Union, edited by Camille C. O’Reilly, 1–19. Basingstoke: Palgrave.
  • Petrov, Aleksandar, Emanuele La Malfa, Philip H.S. Torr, and Adel Bibi. 2024. “Language Model Tokenizers Introduce Unfairness between Languages.” Advances in Neural Information Processing Systems 36. https://doi.org/10.48550/arXiv.2305.15425.
  • Pound, Ezra. 2010. Drafts & Fragments Facsimile Notebooks 1958–1959, Vol. 1. New York: Glenn Horowitz Booksellers.
  • Tanzer, Garrett, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. 2023. “A Benchmark for Learning to Translate a New Language from One Grammar Book.” https://doi.org/10.48550/arXiv.2309.16575.
  • Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention is All You Need.” Advances in Neural Information Processing Systems 30. https://arxiv.org/abs/1706.03762.