353
Views
1
CrossRef citations to date
0
Altmetric
Literature, Linguistics & Criticism

Is Arabic punctuation rule-governed?

, , , , &
Article: 2303818 | Received 18 Sep 2023, Accepted 06 Jan 2024, Published online: 31 Jan 2024

Abstract

This paper investigates the extent to which Arabic punctuation is rule-governed, with the aim of improving text comprehension, disambiguation, and machine translation. The study highlights the lack of systematic punctuation in Arabic written discourse, which may be attributed to difficulties in sentence boundary identification or inadequate differentiation between various conjunctions. The punctuation behavior of Arabic speakers is examined in relation to sentence boundary identification and the level of agreement among Arabic specialists is assessed. A quantitative analysis of paragraph and sentence lengths across genres, categories of writers, and in comparison to English is conducted using five corpora specifically compiled for this study. Additionally, a punctuation survey is carried out to evaluate specialists’ agreement on sentence boundary identification. The results indicate that writers of Arabic interpret punctuation rules differently and that Arabic punctuation practice is irregular. The study suggests that standardization of Arabic punctuation rules is necessary to facilitate comprehension and automatic text processing.

Introduction

Punctuation refers to the use of marks and symbols in text to clarify meaning, convey tone, and indicate the structure and organization of language. It involves the strategic placement of symbols such as commas, periods, and semicolons to help readers understand the intended meaning of a sentence or paragraph. The purpose of punctuation is to create clear and effective communication. To punctuate correctly requires an awareness of mental grammar, the internalized system of rules that speakers and writers use to produce and to comprehend language.

Written communication relies heavily on the use of punctuation marks, including semicolons, periods, and commas, which fulfill crucial functions. Semicolons are used to separate two independent clauses that are closely related in meaning and to separate items in a list when the items themselves contain commas. Periods are used to end a sentence that is a complete thought, and they are used after abbreviations. Commas are used to separate items in a list, separate clauses in a sentence, and set off nonessential information in a sentence (Kaufman & Straus, Citation2021). The use of these punctuation marks can be complex, and it certainly requires careful consideration of the context and structure of the sentence.

Punctuation is advantageous for syntactic processing and can have a substantial impact on a grammar’s natural language text parsing (Jones, Citation1994). Modern state-of-the-art language models, such as BERT, ELMo, OpenAI’s GPT-2, and GPT-3, now consider punctuation as part of their vocabulary based on this and similar findings (Păiş & Tufiş, Citation2022; Yagi, et al., Citation2021).

Adding punctuation and capitalization to speech recognition output using neural machine translation models greatly improves the readability of automatic speech transcripts and can also help many natural language processing tools that may be applied downstream (Varavs & Salimbajevs, Citation2018). Punctuation also contributes to faster reading and easier transfer of the writer’s ideas and emotions (Awad, Citation1970).

Punctuation marks reflect the covert prosody of written language (Chafe, Citation1988). Both writers and readers experience the auditory imagery of intonation, accents, and hesitations. Certain important aspects of this ‘covert prosody’ are reflected in punctuation. Good writing requires an awareness of prosodic imagery.

Despite the acknowledged importance of punctuation in written texts, its systematic use in Arabic written discourse remains limited (Alkhatib et al., Citation2020). When formulating punctuation rules, writers often resort to the term ‘sentence’ (see Zakī (Citation1912/2013), Ūgān (Citation1999), ‘Aṣfūr (Citation1999), and ‘Abdallāh (Citation2001)), which introduces a level of ambiguity. Even distinguished Arab scholars occasionally use ‘sentence’ and ‘paragraph’ interchangeably. This study sets out to explore and illustrate the extent to which the identification of sentence boundaries poses challenges for contemporary writers of Arabic (Sawalha et al., Citation2019).

The difficulty in marking sentence boundaries in Arabic has been emphasized by Williams (Citation1989, pp. 89–91), who commented that ‘punctuation is used very erratically’, rendering a formal definition of the Arabic sentence ‘impossible, if not misguided’. Arabic lacks a clear distinction between subordinating and coordinating conjunctions, as well as discourse adjuncts, resulting in situations where the conjunction ‘wa’ functions as both a ‘coordinator’ and a ‘subordinator’ depending on the context (Yagi & Ali, Citation2008). Additionally, Williams noted that the indefinite relative clause in Arabic stands independently from its containing clause.

In Arabic writing, the use of punctuation has been historically problematic. Holes (Citation2004) underlines that until the latter part of the nineteenth century, Arabic writing largely lacked punctuation, and even today, there is no fully standardized system in place. The usage of periods and commas can be highly variable and idiosyncratic, especially in literary contexts. However, the presence or absence of punctuation is not a significant concern because Arabic relies on a native system of textual chunking, where coordinating and subordinating conjunctions play a dual role by formally signaling the beginnings and endings of sense groups and indicating the logical or functional relationships between them. Keskes et al. (Citation2014) lends support to this observation, noting that punctuation marks are not widely used in contemporary Arabic texts, particularly in long and complex sentences. It is not uncommon to find lengthy paragraphs with only one punctuation mark at the end, such as a dot.

The unconventional use of punctuation in Arabic literature is prominently evident everywhere, even in the works of 20th-century luminaries. Consider, for instance, Naguib Mahfouz’s al-Qāhirah al-Jadīdah, where the third chapter comprises two paragraphs: the initial one extending over four pages, but the subsequent one succinctly encapsulated within 3.5 lines. On average, sentences here consist of 55 words, with punctuation rules appearing to be a secondary consideration (Maḥfūẓ, Citation1962). A similar disregard for the newly introduced punctuation system is observed in Al-Aqqad’s Sara, specifically in the chapter ‘ilāj al-shak, where a paragraph of 314 words is punctuated by a solitary full stop, two instances of ellipses, two exclamations, and 16 commas (ʿAqqād, Citation1960). Ahmad Amin’s Fayḍ al-Khāṭir presents yet another variation of this trend; it is common to find paragraphs of 100 words, punctuated by a single full stop and sentences averaging 100 words each (Amīn, Citation1947). This highlights the idiosyncratic use of punctuation in Arabic. .

Therefore, this paper aims to examine empirically the extent to which Arabic punctuation follows established rules.

The study aims to address the following questions:

  1. Do Arabic speakers employ sentence terminal marks consistently?

  2. Are Arabic specialists bound by the same punctuation rules?

  3. Would the punctuation behavior in English-Arabic translation mirror that of the English source text?

Negative findings in response to the first two questions underscore the need for the development of transparent, intuitive, and user-friendly punctuation rules. Such rules could pave the way for the creation of an automated predictive punctuation system that could be integrated into modern keyboards. This paper serves as an interim report on a more ambitious project with the ultimate goal of producing such a system.

Literature review

In the late 19th century, Zaynab Fawwāz, a Lebanese writer, advocated for the formal adoption of a punctuation system in Arabic. Inspired by French punctuation and the informal use of European punctuation in Arabic newspapers, she highlighted the usefulness of punctuation in a letter to al-Fatā Magazine (Fawwāz, Citation1905/2014, pp. 73–74). Although the original version of her letter is unavailable, its influence in promoting punctuation in Arabic writing is noteworthy.

Aḥmad Zakī, a prominent figure in the development of Arabic punctuation, embarked on the ambitious task of designing a comprehensive punctuation system. His primary goals were to enhance reading comprehension, convey writer intentions and emotions, and provide guidance for text chunking and voice modulation. To achieve these objectives, Zakī drew upon various sources, including European punctuation systems and the Arabic tradition of grammar, tajwīd ‘cantillation’, and dictionaries. His aim was to create a coherent system that aligned with the rules of al-waqf wa-al-ibtidāʼ ‘pausing and starting’ in Quran reading (Zakī, Citation1912/2013). The outcome of his efforts was a 40-page booklet, the first guide on modern Arabic punctuation, which he submitted to the Minister of Education, with a note stating that he had fulfilled his mandate.

The origin and evolution of punctuation in old Arabic manuscripts was the focus of a study by Al-Jawharī et al. (Citation2012) that was a translation of Jaouhari (Citation2009). He found that different types of texts, such as the Quran, Hadith, and chancery writing, had diverse punctuation practices. These included colored dots, dashes, circles, superscript letters, abbreviations, and varying-size blank spaces. The punctuation marks had multiple functions, such as marking section breaks, sentence terminals, pause locations, vocalization, syntactic relations, rhetorical figures of speech, and textual variants. The study showed that punctuation was used in various Classical Arabic text genres, but it did not follow a unified and coherent system.

The status of punctuation in Modern Standard Arabic (MSA) was best characterized by Khafaji (Citation2001) who analyzed the punctuation norms in modern texts and the emerging trends in their use. He conducted an experiment where he asked ten Arabic professors to punctuate a fiction excerpt with all punctuation marks removed. He compared their punctuation with the original text and found that the comma and period were the most common marks, accounting for 85% of the total number of punctuation marks. He discovered that only 38% of the original punctuation marks were retained by the participants. Agreement on the use of the comma was slightly less than 50%, while for the period it was even lower. Only one participant matched five cases out of 29 with the original text. Khafaji defined a sentence as ‘a minimum stretch of text that is structurally independent or self-sufficient and allows a long pause in speech or a period, question mark, or exclamation mark in writing’. He concluded that ‘punctuation usage in original Arabic texts is very fluid" and that "the punctuation rules in MSA references are too general and largely prescriptive…These rules need to be revised and modified." He recommended further research to verify his findings and explore the use of other punctuation marks.

Building upon these findings, Alkohlani (Citation2016) emphasized the highly variable and idiosyncratic nature of punctuation marks in Arabic and cited Ditters (Citation1991) and Stetkevych (Citation2006) to support her claim. She highlighted that writers of Arabic often used punctuation marks for decorative purposes and drew support from an observation by Ghazala (Citation2004, p. 230) that they use punctuation “poorly and haphazardly”. Alkohlani proposed that sentence boundary decisions be based on syntactic and semantic criteria, such as word order, conjunctions, and semantic coherence. According to her analysis, a sentence boundary should be identified when all the constituents involved in the dependency relation are realized, forming an independent grammatical structure tied together with a dependency relation’.

Abuhamdia (Citation2000) investigated the source of the current confusion in Arabic punctuation by examining a large collection of punctuation guides. He attributed the confusion to the lack of a standard structural definition of ‘sentence’ that caused Arabs to disagree on where to end a sentence. He gave two unpunctuated passages, one in Arabic and one in English, to two groups of Arabic and English professors, respectively. He asked them to identify the number of sentences and their boundaries in each text. All participants punctuated the English passage correctly and in a similar manner, but not the Arabic one. They did not agree on how many sentences were in the passage or where they should end. He suggested that the Arabic sentence should be defined as a single predication, like the concept of kalām in Classical Arabic grammar. He explained that a simple sentence would have a subject and a predicate with their extensions and modifications, while a complex sentence would have one main predication linked with one or more dependent predications.

There is a consensus among scholars that Arabic punctuation is in a state of flux and that sentence boundary identification is at the core of the punctuation problem.

Methodology

To answer the research questions, an investigation into how punctuation is practiced by Arabic speakers is necessary. This section outlines the rationale and methodology of the data collection and data analysis procedures.

Data collection

Examining native speaker language production is a key starting point for understanding their punctuation patterns. To address the first research question of whether Arabic speakers use sentence terminal marks in a consistent manner, a corpus of journalistic texts was compiled and named EditCorp. Editorials, written by senior editors, are persuasive pieces that address current or controversial topics with the aim of influencing public opinion, policy, or politics. They are typically short but more concise than news articles, employing a distinctive tone and language. These pieces include opinions, interpretations, evaluations, and recommendations that reflect the publication’s views and values. By addressing timely or contentious issues, they play a crucial role in shaping public discourse.

We focused on editorials since they are widely read and are perhaps norm-setting politically as well as linguistically (Alshargi et al., Citation2019).

It may be questioned whether editorials are representative of other Arabic text genres. To address this issue, we investigated the same punctuation practice across various fields of scholarship exploring how academics in different disciplines use punctuation marks. For this purpose, we compiled BookCorp, another corpus of randomly selected paragraphs from a large number of books spanning a variety of fields in an open-access online library.

To answer the second research question and investigate whether Arabic specialists are similar in their punctuation behavior and whether this behavior is guided by the same set of rules, we created AbsCorp, a corpus of abstracts of articles written by professors of Arabic language and literature. This allowed us to explore the use of punctuation in scholarly research written by those who teach punctuation.

If writers in Arabic were to translate English texts into Arabic, would their punctuation behavior mirror that of the English text or be similar to the punctuation behavior in editorials, books, and abstracts? To answer this, the third research question, we compiled our English-Arabic parallel corpus (En-ArPC) that is aligned at sentence-level. To conclusively demonstrate that what is punctuated like a sentence is occasionally akin to a paragraph, we compiled another translation corpus, the Arabic-English parallel corpus (Ar-EnPC).

Could it be that the problem with Arabic punctuation lies with the determination of what a sentence is? If Arabic specialists were given a controlled piece of text and were asked to focus only on sentence terminal marks, would they reach similar conclusions as to where a sentence ended? We conducted a survey where Arabic specialists were given a punctuation-free passage and asked to punctuate it with a sentence terminal mark (i.e., a full stop), a non-sentence terminal mark (i.e., a comma), or no punctuation mark. Our primary focus was the extent to which they agreed on sentence-boundary identification.

Since there are no objective criteria to judge the accuracy of punctuation in all the above endeavors, it may be informative to contrast the punctuation behavior in Arabic with that in English in comparable contexts; this will give perspective as to what rule-governed punctuation behavior is like. For this purpose, small English corpora of editorials, journal article abstracts, and sample paragraphs from English books were compiled. Furthermore, we studied the punctuation in the English component of the En-ArPC to contrast it with that in the Arabic component; thus, this corpus made it possible to contrast Arabic with English in relation to the same textual content. We also studied the effect of translation in the opposite direction on punctuation by contrasting the two components of the Ar-EnPC, where the ideas are kept constant and only the language changed.

The next subsection will describe the data collection methods used in each of the six pursuits.

Editorials

To compile the corpus of editorials (EditCorp), we ran a query on Google using the search terms:

مجلة OR صحيفة OR “افتتاحية جريدة”

iftitāḥiyya jarīda OR ṣaḥīfa OR mijalla

editorial AND journal OR newspaper OR magazine

This yielded a set of webpages containing editorials from a variety of Arabic newspapers, magazines, and journals. From this set, 100 editorials were randomly selected and the second paragraph of each editorial was extracted for analysis. The second paragraph was chosen in order to avoid any special stylistics or symbols that are often found in the first paragraph. In Python, the second paragraph was processed by counting the number of words, commas, periods, semicolons, question marks, exclamation marks, sentences, and words per sentence.

A similar procedure was used in the compilation of the English editorials, except that we specified the sources and restricted the data to newspaper editorials. We culled around 10 editorials from the latest issues of each of the Atlantic, Bloomberg, the Economist, Foreign Correspondent, Financial Times, Guardian, Independent, New York Times, Washington Post, and Wall Street Journal. The corpus consisted of the second paragraph in each of the 100 English editorials that we downloaded. The count number of the punctuation marks and words in each paragraph, as well as the average number of words in each sentence within the paragraph were recorded.

Books

The corpus of book paragraphs (BookCorp) was compiled by randomly selecting the third paragraph of the third chapter from each book in the Hindawi open access online library, which included 24 different field categories and a total of 2845 books. For English books, the Gutenberg online library was used and the third paragraph of the third chapter was extracted from 68 books across 16 different fields of study. The extracted paragraphs were then processed using Python by counting the same metrics as the editorial paragraphs.

Abstracts

The corpus of abstracts (AbsCorp) consisted of 28 article abstracts in Arabic literature or Arabic linguistics that one of the authors had refereed for journal publication, as well as 90 abstracts of Linguistics and Literature articles in English that were collected from JSTOR. Because abstracts are usually only one paragraph long, the entire paragraph was processed in the same automated manner as the paragraphs in editorials and books.

Parallel corpora

To investigate whether rule-governed text in English would alert Arabic writers to sentence boundaries, the sentence-aligned En-ArPC was used. This corpus is a 5124 sentence portion of an English-Arabic parallel corpus of 19th century English literature, which is being constructed by one of our students. Since En-ArPC did not preserve the paragraph structure of texts, only the number of commas, periods, semicolons, question marks, exclamation marks, and words per English sentence and their Arabic translations were counted.

We also compiled an ad hoc small corpus of Arabic-English literary translations that we refer to as Ar-EnPC in order to establish that the same textual content is punctuated differently in English and Arabic. This corpus contains 3, 207 sentences collected from seven different novels, each translated by a different translator.

Punctuation survey

The source of problems with Arabic punctuation may be identified by examining whether difficulty in determining sentence boundaries could be a contributing factor. Specifically, we investigated whether Arabic experts are guided by a consistent set of rules when identifying sentence boundaries. To address this question, we designed an experiment that assessed the level of agreement among Arabic experts on sentence boundaries. The experiment involved presenting a controlled text without punctuation to 100 Arabic experts to punctuate it using a pull-down menu that offered the options of a full stop, comma, or no punctuation at phrase boundaries. By focusing solely on sentence terminal marks, our goal was to determine the degree of agreement among experts in punctuating the text, and to gain insight into the challenge of identifying sentence boundaries.

The passage selected for the survey was a newspaper editorial. To focus our investigation, we concentrated only on the period and comma marks, as they were the most frequent, the most abused, and the least rule-bound. In a forthcoming study on automatic punctuation, we counted the punctuation marks in a corpus of 332 million words and found that the comma and period were the most frequent marks. Therefore, our survey instrument focused only on these two marks.

The population surveyed consisted of specialists in Arabic and in Islamic Studies, including professors and PhD students. One hundred respondents participated in the experiment, but seven surveys were discarded because of being incomplete. Valid questionnaires were 93, with 70% being from Arabic specialists. The Arabic PhD holders and PhD candidates constituted 48% of participants.

The survey instrument was a self-contained 291-word editorial from a popular independent daily Arabic newspaper on the topic of how political parties sought to appeal to a disgruntled populace against the ideals of the European Union. To prepare it for the survey, all punctuation marks were removed, and the text was segmented syntactically into chunks, each of which constituted a phrase or clause that could have called for a punctuation mark. The resulting 53 chunks were then programmed into Google Forms, which presented them side by side as they would have appeared in the original newspaper editorial, with a small blank at the end of each chunk. Participants were instructed to select one of three options from a pull-down menu in each blank: ‘nil’ for no punctuation marks, ‘com’ for a comma, and ‘per’ for a period (full stop).

Before conducting the survey and to refine its instrument, a pilot study was carried out with 100 B.A. students, enrolled in a university compulsory course entitled Art of Writing and Expression. After making necessary adjustments, we reviewed the validity of the survey instrument.

To determine content validity, we identified three professors with expert knowledge in Arabic naħū ‘syntax’ and asked them to review the survey instrument and rate the appropriateness of the phrase breaks in the passage. Using a 3-point scale, the referees independently rated each blank as essential, desirable but not essential, or not needed. The content validity index (CVI) was calculated based on the proportion of items rated ‘essential’ or ‘desirable’ by all three referees. Our CVI was 0.88, which indicates good content validity.

In order to establish the construct validity of the survey instrument, we conducted a thorough examination of the instrument’s ability to accurately capture the rules of punctuation that are applicable to sentence terminals. To this end, we sought the expertise of a panel of three foreign language professors, who were asked to evaluate it’s face validity, specifically in regards to its ability to effectively measure participants’ understanding of punctuation rules and ensure clarity of sentence meaning. Based on their feedback, we refined the survey instrument by eliminating any ambiguities and enhancing its presentation. We then asked the professors to independently punctuate the provided passage by selecting one option from a pull-down menu in each blank at the end of a phrase chunk, where ‘nil’ represented no punctuation mark, ‘com’ a comma, and ‘per’ represented a period (full stop).

The inter-rater agreement among the three professors evaluating 53 text chunks was measured using Fleiss’ Kappa. This statistic was chosen for the assessment of agreement between the three jurors and subsequently between the punctuation survey respondents. Our choice of Fleiss’ Kappa was based on three reasons: (a) the number of raters (three jurors in one case and 93 respondents in the other), (b) the categorical nature of the analyzed data (punctuation marks), and (c) Fleiss’ Kappa’s robustness against chance agreement. We found it more suitable than Cohen’s Kappa, which is limited to two raters and nominal variables and does not adjust for chance agreement, and it is less demanding in terms of calculation than Krippendorff’s Alpha, which is more complex and less familiar.

The analysis revealed a high level of agreement among the jurors, with a kappa value of 0.789 (95% CI, .633-.944), and a statistically significant p-value of less than .0005. These results suggest that our punctuation passage is a reliable tool for evaluating the proficiency of Arabic experts’ punctuation practice.

Data analysis

Once the six data collection instruments were developed and the data was tabulated, a statistical analysis was performed, and all relevant descriptive statistics were extracted. The study specifically focused on the length of paragraphs, number of sentences per paragraph, and average length of sentences. As commas are often used at the writer’s discretion and other punctuation marks such as semicolons, colons, exclamations, and question marks are relatively infrequent, the analysis focused only on the period. Other sentence terminal marks, including ‘;’, ‘!’, and ‘?’, were recoded as periods (full stops) to simplify the statistical analysis. The comma and period were chosen as the focus of the analysis because they are the most frequently used punctuation marks, the most misused, and the most relevant for automatic language processing. The other marks, on the other hand, are rare, less controversial, or less prone to misuse. Analysis of the questionnaires also focused on the comma and period for the same reasons.

Findings

Below is a discussion of the results of the investigation presented under the same headings as those in the data collection subsections.

Editorials

To answer the first research question as to whether Arabic writers used sentence terminal marks in a consistent manner, we analyzed the corpus of editorials. In this corpus, the average length of a sentence is 44.53 words and of a paragraph is about 64.83 words, with each paragraph consisting of 1.69 sentences on average.

The findings of the study reveal a statistically significant moderate positive correlation between paragraph length and sentence length. The correlation coefficient of 0.38, with a p-value of less than 0.001 and 99 degrees of freedom, suggests that the observed relationship is not likely to be due to chance. Specifically, as the length of the paragraph increases, so does the length of sentence. The coefficient of determination (R-squared) reveals that 14.44% of the variation in sentence length can be explained by the variation in paragraph length.

As for commas, the average number per paragraph is 3.58, which may suggest that the writers use commas occasionally in lieu of the period when they want to mark the end of sentence!

For perspective purposes, here are the same metrics for English editorials contrasted with Arabic editorials:

Books

To rule out that punctuation behavior in editorials is unique to media Arabic, and to confirm the answer to the first research question as to whether Arabic speakers use sentence terminal marks consistently, we analyzed the corpus of books. We considered the average sentence size in our book corpus to see whether the careful rigor of book authorship would have any impact on sentence and paragraph sizes in English and Arabic, and whether discipline would influence their length. A General Linear Model-Univariate (GLM-Univariate) was conducted since the values in the dependent variable are continuous and the independent variables are the categories of language and discipline.

The results indicate that there is a significant effect of language (F(1, 2059) = 92.948, p < .001) and a marginally significant effect of discipline (F(2, 2059) = 2.900, p = .055) on paragraph size in the data; English paragraphs in books tended to be longer than those in Arabic books. The mean number of words per paragraph in English books was found to be 174.26 (SD = 122.36), while in Arabic books it was 74.15 (SD = 61.73). Additionally, Science books had the highest mean number of words per paragraph, followed by Social Science books and Literature books. These findings suggest that language and book category may play a role in determining the length of paragraphs in books. However, the interaction between language and discipline is not significant (F(2, 2059) = 2.190, p = .112), indicating that the relationship between language and paragraph size did not differ significantly across the three categories of disciplines. The model explains 10.5% of the variance in the dependent variable.

The number of sentences per paragraph confirms this result; there is a significant effect of language on the mean number of sentences per paragraph (F(1, 2054) = 182.50, p < .001), with Arabic books having the mean 2.22 (SD = 2.12) sentences, while English books having the mean 6.71 (SD = 4.92).

In terms of length of sentence, the results indicate that there is a statistically significant difference in the mean number of words per sentence between books written in Arabic and those written in English (F(1, 1978) = 6.569, p = .010), with the mean Arabic sentence length being 40.69 (SD = 33.695) words, while the mean English sentence length being 29.42 (SD = 13.782).

In conclusion, these results suggest that paragraphs in Arabic books tended to be less than half the size of English paragraphs, consisted of around two sentences on average when their English counterparts had an average of six sentences, and Arabic sentences tended to be 1.4 times the length of English sentences.

Abstracts

To answer the second research question regarding whether Arabic specialists punctuate in accordance with the same set of rules, abstracts were analyzed. All the Arabic abstracts in AbsCorp had been submitted to an academic journal for consideration of publication; the English ones, on the other hand, had already been published in a journal indexed by JSTOR. In terms of abstract size, the English abstracts were found to be longer on average than the Arabic ones (203 vs. 93 words long, respectively), and they had more sentences (8.19 vs. 2.07 sentences). In terms of sentence length, the average sentence in the English abstracts is 27.02 words/sentence, while it is 56.18 words per sentence in the Arabic abstracts. This is understandable since Arabic abstracts do not conform to the tradition that they contextualize the topic, state its goals, methods, major results, and the implications of the findings.

Parallel corpora

In order to rule out the possibility that Arabic might have developed its own unique punctuation system, we will check whether or not translating English into Arabic would nudge the translator into recognizing sentence boundaries; if it does and Arabic sentences mirror English sentences in punctuation, then that will be evidence to prove that there is no new Arabic punctuation system per se. Had it been a new system, translators into Arabic would exhibit behavior similar to that of editorial, book, and abstract writers. We will also consider Arabic texts translated into English in order to verify that stretches of Arabic ending with a full stop occasionally constitute a whole paragraph; they could not be lengthy sentences that are indicative of a new native Arabic punctuation system but rather symptomatic of the confusion in Arabic punctuation. We will analyze below both the English-Arabic and Arabic-English parallel corpora (i.e., En-ArPC and Ar-EnPC) to demonstrate these facts.

It seems that Arabic sentence length changes radically when English-Arabic translated language is considered. Analysis of the English and Arabic components of the En-ArPC of 5,124 sentences showed that English sentences had a significantly higher mean number of words than their Arabic counterparts, a trend that contradicts the norm observed in editorials, books, and abstracts . Although the ideas being communicated in the two languages were identical, the mean number of words per English sentence was 20.38 (SD=10.589) but the mean number of words per Arabic sentence was 16.68 (SD = 8.631), with both variables having a sample size of 5124. To ensure that pairs of English sentences and their Arabic translations are contrasted and that the differences between them are significant, a paired samples t-test was conducted. Results showed that the differences in sentence size between the two languages is significant (t(5123)=44.638, p<.001). The effect size estimate (Cohen’s d=.624) suggests a large difference between sentence size in the two languages. These findings are evidence that punctuation behavior is language-specific, even when the communicated ideas are meant to be equivalent. Clearly, sentence length is language-specific.

Notwithstanding the differences in sentence length due to language, the results suggest a very strong tendency for Arabic target language sentence size to mirror that of the English source language sentences. In fact, if we correlate source language with target language sentence sizes, we find a strong positive relationship between them, r(5123)= .826, p = 0.000, as demonstrated in Graph 1 below. This indicates that in the context of translation from English into Arabic, translators tended to mimic sentence length in the English source texts. In fact, the use of sentence terminal punctuation marks in English and Arabic shows nearly a one-to-one correspondence. On the other hand, the use of commas in Arabic is more discretionary compared to that in English ().

Figure 1. Arabic vs. English sentence sizes in En-ArPC.

Figure 1. Arabic vs. English sentence sizes in En-ArPC.

To determine whether sentence length was sufficiently different across the four text genres (i.e., Editorials, Books, Abstracts, and Parallel English-Arabic Translations, the words per sentence variable was analyzed using Welch’s ANOVA since it only assumes normality and homogeneity of variance in the populations from which the samples were drawn. There was a statistically significant difference in the mean number of words per sentence across the four text genres (F(3, 7165.45) = 800.22, p < .001, η2 = .251). Post-hoc pairwise comparisons using Games-Howell test revealed that all pairwise differences between the four text genres were significant (p < .001). Thus, we can conclude that sentence length differs significantly between the four text genres.

Punctuation survey

To answer the second research question that asks whether specialists punctuated the same text in a similar manner and whether their punctuation appeared to be governed by the same set of rules, the punctuation survey was conducted. Respondents with a PhD or working for a PhD in Arabic or Islamic studies filled out the survey independently by choosing one option: ‘nil’ for no punctuation mark; ‘com’ for comma; or ‘per’ for period to fill out each phrase break. Since we did not assume any standards for how the survey passage was to be punctuated, we used Fleiss’ kappa to measure the degree of agreement between respondents on the punctuation of the 53 phrase chunks in the survey passage. We adopted an agnostic attitude to where sentence terminals ought to be. This is a problem akin to inter-rater reliability where the consistency in raters’ scores is evaluated. We want to find out the punctuation consistency of the respondents, whether there is agreement among them beyond what would be expected by chance alone. Flieiss Multirater Kappa’s index would give an indication whether or not the respondents were following the same set of punctuation rules. The results are in .

Table 1. Flieiss multirater kappa results.

The Fleiss’ kappa table of results shows an overall agreement of .092, which indicates a very low level of agreement among the 93 survey respondents. The standard error is very low, implying that the Kappa value is a relatively precise estimate. The z-value, which is highly significant, indicates that any agreement between the respondents is not the result of chance but rather due to their common understanding of the text that they punctuated. Furthermore, we are 95% confident that there was only slight agreement among the raters on where sentence boundaries lie. This indicates that consensus on punctuation amongst Arabic experts is weak and that they either had different interpretations of punctuation rules or were operating by different sets of such rules.

Discussion

The problem of punctuation in Arabic lies in the difficulty of identifying sentence boundaries. Consensus on what defines a sentence is lacking, not only in editorials, abstracts, and books from various disciplines but also in our controlled punctuation survey. This lack of clarity in sentence defining rules leads to confusion between sentences and paragraphs.

The results presented in this paper reveal that in a non-specialized writing context, such as editorials, Arabic sentences are lengthy, resulting in an average of only 1.69 sentences per paragraph. This phenomenon is not limited to editorials but is also observed in published books across different fields of specialization. Despite the rigorous nature of book authorship, sentences in books are still lengthy, with an average of 40 words per sentence compared to 44 words per sentence in editorials. Furthermore, the paragraph still consists of no more than two sentences (2.22 to be precise).

It is unclear whether the prevalence of lengthy sentences in Arabic writing is due to insufficient command of the language or inadequate attention to punctuation that is peculiar to editorial writers and book authors. However, even experts in Arabic language and literature write longer sentences, and their paragraphs have an average of two sentences, providing unequivocal evidence that Arabic sentences are inherently lengthy.

In contrast, the corpus of Arabic texts translated from English demonstrates a reversal of this trend. Translators appear to mirror English punctuation, resulting in sentences that are one-third the size of sentences in editorials and books, and only 30% of the size of sentences in abstracts written by Arabic language and literature specialists. It seems that the translation context makes writers more conscious of how to parcel out their ideas into sentences, which reduces the size of sentences to an average of 16.68 words.

The use of the full stop is also at the core of the Arabic punctuation problem. Some Arabs may reserve the full stop for marking paragraph boundaries rather than sentence boundaries. Even highly acclaimed litterateurs, such as Ahmad Amin, consistently use the full stop as a paragraph marker and the comma as both a sentence terminal and an intra-sentential pause marker. For instance, consider this paragraph-sentence from Faiḍ al-Khāṭir, where given to the Bard large language model to translate, five sentences are recognized in English.

كُلُّ شَيْءٍ فِي الْعَالَمِ يَتَقَدَّمُ وَيَتَغَيَّرُ حَسبَ تَطَوُّرِ الْأُمَمِ وَنُظُمِها الِاجْتِمَاعِيَّةِ وَحَاجَاتِها وَأَغْرَاضِها فِي الْحَيَاة،* فَكَما تَغَيَّرَتْ مَصَانِعُ النَّسِيجِ مِنْ مَغَازِلَ يَدْوِيَّةٍ إِلَى مَصَانِعَ مِيْكِانِيكِيَّةٍ تَبَعاً لِتَقَدُّمِ الْأُمَّةِ فِي الصِّنَاعَة، كَذَلِكَ يَجِبُ أَنْ تَتَغَيَّرَ مَصَانِعُ الْأَجْسَامِ وَالْعُقُولِ وَالْأَخْلَاقِ تَبَعًا لِتَقَدُّمِ الزَّمَنِ وَحَاجَاتِ الْأُمَم،* وَكَذَلِكَ كَانَ،* فَالْمَدْرَسَةُ الْقَدِيمَةُ تَطَوَّرَتْ تَطَوُّرَاتٍ مُخْتَلِفَة، وَخَدَمَتْ أَغْرَاضاً مُتَنَوِّعَةً حَسبَ آمالِ الْأُمَّةِ.(Amīn, Citation1947) .وَظُرُوفِها،* فَالْأُمَّةُ يَجِبُ أَنْ تُحَدِّدَ أَغْرَاضَها الَّتِي تَرْمِي إِلَيْها، ثُمَّ تَصُوغَ مَدَارِسَها عَلَى وَفْقِها

kullu shayʾin fī al’ālami yataqaddamu wayataghayyaru ḥasba taṭawwuri alʾumami wanuẓumihā al-ijtimā’iyyati waḥājātihā waʾaghrāḍihā fī alḥayāh,* fakamā taghayyarat maṣāni’u al-nasīji min maghāzila yadwiyyatin ʾilā maṣāni’a mīkānīkiyatin taba’an litaqaddumi alʾummati fī al-ṣinā’ah, kadhālika yajibu ʾan tataghayyara maṣāni’u al-ʾajsāmi wa-l-’uqūli wa-l-ʾakhlāqi taba’an litaqaddumi al-zamani waḥājāti al-ʾumam,* wakadhālika kān,* fa-l-madrasatu al-qadīmatu taṭawwarat taṭawwurātin mukhtalifa, wakhadamat ʾaghrāḍan mutanawwi’atan ḥasba āmāli al-ʾummati waẓurūfihā,* fa-l-ʾummatu yajibu ʾan tuḥaddida ʾaghrāḍahā allatī tarmī ʾilayhā thumma taṣūgha madārisahā ‘alā wafqihā.

Fueled by the advancement of nations, their social structures, needs, and aspirations in life, everything in this world is evolving and changing. Just as textile factories have changed from manual looms to mechanical factories in line with the nation’s progress in industry, so too must the factories of bodies, minds, and morals change in line with the progress of time and the needs of nations. And so it has been: the ancient school has undergone various developments and served a variety of purposes according to the nation’s hopes and circumstances. Therefore, the nation must define its goals, then shape its schools accordingly (Google, Citation2023).

Paragraph size, as demonstrated in this typical example, violates the upper and lower limits prescribed by publishers’ traditional style guides (Crystal, Citation2015). Novelists seem to epitomize the problem in their occasional resort to extra-long paragraphs. The narrator, when describing a single moment that requires reflection and contemplation, may use extra-long paragraphs. For example, the Nobel Prize laureate, Naguib Mahfouz, occasionally writes several-page-long paragraphs. In Cairo Modern (Maḥfūẓ, Citation1962), Chapter 3 consists of two paragraphs, the first is 997 words long, while the second is only 40 words long. Although this might be an extreme case, we have found extra-long paragraphs in novels by other authors, such as the Sudanese al-Ṭayib Ṣāliḥ, the Saudi ‘Abd al-Raḥmān Munīf, and the Egyptians Yūsuf Idrīs and ʿAbbās Maḥmūd al-ʿAqqād.

Anecdotal evidence, together with the results of this study, suggests that extra-long paragraphs and sentences are commonplace in Arabic writing, to the point where they no longer raise any eyebrows. This poses the question of whether Arabic is developing its own punctuation system, distinct from the European-influenced system. However, we argue that the plausible conclusion is that Arabic punctuation is in a state of flux, and that the system has not yet settled. This is supported by the results reported from the parallel corpus analysis, but the jury is out. Other researchers may investigate this experimentally and make an empirical statement that either corroborates or refutes our conclusions.

Further research has led us to believe that Arabic punctuation should be grammar-based with emphasis on the roles of theme and rheme in structuring sentences. While we do not advocate the use of grammar metalanguage in the formulation of punctuation rules, we advocate training learners on introspection and the use of mental grammar. We propose two rules:

  1. Sentences should end upon the completion of the ‘theme and rheme’, ‘topic and comment’, ‘musnad and musnad ilayh’, however one may want to refer to the core elements in a sentence.

  2. Conjunctions like ‘wa’ should be disregarded when determining sentence boundaries.

A sentence usually consists of a theme, topic, musnad ilayh (the entity of interest) and a rheme, comment, musnad (the new information that is being attributed to that entity). Sentences can be simple, like ‘Sarah is good’, or complex, such as ‘The blonde Sarah you met downtown is a skilled architect and civil engineer who can design and build the house of your dreams’.

Terminated by a full stop, question mark, or exclamation point, the ‘sentence’ is a group of words that is informative independently of what precedes or follows it. It is autonomous and self-sufficient; it can be uttered in isolation without losing its power of predication, assertion, existentiality, imperativeness, interrogativeness, or exclamation. Commas and other intra-sentential punctuation should be left to writer discretion.

Conclusion

Based on the findings of this investigation, it is clear that there are distinct patterns in the punctuation behavior and sentence structure of Arabic writers. However, these patterns are not necessarily consistent across all types of writing or even within the same language.

Senior journalists, book authors, and Arabic specialists all exhibited the tendency to write in longer sentences and to limit their paragraphs to no more than two sentences. However, the length of sentences and paragraphs varied depending on the type of text being written, with science books and abstracts tending to have longer paragraphs and sentences.

The study also revealed significant differences in sentence length between Arabic and English, even when the ideas being communicated were meant to be equivalent. This suggests that sentence length is language-specific and that writers should be aware of the differences when translating between languages.

Finally, the results of the punctuation survey demonstrate that even well-versed writers of Arabic cannot agree on where to place a sentence terminal. This highlights the need for clearer punctuation rules, more systematic training in schools, and greater consistency in their application.

We suggest that punctuation rules should be based on grammar rather than semantics. One proposed rule is that sentences should be terminated upon the completion of the topic and comment, musnad and musnad ilayh, regardless of conjunctions and discourse markers (Fareh, et al., Citation2020).

This investigation has shed light on important aspects of Arabic writing that can help writers improve their communication and produce better texts for machine learning systems and machine translation tasks. These findings suggest that Arabic punctuation is in a state of flux, and there is much work to be done to develop a more consistent system. Therefore, punctuation should not be treated as a sole indicator of sentence boundaries in automatic text comprehension, disambiguation, and machine translation tasks due to the ambiguity of the full stop. Future research can build upon our findings and explore new avenues to enhance Arabic writing.

Acknowledgments

We would like to thank the University of Sharjah for their generous grant 2003020119 that made the Punctuation Project possible. We would also like to extend our heartfelt gratitude to Prof. MHM Asfour for his inspiration and invaluable guidance since 2003.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Sane Yagi

Sane Yagi received his education in Jordan, USA, and New Zealand. He is currently a Professor of Linguistics at the University of Sharjah and the University of Jordan. The primary themes are corpus development, computational lexicography and lexicology, computational morphology, syntactic parsing, automatic punctuation, and machine learning. His research interests include computational linguistics, CMC, CALL, and TEFL.

Shehdeh Fareh

Shehdeh Fareh is a Professor of Linguistics from the University of Kansas in 1988. His research interests include contrastive linguistics, discourse analysis, translation and TEFL. Authored a series of books for teaching English as a foreign language, a textbook for teaching English to students of medicine and health sciences, published more than 40 articles in prestigious journals and translated more than 20 books from English into Arabic and vice versa.

Ashraf Elnagar

Ashraf Elnagar is a Professor of Artificial Intelligence at the Department of Computer Science, University of Sharjah, UAE. During his service at the University of Sharjah, he served as the founding chair of the Dept. of Computer Science, Chair of the MIS Department, and Dean of the Community College. He won a number of teaching, research and community and professional service awards. He is the recipient of the 1999 Shoman’s Best Young Researcher Award in the Arab World in the fields of Mathematics, Computer Science and Statistics. His research interests include artificial intelligence, natural language processing, robotics, pattern analysis and recognition, and IT education.

Mariam Balajeed

Mariam Balajeed is the Head of the Department of Arabic Language and Literature at the University of Sharjah. She is specialized in Arabic syntax and published a number of articles and books in her field. She obtained a number of academic awards from different institutions.

Abdalla El-mneizel

Abdalla El-mneizel is a professor at the Department of Education and Psychology. His research interests include Intelligence, Educational and Psychological Measurement, Attribution. Tests Developments, Elderly, Mental Health and Adaptation, Applied Statistics, Early Childhood Education. Published articles in the areas of learning and motivation, child development, special education, university life, classroom climate, curriculum and instruction.

Mohammad Al-Badawi

Mohammad Al-Badawi is a dedicated researcher in linguistics at Zarqa University in Jordan, whose research expertise covers linguistics sub disciplines, with special focus on Stylistics, Discourse Analysis Pragmatics and Sociolinguistics. He has been widening his research sphere to include Arabic structure, translation studies, and foreign language teaching. Dr. Al-Badawi has a wealth of experience in both academic instruction and administrative leadership. His career spans over a decade, during which he has made several publications in the field of linguistics. He is known for his expertise in syllabus development, innovative teaching methods, and effective communication. His commitment to academic excellence is reflected in the roles he assumed during his career path at the English Department.

References

  • ‘Abdallāh, N. (2001). al-Marji’ fī al-naḥw al-’arabī. (1 ed.). Dār al-wisām li-al-ṭibā’ati wa-al-nashr.
  • ‘Aṣfūr, M., et al. (1999). Mahārāt al-ittiṣāl bi-al-lughah al-’arabiyyah.
  • ʿAqqād, A. M. (1960). Sāra (Ṭabʿa 3 ed.). Dār al-Maʿārif.
  • Abuhamdia, Z. A. (2000). Taqʻīd al-ishārah ilá nihāyat al-jumlah al-ʻArabiyah. Majallat Majmaʻ al-Lughah al-ʻArabiyah al-Urdunī. (Journal of Jordan Academy of Arabic), 58, 1–14.
  • Al-Jawharī, M., Maḥmūd, S., & ʻAdālssamiʻ, M. (2012). ʻAlāmāt al-tarqīm fī al-makhṭūṭāt al-ʻArabīyah: Malḥūẓāt wa-wathāʼiq (Punctuation marks in Arabic manuscripts: Comments and documents). Majallat Maʻhad al-Makhṭūṭāt al-ʻArabīyah. (Journal of the Institute for Arabic Manuscripts), 56(2), 281–340.
  • Alkhatib, M., Monem, A. A., & Shaalan, K. (2020). Deep learning for Arabic error detection and correction. ACM Transactions on Asian and Low-Resource Language Information Processing, 19(5), 1–13. https://doi.org/10.1145/3373266
  • Alkohlani, F. A. (2016, January–March). The Arabic sentence: Towards a clear view. Annals of the Faculty of Arts, Ain Shams University, 44, pp. 559–576.
  • Alshargi, F., Dibas, S., Alkhereyf, S., Faraj, R., Abdulkareem, B., Yagi, S., Kacha, O., Habash, N., & Rambow, O. (2019). Morphologically annotated corpora for seven Arabic dialects: Taizi, Sanaani, Najdi, Jordanian, Syrian, Iraqi and Moroccan. In ACL 2019 - 4th Arabic Natural Language Processing Workshop, WANLP 2019 - Proceedings of the Workshop.
  • Amīn, A. (1947). Fayḍ al-khāṭir: wa-huwa majmūʻ maqālāt adabīyah wa-ijtimāʻīyah. Maktabat al-Nahḍah al-Miṣrīyah.
  • Awad, D. (1970). The evolution of Arabic writing due to European influence: The case of punctuation. Journal of Arabic and Islamic Studies, 15, 117–136. https://doi.org/10.5617/jais.4650
  • Chafe, W. (1988). Punctuation and the prosody of written language. Written Communication, 5(4), 395–426. https://doi.org/10.1177/0741088388005004001
  • Crystal, D. (2015). Making a point: The persnickety story of English punctuation (1st U.S. ed.). St. Martin’s Press. Cover image http://www.netread.com/jcusers2/bk1388/419/9781250060419/image/lgcover.
  • Ditters, E. (1991). A modern standard Arabic sentence grammar. Bulletin d’études orientales, 43, 197–236. http://www.jstor.org/stable/41608975
  • Fareh, S., Jarad, N., & Yagi, S. (2020). How well can Arab EFL learners adequately use discourse markers? International Journal of Arabic-English Studies, 20(2), 85–98. https://doi.org/10.33806/ijaes2000.20.2.4
  • Fawwāz, Z. (1905/2014). Al-rasāʼil al-zaynibbayh. Hindāwī. https://www.hindawi.org/books/86864915/
  • Ghazala, H. S. (2004). Stylistic-semantic and grammatical functions of punctuation in English-Arabic translation. Babel. Revue internationale de la traduction / International Journal of Translation, 50(3), 230–245. https://doi.org/10.1075/babel.50.3.03gha
  • Google. (2023). Bard. Retrieved December 15, 2023, from https://bard.google.com/
  • Holes, C. (2004). Modern Arabic: Structures, functions, and varieties (Rev. ed.). Georgetown University Press.
  • Jaouhari, M. (2009). Notes et documents sur la ponctuation dans les manuscrits arabes. Arabica, 56(4–5), 315–359. https://doi.org/10.1163/057053909X12475581297443
  • Jones, B. E. M. (1994). Exploring the role of punctuation in parsing natural text [Paper presentation]. 15th International Conference on Computational Linguistics, Kyoto, Japan. https://doi.org/10.3115/991886.991960
  • Kaufman, L., & Straus, J. (2021). The blue book of grammar and punctuation: An easy-to-use guide with clear rules, real-world examples, and reproducible quizzes (12th ed.). Jossey-Bass.
  • Keskes, I., Zitoune, F. B., & Belguith, L. H. (2014). Splitting Arabic texts into elementary discourse units. ACM Transactions on Asian Language Information Processing, 13(2), 1–23. https://doi.org/10.1145/2601401
  • Khafaji, R. (2001). Punctuation marks in original Arabic texts. Zeitschrift für Arabische Linguistik, (40), 7–24. http://www.jstor.org/stable/43525749
  • Maḥfūẓ, N. (1962). al-Qāhirah al-jadīdah (Al-ṭabʿah 4 ed.). Maktabat Miṣr.
  • Păiş, V., & Tufiş, D. (2022). Capitalization and punctuation restoration: A survey. Artificial Intelligence Review, 55(3), 1681–1722. https://doi.org/10.1007/s10462-021-10051-x
  • Sawalha, M., Alshargi, F., Alshdaifat, A., Yagi, S., & Qudah, M. A. (2019). Construction and annotation of the Jordan comprehensive contemporary Arabic corpus (JCCA). In ACL 2019 - 4th Arabic Natural Language Processing Workshop, WANLP 2019 - Proceedings of the Workshop,
  • Stetkevych, J. (2006). The modern Arabic literary language: lexical and stylistic developments. Georgetown University Press. http://lib.ugent.be/catalog/rug01:001351694
  • Ūgān, O. (1999). Dalāʼil al-imlāʼ wa-asrār al-tarqīm: [kitāb fī uṣūl al-tarqīm wa-al-naḥw]. Afrīqiyā al-Sharq.
  • Varavs, A., & Salimbajevs, A. (2018). Restoring punctuation and capitalization using transformer models. In 6th International Conference on Statistical Language and Speech Processing, Mons, Belgium.
  • Williams, M. P. (1989). A comparison of the textual structures of Arabic and English written texts: A study in the comparative orality of Arabic (volumes i and ii) (Publication Number D-87016) [PhD diss., University of Leeds]. ProQuest Dissertations & Theses Global. England.
  • Yagi, S. M., & Ali, M. Y. (2008). Arabic conjunction WA: A conflict in pragmatic principles. Poznan Studies in Contemporary Linguistics, 44(4), 617–627. https://doi.org/10.2478/v10010-008-0029-4
  • Yagi, S. M., Mansour, Y., Kamalov, F., & Elnagar, A. (2021). Evaluation of Arabic-based contextualized word embedding models [Paper presentation]. 2021 International Conference on Asian Language Processing, IALP 2021,
  • Zakī, A. (1912/2013). al-Tarqīm wa ʻalāmātuh fī al-lughah al-ʻArabīyah. Hindāwī. https://www.hindawi.org/books/82047270/