Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Evaluating LLMs’ grammatical error correction performance in learner Chinese

Abstract

Large language models (LLMs) have recently exhibited significant capabilities in various English NLP tasks. However, their performance in Chinese grammatical error correction (CGEC) remains unexplored. This study evaluates the abilities of state-of-the-art LLMs in correcting learner Chinese errors from a corpus linguistic perspective. The performance of LLMs is assessed using standard evaluation metrics of MaxMatch score. Keyword and key n-gram analyses are conducted to quantitatively explore linguistic features that differentiate LLM outputs from those of human annotators. LLMs’ performance in syntactic and semantic dimensions is further qualitatively analyzed based on these probes of keywords and key n-grams. Results show that LLMs achieve a relatively higher performance in test datasets with multiple annotators and low performance in those with a single annotator. LLMs tend to overcorrect wrong sentences, under the explicit prompt of the “minimal edit” strategy, by using more linguistic devices to generate fluent and grammatical sentences. Furthermore, they struggle with under-correction and hallucination in reasoning-dependent situations. These findings highlight the strengths and limitations of LLMs in CGEC, suggesting that future efforts should focus on refining overcorrection tendencies and improving the handling of complex semantic contexts.

Introduction

A critical aspect of improving learners’ proficiency in Chinese as a foreign language is grammatical error correction (GEC), a specialized area within the domain of both natural language processing and applied linguistics involving the automatic detection and correction of grammatical errors in learners’ written texts. With the current global rise in the number of students learning Chinese, the demand for efficient, automated systems capable of accurately identifying and correcting Chinese grammatical errors is increasing.

Current research in GEC has predominantly featured English, experiencing significant advancements with NLP technologies like sequence-to-sequence (seq2seq) [13] and sequence-to-edit (seq2edit) [46] achieving state-of-the-art results in major benchmarks. However, these methods often require extensive labeled datasets, posing difficulties in scenarios where such resources are scarce. Despite considerable advancements, applying these systems to Chinese presents notable challenges including the scarcity of high-quality parallel data and the intrinsic complexities of the Chinese language [7]. While Chinese GEC (CGEC) has been somewhat explored [8], accurately resolving the learner Chinese errors remains a daunting task, underscoring a demand for more refined correction systems. LLMs have shown impressive capabilities in various natural language processing tasks due to their ability to generalize across languages and tasks through in-context learning, thus addressing low-resource challenges effectively. Given the proficiency of LLMs in overcoming these data limitations, the increasing exploration of LLMs in CGEC showed that overcorrection is a major problem [9]. LLMs present a promising avenue for enhancing GEC systems, as recent studies have begun leveraging these models to improve the accuracy of corrections. However, we are still knowing little about how errors are over or under-corrected by LLMs and how well they could do in the different CGEC tasks. To the best of our knowledge, several researchers conducted an evaluation of ChatGPT’s performance in the English GEC [1015], and no studies have been done on LLMs’ CGEC abilities.

When it comes to the ability of LLMs in CGEC, a noticeable gap persists; much like earlier research, current approaches often overlook the necessity of analyzing and evaluating the linguistic qualities of model outputs. Current efforts typically focus on refining algorithms without a deep understanding of the lexical, syntactical, and semantic nuances being processed. This oversight restricts the potential feedback loop where insights from linguistic analysis could inform and inspire further model improvements. This study evaluates three mainstream LLMs’ performance in CGEC from a corpus linguistics perspective to have a comprehensive understanding of LLMs’ performance in CGEC, extend the scope of CGEC research by focusing on qualitative and quantitative analysis of the generated texts in CGEC, and enrich the analytical methods such as keywords and key n-gram analysis in corpus linguistics. The main contributions of this study are twofold. Firstly, it is the first study to evaluate mainstream LLMs for CGEC. Secondly, it is the first to combine detailed corpus linguistics analysis and qualitative methods to examine how these models perform in terms of lexical, syntax, and semantic aspects.

In the next section, a brief literature review will be carried out to identify existing gaps in CGEC for learners, particularly emphasizing the limited exploration of large language models in this domain. Following this, in the “Methods” section, the datasets, evaluation metrics, and corpus linguistic analysis of the LLMs outputs will be presented. The study employs the APIs of leading LLMs—namely ChatGPT, Ernie-4 (Wenxinyiyan from Baidu), and Llama3—to correct grammatical errors and systematically evaluate their performance. Subsequently, the next section will present the results of keywords and key n-grams analysis to assess the models’ effectiveness across lexical, syntactic, and semantic dimensions. Finally, the study will qualitatively analyze the findings in depth, highlighting the strengths and limitations of LLMs in CGEC, and proposing implications and recommendations for future research and advancements in the task of correcting Chinese grammatical errors.

Literature review

LLMs in English GEC

GEC has developed with advanced natural language processing technologies from rule-based, classifier, and statistical machine translation to deep learning models, particularly those based on the transformer architecture. By treating GEC as a machine translation or sequence-to-sequence task, pretrained transformer models map erroneous text to corrected forms using extensive authentic data and complex architectures. Studies [16, 17] have demonstrated state-of-the-art results, highlighting these models’ effectiveness in addressing diverse grammatical errors. Alternatively, studies using the sequence-to-edit approach [6, 7, 18], which converts input sentences into a series of edit operations, have gained popularity and enhanced correction accuracy by focusing directly on errors and their corrections, resulting in faster and more precise outputs.

The emergence of Generative Pre-trained Transformer (GPT) models has led to remarkable progress in the field of large language models (LLMs). ChatGPT can outshine existing models in grammatical error correction for English [10], Chinese [19], Japanese [20], Korean [21], and Arabic [22], implying its potential as a valuable tool for second language learners striving to enhance their writing accuracy and fluency. A current trend in English GEC involves writing prompts for LLMs such as GPT3, GPT-3.5, or GPT-4 that would generate grammatical corrections [10, 23, 24]. Coyne et al. [23] assessed the GEC capabilities of GPT-3.5 (text-DaVinci-003) and GPT-4 (GPT-4-0314) using major benchmarks like BEA-2019 and JFLEG. It explores both zero-shot and few-shot prompt settings and highlights that GPT-4 achieves a new high score on the JFLEG benchmark. Through detailed analysis, the study reveals that GPT models tend to make fluency edits and over-corrections. Human evaluations indicate that GPT-4 produces fewer under-corrections but more over-corrections compared to traditional models, showcasing its ability to generate grammatically correct and fluent text. The use of LLMs like GPT-3.5 and GPT-4 has shown promising results, yet the field still remains relatively underexplored. Key issues like prompt engineering, model consistency, and thorough performance assessments are major challenges. Particularly, consistency across different LLMs and their effectiveness in correcting grammatical, semantic, and even pragmatic errors require further examination. Further research is needed to explore more models like LLaMA3, addressing these challenges and expanding the application to ensure robust and reliable performance.

LLMs in Chinese GEC

In the NLPCC-2018 shared task [2], many systems adopt Seq2Seq models based on RNN/CNN [25, 26]. Recent work adopts diverse approaches of Transformer [27, 28], Seq2Edit [29, 30], and LLMs [19, 29] to achieve state-of-the-art performance. Of the latest LLMs-based studies, Fan et al. [19] introduce GrammarGPT to CGEC using a hybrid dataset of ChatGPT-generated and manually annotated data. By leveraging instruction tuning and innovative data augmentation techniques, GrammarGPT demonstrates the potential of open-source language models in specialized linguistic tasks. For learner Chinese GEC, MuCGEC [30] introduces a comprehensive dataset designed to improve the evaluation of CGEC models. Li et al. [31] explore the application Baichuan-13B-Chat, using instruction tuning to refine their performance on CGEC tasks. The study finds that incorporating universal instructions and leveraging both full-parameter fine-tuning and LoRA (a parameter-efficient fine-tuning technique) can enhance the models’ correction capabilities, underscoring the importance of tailored instruction data. Besides modeling optimization, techniques like data augmentation [27, 28] and model ensemble [29] have proven to be very useful for CGEC.

Recent advancements in CGEC have demonstrated significant and dynamic progress, driven by innovative model architectures, fine-tuning strategies, and comprehensive datasets. This brief overview highlights the current dominance of LLM-based systems. Despite their effectiveness, these systems often depend on substantial computing resources and human-annotated training data. These systems generally treat CGEC as a translation or text generation task, neglecting the specific aspects of grammatical error correction as highlighted by Bryant et al. [32]. With the increasing reliance on neural language models, researchers risk becoming dependent on these less transparent “black box” systems, potentially losing control over the specific CGEC tasks. There is a notable lack of focus on the inefficiencies and limitations of these systems, as well as an insufficient understanding of the real performance of other influential LLMs, such as GPT4, LLaMA3, and ERNIE4. To effectively manage CGEC and address the needs of Chinese foreign language learners through a more transparent approach, it is essential for researchers to assess these models from linguistic and language learning perspectives. Researchers must explore the operational mechanisms specific to CGEC and devise strategies to improve their effectiveness.

Error analysis in Chinese GEC

Error analysis in GEC is critical for understanding and improving model performance. Fang et al. [10] evaluated ChatGPT on the CoNLL14 test set, focusing on fluency, minimal edits, over-correction, and under-correction. They found that ChatGPT exhibited superior fluency, correcting sentences freely and diversely rather than using minimal edits. While it had fewer under-corrections, it was prone to over-corrections. Similarly, Wu et al. [14] conducted a human evaluation, revealing that ChatGPT produced the most over-corrections but had the fewest under-corrections and fewer mis-corrections compared to other systems.

In the context of CGEC, error analysis is equally vital. Li et al. [26] examined a CNN model, identifying four main error types: word-level errors in common words, contextual errors, challenging verb corrections, and structural word order errors. They emphasized the need for a flexible evaluation approach, recognizing multiple correct solutions beyond gold-standard sentences. Xu et al. [33] introduced the FCGEC corpus, which annotates grammatical errors such as Incorrect Word Collocation, Component Missing, Component Redundancy, Structure Confusion, Incorrect Word Order, Illogical, and Ambiguity, from operational and semantic perspectives. This fine-grained categorization enhances the understanding of common Chinese errors. Wang et al. [13] analyzed 66 sentences, identifying 100 errors across eight categories from the HSK learner corpus: character-level spelling errors, word selection errors, redundant word errors, missing word errors, and various sentence-level errors. Their evaluation of correction results from pre-trained models showed that BART was the most effective, particularly in handling character-level and miscellaneous sentence-level errors. T5 performed better than BERT on missing word and word order errors, likely due to its pre-trained decoder.

Despite the progress made in these studies, there are still several notable shortcomings. First, the human-annotated datasets are relatively small, which limits the robustness and generalizability of the findings. Second, although the error types are described as fine-grained, they are still quite dealing with the common operational categories used in general CGEC research—missing, replacing, redundancy, and word order. This categorization might not fully address the unique challenges posed by learning Chinese as a foreign language. A more nuanced corpus linguistics approach, specifically tailored to the intricacies of Chinese grammar and usage, could provide deeper insights and lead to more effective improvements in CGEC model performance.

Methods

Datasets

This study utilizes three publicly available CGEC evaluation datasets: NLPCC18 (https://github.com/zhaoyyoo/NLPCC2018_GEC), CGED-2021 (https://github.com/blcuicall/cged_datasets), and YACLC-Minimal (Track 3 dev) (https://github.com/blcuicall/CCL2022-CLTC). These datasets were contributed by the NLPCC-2018, the series of CGED shared tasks, and CCL2022- CLTC, and are used respectively to evaluate the performance of large language models. NLPCC18 test data is extracted from PKU Chinese Learner Corpus, a collection of essays written by foreign college students. It was annotated by two annotators with the first annotator marking the edit alone, and the second annotator checking the annotation and making a revision. This study takes the second annotator’s reference as the gold data to evaluate the LLMs’ performance. CGED-2021 is one of the series of shared CGEC task datasets, from 2020 and onwards mainly from Beijing Advanced Innovation Center for Language Resources at Beijing Language and Culture University. CCL2022-CLTC refers to the Chinese Learner Text Correction (CLTC) task in Chinese Computational Linguistics conference. This study takes the development dataset of track 3 in this task called YACLC-Minimal. Sentences in datasets of NLPCC18 and CGED have only one reference but CCL2022-CLTC offers multiple corrections per sentence, thereby providing a more comprehensive basis for evaluating CGEC systems. The annotation guidelines follow the general principle of Minimum Edit Distance. Errors are divided into four types: redundant words (denoted as a capital “R”), missing words (“M”), word selection errors (“S”), and word ordering errors (“W”). All the datasets used in the experiment are described in Table 1.

Evaluation

To evaluate the capabilities of LLMs for CGEC, we prompt three top LLMs of varying sizes, including GPT-4 (GPT-4-0613, up to Sep 2021), LLaMA-3-70B, and ERNIE4.0. For experiments with ChatGPT, we use the official API to prompt GPT-4. For those with LLaMA-3-70B and ERNIE4.0, we use the Qianfan LLM Platform, which is provided by Baidu in China. The prompt is direct and to the point: “请使用中文回答问题。请根据最小编辑原则修改以下句子中的语法错误,并直接仅给出修改后的正确语句, 不需要任何解释:/Please answer the questions in Chinese. According to the principle of minimal edits, please correct the grammatical errors in the following sentence and provide only the corrected sentence directly, without any explanation:”.

For assessment purposes, we employ the MaxMatch (M2) metric, as developed by Dahlmeier and Ng [34], which leverages Levenshtein distance to align the original and corrected sentences and identifies the most extensive matching edits to calculate precision, recall, and the F score. To calculate the metrics in Chinese, an adaptation of the F1 score that emphasizes precision by weighting it twice as heavily as recall, underscoring the prevailing view in grammatical error correction research that precision is more crucial than recall. To guarantee uniformity in our evaluations, we utilize the available MaxMatch scorer (https://github.com/nusnlp/m2scorer) for CGEC.

Corpus linguistics analysis

In the study, we conducted a one-way Analysis of Variance (ANOVA) to compare the differences in editing distance values among ChatGPT, Llama-3, Ernie4 (Wenxinyiyan from Baidu), and human annotators. The ANOVA was performed using R (version 4.3.2) with the aov function, ensuring that our statistical analysis is both transparent and reproducible. Prior to the ANOVA, we verified the assumptions of normality and homogeneity of variances using the Shapiro-Wilk test and Levene’s test, respectively. Post-hoc comparisons were conducted using Tukey’s Honest Significant Difference (HSD) test to identify specific group differences where significant effects were found. Detailed scripts and data used for these analyses are provided in the supplementary materials, further enhancing the reproducibility of our findings. This comprehensive statistical approach allows for a robust comparison of the editing behaviors of different LLMs against human performance, thereby reinforcing the validity of our results.

Moreover, we conducted a detailed comparative keyword and key n-gram analysis using Quanteda package in R to evaluate the performance of the LLMs from a linguistic perspective. Since the differences between incorrect and corrected sentences in the datasets are primarily grammatical—while content words largely remain unchanged—this analysis effectively focuses on grammatical word usage. We processed the corrected outputs from the LLMs, the corrections made by human annotators, and the original learner sentences to extract word and n-gram frequencies. Using chi-squared (χ2) tests within R, we determined the significance of differences in the frequency of key linguistic devices among learners, annotators, and LLMs, thereby identifying specific grammatical features that distinguish the groups. The combination of corpus analysis with statistical testing provided a robust methodological framework for evaluating the linguistic performance of LLMs. All scripts, frequency lists, and statistical outputs have been documented and are available in the supplementary materials to enhance the transparency and reproducibility of our findings.

Results

LLMs performance

Table 2 presents the performance of three LLMs–GPT-4, ERNIE-4.0-8k, and LLAMA-3 70B –across different datasets for CGEC. In the YACLC-Minimal dataset with multiple annotators, ERNIE-4.0-8k exhibits notably higher performance. Meanwhile, it achieves the highest precision, recall, and F0.5 in CGED-2021 dataset with a single annotator. ERNIE-4’s robust performance underscores its suitability for precise grammatical error correction tasks in contexts of Chinese as a foreign language. In contrast, GPT-4 and LLAMA-3 70B LLMs demonstrate lower scores, highlighting potential challenges in CGEC. ERNIE-4’s stronger performance may be due to its development in China and training on extensive Chinese language data.

thumbnail
Table 2. Performance comparison among LLMs with minimum editing.

https://doi.org/10.1371/journal.pone.0312881.t002

In comparison to previous studies utilizing the character-based M2 metric, Zhang et al. [30] achieved an F0.5 score of 53.81 for NLPCC18 and 47.52 for the CGED test data. For the YACLC-Minimal (Track 3 dev), there are currently no comparative studies available; however, the reported F0.5 score on the YACLC-Minimal test data is also 47.52 [30], which can serve as a rough reference. The F0.5 scores for LLMs are significantly lower than those of state-of-the-art ensemble or complex seq2seq models. This discrepancy may be attributed to the large-scale mixed data used for training, predominantly in English, for models like GPT-4 and LLaMA-3 70B. While LLMs provide a more accessible solution for CGEC applications, their performance in terms of precision and recall is significantly lower and still falls short of human annotators.

Corpus linguistic analysis

Editing distance.

The main goal of the editing distance analysis is to show that LLMs have a tendency to make more corrections to sentences compared to human annotators. By comparing the Levenshtein distances, between the source texts and the predictions we can determine how editing each model does. We segment the corrected sentences into words instead of characters which allows us to calculate the minimum number of word changes needed to transform one string into another using the Levenshtein distance. This method gives us a way to assess the level of corrections made. After calculating the editing distance, we conducted a one-way ANOVA analysis to compare the difference of distance values among ChatGPT, Llama-3, Erinie4, and human annotators.

The ANOVA results on comparing the editing distances of wrong sentences and corrected sentences indicate significant differences among the various models and groups tested. The analysis shows that the Levenshtein distance is significantly higher for the GPT-4 model (mean difference = 2.13, 95% CI [1.86, 2.41], p < 0.0001) and the llama-3 model (mean difference = 1.81, 95% CI [1.54, 2.09], p < 0.0001) compared to the ERNIE-4 model. This suggests that GPT-4 and llama-3 make more substantial changes to correct sentences. Additionally, human corrections exhibit a significantly lower Levenshtein distance compared to GPT-4 (mean difference = -1.94, 95% CI [-2.22, -1.67], p < 0.0001), indicating that human corrections are more conservative and potentially more precise. Furthermore, there is a slight but significant difference between Llama-3 and GPT-4 (mean difference = -0.32, 95% CI [-0.60, -0.04], p = 0.016), highlighting nuanced variations in the correction approaches of these models. Overall, these results indicate significant differences in the editing distances between most pairs of groups, except for the comparison between human and ERNIE-4 (mean difference = 0.19, 95% CI [-0.08, 0.47], p = 0.2807).

Keyword analysis.

The keyword analysis using Quanteda in R revealed distinct patterns in the original texts produced by Chinese learners and the texts generated by the LLMs. By comparing these texts, we identified key differences that might provide valuable insights into CGEC research. The analysis highlighted several keywords that were significantly more frequent in the learners’ texts, human annotated texts, and LLMs’ texts respectively. These keywords may serve as indicators of common errors made by learners or LLMs, hence of the LLMs’ real performance.

Table 3 presents the top 20 keywords used in LLMs in contrast to the usage in learners’ original texts. It includes chi-squared values, p-values, and the frequency of keyword usage by learners and LLMs. By examining these keywords, we can categorize them based on their grammatical roles and semantic meanings, providing insight into how different groups utilize language elements in grammatical error correction. From a grammatical perspective, the keywords can be divided into several categories. Modals and particles, such as “会/will”, “将/will”, and “了/aspect marker”, indicating future actions, obligations, or completed actions, are essential for constructing grammatically correct sentences in Chinese. Conjunctions like “和/and”, “并/and”, “且/and”, and “如果/if” link clauses or phrases, facilitating complex sentence structures. Additionally, prepositions and locatives, exemplified by “中/in”, denote positions or locations within sentences. These grammatical roles highlight the structural differences in language use between learners and LLMs. Semantically, the keywords can be grouped based on their meanings and contextual use. Temporal and conditional words like “已经/already”, and “如果/if” provide information about time and conditions, situating actions within temporal and hypothetical contexts. Keywords such as “拥有/possess” and “需要/need” relate to the existence or necessity of objects or states, indicating ownership or requirement. Words like “却/but”, “不仅/not only”, and “仍然/still” express contrasts, additions, or continuity, adding complexity to statements. Keywords like “无法/unable” and “是否/whether” indicate abilities or possibilities, expressing capability or uncertainty. This semantic diversity suggests that LLMs engage with a wide array of contexts and scenarios, albeit with varying degrees of accuracy and sophistication. Chinese learners often use fewer modals, particles, and conjunctions, which can lead to mistakes in their writing. While LLMs aim to correct these deficiencies by incorporating more of these grammatical elements, their interventions can sometimes result in redundancy and overly complex expressions. This tendency highlights a gap between the concise writing style preferred by human learners and the more complex constructions produced by LLMs.

thumbnail
Table 3. Top 20 keywords featuring LLMs compared with learners.

https://doi.org/10.1371/journal.pone.0312881.t003

Tables 4 and 5 are the comparison between LLMs and human annotators, revealing distinct syntactic and semantic patterns in their usage of keywords. A notable difference is that human annotators tend to feature single Chinese characters, while LLMs more frequently use two-character Chinese words. This reflects the mini-editing strategy of human annotators, who make minimal edits to correct sentences. In contrast, LLMs may generate more substantial changes, using complete words to ensure grammatical correctness. Common keywords such as “和/and”, “并/and”, “无法/unable”, and “如果/if” are more frequently used by LLMs, indicating their tendency to create coherent sentences. LLMs might overuse the tense and aspect markers like “将/will”, “已经/already”, “正在/be+ving”, and “曾经/once” to clarify the timing of actions and ensure grammaticality even when unnecessary. LLMs tend to use more modals, particles, and conjunctions to produce fluent Chinese, which can result in redundancy and less concise expressions. In contrast, Chinese native speakers often prefer more straightforward constructions, prioritizing clarity and brevity. This difference in writing style suggests that while LLMs may generate grammatically correct sentences, they may lack the conciseness and precision that characterize human communication.

thumbnail
Table 4. Keywords featuring LLMs compared with human annotators.

https://doi.org/10.1371/journal.pone.0312881.t004

thumbnail
Table 5. Keywords featuring human annotators compared with LLMs.

https://doi.org/10.1371/journal.pone.0312881.t005

Table 6 shows the keywords featuring human annotators in contrast to learners. Only 5 keywords are salient, which shows fewer variations and is the result of the human mini-editing strategy. Devices like the modal verb “会/will”, the particle “地/a particle used to form adverbial phrases”, and the inclusive adverb “都/all” show significant differences with annotators using it more frequently than learners.

thumbnail
Table 6. Keywords featuring human annotators compared with learners.

https://doi.org/10.1371/journal.pone.0312881.t006

Key N-gram analysis.

The key n-gram analysis depicted in Table 7 reveals a significant divergence in the usage patterns of key n-grams (n = 2, 3) between texts produced by learners and those generated by LLMs. Based on their grammatical functions and usage patterns, they could be grouped into several categories including frequency modifiers and intensifiers, expressions of possibility and probability, temporal context, actions and outcomes, descriptions and comparisons, and initiating actions. This grouping allows for a detailed examination of how these n-grams are used differently by LLMs and human learners.

thumbnail
Table 7. Key N-grams (n = 2,3) featuring LLMs compared with learners.

https://doi.org/10.1371/journal.pone.0312881.t007

Table 7 shows that LLMs use certain n-grams much more frequently than learners. For instance, the n-grams for expressing changes and possibilities “就会/will”, “变得/become” and “可能会may” show higher usage by LLMs. N-grams indicating quantity or degree, such as “了很多 /le + a lot” and “了许多/le + many”, are also used more frequently. This indicates that LLMs are more adept at using phrases that convey intensity or extent. Additionally, n-grams like “是我 (is me)” and “时候起/from + time” show substantial differences, highlighting LLMs’ ability to generate self-referential constructs and temporal expressions more effectively. The frequent use of “给了/have given” and “会觉得/will feel” by LLMs further underscores their tendency to construct sentences with past actions and future feelings, contributing to richer narrative and descriptive language. This tendency suggests that LLMs excel in generating language that conveys complexity, which may not be as prevalent in the writing of Chinese L2 learners. However, this increased usage of certain n-grams can also lead to expressions that feel overly formal, potentially detracting from clarity.

Tables 8 and 9 list the key n-grams featuring LLMs and human annotators, respectively. It is apparent that LLMs have much fewer key n-grams, whereas human annotators have more featured phrases. For LLMs, these phrases indicate that LLMs tend to use specific grammatical structures to express changes, conditions, time-specific actions, ongoing activities (“我正在/I am V+ing”), “当我/when I”, “就会/will”, and “变得/become” in Table 8. In contrast, Table 9 reveals that human annotators frequently use n-grams that are more about detailed expressions and specific conditions. For example, “很多的/many”, “的所有的/all of”, and “有很多的/there are many” indicate their attention to specificity and possession. Additionally, annotators favor phrases like “起来是/seem to be”, reflecting their focus on clear and descriptive language. This disparity highlights that while LLMs tend to favor more formulaic constructions, human annotators are more adept at crafting nuanced and contextually rich expressions. The variety in human-generated n-grams suggests a greater awareness of subtleties in meaning and context, which may contribute to clearer communication.

thumbnail
Table 8. Key N-grams (n = 2,3) featuring LLMs compared with human annotators.

https://doi.org/10.1371/journal.pone.0312881.t008

thumbnail
Table 9. Key N-grams (n = 2,3) featuring human annotators compared with LLMs.

https://doi.org/10.1371/journal.pone.0312881.t009

Table 10 offers a comparative analysis of key n-grams used by human annotators versus learners. Notably, only two phrases are featured for human annotators, suggesting that they employ minimal edit strategies and concentrate on correcting word errors rather than phrases. This approach contrasts sharply with the editing strategies of LLMs, which tend to generate more complex phrases, highlighting a fundamental difference in how each group addresses grammatical corrections.

thumbnail
Table 10. Key N-grams (n = 2,3) featuring human annotators compared with learners.

https://doi.org/10.1371/journal.pone.0312881.t010

Discussion

This study shows that the CGEC performance of LLMs, measured by the MaxMatch metric, is significantly lower than that of current state-of-the-art models, such as ensemble and sequence-to-edit approaches, which contradicts findings from previous studies [19]. From a corpus linguistic perspective, the underlying reasons for this inferior performance are primarily linguistic in nature. Based on the editing distance analysis results, LLMs have significant differences in the editing distances from human annotators. In addition, the keyword and key n-gram analysis findings from the comparison between LLMs and human annotators provided apparent evidence that LLMs might have significant over-correction in dealing with words instead of phrases and over-complication with sentence structures. Using these words and phrases as probes, LLMs’ performance in syntactic and semantic dimensions was qualitatively analyzed in this section.

Seq2Seq models tend to generate sentences with higher probabilities and replace infrequent words with more frequent ones, leading to overcorrection [9]. Alirector in Yuan and Quan’s study [9] is an example to mitigate the number of overcorrections without deteriorating under-correction by distilling knowledge and finetuning LLMs. However, their analysis focuses on four common operational types of errors. In this study, we shifted focus from operational errors to specific linguistic errors and found out that conjunctions, particles, and modal verbs are more frequently used by LLMs than human annotators. Both LLMs and human annotators employ those keywords to establish logical relationships, convey temporal and aspectual information, and indicate degree and intensity. However, LLMs show a strong tendency to overuse them to generate more coherent and complex sentence constructions, as shown in Table 11, which can be beneficial in formal contexts where detailed explanations are required. It can also lead to potential overcorrection, which may not always align with the natural flow of conversational or written Chinese as produced by learners. The results align with the word-level analysis from previous studies [13, 23], providing clearer insights into the nature of LLMs. LLMs tend to generate more formal and structured text, likely reflecting the influence of their training data, which often consists of more formal texts. In contrast, human annotators prioritize simplicity and clarity, which are essential for effective communication. Additionally, this tendency for LLMs to overuse certain words does not extend to phrases, as the analysis of key n-grams reveals that LLMs do not utilize more complex n-grams for error correction compared to annotators. Several featuring n-grams are similar with keywords showing temporal actions.

In terms of under-correction, LLMs struggle with semantic understanding and usage of modal verbs and collocations. Albeit the overuse of modal verbs, LLMs fail to understand fully the complex and subtle meaning of different modal verbs. For example, the negative modal verbs of “不可以/not to be allowed to”, “不能/unable to”, and “无法/cannot” in Table 12 are misused by both learners and LLMs. The reason is that the word “会/know how to” has a similar meaning to “能/to be able to” and “可以/to be allowed to”, which are all often translated as “can” in English. This is a useful way of thinking about them, but their usage does overlap because of the context. In this context, their negative forms involve subtleties in meaning with co-occurrence of the verb “融化/melt” that are difficult for LLMs to accurately interpret and correct. For collocational knowledge, researchers hold that word embedding technology provides a great possibility for LLMs to learn natural word occurrence. However, in the CGEC task, original sentences are written by L2 learners, more than 1 or 2 errors in the sentence, and cause LLMs to fail to understand the original meaning of the writer then fail to perform well in collocational error correction. For example, the error in Table 12 is the mis-collocation of “组成/constitute” and “人生的小部分/a small part of life”. The central word of the noun phrase is “小部分/ a small part” in syntax, and “组成/constitute” seems to collocate well with it. However, the real collocate of the verb in this sentence should be “人生/life” and “完成/accomplish” is proper in the verb position to form a natural and idiosyncratic unit of meaning. Another typical collocational error is the missing noun object of the verb “策划/plan”, and none of the LLMs can grasp the meaning to add the object “活动/event” though they make other over-corrections. This aligns with previous studies [33] and it is clear that the performance on word collocation is inferior and the performance of LLMs among various fine-grained grammatical errors will be very challenging [31].

For hallucination, we focus on semantic inconsistency and common-sense errors. Prior studies evaluating LLMs’ hallucination have shown that LLMs provide incorrect or misleading information [35, 36]. In CGEC, semantic inconsistency is one major type of incorrect and misleading information, which occurs when the corrected sentence adds unrelated content or changes the original meaning in a way that diverges from the source sentence. For instance, as shown in Table 13, the source sentence has a redundant “曾经刚毕业/once just graduated,” which ERNIE-4, ChatGPT-4, and Llama-3 all simplify to “曾经毕业/once graduated,” thus removing redundancy but altering the intended meaning. In contrast, the human correction retains “刚毕业/just graduated,” clearly indicating a recent event. Additionally, another hallucination is a common-sense error happening when the generated content does not align with real-world knowledge. For example, LLAMA-3’s corrected sentence contains the awkward phrase “骑着滑板/riding a skateboard,” which is a miscollocation that defies common sense. ChatGPT-4 avoids the common sense error but introduces a new subject “有人/someone,” thereby changing the original meaning of the source sentence and making the semantic consistency hallucination. Our results support the work of other studies [37]. A key reason for this issue seems to be that the source texts consist of isolated sentences, lacking the necessary context from preceding and following sentences. This absence of contextual information restricts LLMs’ ability to accurately interpret and correct errors, often resulting in the generation of irrelevant or incorrect content. Chollampatt & Wang [38] showed that incorporating surrounding sentence context significantly improves GEC performance. To tackle the problem of hallucinations, researchers should focus on creating more training data with broader contexts, as well as corresponding test data. By integrating multi-sentence contexts in both training and evaluation phases, LLMs can better understand the relationships and dependencies between sentences, thereby enhancing their accuracy in making corrections.

Conclusion

The corpus linguistic analysis of LLMs’ performance in CGEC has provided valuable results and insights into their strengths and limitations. The distance editing analysis, keyword analysis, and key n-gram analysis demonstrated that LLMs tend to overcorrect the source sentences, often at the word level rather than the phrase level errors. Further manual analysis identified LLMs’ instances of over-corrections, under-corrections, and hallucinations, highlighting specific areas where LLMs diverge from human annotators.

These findings have a clear vision for the development of using LLMs in CGEC. To enhance the performance, it might be essential to fine-tune these models with more diverse and contextually rich training data that emphasize collocational, modal, and semantic corrections. Additionally, implementing feedback loops where the model learns from human annotators’ corrections can help LLMs develop a deeper understanding of context and improve their overall accuracy. These approaches will likely lead to more natural and effective language corrections, aligning LLM output more closely with human annotators.

The results indicate several practical strategies for enhancing LLMs’ performance in CGEC tasks. First, fine-tuning these models with diverse and contextually rich training data that emphasizes collocational, modal, and semantic corrections is essential. Additionally, integrating collocational data and knowledge graphs can significantly improve their accuracy and reduce hallucinations. Collocational data provides LLMs with a better grasp of contextually appropriate word combinations, minimizing semantic errors. Meanwhile, knowledge graphs, which illustrate the relationships between entities, can help refrain the models’ responses from generating unrelated content. These enhancements could lead to more reliable and context-aware grammatical error corrections.

For the limitation of the study, the qualitative analysis is limited to a small sample of LLMs’ generated texts, and the analysis of LLMs’ linguistic performance is not comprehensive. This constrained scope means that the findings may not fully capture the breadth of potential errors and hallucinations that LLMs can produce. Additionally, the study primarily focuses on specific types of grammatical errors, potentially overlooking other significant aspects of linguistic performance such as style and pragmatic appropriateness. Future research should aim to include a larger and more diverse dataset of generated texts and consider a wider range of linguistic features to provide a more holistic evaluation of LLMs’ capabilities and limitations.

Supporting information

S1 Data. This zip file contains the R code along with all the related data and tools utilized in this study.

https://doi.org/10.1371/journal.pone.0312881.s001

(RAR)

References

  1. 1. Chollampatt S, Ng HT. A Multilayer Convolutional Encoder-decoder Neural Network for Grammatical Error Correction. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2018.
  2. 2. Ge T, Wei F, Zhou M. Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study. arXiv. 2018; abs/1807.01270.
  3. 3. Makarenkov V, Rokach L, Shapira, B. Choosing the Right Word: Using Bidirectional LSTM Tagger for Writing Support Systems. Engineering Applications of Artificial Intelligence. 2019; 84:1–10.
  4. 4. Awasthi A, Sarawagi S, Goyal R, Ghosh S, Piratla V. Parallel Iterative Edit Models for Local Sequence Transduction. In Inui K, Jiang J, Ng V, Wan X, editors. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics; 2019. p. 4260–4270. https://doi.org/10.18653/v1/D19-1435
  5. 5. Omelianchuk K, Atrasevych V, Chernodub A, Skurzhanskyi O. GECToR–Grammatical Error Correction: Tag, Not Rewrite. In: Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Seattle, USA: Association for Computational Linguistics; 2020. p. 163–170. https://doi.org/10.18653/v1/2020.bea-1.16
  6. 6. Wang Q, Tan Y. Automatic Grammatical Error Correction Based on Edit Operations Information. In: International Conference on Neural Information Processing; 2018. p. 494–505.
  7. 7. Yue T, Liu S, Cai H, Yang T, Song S, Yu T. Improving Chinese Grammatical Error Detection via Data augmentation by Conditional Error Generation. In: Muresan S, Nakov P, Villavicencio A, editors. Findings of the Association for Computational Linguistics: ACL 2022. Dublin, Ireland: Association for Computational Linguistics; 2022. p. 2966–2975. https://doi.org/10.18653/v1/2022.findings-acl.233
  8. 8. Zhao Y, Jiang N, Sun W, Wan X. Overview of the NLPCC 2018 Shared Task: Grammatical Error Correction. In Zhang M, Ng V, Zhao D, Li S, Zan H, editors. Natural Language Processing and Chinese Computing. Lecture Notes in Computer Science (Vol. 11109). Springer International Publishing, 2018. p. 439–445. https://doi.org/10.1007/978-3-319-99501-4_41
  9. 9. Yang H, Quan X. Alirector: Alignment-Enhanced Chinese Grammatical Error Corrector. In: Findings of the Association for Computational Linguistics ACL 2024 2024. Bangkok, Thailand: Association for Computational Linguistics; 2024. p. 2531–2546. Available from: https://aclanthology.org/2024.findings-acl.148/.
  10. 10. Fang T, Yang S, Lan K, Wong D. F, Hu J, Chao LS, et al. Is ChatGPT a Highly Fluent Grammatical Error Correction System? A Comprehensive Evaluation. arXiv. 2023; abs/2304.01746.
  11. 11. Katinskaia A, Yangarber R. GPT-3.5 for Grammatical Error Correction. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL; 2024. p. 7831–7843. Available from: https://aclanthology.org/2024.lrec-main.692.
  12. 12. Mizumoto A, Shintani N, Sasaki M, Teng MF. Testing the viability of ChatGPT as a companion in L2 writing accuracy assessment. Research Methods in Applied Linguistics. 2024; 3(2):100116.
  13. 13. Wang H, Kurosawa M, Katsumata S, Mita M, Komachi M. Chinese Grammatical Error Correction Using Pre-trained Models and Pseudo Data. ACM Transactions on Asian and Low-Resource Language Information Processing. 2023; 22(3)Article 89:1–12.
  14. 14. Wu H, Wang W, Wan Y, Jiao W, Lyu M. ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark. arXiv. 2023;abs/2303.13648.
  15. 15. Zeng M, Kuang J, Qiu M, Song J, Park J. Evaluating Prompting Strategies for Grammatical Error Correction Based on Language Proficiency. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL; 2024. p. 6426–6430. 2024. Available from: https://aclanthology.org/2024.lrec-main.569.
  16. 16. Junczys-Dowmunt M, Grundkiewicz R, Guha S, Heafield K. Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, USA: Association for Computational Linguistics; 2018. p. 595–606. https://doi.org/10.18653/v1/N18-1055
  17. 17. Kiyono S, Suzuki J, Mita M, Mizumoto T, Inui K. An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics; 2019. p. 1236–1242. https://doi.org/10.18653/v1/D19-1119
  18. 18. Mesham S, Bryant C, Rei M, Yuan Z. An Extended Sequence Tagging Vocabulary for Grammatical Error Correction. In: Findings of the Association for Computational Linguistics: EACL 2023. Dubrovnik, Croatia: Association for Computational Linguistics; 2023. p. 1608–1619. https://doi.org/10.18653/v1/2023.findings-eacl.119
  19. 19. Fan Y, Jiang F, Li P, Li H. GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning. In: Liu F, Duan N, Xu Q, Hong Y, editors. Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science, vol 14304. Cham: Springer; 2023. p. 69–80. https://doi.org/10.1007/978-3-031-44699-3_7
  20. 20. Schmidt-Fajlik R. ChatGPT as a Grammar Checker for Japanese English Language Learners: A Comparison with Grammarly and ProWritingAid. AsiaCALL Online Journal. 2023; 14(1):105–119.
  21. 21. Park C, Koo S, Kim G, Lim H. Towards Harnessing the Most of ChatGPT for Korean Grammatical Error Correction. Applied Sciences. 2024; 14(8), 3195.
  22. 22. Kwon S, Bhatia G, Nagoudi EMB, Abdul-Mageed M. Beyond English: Evaluating LLMs for Arabic Grammatical Error Correction. In: Proceedings of ArabicNLP 2023. Singapore: Association for Computational Linguistics; 2023. p. 101–119. https://doi.org/10.18653/v1/2023.arabicnlp-1.9
  23. 23. Coyne S, Sakaguchi K, Galvan-Sosa D, Zock M, Inui K. Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error Correction. arXiv. 2023;abs/2303.14342.
  24. 24. Loem M, Kaneko M, Takase S, Okazaki N. Exploring Effectiveness of GPT-3 in Grammatical Error Correction: A Study on Performance and Controllability in Prompt-Based Methods. In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). Toronto, Canada: Association for Computational Linguistics; 2023. p. 205–219. https://doi.org/10.18653/v1/2023.bea-1.18
  25. 25. Kaneko M, Mita M, Kiyono S, Suzuki J, Inui K. Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction. arXiv. 2020;abs/2005.00987.
  26. 26. Li S, Zhao J, Shi G, Tan Y, Xu H, Chen G, et al. Chinese Grammatical Error Correction Based on Convolutional Sequence to Sequence Model. IEEE Access. 2019; 7:72905–72913.
  27. 27. Tang Z, Ji Y, Zhao Y, Li J. Chinese grammatical error correction enhanced by data augmentation from word and character levels. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics. Hohhot, China; 2021. p. 13–15.
  28. 28. Zhao Z, Wang H. MaskGEC: Improving Neural Grammatical Error Correction via Dynamic Masking. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34(01); 2020. p. 1226–1233.
  29. 29. Hinson C, Huang HH, Chen HH. Heterogeneous Recycle Generation for Chinese Grammatical Error Correction. In: Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics; 2020. p. 2191–2201. https://doi.org/10.18653/v1/2020.coling-main.199
  30. 30. Zhang Y, Li Z, Bao Z, Li J, Zhang B, Li C, et al. MuCGEC: A Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics; 2022. p. 3118–3130. https://doi.org/10.18653/v1/2022.naacl-main.227
  31. 31. Li Y, Huang H, Ma S, Jiang Y, Li Y, Zhou F, et al. On the (In)Effectiveness of Large Language Models for Chinese Text Correction. arXiv. 2023; abs/2307.09007.doi.
  32. 32. Bryant C, Yuan Z, Qorib MR, Cao H, Ng HT, Briscoe T. Grammatical Error Correction: A Survey of the State of the Art. Computational Linguistics. 2023; 49 (3): 643–701.
  33. 33. Xu L, Wu J, Peng J, Fu J, Cai M. FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction. In: Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; 2022. p. 1900–1918. https://doi.org/10.18653/v1/2022.findings-emnlp.137
  34. 34. Dahlmeier D, Ng HT. (2012). Better Evaluation for Grammatical Error Correction. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montréal, Canada: Association for Computational Linguistics; 2012. p. 568–572. Available from: https://aclanthology.org/N12-1067.
  35. 35. Li Y, Du Y, Zhou K, Wang J, Zhao WX, Wen JR. Evaluating object hallucination in large vision-language models. arXiv. 2023; preprint arXiv:230510355.
  36. 36. Manakul P, Liusie A, Gales MJ. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv. 2023; preprint arXiv:230308896.
  37. 37. Abedi M, Alshybani I, Shahadat MRB, Murillo MS. Beyond Traditional Teaching: The Potential of Large Language Models and Chatbots in Graduate Engineering Education. arXiv. 2023;abs/2309.13059.
  38. 38. Chollampatt S, Wang W, Ng HT. Cross-Sentence Grammatical Error Correction. In: Korhonen A, Traum D, Màrquez L, editors. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics; 2019. p. 435–445. https://doi.org/10.18653/v1/P19-1042