Evaluating the Accuracy of AI-Generated Text Detection in Scientific Writing

Giuseppe Lippi; Camilla Mattiuzzi

doi:10.4274/hamidiyemedj.galenos.2025.71667

ABSTRACT

The rapid advancement of artificial intelligence (AI) tools, especially in natural language processing, is transforming scientific writing by improving efficiency, consistency and accessibility, especially for non-native English speakers and early-career researchers. This study aimed to evaluate the effectiveness of Compilatio, a widely used plagiarism detection software, in identifying AI-generated scientific content.

Four commonly used and freely available AI tools [ChatGPT, Gemini, Perplexity, and synthesis of topic outlines through retrieval and multi-perspective question asking (STORM)] were prompted to generate introductory texts on the burden of diabetes. Each output was copied into a Word document, uploaded and analyzed by Compilatio, which provided integrity score, similarity index, and likelihood of AI-generated content.

Integrity scores varied substantially, ranging from 32% (STORM) to 100% (Gemini), while similarity indices remained consistently low (0-6%), indicating minimal direct text overlap with existing sources. The likelihood of AI authorship also varied, with STORM yielding the lowest detection rate (27%) while Gemini yielded the highest (100%).

These findings highlight the distinct textual characteristics produced by different AI models and demonstrate the overall effectiveness of Compilatio in identifying AI-generated content from three out of four tools. However, the limited performance observed with STORM-generated text underscores the need for more sophisticated and adaptable detection systems to uphold academic integrity in the evolving landscape of AI-supported scientific writing.

Keywords:

Artificial intelligence, scientific writing, ethics

Introduction

The exponential rise of artificial intelligence (AI) tools in recent years, has not only contributed significantly to various aspects of daily life, but has also revolutionized scientific writing (1). These advancements, especially in natural language processing and machine learning, are transforming academic research and communication among non-native English speakers, as well as early-career scientists who are still in the process of developing their writing skills (1). The ability of AI tools to streamline tasks that traditionally need substantial time and effort, such as conducting literature reviews and improving the clarity and consistency of written scientific communication, has made them an invaluable resource (2).

Freely available tools like ChatGPT, Gemini, and Perplexity, along with specialized resources such as STORM (Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking), are becoming very attractive for drafting, refining, and summarizing scientific content, especially the introduction of academic papers. In fact, this section of the article often requires clearly presenting the problem by summarizing previous evidence, and is hence well-suited to be generated with the support of AI (3). However, the increased adoption of AI in scientific writing also raises significant ethical concerns, such as questions about authorship integrity and balance between human creativity and machine support (4).

Some software programs have been developed to detect both plagiarism and AI-generated text in scientific papers. In this study, we evaluate the effectiveness of one of these tools in identifying AI-generated content.

Materials and Methods

Four widely used AI tools were employed, including three free online “generic” resources (ChatGPT 3.5, Gemini 2.5, and Perplexity 2.0) and STORM 1.1.0, a specialized AI-powered tool developed by Stanford University for creating comprehensive, Wikipedia-style articles. Each tool was prompted with the following generic request: “Please write an introduction about the epidemiology, clinical, social and economic burden of diabetes”. The resulting outputs from each of the four AI tools were copied into separate Word documents, which were then sequentially uploaded to Compilatio (https://www.compilatio.net/it), a plagiarism detection software used by many academic institutions. This software provides an “integrity score” expressed as a percentage, along with three additional metrics: similarity index (the percentage of content matched from other sources), likelihood of AI-generated text (also expressed as percentage), and unrecognized language. The software analyzes documents by comparing the uploaded text with a vast array of online sources, academic papers, and databases, employing stylometric techniques such as vocabulary diversity, sentence structure, average sentence length, and word rarity to detect AI-generated content. The performance of the four distinct models for detecting plagiarism and AI-generated text was evaluated using a c² test. Access to Compilatio is free and unlimited for members of Verona University. Ethical approval was not required due to the use of publicly available web resources.

Results

The results of our analysis are summarized in Table 1 and Figure 1. The word count of the AI-generated documents varied broadly, from 168 (Gemini) to 1604 (STORM). The integrity scores also varied significantly, from a minimum of 32% for STORM to a maximum of 100% for Gemini. The similarity index remained relatively low for all tools (ranging from 0% to 6%) while the percentage of likely AI-written text varied considerably, with STORM having a minimum of 27% and Gemini displaying a maximum of 100%. An area of overlap between AI-generated text and similar content was detected by Compilatio in the content generated by ChatGPT. The χ² test revealed a substantial difference in the performance of the four models to detect plagiarism and/or AI-generated text (c²=56.02; p<0.001), suggesting that their outputs differ substantially in these metrics.

Discussion

The results of this analysis reveal that AI-generated text seems to vary in terms of both quality and likelihood of being flagged as “AI-written” by Compilatio, one of the most commonly used plagiarism and AI-generated text detection software programs in Italian universities. Gemini generated content that was flagged with a 100% integrity score, and this is likely because of its succinct and broadly comprehensive output. STORM, which is specifically designed to generate in-depth and structured scientific content, yielded a substantially lower integrity score (32%) despite the considerably higher word count of the text produced. This difference can mostly be attributed to the nature of the web resources, as STORM provides more comprehensive content, likely accessing a larger number of sources and ideas, which may ultimately contribute to diluting or even masking its “AI fingerprints”. The similarity index was found to be low across all tools, suggesting that the content generated by the four freely available AI resources used in this study may not have been directly copied and pasted from other existing sources indexed by Compilatio. However, the significant variation in the proportion of AI-written text highlights the different approaches that these tools use for generating content. ChatGPT and Perplexity produced text with high percentages of AI-written content (79% and 74%, respectively), suggesting that they may heavily rely on pre-trained models. Gemini, on the other hand, produced text that was entirely flagged as AI-written likely due to its minimalist and direct strategy. STORM, although producing less AI-sounding content, still had a modest portion (27%) that was identified as AI-generated.

Conclusion

This study highlights the increasing significance of tools designed to detect AI-generated text. The ability of Compilatio to identify AI-written content from the three “general” LMs demonstrates its real utility when unmodified text obtained from these three freely available web resources is used. Nonetheless, its performance was found to be considerably decreased when detecting text generated by STORM, suggesting that these resources still require further refinement when used for ensuring academic research integrity.

Ethics

Ethics Committee Approval: Ethical approval was not required due to the use of publicly available web resources.

Informed Consent: It is not necessarry.

Authorship Contributions

Concept: G.L., C.M., Analysis or Interpretation: G.L., C.M., Writing: G.L., C.M.

Conflict of Interest: No conflict of interest was declared by the authors.

Financial Disclosure: The authors declared that this study received no financial support.

References

Giglio AD, Costa MUPD. The use of artificial intelligence to improve the scientific writing of non-native english speakers. Rev Assoc Med Bras (1992). 2023;69:e20230560.

CrossRef PubMed Google Scholar

Chirichela IA, Mariani AW, Pêgo-Fernandes PM. Artificial intelligence in scientific writing. Sao Paulo Med J. 2024;142:e20241425.

CrossRef PubMed Google Scholar

Lippi G. How do I write a scientific article?-A personal perspective. Ann Transl Med. 2017;5:416.

CrossRef PubMed Google Scholar

Hosseini M, Resnik DB, Holmes K. The ethics of disclosing the use of artificial intelligence tools in writing scholarly manuscripts. Res Ethics. 2023;19:449-465.

CrossRef PubMed Google Scholar