How much are AI tools writing out in the wild?

A recent paper suggests that between 6% and 16% of the text of peer reviews for some major recent conferences on machine learning (ML) and artificial intelligence (AI) may have been written with substantial help from Large Language Models (LLM), such as ChatGPT.

The paper makes these estimates in illustrating a method for estimating how much of the text in a corpus (a collection of texts) is produced solely by humans and how much of the corpus involves substantial input by AI and LLMs. The method does not pinpoint which individual texts in a corpus were generated with substantial help by AI.

The paper Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews, by Weixin Liang, Zachary Izzo, Yaohui Zhang and 10 other named co-authors has not (yet?) appeared in a peer-reviewed journal. A preprint is available at https://arxiv.org/pdf/2403.07183.pdf  

Estimating how much AI writing is present in a corpus

The authors present an approach for estimating what proportion of the text in a large corpus involves substantial input from a large language model (LLM). Their approach involves the following steps:

  1. Gather texts known to have been written by humans without significant involvement of an LLM. In this study, the texts were peer reviews of papers submitted for some machine learning (MNL) and Artificial Intelligence (AI) conferences held before November 2022, when ChatGPT was released.
  2. Use an LLM to generate corresponding texts. An LLM was fed the writing instructions given to the human authors. The LLM used those instructions as prompts to produce peer reviews of papers for the same conferences.
  3. Collect statistics for the human-written and AI-generated corpora. In this study, the authors focused on the frequency of non-technical adjectives. The authors concluded that this would give more stable results than frequencies of other classes of words, such as adverbs, nouns or verbs.
  4. Create synthetic corpora containing the human-written and the LLM-generated texts in varying proportions (eg LLM-generated text of 0%, 2.5%, 5%, 10%). Using a statistical model, predict the frequencies of (non-technical) adjectives in the synthetic corpora. Then validate the model by comparing the predicted frequencies with the observed frequencies.
  5. Observe the frequencies in a corpus of texts written for later conferences. Then use the statistical model to estimate the proportion of LLM-generated texts present in that corpus. This study analysed corpora of peer reviews submitted to some AI conferences held after the release of ChatGPT.

This approach investigates only frequencies for a whole corpus. It does not try to identify which texts within a corpus were generated by (or with the help of) AI tools.

Findings

The study concluded that:

  • between 6% and 16% of text submitted as peer reviews to these conferences might have been modified by LLMs substantially (beyond mere spell-checking or minor editing).
  • the estimated proportion of LLM-generated text was higher in reviews :
    (a) which were submitted close to the submission deadline (within 3 days);
    (b) which contained fewer scholarly citations (measured by instances of the phrase et al);
    (c) whose authors reported lower self-confidence; or
    (d) whose authors did not respond to rebuttals by the author(s) of the conference papers reviewed.
  • the estimated proportion of LLM-generated text was higher in reviews containing formulaic responses which might not provide unique or creative input.

The study also looked at peer reviews submitted to the Nature family of journals but found no significant evidence of ChatGPT usage in those reviews. The authors suggest that this shows that the specialised field of machine learning is adopting AI tools faster and more widely than other scientific disciplines are.

AI tools use some adjectives and adverbs more than humans do

The paper notes that some adjectives became much more frequent in ICLR 2024 peer reviews than in peer reviews for the same conference in earlier years. For example, commendable, meticulous, intricate became respectively 10, 35, and 11 times more probable. There was a similar trend in NeurIPS peer reviews, but not in Nature journals.

The paper lists the top 100 adjectives and top 100 adverbs produced disproportionately by AI. I list here the 1st 20 from each list.

commendableinnovativemeticulousintricatenotable
versatilenoteworthyinvaluablepivotalpotent
freshingeniouscogentongoingtangible
profoundmethodicallaudablelucidappreciable
Top 20 adjectives
meticulouslyreportedlylucidlyinnovativelyaptly
methodicallyexcellentlycompellinglyimpressivelyundoubtedly
scholarlystrategicallyintriguinglycompetentlyintelligently
hithertothoughtfullyprofoundlyundeniablyadmirably
Top 20 adverbs

The full list of top 100 adverbs includes:

  • some that to me seem very stilted and almost archaic (hitherto, thereby, herein);
  • some that British humans hardly use at all, though American humans use them more often (forth, further); and
  • one that feels fairly uncommon to me (alike).

It surprises me to see scholarly listed as an adverb. To my ear, it can only be an adjective.

Leave a comment

Your email address will not be published. Required fields are marked *