Pixels vs. Pens: AI and Human Writers Face Off
- Yves Peirsman
- Llms , Creative writing
- September 21, 2024
Table of Contents
In recent years, Large Language Models (LLMs) have made significant strides in creative writing. Just five years ago, no language model would have come close to matching the narrative skills of human authors. Today, however, the landscape has dramatically shifted, prompting researchers and writers alike to ask: How do current AI models truly stack up against human storytellers? Let’s take a look at three recent studies that set out to find an answer to this question.
LLMs vs. Creative Writing Students
To compare LLMs with human authors, Gómez-Rodríguez and Williams (2023) pitted five Creative Writing students against twelve LLMs. Their task was to write an epic short story about a combat between a pterodactyl and Ignatius J. Reilly (the protagonist of John Kennedy Toole’s A Confederacy of Dunces ) in the author’s distinctive dark, humorous style. The reason for this original assignment was that the LLMs would likely not have seen a similar story in their training material and would thus have to create a new story.
Ten Creative Writing students rated the resulting stories on ten criteria: readability, plot, understanding of the epic genre, accurate inclusion of the main characters, their usage of the right humorous tone, etc. Surprisingly, AI outperformed human writers in 9 out of 10 criteria, including readability, plot, and genre understanding. Only in creativity did human writers maintain a slight edge. GPT-4 and Claude emerged as the top AI performers, surpassing both human writers and open-source AI models, such as Koala and Vicuna. The difference between closed-source and open-source models was most outspoken in their ability to generate dark humor, with only GPT-4, Claude, Bing (and humans) achieving scores above 5 out of 10. Given that today’s open-source models have narrowed the performance gap with their closed-source counterparts, it would be fascinating to repeat this experiment.
Comparing LLMs to Top Authors
If students lose out to AI, how would established authors fare? To find out, Chakrabarty et al. (2024) did not ask writers to create new stories, but selected as their benchmark twelve short stories from The New Yorker, by renowned authors like Haruki Murakami and Annie Ernaux. Then they set out to have LLMs create similar narratives. To ensure the texts were as comparable as possible, they had GPT-4 create one-sentence summaries of each story, and then prompted three top language models (ChatGPT, GPT-4, and Claude 1.3) to write stories of similar length, based on these summaries.
Experts evaluated the results using fourteen binary tests, such as the following:
- Do the different elements of the story work together to form a unified, engaging, and satisfying whole?
- Does the story contain turns that are both surprising and appropriate?
- Will an average reader of this story obtain a unique and original idea from reading it?
- Does each character in the story feel developed at the appropriate complexity level?
In contrast to the study above, the findings revealed a significant gap between human and AI-generated content. The New Yorker stories passed twelve of the fourteen binary tests on average, while LLM stories performed significantly worse. GPT-3.5 typically passed only one of the tests, while GPT-4 and Claude managed around four. Although LLMs scored well on fluency, they struggled with originality, just like in the previous study. This highlights that creativity remains a key area for improvement in AI-generated writing.
The Ultimate Showdown: GPT-4 vs. Patricio Pron
In an experiment reminiscent of the Kasparov vs. Deep Blue chess match , Marco et al. (2024) organized a writing contest between GPT-4 and Patricio Pron, a renowned Argentine writer. Both were tasked with creating thirty original titles, and then writing short stories for their own and their competitor’s titles. A group of literary scholars served as judges, by scoring all stories for attractiveness, originality, creativity, their potential to be included in an anthology, and the uniqueness of their voice.
The results of this showdown were clear: Pron’s stories consistently outperformed GPT-4’s across all evaluation criteria. Pron’s titles, too, were judged more attractive, original, and creative than GPT-4’s. They included All love songs are sad songs, I keep trying to forget your promise, The last laugh of that year and The national red button, whereas GPT-4 fell back on cliches more often, such as Between the lines of fate, Echoes of a lost dream, Shadows in the mist and The forgotten melody. Interestingly, GPT-4 wrote better stories for Pron’s titles than for its own, which shows that human input can improve the quality of AI’s creative writing.
Additionally, this study looked at both English and Spanish AI output. The authors found that GPT-4 performed worse when writing in Spanish compared to English, suggesting that AI proficiency declines for languages that are less well-represented in its training data. Moreover, experts quickly learned to distinguish between human and AI-written stories with high accuracy, indicating that AI-generated content relies on recognizable patterns.
Conclusion
Obviously, all the studies above have their limitations. First, they focused primarily on very short stories, typically ranging from a few hundred to 2,000 words. Longer fiction, with its higher demands on plot and character development, is likely to present a much harder challenge. Second, the researchers employed fairly simple prompting strategies to instruct the LLMs and did not experiment with the many parameter settings available. This could have disadvantaged the AI models. Nevertheless, their research convincingly demonstrates that LLMs have made impressive progress in creative writing. Even though they tend to be less creative, very short AI stories in particular can already compete with short fiction by amateur authors. And while AI still falls short of top human authors, its strong performance in aspects such as fluency suggests that language models can serve as a valuable tool for writers.