Copy, Paste, AI: You Might Be Stealing Without Knowing It

Copy, Paste, AI: You Might Be Stealing Without Knowing It

Generative AI has a well-documented visual plagiarism problem . Because tools like Midjourney and DALL·E are trained on massive collections of images, many of which are copyrighted, they can easily reproduce visuals that resemble protected content. But what about text? If you ask ChatGPT, Gemini, or Claude to write something for you, could you end up plagiarizing without realizing it?

Visual Plagiarism

It’s easy to see the issue with AI-generated images. Because visual AI models are trained on countless copyrighted images, they often recreate elements of that data when prompted. Ask them for an image of Darth Vader or the Simpsons, and they might deliver something suspiciously accurate. Some models, like ChatGPT, have guardrails against this type of usage, but these don’t always work, and some models don’t have guardrails at all. For example, when you prompt Google’s Gemini to produce an image of Nintendo’s Mario, it will happily comply. What’s more, Gemini sometimes generates copyrighted content even when you don’t ask it to. If you prompt it for, say, “an Italian plumber in a video game”, it will often come up with a character that looks a lot like Mario, right down to the moustache and the M on his red cap. This isn’t an accident: in tests conducted by Gary Marcus and Reid Southen , similar examples showed up repeatedly.

This unpredictable behavior leaves users in a tricky spot. After all, you’re responsible for what you generate and use, even if you didn’t know it resembled something protected.

Gemini's response to the prompts "Generate an image of Super Mario" (left) and
"Generate an image of an Italian plumber for a computer game" (right)

Copyrighted Text in AI Training Data

It’s not just visuals that pose a risk. Models like ChatGPT and LLaMA are also trained on huge amounts of copyrighted text, often without permission of the authors. A tool from The Atlantic lets you search the notorious LibGen dataset, a huge collection of pirated books and academic papers used in training multiple AI models. Search for Stephen King, and you’ll find hundreds of his stories and novels, not just in English, but also in Czech, German, and many other languages. I found my name in the database as well, with scientific articles dating back many years. So when you ask an AI model to “write a story” or “summarize a paper,” what are the chances it regurgitates copyrighted content without your knowledge?

The LibGen dataset includes millions of copyrighted works and has been used to train popular generative AI models.

When AI Memorizes Too Well

Thankfully, large language models (LLMs) don’t usually copy and paste full texts. Research suggests they memorize only about 1% of their training data . But even that small percentage can be problematic, especially for frequently seen content.

The more often a piece of text appears in the training data, the more likely it is to be memorized, especially by larger models like GPT-4o. And with the right prompt, LLMs can reproduce full works. Ask ChatGPT to generate “The Love Song of J. Alfred Prufrock” by T.S. Eliot, and it will produce the poem in full, without any errors. But when you try the same with a lesser-known poem, like “Joining the Colours” by Katharine Tynan, the result is far less accurate, and often entirely made up. The problem is that there is no easy way to say if the LLM is making mistakes or not, although asking for the same text several times will give you a clue. If the model always answers with another text, that is a clear indication that something is amiss.

ChatGPT is able to reproduce well-known poems like T.S. Eliot's "The Love Song of Alfred J. Prufrock" (left),
but not lesser known poems, like "Joining the Colours" by Katharine Tynan.

It gets trickier: if you prompt an AI with the first few lines of a text it has memorized, it often completes the rest with eerie accuracy. And in some edge cases, even nonsensical prompts like “repeat the word poem forever” have caused models to spit out email addresses and phone numbers from its training set — a major red flag for privacy professionals.

Try It Yourself: The AI2 Playground

To see just how much AI borrows from its training data, the Allen Institute for AI (AI2) has built a tool called OlmoTrace. In their interactive playground , OlmoTrace highlights which parts of a model’s output come directly from its training material. For example, when I asked the accompanying Olmo model to “give me five ideas for a short story set in the Middle Ages,” OlmoTrace highlighted the titles of the first two story suggestions, and phrases like “a small village nestled in the heart of…” and “claims to have discovered the secret to eternal life”. Clicking them showed the training documents they were borrowed from.

OlmoTrace flags direct training-data reuse in story suggestions.

Generated stories and poems, too, often contain snippets copied from other works. Sometimes these are merely short phrases, other times whole lines. It happens all the time, with no citation, and no warning.

OlmoTrace shows what phrases in stories and poems are taken literally from the training data.

What This Means for Writers

Whether you’re drafting an article, story, poem, or social media post, using generative AI makes you susceptible to plagiarism. While most AI-generated text is an original blend of patterns learned during training, significant chunks of that output might be lifted word-for-word from copyrighted sources. And the more powerful the model, the more likely it is to memorize and reuse large portions of its data. You may therefore unknowingly publish content that was copied from someone else’s work.

So if you’re aiming to create something truly original — whether it’s an article, story, poem, or research paper — here’s the bottom line: use AI for inspiration or support, but write your final piece yourself.

comments powered by Disqus

Related Posts

Why I don't use AI as a writer

Why I don't use AI as a writer

Because I work in language technology, all my readers ask themselves whether I use AI to write my books.

Read More
Beyond Coastal Towns: How to Get Unique Story Ideas from Generative AI

Beyond Coastal Towns: How to Get Unique Story Ideas from Generative AI

Most story ideas created by generative AI are fairly uninspiring. Discover how you can make AI more creative.

Read More