The Challenge of AI-Generated Components
AI models are trained on vast repositories of public code, which means they frequently produce code that closely resembles, or directly replicates, existing open source implementations. These similarities are often invisible to developers who assume the generated code is original. The problem is structural: AI models learn patterns from their training data and naturally reproduce those patterns when prompted for similar functionality. Research demonstrates the scope of this challenge. Studies analysing LLM-generated code found that between 0.8% and 5.3% exhibits notable similarity to existing open source implementations (similarity strong enough to indicate copying rather than independent creation). When measured more permissively, approximately 30% of AI-generated code shows at least some degree of overlap with open source codebases. These aren’t edge cases, they represent mainstream AI coding tool outputs. The implications are severe:- Hidden in plain sight: AI-generated snippets don’t appear in dependency manifests, bypassing traditional Software Composition Analysis (SCA) tools that only scan declared dependencies.
- Scattered across codebases: Rather than concentrated in a few files, AI code fragments appear throughout projects, embedded in functions developers wrote themselves.
- Licence obligations without awareness: When an AI model generates code matching GPL-licenced source code, developers inherit those licence obligations, even though they never consciously copied anything.
- Undetectable by conventional means: Traditional package-level scanners miss these snippet-level similarities entirely, creating blind spots in compliance programmes.