When the Code Generator Doesn't Know What Secure Looks Like
A new IOActive study quantifies the security gap in AI-generated code. The findings confirm something we’ve been arguing in our previous posts: as AI accelerates how software gets built, the security model that worked before stops working — and a new layer becomes necessary.
In “When AI Writes the Code, Who Reasons About the Risk?” we covered Sonar’s State of Code findings on how more capable LLMs produce different — not fewer — security flaws.
In “If AI Writes Better Code, Why Do We Still Need Security?” we made the case that security was never about syntax — it’s about intent, context, and consequences.
IOActive’s April 2026 whitepaper gives both arguments hard numbers. Let’s dive in.
1. AI generates insecure code at scale — measured, not anecdotal
IOActive evaluated 27 leading AI models across 730 prompts, 27 languages, and 219 vulnerability categories. Nearly 20,000 code samples analyzed.
The headline result: average security performance across all models was just 59%.
Let that sit for a second.
If a developer wrote code that was secure 59% of the time, you’d fire them. The industry’s most capable AI models are performing at that bar — and shipping that code into production at unprecedented velocity.
2. Almost a third of AI-generated code is fully exploitable
Not theoretical risk. Not “could be a problem.” Fully exploitable.
31.6% of generated samples crossed that line.
The number is shocking on its own, but it gets worse when you consider the volume curve. AI doesn’t write 50 lines of code per developer per day anymore. It writes hundreds. Multiply that volume against a 31.6% exploitability rate, and the math gets ugly fast.
This is no longer a “review more carefully” problem. It’s a “the math doesn’t work without a new layer” problem.
3. The failure categories are exactly the wrong ones
If AI failed randomly across vulnerability classes, that would be one kind of problem.
It doesn’t fail randomly. It fails in the load-bearing places.
- Infrastructure code (Dockerfiles, Terraform, CI/CD pipelines): 70–97% vulnerability rates
- Authentication: consistent failure across nearly every model
- Rate limiting: same
- Cryptography: same
- Dockerfiles specifically: near-universal failure
The pattern is clear. AI is best at generating plausible-looking code — and worst at the categories where plausible-looking and actually-secure diverge most. Authentication that looks right but isn’t. Cryptographic implementations that compile and run but expose key material. Infrastructure that deploys cleanly and exposes everything.
This is exactly the failure mode we described in our last post. AI doesn’t write bad code. It writes plausible code that’s wrong in the ways that matter most.
4. “Just tell the AI to write secure code” doesn’t work
The intuitive fix is the obvious one: prompt the model to be careful.
IOActive tested it. The results aren’t kind.
Simple prompts like ‘write secure code’ were often ineffective or counterproductive.
Wrapper tools and security-aware system prompts performed better — improving security by up to 25 percentage points. But even the best configuration of the best tools produced 90 vulnerabilities across the test set.
No model achieved 100% secure output. None.
The ceiling on prompt-based remediation is real and hard. Beyond a certain point, the model can’t be coaxed into producing secure code through better instructions, because the model doesn’t actually understand what secure means in your specific system context.
That ceiling is what defines the gap.
5. The gap can’t be closed inside the generation layer
Here’s the architectural conclusion this data forces.
If foundation models can’t reliably generate secure code, and prompt engineering has hard limits, then security can’t live inside the generation layer.
It has to live as a separate reasoning layer that operates on whatever code the model produced — regardless of which model produced it, regardless of how it was prompted, regardless of the language or framework.
This is what traditional AppSec was supposed to do. It’s also what traditional AppSec can’t do anymore. SAST rules trained on human-written code don’t recognize AI’s failure patterns. Pattern matchers were built for a world where vulnerabilities recur in known shapes. AI generates infinite variations of “almost right” — and patterns can’t keep up.
What’s needed isn’t more rules. It’s a layer that reasons about what code does, not just matches what code looks like.
6. What that layer has to do
We’ve been making this argument across our previous posts. The IOActive data sharpens it. The new layer has to:
- Reason about behavior, not patterns — because AI generates plausible code that pattern matchers miss
- Operate at AI velocity — because reviewing days later doesn’t work when code ships hourly
- Compound understanding over time — because each codebase is different, and security has to learn what this system actually looks like
- Validate exploitability against system context — because volume makes triage impossible without it
- Generate fixes that hold — because flagging without fixing doesn’t scale either
- Verify after merge — because “patch generated” without “exposure closed” is half the loop
This is what AI-native security has to look like. Not another scanner. A reasoning layer that operates on what the AI produced.
7. The honest implication for AppSec leaders
The IOActive data should change how engineering and security leaders think about AI coding tools. Not in a fearful way. In a clear-eyed way.
The conversation can’t stay at “AI helps developers move faster.”
It has to include:
- AI generates insecure code at meaningful rates
- The failure categories are the highest-stakes ones
- Prompt-based remediation has hard ceilings
- Volume makes traditional security review impossible
- A new architectural layer is required, not optional
That’s not a reason to slow AI adoption. The velocity is real and the productivity gains are too large to give up.
It’s a reason to invest in the security architecture that has to exist alongside AI coding to make the velocity safe. The platform that reasons about what the AI produced, validates what’s actually exploitable, drafts contextual fixes, and verifies that exposure stays bounded as code keeps shipping.
8. Where this is heading
We’ve been building Neuralsec on the premise that this layer is necessary.
The IOActive data confirms the premise with numbers.
- 59% average security performance
- 31.6% fully exploitable
- 70–97% vulnerability rates in infrastructure code
- 90 vulnerabilities even at the best configuration of the best tools
The gap is real. It’s quantified. It’s not closing on its own.
The next phase of AppSec won’t be another scanner. It won’t be another LLM wrapper. It will be an architectural layer that reasons about systems, validates what matters, and keeps pace as AI continues to accelerate how software gets built.
That’s the layer Neuralsec exists to be.
At Neuralsec, we’re building the agent-based security layer for the AI-coding era. Our platform reasons about code the way an experienced security team would — finding what scanners miss, validating what’s actually exploitable, drafting fixes grounded in your system’s context, and verifying that exposure is closed.