In 2023 alone, over 29,000 CVEs were reported. That’s nearly 4,000 more than the previous year, and the trend shows no signs of slowing down. We’ve seen roughly 120,000 CVEs disclosed over the past five years. Traditional static analysis tools like CodeQL, Semgrep, and Bandit still have their place, but they share a fundamental bottleneck: they depend on manually authored rules, hand-labeled specifications, and rigid pattern matching that can’t keep pace with the creativity of real-world exploits.
Large language models are changing the equation. Not as a magic replacement, but as a force multiplier. They augment traditional analysis with natural language reasoning, cross-file context understanding, and the ability to generalize across vulnerability patterns that rule-based systems simply cannot encode. The DARPA AI Cyber Challenge (AIxCC) in 2024 was a turning point: competitors used only general-purpose LLMs (GPT, Claude, Gemini families) for vulnerability detection, reproduction, and patching. That got the security community’s attention.
This post walks through the full stack, from fine-tuning strategies and retrieval-augmented reasoning to hallucination mitigation, evaluation frameworks, and what it actually takes to deploy these systems in production.
1. LLM Fine-Tuning for Vulnerability Detection: QLoRA and Domain Adaptation
The Case for Fine-Tuning Over Prompting
One finding keeps showing up across recent research: zero-shot and few-shot prompting alone just don’t cut it for vulnerability detection. In experiments with Llama-3.1 8B on real-world datasets like BigVul and PrimeVul, simple prompting techniques failed to achieve competitive performance. You need specialized training. The question is how to do it without burning through GPU budgets.
Full fine-tuning of a billion-parameter model requires enormous GPU memory since every weight in the model gets updated. For most security teams and researchers, this is impractical. That’s where parameter-efficient fine-tuning comes in.
QLoRA: Making Fine-Tuning Accessible
QLoRA (Quantized Low-Rank Adaptation), introduced by Dettmers et al. in 2023, has become the go-to approach for efficient LLM fine-tuning in security applications. The technique works through three key innovations:
4-bit NormalFloat (NF4) quantization compresses the pretrained model from full precision (32-bit) down to 4-bit, dramatically reducing memory. A 13B-parameter model like WizardCoder that normally requires around 26 GB of VRAM drops to approximately 7 GB after quantization.
Double Quantization goes a step further by quantizing the quantization constants themselves, saving roughly 0.37 bits per parameter. That works out to about 3 GB saved for a 65B model.
Paged Optimizers use NVIDIA unified memory to handle the memory spikes that occur during gradient checkpointing with long sequences.
With the model frozen in this compressed state, LoRA adapters (small trainable matrices injected into the attention layers) are the only parameters that get updated during training. The result: only a tiny fraction of the model’s total parameters are trainable, making it feasible to fine-tune large models on consumer-grade hardware.
Here’s what a basic QLoRA setup looks like using Hugging Face’s PEFT and bitsandbytes libraries:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# 4-bit quantization config (NF4 + double quantization)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True,
)
model_name = "deepseek-ai/deepseek-coder-7b-instruct-v1.5"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Prepare model for QLoRA training
model = prepare_model_for_kbit_training(model)
# LoRA adapter configuration
lora_config = LoraConfig(
r=16, # rank of the adapter
lora_alpha=32, # scaling factor
target_modules=[
"q_proj", "k_proj",
"v_proj", "o_proj",
],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 6,924,046,336 || trainable%: 0.197
Applying QLoRA to Vulnerability Detection
Recent work has explored several fine-tuning strategies for code vulnerability detection (CVD):
Generative fine-tuning trains the model to output text labels like “Vulnerable” or “Safe” given a code snippet. This is intuitive but leaves performance on the table because the model has to generate the right token sequence rather than directly learning a classification boundary.
Classification head fine-tuning adds a feed-forward neural network on top of the LLM’s representations, with a single output neuron returning a vulnerability probability. Research on Llama-3.1 8B shows this approach significantly outperforms the generative approach, achieving an F1-score of 0.95 on BigVul. That’s comparable to or better than fine-tuned CodeBERT and UniXcoder models.
Here’s a simplified example of adding a classification head:
import torch
import torch.nn as nn
from transformers import AutoModel
class VulnClassifier(nn.Module):
def __init__(self, base_model_name, num_labels=2):
super().__init__()
self.encoder = AutoModel.from_pretrained(base_model_name)
hidden_size = self.encoder.config.hidden_size
self.classifier = nn.Sequential(
nn.Dropout(0.1),
nn.Linear(hidden_size, 256),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(256, num_labels),
)
def forward(self, input_ids, attention_mask):
outputs = self.encoder(
input_ids=input_ids,
attention_mask=attention_mask,
)
# Use the last hidden state of the final token
# (or mean-pool across all tokens)
cls_repr = outputs.last_hidden_state[:, -1, :]
logits = self.classifier(cls_repr)
return logits
# Training loop sketch
model = VulnClassifier("deepseek-ai/deepseek-coder-7b-instruct-v1.5")
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.classifier.parameters(), lr=2e-4)
for batch in train_dataloader:
logits = model(batch["input_ids"], batch["attention_mask"])
loss = criterion(logits, batch["labels"]) # 0 = safe, 1 = vulnerable
loss.backward()
optimizer.step()
optimizer.zero_grad()
Double fine-tuning is a novel technique where the model is first fine-tuned on the full training set via QLoRA, then receives a second round of targeted fine-tuning at test time using RAG-retrieved similar examples. This two-stage approach proved to be the most performant in recent evaluations.
Domain Adaptation Considerations
Fine-tuning for security is fundamentally a domain adaptation problem. A few key decisions:
Base model selection matters more than size. Research shows that smaller but code-specialized models like DeepSeekCoder 7B can outperform much larger general-purpose models. In the IRIS evaluation framework, DeepSeekCoder 7B detected 67 vulnerabilities out of 120, nearly matching GPT-4’s performance despite being a fraction of the size.
Training data quality trumps quantity. QLoRA fine-tuning on small, high-quality datasets consistently produces state-of-the-art results. Focus on curating datasets with verified, real-world vulnerabilities rather than synthetic or ambiguous examples.
Hyperparameter sensitivity requires attention. Experiments show that including large functions in training data is strictly positive for learning, but optimal sequence length remains inconclusive. LoRA adapters are typically applied to attention projections with a rank of 16 to 64, alpha of 32, and dropout of 0.1.
2. Agentic RAG and Reasoning Validation
Beyond Naive Retrieval
Standard RAG (retrieve some documents, stuff them into a prompt, generate an answer) hits a ceiling quickly in the vulnerability detection domain. Security analysis requires multi-step reasoning: tracing data flows across files, correlating CVE descriptions with code patterns, understanding the context of why a code construct is dangerous in one situation but safe in another.
Agentic RAG architectures address this by giving the LLM agency to plan, retrieve, reason, and validate iteratively. Instead of a single retrieval step, the system operates as a reasoning loop:
- Query decomposition: The agent breaks a vulnerability query into sub-problems (e.g., “identify user-controlled input sources,” “trace data flow to sink functions,” “check for sanitization along the path”)
- Selective retrieval: Each sub-problem triggers targeted retrieval against a knowledge base of CVEs, CWE definitions, vulnerability patches, and code context
- Evidence synthesis: Retrieved evidence is integrated with the model’s reasoning to build a chain-of-thought analysis
- Self-reflection: The agent evaluates whether its current analysis is complete and consistent, triggering additional retrieval if gaps are detected
Here’s a conceptual implementation using LangChain:
from langchain.agents import AgentExecutor, create_react_agent
from langchain.tools import Tool
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
# Build a vector store from CVE/CWE knowledge base
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vuln_knowledge_store = FAISS.load_local("./vuln_kb_index", embeddings)
# Define tools the agent can use
tools = [
Tool(
name="search_cve_database",
func=lambda q: vuln_knowledge_store.similarity_search(q, k=5),
description="Search the CVE/CWE knowledge base for vulnerability patterns, "
"known exploits, and remediation guidance.",
),
Tool(
name="analyze_code_path",
func=analyze_taint_path, # wraps a static analysis engine
description="Trace data flow from a source to a sink in the given code. "
"Input: source function, sink function, file path.",
),
Tool(
name="check_sanitization",
func=check_sanitization_fn, # custom function
description="Check whether a data-flow path includes proper input "
"sanitization or encoding before reaching the sink.",
),
]
# System prompt for the security analysis agent
AGENT_PROMPT = """You are a security vulnerability analyst. Given a code snippet,
your job is to:
1. Identify potential sources of user-controlled input
2. Trace data flows to sensitive sinks (SQL queries, OS commands, file paths)
3. Check whether proper sanitization exists along each path
4. Search the CVE/CWE knowledge base for matching vulnerability patterns
5. Provide a verdict with evidence and confidence level
Always ground your analysis in retrieved evidence. If you are uncertain,
say so and explain what additional information would help."""
agent = create_react_agent(llm, tools, AGENT_PROMPT)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=8)
# Run analysis on a suspicious code snippet
result = executor.invoke({
"input": """Analyze this Java servlet for vulnerabilities:
@WebServlet("/search")
public class SearchServlet extends HttpServlet {
protected void doGet(HttpServletRequest req, HttpServletResponse resp) {
String query = req.getParameter("q");
String sql = "SELECT * FROM products WHERE name LIKE '%" + query + "%'";
Statement stmt = connection.createStatement();
ResultSet rs = stmt.executeQuery(sql);
// ... render results
}
}"""
})
Structured Knowledge for Security
The most effective security RAG systems go beyond flat vector stores. Recent work on Graph RAG organizes vulnerability knowledge into structured representations: attack technique taxonomies, CWE hierarchies, and code dependency graphs that enable more precise retrieval and richer reasoning.
For example, the AgCyRAG framework for cybersecurity combines semantic search over unstructured threat intelligence with SPARQL queries against structured knowledge graphs. When the initial retrieval detects suspicious patterns but lacks a mapping to known attack techniques, a specialized agent automatically queries the graph for matching TTPs (Tactics, Techniques, and Procedures).
RAG for Example Selection in Fine-Tuning
An underexplored but powerful use of RAG is during the fine-tuning process itself. Rather than randomly selecting few-shot examples, using RAG to retrieve the most semantically similar examples from the training set produces better results than both random selection and same-vulnerability-type selection. This approach, combined with test-time fine-tuning, creates a feedback loop where the model dynamically adapts to each new code sample it encounters.
3. Reducing Hallucinations and Improving Grounding
The Hallucination Problem in Security
Hallucinations in general-purpose chatbots are annoying. Hallucinations in security tools are dangerous. A model that fabricates a CVE identifier, invents a non-existent vulnerability in safe code, or confidently declares vulnerable code as safe can cause real damage. Addressing this isn’t optional. It’s a prerequisite for production deployment.
Research identifies two distinct categories relevant to security:
Knowledge-based hallucinations happen when the model lacks the factual information needed for a correct assessment. It might reference a CWE that doesn’t exist, hallucinate details about a vulnerability’s impact, or generate remediation advice based on incorrect assumptions about a library’s API.
Logic-based hallucinations arise when the model fails to maintain consistent reasoning. It might correctly identify a taint source and a sensitive sink but fail to reason about the sanitization function in between, leading to a false positive. Or worse, it misses a real vulnerability because it incorrectly assumes sanitization is happening.
Mitigation Strategies
Retrieval confidence thresholds reject low-confidence retrievals before they enter the generation prompt. If the retriever can’t find sufficiently relevant evidence (CVE records, code examples, CWE definitions), the system should acknowledge uncertainty rather than hallucinate an answer.
from dataclasses import dataclass
@dataclass
class RetrievalResult:
content: str
score: float
source: str
def grounded_analysis(query: str, retriever, llm, threshold: float = 0.75):
"""Only generate analysis when retrieval confidence is high enough."""
results: list[RetrievalResult] = retriever.search(query, k=10)
# Filter by confidence threshold
grounded_results = [r for r in results if r.score >= threshold]
if not grounded_results:
return {
"verdict": "UNCERTAIN",
"explanation": "Insufficient evidence in the knowledge base to "
"make a confident determination. Manual review "
"recommended.",
"confidence": 0.0,
}
# Build context from high-confidence retrievals only
context = "\n\n".join(
f"[Source: {r.source} | Score: {r.score:.2f}]\n{r.content}"
for r in grounded_results
)
prompt = f"""Based ONLY on the following evidence, analyze the code
for vulnerabilities. Do not speculate beyond what the evidence supports.
Evidence:
{context}
Code to analyze:
{query}
Provide your verdict, the CWE classification (if applicable), and cite
which evidence supports each claim."""
response = llm.generate(prompt)
return {
"verdict": parse_verdict(response),
"explanation": response,
"confidence": sum(r.score for r in grounded_results) / len(grounded_results),
"sources": [r.source for r in grounded_results],
}
Chain-of-thought verification forces the model to externalize its reasoning step by step, making it possible to audit whether each logical step is grounded in retrieved evidence. For security analysis, this looks like: “Step 1: The function processInput() accepts user-controlled data from the HTTP request parameter. Step 2: This data flows through buildQuery() without sanitization. Step 3: The unsanitized input is concatenated into a SQL query string. Conclusion: CWE-89 SQL Injection.”
Monte Carlo dropout provides uncertainty quantification by running inference multiple times with different dropout masks and measuring the variance in predictions. High variance indicates low confidence, a signal that the finding needs human review.
Self-RAG (reflective retrieval) adds a layer of meta-cognition where the model evaluates whether it has enough information before generating, and can trigger additional retrieval on demand. This prevents the model from confidently generating analysis on vulnerability types it hasn’t been trained on.
Human-in-the-loop escalation remains essential. The most mature production systems implement hallucination scoring: if a response’s alignment with retrieved evidence falls below a threshold, the query gets escalated to a human analyst rather than returning a potentially hallucinated result.
4. Evaluation Frameworks and Performance Metrics
Security-Specific Benchmarks
The evaluation landscape for LLM-based vulnerability detection has matured rapidly. Key benchmarks worth knowing:
| Benchmark | Focus | Languages | Granularity |
|---|---|---|---|
| CWE-Bench-Java | 120 real-world Java vulns, avg 300K LOC per project | Java | Repository |
| BigVul | Large-scale function-level CVD | C/C++ | Function |
| PrimeVul | Stricter labeling than BigVul | C/C++ | Function |
| CVE-Bench | AI agents exploiting real web app vulns | Multi | Application |
| SecBench | Multi-dimensional LLM cybersecurity eval | Multi | Varied |
| CYBERSECEVAL 3 | Meta’s broad cybersecurity LLM benchmark | Multi | Varied |
| VADER | Human-evaluated: detect, explain, and remediate | Multi | Function |
Core Metrics
Standard classification metrics (precision, recall, F1-score, accuracy) remain foundational but need careful interpretation in the security context:
Precision measures false positive rate. In security tooling, false positives are the primary driver of “alert fatigue,” where developers start ignoring security findings because too many of them are noise. High precision is critical for adoption.
Recall measures missed vulnerabilities. In security, false negatives are dangerous since a missed vulnerability is a potential breach. The tension between precision and recall is especially acute in security applications.
Beyond classification metrics, security-specific evaluation should include:
from dataclasses import dataclass
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np
@dataclass
class SecurityEvalResults:
precision: float
recall: float
f1: float
false_discovery_rate: float
hallucination_rate: float
avg_latency_ms: float
vulns_detected: int
total_vulns: int
def evaluate_vuln_detector(
y_true: np.ndarray,
y_pred: np.ndarray,
analysis_texts: list[str],
retrieved_contexts: list[str],
latencies_ms: list[float],
) -> SecurityEvalResults:
"""Comprehensive evaluation for a vulnerability detection system."""
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
# False Discovery Rate: proportion of reported findings that are FP
fp = np.sum((y_pred == 1) & (y_true == 0))
tp = np.sum((y_pred == 1) & (y_true == 1))
fdr = fp / (fp + tp) if (fp + tp) > 0 else 0.0
# Hallucination rate: how often claims lack grounding in context
hallucination_count = sum(
1 for text, ctx in zip(analysis_texts, retrieved_contexts)
if not is_grounded(text, ctx) # semantic entailment check
)
hallucination_rate = hallucination_count / len(analysis_texts)
return SecurityEvalResults(
precision=prec,
recall=rec,
f1=f1,
false_discovery_rate=fdr,
hallucination_rate=hallucination_rate,
avg_latency_ms=np.mean(latencies_ms),
vulns_detected=int(tp),
total_vulns=int(np.sum(y_true)),
)
End-to-end latency matters for integration into CI/CD pipelines. A tool that takes 30 minutes per scan won’t get adopted, regardless of accuracy.
5. Dynamic vs. Static Analysis for Security Tasks
Static Analysis with LLM Enhancement
The dominant paradigm today is augmenting static analysis with LLM capabilities. IRIS, the most prominent example, uses a neuro-symbolic approach that combines traditional taint analysis with LLM-powered specification inference:
LLM-inferred taint specifications: Instead of relying on manually authored source/sink definitions, the LLM infers which methods accept user-controlled input (sources) and which perform sensitive operations (sinks). This eliminates the biggest bottleneck of traditional SAST tools: the need for human-created rules for every framework and library.
Contextual path analysis: Once static analysis identifies potential data-flow paths from sources to sinks, the LLM performs contextual analysis. It examines whether the path represents a genuine vulnerability or a false positive due to sanitization, access controls, or business logic that makes exploitation infeasible.
The results speak for themselves: on CWE-Bench-Java, CodeQL detected 27 out of 120 vulnerabilities. IRIS with GPT-4 detected 69, a 155% improvement. The approach also discovered 4 previously unknown vulnerabilities that no existing tool could find.
Here’s a simplified sketch of how the LLM-augmented taint specification inference works:
import json
TAINT_SPEC_PROMPT = """You are a security analyst. Given the following Java method
signature and its documentation, determine:
1. Is this method a SOURCE of user-controlled input? (e.g., reads from HTTP
requests, user files, network sockets, environment variables)
2. Is this method a SINK that performs a sensitive operation? (e.g., SQL query
execution, OS command execution, file system writes, HTML rendering)
3. Is this method a SANITIZER that validates or encodes input?
Method signature:
{method_signature}
Documentation/context:
{method_context}
Respond in JSON format:
{{
"is_source": true/false,
"is_sink": true/false,
"is_sanitizer": true/false,
"source_type": "HTTP_PARAM | FILE_READ | ENV_VAR | NETWORK | null",
"sink_type": "SQL_EXEC | OS_CMD | FILE_WRITE | HTML_RENDER | null",
"confidence": 0.0-1.0,
"reasoning": "brief explanation"
}}"""
def infer_taint_specs(codebase_methods: list[dict], llm) -> list[dict]:
"""Use the LLM to classify methods as sources, sinks, or sanitizers."""
specs = []
for method in codebase_methods:
prompt = TAINT_SPEC_PROMPT.format(
method_signature=method["signature"],
method_context=method.get("javadoc") or method["body"][:500],
)
response = llm.generate(prompt)
spec = json.loads(response)
spec["method"] = method["fqn"] # fully qualified name
# Only keep high-confidence specs
if spec["confidence"] >= 0.8:
specs.append(spec)
return specs
However, certain vulnerability classes remain challenging for static analysis even with LLM augmentation. OS command injection (CWE-78) involves highly intricate patterns like gadget chains and external side effects (file writes, environment modifications) that are fundamentally difficult to track without runtime information.
Dynamic Analysis and LLM-Driven Exploitation
Dynamic analysis (actually running code and observing behavior) provides ground truth that static analysis can only approximate. LLM-enhanced dynamic analysis is emerging in several forms:
Automated PoC generation: Tools like POCGEN use LLMs to understand vulnerability reports, generate candidate exploits, and iteratively refine them through dynamic testing. The LLM reasons about the input conditions needed to traverse vulnerable paths and generates executable payloads.
Fuzz testing orchestration: LLMs generate targeted test inputs based on their understanding of code structure and vulnerability patterns, moving beyond random mutation toward semantically informed fuzzing.
Runtime validation: After static analysis flags potential issues, dynamic analysis confirms or refutes them with actual execution, dramatically reducing false positives.
The Hybrid Approach
The most effective production systems will combine both:
- Static analysis (LLM-augmented) for broad coverage, scanning entire repositories to identify candidate vulnerabilities
- Dynamic analysis (LLM-orchestrated) for validation, confirming whether flagged issues are actually exploitable
- LLM synthesis to generate human-readable reports with explanations, severity assessments, and remediation guidance
This pipeline transforms raw security findings from alerts into actionable intelligence.
6. Deployment Implications and Production Readiness
Infrastructure Considerations
Model hosting: QLoRA-fine-tuned models can run on surprisingly modest hardware. A 7B model fine-tuned with 4-bit quantization fits on a single GPU with 8 GB VRAM. For production workloads, the trade-off is between model size (and corresponding accuracy) and inference cost/latency.
Air-gapped and on-premises deployment: Security-sensitive organizations often cannot send code to external APIs. On-premises deployment of fine-tuned open-source models (Llama, DeepSeekCoder, Mistral) running via Ollama or vLLM provides the required privacy guarantees while maintaining strong performance.
Here’s a quick example of serving a QLoRA-fine-tuned model locally with Ollama:
# Create a Modelfile for your fine-tuned vulnerability detector
cat > Modelfile <<EOF
FROM deepseek-coder:7b
ADAPTER ./vuln-detector-qlora-adapter
SYSTEM """You are a security vulnerability detection assistant.
Analyze code for common vulnerability patterns including SQL injection,
XSS, path traversal, command injection, and insecure deserialization.
Always cite the relevant CWE identifier and explain the attack vector."""
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
EOF
# Build and run the model
ollama create vuln-detector -f Modelfile
ollama run vuln-detector "Analyze this code for vulnerabilities: ..."
And for higher throughput in production, here’s how you’d use vLLM with a merged QLoRA model:
from vllm import LLM, SamplingParams
# Load the merged model (base + LoRA adapter merged into one)
llm = LLM(
model="./vuln-detector-merged",
quantization="awq", # or "gptq" for quantized serving
max_model_len=8192,
gpu_memory_utilization=0.85,
)
sampling_params = SamplingParams(
temperature=0.1,
max_tokens=2048,
top_p=0.95,
)
# Batch process multiple code files for vulnerability scanning
code_files = load_changed_files_from_pr(pr_number=1234)
prompts = [build_vuln_analysis_prompt(code) for code in code_files]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
finding = parse_finding(output.outputs[0].text)
if finding.is_vulnerable:
report_to_sarif(finding)
SARIF compatibility: Modern security tools output results in SARIF (Static Analysis Results Interchange Format). LLM-augmented tools should produce SARIF-compatible output to integrate seamlessly with existing security dashboards, GitHub code scanning, and IDE integrations.
import json
def to_sarif(findings: list[dict], tool_name: str = "llm-vuln-detector") -> dict:
"""Convert LLM vulnerability findings to SARIF format."""
return {
"$schema": "https://raw.githubusercontent.com/oasis-tcs/sarif-spec/"
"main/sarif-2.1/schema/sarif-schema-2.1.0.json",
"version": "2.1.0",
"runs": [{
"tool": {
"driver": {
"name": tool_name,
"version": "1.0.0",
"rules": [
{
"id": f["cwe_id"],
"shortDescription": {"text": f["title"]},
"helpUri": (
f"https://cwe.mitre.org/data/definitions/"
f"{f['cwe_id'].split('-')[1]}.html"
),
}
for f in findings
],
}
},
"results": [
{
"ruleId": f["cwe_id"],
"level": f.get("severity", "warning"),
"message": {"text": f["explanation"]},
"locations": [{
"physicalLocation": {
"artifactLocation": {"uri": f["file"]},
"region": {
"startLine": f["line"],
"startColumn": f.get("column", 1),
},
}
}],
"properties": {
"confidence": f.get("confidence", 0.0),
"grounded": f.get("grounded", False),
},
}
for f in findings
],
}],
}
Security of the Security Tool
Deploying an LLM in the security pipeline introduces its own attack surface:
Prompt injection: Malicious code comments could attempt to manipulate the LLM into ignoring vulnerabilities or generating misleading analysis. Input sanitization and output validation are essential.
Data poisoning: If the model’s training data or RAG knowledge base is compromised, it could learn to suppress detection of specific vulnerability patterns. Provenance tracking and integrity checks on all data sources are critical.
Model confidentiality: A fine-tuned vulnerability detection model implicitly encodes knowledge about what types of vulnerabilities an organization is most concerned about, potentially valuable intelligence for adversaries.
7. Dataset Expansion and New Language Support
The Current Data Landscape
The field has a significant data imbalance problem. C/C++ vulnerability datasets dominate, followed by Java, with other languages severely underrepresented. Datasets like BigVul and PrimeVul provide function-level granularity for C/C++, but many real-world vulnerabilities manifest at the file or repository level and involve cross-file data flows.
Scaling to More Languages
Supporting new programming languages requires language-specific training data. The CWE taxonomy is language-agnostic, but the manifestation of each CWE varies dramatically across languages. SQL injection looks very different in Python/Django versus Java/Spring versus Go.
Multi-language benchmarks have started emerging. Recent work reveals that models trained primarily on C/C++ vulnerabilities don’t transfer well to Solidity smart contracts or Rust memory safety issues. Synthetic data generation frameworks like ELTEX use domain-driven approaches to generate vulnerability data for underrepresented languages, though ensuring the examples reflect realistic patterns remains challenging.
Repository-Level Datasets
Function-level datasets are necessary but insufficient. The next frontier is repository-level vulnerability data that captures cross-file data flows (user input enters in one module, reaches a sink in another), configuration vulnerabilities (insecure defaults, missing security headers), supply chain issues (vulnerable dependencies, typosquatting), and business logic flaws that require understanding application semantics.
Building these datasets at scale is one of the hardest open problems in the field.
Looking Ahead
The trajectory is clear: LLMs will become standard components in the vulnerability detection toolkit, not as replacements for existing tools but as a reasoning layer that makes traditional analysis dramatically more effective. The research shows that combining fine-tuned LLMs with static analysis can more than double the number of detected vulnerabilities while reducing false positives.
But the work is far from done. The hallucination problem demands continued attention, particularly in an adversarial domain where attackers may deliberately craft code to confuse LLM-based defenses. Dataset coverage needs to expand beyond C/C++ and Java to match the polyglot reality of modern software. And the gap between research prototypes and production-ready tools needs systematic engineering effort.
For teams looking to get started: begin with QLoRA fine-tuning of a code-specialized model on your domain’s vulnerability data, integrate it as a second-pass filter on existing SAST results, and measure relentlessly. The evaluation frameworks exist. The benchmarks are available. The question is no longer whether LLMs can help with vulnerability detection. It’s how quickly your organization can build the capability to deploy them effectively.
Links and Resources
Key Papers
- QLoRA: Efficient Finetuning of Quantized LLMs – Dettmers et al., 2023
- IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities – Li et al., 2024
- Llama-based Source Code Vulnerability Detection: Prompt Engineering vs Fine-tuning – 2024
- LLMs in Software Security: A Survey of Vulnerability Detection Techniques – ACM Computing Surveys, 2025
- Generative AI in Cybersecurity: A Comprehensive Review – ScienceDirect, 2025
- Hallucination Mitigation for Retrieval-Augmented LLMs: A Review – MDPI, 2025
- LLM-Driven SAST-Genius: A Hybrid Static Analysis Framework – 2025
- Lightweight LLMs for Network Attack Detection in IoT – ComComAp, 2025
Tools and Frameworks
- IRIS SAST – LLM-assisted static analysis
- CWE-Bench-Java – Benchmark dataset
- Hugging Face PEFT – Parameter-efficient fine-tuning
- bitsandbytes – Quantization library
- QLoRA Reference Implementation
- Awesome-LLM4Cybersecurity – Curated paper list
- LangChain – Agentic RAG framework
- Ollama – Local model serving
- vLLM – High-throughput inference engine
- FAISS – Vector similarity search
Datasets
- BigVul – C/C++ vulnerability dataset
- PrimeVul – Curated C/C++ vulnerabilities
- OWASP Benchmark – SAST evaluation suite
- MITRE CWE Database – Vulnerability classification
- NVD (National Vulnerability Database) – CVE records
Benchmarks for LLM Security Evaluation
- CYBERSECEVAL 3 – Meta’s cybersecurity LLM benchmark
- SecBench – Multi-dimensional cybersecurity benchmark
- CVE-Bench – Agent exploitation benchmark
- VADER – Human-evaluated detection + remediation
- AthenaBench – Dynamic CTI benchmark
- SecureAgentBench – Secure code generation under realistic scenarios