CodeCome — AI-assisted vulnerability research without losing the trail

the problem

Chat is not an audit trail.

After watching too many chat sessions produce confident-sounding "potential SQL injection" claims with zero evidence, the goal became simple: a workflow where every claim is a file on disk, every file points at specific lines of code, every finding either has evidence or gets rejected, and the whole thing is reviewable by a human in an afternoon.

No database. No RAG. No disappearing chat history. Everything lives on disk as Markdown and YAML you can grep.
Every claim becomes an artifact. A hypothesis is a file. A confirmation is a file. A PoC is a file.
Hypotheses are not confirmed bugs. A plausible vulnerability is first a hypothesis — confirmation requires evidence.
Impact must be demonstrated. Without a PoC, developers dismiss findings as theoretical. Phase 5 is where impact is shown.

what CodeCome is

Research methodology, made executable.

CodeCome is not a vulnerability scanner. It is not a pentest tool. It is not a magic AI bug finder. It is a harness — a set of conventions, prompts, agents and Make targets — that encodes how a careful researcher actually audits code. The model helps you think. You stay in control.

01

Conventions over magic

A workspace layout, naming scheme, finding template and Make targets. Nothing you can't read in an afternoon.

02

Prompts as code

Each phase has explicit prompts checked in under prompts/. Fork, audit and version them like any other code.

03

Evidence over vibes

A finding is not "real" until it has been counter-argued, validated in a sandbox, and reproduced from an artifact on disk.

how it works

Drop source under src/, configure codecome.yml, run the phases.

src/ can be a copied source tree, a git submodule, a checked-out repo, an extracted archive, or a benchmark corpus. CodeCome doesn't care which. The harness will try to build, test and run the target inside the sandbox — that's the point. Validation happens against a real build.

codecome.yml

# Project + audit configuration. Defaults work out of the box.
project:
  name: my-target

audit:
  scope:
    include: ["src/**"]
    exclude: ["src/vendor/**", "src/**/_test.*"]
  focus: [sqli, ssrf, deserialization, authz]
  extra_prompts:
    reconnaissance: |
      Focus sandbox on ASAN builds.

agents:
  auditor:
    model:   anthropic/claude-opus-4-7
    variant: high
  reviewer:
    model:   anthropic/claude-haiku-4-5

validation:
  allowed_write_paths: ["itemdb/**", "sandbox/**", "tmp/**"]

$ run your first audit — 8 commands

# bootstrap a virtual env and install deps
$ make venv

# sanity-check the workspace, model creds, sandbox
$ make check

# six phases of an audit — each restartable
$ make phase-1                        # recon + sandbox bootstrap
$ make phase-2                        # generate candidate findings
$ make phase-3                        # counter-analysis (dedup / reject)
$ make phase-4 FINDING=CC-0001        # validate one finding
$ make phase-5 FINDING=CC-0001        # build a PoC, demonstrate impact
$ make phase-6                        # generate the report

# walk ONE finding end-to-end first — you'll learn more
# from a single CC-0001 than from twenty PENDING ones.

A flat, explicit workflow

1. Mount. Put the source you want to audit under src/ — copied, submodule, symlink, your call.
2. Configure. Tune audit.scope and audit.focus in codecome.yml; pick a model per agent under agents.<name>.model.
3. Run the phases. Phases 1–3 are batch; phases 4–5 are per finding, so evidence stays traceable.
4. Inspect on disk. Findings, evidence and reports live under itemdb/ as plain files — tree itemdb/ is the dashboard.
5. Ship. Generate the report with make phase-6 or hand the workspace to a teammate.

Avoid make validate-all / exploit-all on a fresh project. Walk one finding through end-to-end first.

Read the workflow guide

Bias a phase without forking

Three layered ways to append instructions to any phase prompt. All additive, applied in this order:

YML

audit.extra_prompts.<phase> in codecome.yml — persistent project-wide policy.

FILE

make phase-1 PROMPT_EXTRA_FILE=my-notes.md — versioned notes you keep around.

ENV

make phase-1 PROMPT_EXTRA="…" — one-shot inline bias.

six phases · six agents

The audit, broken into six discrete steps.

Each phase has its own prompt under prompts/, its own agent under .opencode/agents/, its own outputs, and writes to a known location on disk. Phases 1–3 are batch operations; phases 4 and 5 run per finding — intentional, to keep evidence traceable.

PHASE 01

Recon + sandbox bootstrap

1a: agent reads src/, infers target type, languages, build model, attack surface — notes under itemdb/notes/. 1b: picks a baseline from templates/sandboxes/, applies it to sandbox/, validates it, writes itemdb/notes/sandbox-plan.md.

reconsandbox bootstrap

PHASE 02

Hypothesis

Generate candidate findings under itemdb/findings/PENDING/. Each points at specific code, sources, sinks and a trust boundary. Gated by the sandbox.

+ deep sweep

Pair Phase 2 with make sweep to force one auditor session per high-risk file.

PHASE 03

Counter-analysis

A reviewer pass tries to disprove or deduplicate findings. Looks for unreachable code, input validation, authorization, framework protections, false assumptions. Weak hypotheses → REJECTED/, repeats → DUPLICATE/.

REJECTEDDUPLICATE

PHASE 04

Validation

One finding at a time, inside the Docker sandbox. Build the target, write a small PoC, capture evidence under itemdb/evidence/<id>/, decide CONFIRMED or REJECTED.

CONFIRMEDsandbox

PHASE 05

Exploit

Build a real PoC that shows concrete impact: code execution, data exfiltration, privilege escalation. The exploiter may adjust severity based on what is demonstrated and move the finding to EXPLOITED/. Artifacts under evidence/<id>/exploits/.

EXPLOITEDseverity_after

PHASE 06

Reporting

Generate a Markdown report grouping exploited and confirmed findings with evidence references. Default path: itemdb/reports/report.md. make report is the lightweight local variant (no agent).

report

finding lifecycle

Five states. Explicit transitions.

Every finding lives in exactly one folder, named after its current state. Phase 3 moves to REJECTED/DUPLICATE. Phase 4 promotes to CONFIRMED. Phase 5 may promote to EXPLOITED. You can also move them manually with make findings-move.

PENDING

Hypothesis filed

An idea worth investigating. Filed by Phase 2 (or sweep). Not yet validated.

CONFIRMED

Reproduced in sandbox

Validation captured evidence. Real bug, not weaponized yet.

EXPLOITED

Impact demonstrated

A reproducible exploit exists under evidence/<id>/exploits/. Severity adjusted by Phase 5.

REJECTED

Falsified

Counter-analysis or validation killed the hypothesis. Kept on disk so it isn't re-investigated for free.

DUPLICATE

Already tracked

Same root cause as another finding. Linked, not deleted.

PENDING→CONFIRMEDevidence captured in sandbox (Phase 4)

PENDING→REJECTEDdisproved by counter-analysis or validation (Phase 3 / Phase 4)

PENDING→DUPLICATEalready filed under another finding (Phase 3)

CONFIRMED→EXPLOITEDworking PoC under exploits/ (Phase 5)

CONFIRMED→— stays —not feasible to weaponize; documented and kept

what a finding looks like

Plain Markdown. Structured YAML. Real evidence.

A finding is a single Markdown file with a validated YAML frontmatter — the unit of work in CodeCome, not a Jira ticket or a row in a database. Below: the rendered view of CC-0022 (SQL injection in a PHP app's user.get JSON-RPC API), followed by the raw source on disk.

CC-0022

SQL injection via unvalidated selectRole in user.get

CRITICAL EXPLOITED CWE-89

CategorySQL Injection

Target areaJSON-RPC API user.get method

Fileui/include/classes/api/services/CUser.php

SymbolCUser::addRelatedObjects()

SourceJSON-RPC options['selectRole']

SinkDBselect() · CUser.php:2243-2248

Trust boundaryauthenticated API user → raw SQL SELECT clause

SeverityHIGH → CRITICAL (after exploit)

Phases recon hypothesis counter validation exploit

Evidence

itemdb/evidence/CC-0022/

itemdb/evidence/CC-0022/exploits/

# Summary The user.get JSON-RPC API accepts a selectRole array whose elements are concatenated into a SQL SELECT clause via implode(',r.', ...) without any allowlist check. Authenticated users at the lowest privilege level can inject SQL fragments and extract data from the database. # Counter-analysis - Argued: CApiInputValidator would catch this. - Outcome: not used on this code path. Verified at line 91. - Argued: dbConditionInt() sanitises arrays. - Outcome: only sanitises $userIds, not the SELECT clause. # Validation plan 1. Send a user.get JSON-RPC request as a low-privilege user with selectRole: ["roleid,(SELECT version())"]. 2. Observe the version string returned inline in the response. 3. Evidence under itemdb/evidence/CC-0022/.

itemdb/findings/EXPLOITED/CC-0022-sqli-user-get.md

---
id:            "CC-0022"
title:         "SQL injection via unvalidated selectRole in user.get JSON-RPC API"
status:        "EXPLOITED"
severity:      "CRITICAL"
confidence:    "CONFIRMED"
category:      "SQL Injection"
cwe:           ["CWE-89"]
language:      "php"
target_area:   "JSON-RPC API user.get method"
files:
  - "src/app-1.4.1/ui/include/classes/api/services/CUser.php"
symbols:
  - "CUser::addRelatedObjects()"
sources:
  - "JSON-RPC options['selectRole'] parameter"
sinks:
  - "DBselect() at CUser.php:2243-2248"
trust_boundary: "authenticated API user -> raw SQL SELECT clause"
validation:
  status:       "CONFIRMED"
  methods:      ["http_exploit", "runtime_reproduction"]
  evidence_dir: "itemdb/evidence/CC-0022"
exploitation:
  status:           "DEMONSTRATED"
  severity_before:  "HIGH"
  severity_after:   "CRITICAL"
  artifacts_dir:    "itemdb/evidence/CC-0022/exploits"
---

# Summary

The user.get JSON-RPC API accepts a selectRole array
whose elements are concatenated into a SQL SELECT clause without any
allowlist check. Low-privilege authenticated users can inject SQL.

Why files, not a database

A vulnerability research project should still be readable in five years. Files survive renaming, forking, GitHub outages and SQL migrations. grep, git log and diff are the only tools you need.

YAML you can validate

Run make frontmatter to validate every finding's metadata via tools/check-frontmatter.py. Bad frontmatter fails fast. The Markdown body stays free-form so researchers aren't fighting a form.

Tooling that travels

Findings render in any Markdown viewer — GitHub, Obsidian, an editor preview pane. The "dashboard" is just tree itemdb/findings/ or make status.

sandbox validation

Validation happens in a sandbox.

Before a finding is marked CONFIRMED, CodeCome reproduces it against a real build of the project — in a Docker container, behind a network namespace, away from your host. Phase 1b bootstraps a sandbox suited to the stack; if the payload doesn't fire there, it doesn't make the cut.

Bootstrap the sandbox image (Phase 1b)

Pick baseline from templates/sandboxes/<id>/, apply, validate

Run the payload (Phase 4)

Script under itemdb/evidence/<id>/ · captured stdout / stderr / status

Score the result

Exit code · log signature · response body match · timing

CC-0022 PASS · CONFIRMED

CC-0009 FAIL · REJECTED

$ make phase-4 FINDING=CC-0022

→ sandbox already bootstrapped (Phase 1b)
→ starting container       … healthy
→ replaying payload        … itemdb/evidence/CC-0022/exploit.sh

  POST /api_jsonrpc.php HTTP/1.1
  Authorization: Bearer <low-priv-token>
  Body: {"method":"user.get","params":{"selectRole":["roleid,(SELECT version())"]}}

  HTTP/1.1 200 OK
  Body contains: "10.6.18-MariaDB"

→ assertions
  ✓ status 200
  ✓ response inlines server version()
  ✓ query log shows injected SELECT in r.* clause

→ result: CONFIRMED
→ moved itemdb/findings/PENDING/CC-0022 → itemdb/findings/CONFIRMED/CC-0022

Per-capability sandbox helpers exist as separate targets: sandbox-list, sandbox-detect, sandbox-inspect ID=python, sandbox-bootstrap, sandbox-validate, sandbox-regenerate, sandbox-status, plus runtime helpers sandbox-{setup,up,check,build,test,down,shell,logs,clean,reset}. See docs/sandbox.md.

make sweep · the secret weapon

When breadth isn’t enough, sweep file-by-file.

Phase 2 is wide and fast. make sweep is the opposite: it runs the auditor agent once per file, forcing exhaustive line-by-line analysis on every high-risk file in itemdb/notes/file-risk-index.yml. It catches what broad audits miss; Phase 3 cleans the overlap.

deep sweep

# preview which files would be swept
$ make list-risk-files

# dry-run: show selected files + prompts, no agent calls
$ python tools/run-sweep.py --dry-run

# sweep everything scoring 4+ in file-risk-index.yml
$ make sweep

# sweep a specific file…
$ make sweep FILE="src/path/to/file.ext"

# …or a glob
$ make sweep FILE="src/**/*.cs"

Trade-offs to know

One full agent session per file. Token cost scales linearly with the number of files swept — sweep on 10 files costs roughly 10 Phase-2 runs.
Produces overlap with Phase 2. By design. Phase 3 deduplicates on semantic frontmatter fields (sources, sinks, entry_points, trust_boundary, target_area).
Always --dry-run first. See what would be swept and the per-file prompts before committing tokens.
Reads itemdb/notes/file-risk-index.yml written by Phase 1. Without a fresh recon, the sweep set is stale.

docs/file-risk-sweeps.md

screenshots

What CodeCome actually looks like.

Sanitized snapshots from real audits — enough to show the workflow, not enough to leak target-specific exploit details or credentials. Click any tile for the full-size image.

Finding queue

Reviewable hypotheses awaiting validation.

on disk ↗

Agent workflow

Agentic, but auditable from the file system up.

six agents ↗

Sandbox validation

Validation before belief — inside Docker.

phase 4 ↗

Evidence artifacts

Every confirmed claim leaves files behind.

disk-resident ↗

Generated helpers

Sandbox scripts produced on demand per finding.

on demand ↗

Exploit notes

Readable PoC writeups, not chat scrollback.

phase 5 ↗

Counter-analysis

Try to disprove first — then validate.

phase 3 ↗

Impact summary

Exploited findings with linked artifacts.

phase 6 ↗

An asciinema cast of a full run is planned.

who it's for

Built for people who already do this work.

CodeCome won't turn a non-researcher into one. It will save a researcher hours of bookkeeping per audit. If you want a one-click vulnerability scanner, this is not it. CodeCome is for people who want the model to help them think, not to replace the thinking.

Solo security researchers

LLM help on source-code audits — without trusting an opaque chat

Audit codebases at your own pace, with a trail you can re-read months later or hand to someone else.

Blue + Red teamers

Source-code review that produces commit-friendly artifacts

From recon to PoC, every step lands in the workspace as a Markdown finding with evidence references, ready to ship.

LLM-assisted security studies

An instrumented harness you can fork or A/B

Intentionally simple — fork the prompts, swap the agent runner, compare runs across models. The harness is the experimental surface.

prerequisites

What you need before running it.

CodeCome runs on top of OpenCode (1.14.39 or newer) with your own LLM provider, plus a small Python + Make + Docker stack. make check warns about anything missing — the core workflow runs without the optional tools.

required core stack · every audit needs these

OC

OpenCode 1.14.39+

The open-source AI coding agent CodeCome drives. Install guide.

K

An LLM provider key

At least one of Anthropic, OpenAI, Google, xAI, Groq, Cerebras, GitHub Copilot, Google Vertex — or a local OpenAI-compatible endpoint. Provider setup.

PY

Python 3.10+

For workspace tooling. make venv creates a local virtualenv at .venv/.

MK

GNU Make

The entire workflow is driven through make targets.

D

Docker

Required for the sandboxed validation environment used by Phases 1b / 4 / 5.

optional for Phase 5 visual evidence

▶

asciinema

Terminal recordings of exploit replays.

GIF

agg

Renders .cast files to GIFs. CodeCome falls back to a Docker container if missing.

●

ffmpeg + xvfb

For GUI / browser exploits where video evidence matters. xvfb-run is fine too.

safety considerations

Treat unknown source code as data, not safe input.

Risks worth knowing about

Prompt injection from the target

Comments, docstrings, READMEs, test fixtures, log strings, commit messages, filenames — even crafted binary blobs inside src/ — can carry instructions aimed at the agent ("ignore previous instructions…", "exfiltrate $HOME/.ssh/…"). The agent reads these as input, but LLMs are still susceptible.

Supply-chain hazards in the sandbox

Phase 1b will try to build and run the target. A malicious setup.py, package.json lifecycle hook, Makefile, Dockerfile, or configure script executes inside the sandbox container with whatever permissions Docker gives it.

Resource exhaustion and side effects

Adversarial code may try to consume CPU, disk, or network from the validation phase. A prompt-injected runaway agent loop can burn tokens just as easily.

Exfiltration via network

If the sandbox (or your host) can reach the internet, an injected agent or a malicious build step can attempt to send data out. The default policy assumes egress is possible.

Recommended precautions

Run the whole workspace inside an isolation boundary when auditing untrusted sources — a disposable VM (Multipass, Vagrant, UTM, Proxmox), a dedicated container, or a remote throwaway host. Do not run CodeCome on a machine that holds credentials, SSH keys, browser profiles, or production access you can't afford to lose.

Treat src/ as untrusted. CodeCome funnels execution through sandbox/, but the make runner itself, the agent, and any helper scripts still execute on the host.

Restrict network egress from the sandbox (and ideally from the outer VM) to only what you need for builds and package installs.

Use a fresh API key with low spend limits for the LLM provider, so a prompt-injected runaway loop can't rack up an unbounded bill.

Review what the agent writes under itemdb/, sandbox/ and tmp/ before trusting any of it. Findings, evidence and reports are all attacker-influenced when the target is untrusted.

Avoid make exploit-all and make validate-all on untrusted targets until you have walked at least one finding through manually and confirmed the sandbox behaves the way you expect.

CodeCome's sandbox is a containment aid, not a security boundary against a determined attacker. If you wouldn't be willing to run docker build and ./run-tests.sh from the target's repo on the host, you shouldn't run CodeCome against it on the host either.

workspace layout

One repo. Everything on disk.

A CodeCome workspace is a normal git repo with a small, fixed set of folders. The heart is itemdb/ — findings, evidence, notes and reports. Drop it into any IDE and read it as code.

~/research/my-target

▸my-target/
├─README.md
├─AGENTS.mdagent rules
├─codecome.ymlproject + audit config
├─src/target source code
├─sandbox/Docker validation env
├─itemdb/heart of the audit
│  ├─notes/recon, sandbox-plan
│  ├─findings/
│  │  ├─PENDING/
│  │  ├─CONFIRMED/
│  │  ├─EXPLOITED/
│  │  ├─REJECTED/
│  │  └─DUPLICATE/
│  ├─evidence/PoCs · exploits/
│  └─reports/report.md
├─runs/run summaries + transcripts
├─templates/finding, evidence, sandbox templates
├─tools/Python helper scripts
├─prompts/per-phase prompts
├─docs/deeper documentation
└─.opencode/agents + skills

Why this shape

Code under audit lives under src/. Vendor it, submodule it, or symlink it — your call.
itemdb/ is the audit. Everything generated about the target — notes, findings, evidence, reports — lives there.
Findings are folders. No DB, no SaaS. State is "which folder is this file in".
Prompts and sandbox are first-class. They live in the repo and ship with the audit.
.opencode/agents/ holds the six agents: recon, auditor, reviewer, validator, exploiter, reporter.

model strategy

Bring your own model. Pin per agent.

CodeCome lets you pin a different model for every phase via agents.<name>.model in codecome.yml (or CODECOME_MODEL on the command line). Different models see different bugs — running Phase 2 with two providers and letting Phase 3 dedupe is a legitimate strategy.

RS

Reasoning-heavy for P2 & P5

Audit and exploit phases benefit from reasoning models — Opus, GPT reasoning variants, Gemini Pro reasoning. Pin with agents.auditor.model: anthropic/claude-opus-4-7.

FW

Fast workhorses for P3 & P6

Counter-analysis and reporting are well-served by smaller, cheaper models. Mix freely; no requirement to run a single provider end-to-end.

L

Local mode

Point any agent at an OpenAI-compatible local endpoint (Ollama, vLLM, llama.cpp). No code leaves your machine. Use make show-model to print the resolution table per agent.

The model helps you think. You stay in control. Nothing is committed to a finding folder without an explicit phase being run.

styled wrapper

Tool calls rendered as panels — not as JSON soup.

By default, phase targets wrap opencode run --format json with a CodeCome-owned styled renderer. Assistant output, tool calls and tool results render with consistent colors and structure. It also routes plain bash invocations (cat, head, tail, rg, ls, find, tree, rtk …) through the matching styled renderer.

environment toggles

# bypass the styled wrapper entirely
CODECOME_USE_WRAPPER=0

# control --thinking on the provider call
CODECOME_THINKING=1            # force on
CODECOME_THINKING=0            # force off (don't pay for reasoning tokens)

# model resolution
CODECOME_MODEL=anthropic/claude-opus-4-7
CODECOME_MODEL_VARIANT=high

# surfaces
CODECOME_RENDER_REASONING=0     # hide on-screen Thinking panels
CODECOME_SANDBOX_RENDER=0       # disable structured Sandbox panel
CODECOME_BASH_SHIM_RENDER=0     # disable rtk/cat/head/tail/rg routing

# budgets
CODECOME_BOOTSTRAP_MAX_RETRIES=3
CODECOME_REASONING_MAX_CHARS=4000

# forward extra flags to opencode run
OPENCODE_ARGS="--max-tokens 8192"

What it gives you

Per-tool panels. read, write, edit, apply_patch, grep, glob, bash, todowrite, skill all get their own renderer.
Sandbox panel. Detects tools/sandbox-bootstrap.py --format json calls and renders capability tables, validation tier summaries and color-coded gate badges.
Per-provider --thinking defaults. Anthropic off (interleaved), most others on. Override with CODECOME_THINKING.
Model resolution banner. Every phase prints which model it actually picked and where the value came from. Useful when a run feels off.

docs/development.md

local helper commands

Everything you need without an agent.

A handful of make targets cover day-to-day workspace bookkeeping — no LLM call required.

make help

Show all available commands.

make check

Validate workspace + model creds + Docker reachability.

make status

Show finding status counts.

make findings

List findings (filter with STATUS=PENDING).

make findings-create TITLE="…"

Create a new finding skeleton.

make findings-move FINDING=CC-0001 STATUS=CONFIRMED

Move a finding between status folders.

make findings-evidence FINDING=CC-0001

Create the evidence directory for a finding.

make next-id

Print the next free finding id.

make frontmatter

Validate finding frontmatter via tools/check-frontmatter.py.

make index

Regenerate the finding index.

make report

Regenerate the lightweight local report (no agent).

make list-risk-files

Top-scoring risky files from the risk index.

make show-model [AGENT=auditor]

Print the model resolution table for a phase.

make itemdb-reset

Reset local audit artifacts (destructive — keep prior work elsewhere first).

make tests

Run the Python test suite under tests/.

make sandbox-shell

Open an interactive shell inside the sandbox container.

project status

Early. Useful. Honest about both.

CodeCome is an early PoC. The conventions are stable enough to use; the tooling around them is still moving. Below is what works well today and what is still rough — no marketing.

Works well Markdown findings with structured YAML frontmatter — stable schema itemdb/findings/

Works well File-based item DB — no DB, no RAG, easy to grep and commit itemdb/

Works well Per-phase make targets with readiness gates Makefile

Works well Docker sandbox bootstrap (Python, C/C++, .NET, PHP, IaC, …) templates/sandboxes/

Works well Styled wrapper output with per-tool renderers tools/

Rough One agent at a time — no parallel auditing or validation v0.next

Rough validate-all is sequential v0.next

Rough Docker is the only first-class sandbox runtime today sandbox/

Rough Phase 2 + deep sweep produce overlapping findings (Phase 3 cleans) prompts/

Rough Provider coverage for --thinking is hand-maintained tools/

Missing No CI — quality gate is make tests run locally v0.next

documentation

Read the docs before you trust the output.

CodeCome's value is in its methodology, not its UI. The docs explain how each phase works and what its prompts assume.

Getting started

Install, configure a provider, run your first audit.

README ↗

Workflow reference

Full phase-by-phase workflow reference.

docs/workflow.md ↗

Target setup

Supported layouts: source trees, submodules, archives, benchmark corpora.

docs/target-setup.md ↗

Sandbox guide

Bootstrap, boundaries, evidence capture, validation environment.

docs/sandbox.md ↗

File-risk sweeps

Risk index format and the deep sweep reference.

docs/file-risk-sweeps.md ↗

Development

Repo conventions, helper tools, contributor workflow.

docs/development.md ↗

Prompt catalog

All phase prompts — phase-1-recon.md through phase-6-report.md + sweep.md.

prompts/ ↗

authors

Who builds CodeCome.

Project Lead

Pablo Ruiz García

Architecture, engineering, implementation, and the person who turns vague ideas into working code.

Product Lead

Alejandro Ramos

Product direction, use cases, requirements, and official provider of impossible requests that somehow keep becoming roadmap items.

Pull requests are expected, encouraged, and appreciated. See CONTRIBUTING.md.

contributing

Help shape an honest research harness.

CodeCome is small. A patch to a phase prompt, a sandbox template for a new language, or a bug report on a confusing convention are all valuable. We won't accept PRs that turn this into a scanner.

Read CONTRIBUTING Open an issue

Prompts

Improve a phase prompt with a diff and a short rationale. Bring a run summary if you can.

Sandbox templates

Contribute a Dockerfile + scripts for a stack we don't cover yet, under templates/sandboxes/.

Methodology

Disagree with a phase boundary? Open a discussion before a PR.

Tooling

CLI ergonomics, schema validation, report generation — all welcome.

AI-assisted vulnerability research without losing the trail.

Chat is not an audit trail.

Research methodology, made executable.

Conventions over magic

Prompts as code

Evidence over vibes

Drop source under src/, configure codecome.yml, run the phases.

A flat, explicit workflow

Bias a phase without forking

The audit, broken into six discrete steps.

Recon + sandbox bootstrap

Hypothesis

Counter-analysis

Validation

Exploit

Reporting

Five states. Explicit transitions.

Hypothesis filed

Reproduced in sandbox

Impact demonstrated

Falsified

Already tracked

Plain Markdown. Structured YAML. Real evidence.

SQL injection via unvalidated selectRole in user.get

Why files, not a database

YAML you can validate

Tooling that travels

Validation happens in a sandbox.

When breadth isn’t enough, sweep file-by-file.

Trade-offs to know

What CodeCome actually looks like.

Finding queue

Agent workflow

Sandbox validation

Evidence artifacts

Generated helpers

Exploit notes

Counter-analysis

Impact summary

Built for people who already do this work.

LLM help on source-code audits — without trusting an opaque chat

Source-code review that produces commit-friendly artifacts

An instrumented harness you can fork or A/B

What you need before running it.

Auditing untrusted code?

Treat unknown source code as data, not safe input.

Read this before pointing CodeCome at code you did not write.

Risks worth knowing about

Prompt injection from the target

Supply-chain hazards in the sandbox

Resource exhaustion and side effects

Exfiltration via network

Recommended precautions

One repo. Everything on disk.

Why this shape

Bring your own model. Pin per agent.

Reasoning-heavy for P2 & P5

Fast workhorses for P3 & P6

Local mode

Tool calls rendered as panels — not as JSON soup.

What it gives you

Everything you need without an agent.

Early. Useful. Honest about both.

Read the docs before you trust the output.

Getting started

Workflow reference

Target setup

Sandbox guide

File-risk sweeps

Development

Prompt catalog

Who builds CodeCome.

Pablo Ruiz García

Alejandro Ramos

Help shape an honest research harness.

Prompts

Sandbox templates

Methodology

Tooling

GPL-3.0-or-later OR AGPL-3.0-or-later.