Ultimate Local AI Pentest + Coding Machine
This guide is about building a serious local workstation for coding and authorized pentest work: MLX for private inference, OpenCode as the multi-agent shell, abliterated or otherwise pentest-capable coder models, and file-backed workflows that keep scope, evidence, and decisions out of fragile chat memory.
Primary Model
Qwen3-Coder-Next Abliterated 8-bit
Best tested daily-driver class for coding plus authorized pentest tasks.
Context Declared
131,072
OpenCode model limit, conservative half of native 256K.
Sampling
1.0 / .95 / 40
Qwen recommended temp, top_p, top_k.
Model Slots
128 GB Class
Recommended for 8-bit weights plus long context and browser/tool overhead.
Best Architecture Blueprint
The best setup is a local-first operator workstation: one strong coding and pentest-capable model, one OpenAI-compatible MLX server, OpenCode as the native multi-agent shell, focused tools, and durable project files that keep scope, interfaces, findings, evidence, and decisions outside the fragile chat transcript.
1. Main Reasoning Model
Qwen3-Coder-Next abliterated MLX, preferably 8-bit
Use a coding-tuned model that can still reason about payloads, exploit paths, and vulnerability validation inside an authorized scope.
2. MLX Server Layer
OpenAI-compatible API on port 8000
MLX is the right backend for Apple Silicon because it uses unified memory and Metal directly. Treat 30 tok/s as the usability floor, 50+ tok/s as comfortable, and 90+ tok/s as fast; the 4-bit Coder-Next fallback measured around 90 tok/s, while the 8-bit model stayed interactive enough to be the quality default.
3. OpenCode Agent Shell
Native agents, slash commands, tools, and MCP
OpenCode is not just a prettier chat TUI. It gives you project files, slash-command workflows, specialized subagents, tool calls, and provider config in one shell, which is why it beats a simple model wrapper like qwen-code-style TUIs for long pentest work.
4. File-Backed Memory
`AGENTS.md`, `SPEC.md`, `SCOPE.md`, `INTERFACES.md`, `NOTES.md`
Files are the source of truth. The chat is only the control surface. The concrete coding and pentest workflows later in this guide are built around these files so workers can stay disposable.
5. Focused Tool Surface
Use MCP where it helps, not everywhere
Use shell, browser, Playwright, focused pentest helpers, and normal tools. Avoid huge tool catalogs that turn the model into a noisy scanner launcher.
6. Human-Led Operations
AI proposes, you steer, evidence stays local
The model is a strong assistant. You still control scope, authorization, validation depth, rate, evidence, and final judgment.
Build Path: Make It Reproducible First
Build the machine before arguing with the methodology. The minimum reproducible path is: hardware that can hold the model, an MLX server with known flags, an OpenCode provider that actually reaches that server, and a custom command/agent pack that defines the workflows.
Hardware Floor
Apple Silicon, 64 GB minimum, 128 GB preferred.
The 8-bit model is about an 85 GB weight load before KV cache and OS/app overhead. A 64 GB Mac is fallback-model territory; the 8-bit default is a 128 GB-class setup.
Memory Rule
Context is a RAM decision.
128K declared context is not free. If memory pressure or swap climbs, reduce context first, then output limit, then quant/model size.
Software
MLX-LM, OpenCode, browser tooling.
Install `mlx-lm`, install OpenCode, and wire Playwright/browser MCP only after the model route works. Do not debug browser agents before text, tool, and file tests pass.
Workspace Variable
Define it once.
export WORKSPACE="$HOME/ai-work"
mkdir -p "$WORKSPACE"/{coding,engagements,scratch}
Version Checks
Do not trust copied flags blindly.
mlx_lm.server --help opencode run --help
| Build Step | Required Outcome | Common Failure |
|---|---|---|
| 1. Install runtime | `mlx_lm.server --help` shows `--model`, `--port`, `--temp`, `--top-p`, `--top-k`, `--max-tokens`, and prompt-cache flags. | Different `mlx-lm` version. Adjust flags to your installed help output instead of assuming the guide is timeless. |
| 2. Start model | `/v1/models` returns the exact Hugging Face model ID you intend to use. | Wrong port, wrong server, typo in repo ID, missing model access, or different quant than expected. |
| 3. Configure OpenCode | OpenCode sends requests to `http://127.0.0.1:8000/v1` and selects `mlx/AITRADER/...8Bit`. | Custom provider has no local credential entry, `localhost` resolves strangely, or the provider model key does not match the MLX model ID. |
| 4. Install command pack | `/start-coding` and `/start-pentest` exist because you created custom `.opencode/command/*.md` files. | Reader assumes these commands are built into OpenCode. They are not. |
| 5. Prove tool use | A worker can read/write a scratch file and return a clean result, not just generate text. | Text generation works, but tool calls or file discipline are broken. |
Model Selection: Why Random Models Fail
The model must be selected for this exact job: local coding plus authorized pentest reasoning. A model that is fast in chat, good on generic benchmarks, or merely "uncensored" is not automatically useful. The daily driver has to write code, follow tools, reason about exploitability, handle scope, and avoid useless refusals.
Stock Aligned Coder
Usually good at code, often bad at pentest.
It can implement features, but may refuse payload generation, exploit-chain analysis, or vulnerability validation even when the scope is clearly authorized.
Random Chat Model
Wrong shape for OpenCode work.
It may talk well but fail at patch discipline, tool routing, JSON-ish outputs, file updates, and long coding sessions with real constraints.
Random Abliterated Model
Abliterated does not mean good.
Some variants reduce refusals but also damage instruction following, increase hallucinated commands, or become too eager to ignore scope boundaries.
Abliterated Coder Model
The useful target class.
Start with coder-first models, then test abliterated or decensored MLX variants. The goal is not edginess; it is low-friction authorized pentest work plus strong coding.
Huge Dense Model
Quality may improve, workflow may not.
70B+ dense models can reason well, but if they drop below comfortable speed, refuse pentest tasks, or make OpenCode feel heavy, they lose as daily drivers.
Large MoE Model
Good only when the tested tradeoff is real.
MoE can give large-model quality with lower active parameters, but every candidate still needs MLX compatibility, stable memory use, speed, and pentest usefulness tests.
Concrete Failure Examples
| Candidate | What Looked Promising | What Failed |
|---|---|---|
| mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit | Strong stock Qwen3-Next model, 8-bit MLX, good general reasoning shape. | Refused a bounded authorized pentest request. Good model, wrong behavior for this machine. |
| mlx-community/Kimi-Dev-72B-4bit | Strong Kimi-family coding model, dense 72B, no refusal in the lab prompt. | Ran around 3.3-10.5 tok/s, exposed visible thinking, and was too slow for planner/worker loops. |
| Jcoa/Qwen3.5-122B-A10B-Abliterated-MLX-mixed3_6 | Larger abliterated Qwen-family MoE looked like a possible quality jump. | Abliteration alone was not enough. It did not clearly replace the 8-bit Coder-Next default in workflow behavior. |
| gregfrank/GLM-4.5-Air-ULRE-abliterated | Different model family, abliterated/ULRE behavior, useful comparison against Qwen blind spots. | Worth comparing, but did not become the default without beating Coder-Next on coding, scope control, and tool work. |
| Random "uncensored" chat models | They often answer forbidden-looking prompts without refusals. | They usually lose on code quality, patch discipline, tool use, and audit-quality pentest notes. |
| Gate | What To Test | Why It Fails |
|---|---|---|
| Refusal Gate | Ask for a bounded lab payload plan, exploitability analysis, and PoC outline with scope explicitly stated. | Stock aligned models may lecture or refuse instead of helping with authorized work. |
| Scope Gate | Ask what not to touch, how to rate-limit, and what evidence is required before calling a finding real. | Bad abliterated models can overcomply and ignore authorization boundaries. |
| Coding Gate | Ask for a constrained patch, tests, and no unrelated edits. | Generic chat and some uncensored models produce plausible text but weak code. |
| OpenCode Gate | Run one worker task through OpenCode and inspect file updates, notes, and tool choices. | A model can pass chat prompts and still fail as an agentic worker. |
| Speed Gate | Measure real generation speed under the launcher, not a model-card claim. | Large models that feel brilliant at 5 tok/s are not comfortable interactive companions. |
Why OpenCode
OpenCode wins here because it is already an agent operating shell, not just a chat UI. The important feature is not text generation; it is the combination of native multi-agent routing, slash commands, tools, local project files, MCP integration, and terminal ergonomics.
Native Multi-Agent Model
OpenCode can route work to specialized agents like `ext-coder`, `ext-reviewer`, `ext-recon`, `ext-vuln`, `ext-browse`, and `ext-report`.
This matters because coding and pentest tasks need different prompts, different context, and different output discipline.
Slash Commands Are Workflows
`/start-coding`, `/plan-project`, `/implement`, `/verify-impl`, `/start-pentest`, `/run-vuln`, and `/review-gaps` are not decorative shortcuts.
They are repeatable workflows that load the right files, call the right worker, and update the right artifacts.
Tools Stay Close To The Work
The agent can use shell, local files, MCP, and browser workflows from the same project directory where the evidence and code live.
That keeps the machine debuggable: when something breaks, you inspect one repo, one config, one model server, and one workflow.
Fresh Context Per Worker
The planner does not need to carry the whole repo or engagement in its head. It delegates small jobs to workers with narrower context.
This avoids the usual local-model failure mode: a giant chat slowly becoming confused while pretending to remember everything.
OpenAI-Compatible Provider
OpenCode can point at MLX through `@ai-sdk/openai-compatible`, so the local model behaves like a normal provider.
That means switching models is a config/launcher problem, not a rewrite of the workflow system.
Less Platform Tax
Compared with standalone autonomous pentest platforms, OpenCode has less infrastructure to operate and less hidden state.
The workflow is explicit: files, slash commands, agents, and tool calls you can inspect.
Planner / Worker Discipline
The main rule: speak to the planner, not to every worker. The planner owns state, decides which specialist is needed, gives the worker a narrow packet of context, and accepts only clean results back. This keeps the main conversation from becoming a junk drawer.
User
Gives intent, scope changes, approvals, and final judgment.
Planner
Reads canonical files, picks the next worker, and writes task packets.
Worker
Gets only the relevant files, command goal, and output contract.
Clean Result
Returns facts, commands run, artifacts, decisions, and next risks.
Files
Planner updates `NOTES.md`, `TASKS.md`, findings, and scope state.
Why It Works
Workers can be aggressive with local context because their context is disposable. The planner remains boring, stateful, and conservative.
What Workers Return
A worker should return verified facts, changed files, commands run, evidence paths, confidence, and recommended next step. It should not paste a giant transcript.
What The Planner Rejects
Unscoped scans, unverified findings, raw scanner dumps, unrelated code edits, and vague "looks vulnerable" conclusions go back for cleanup.
Why Files Are Memory
Long context is useful, but it is not a memory strategy. Files are memory because they survive restarts, make state reviewable, let workers start fresh, and force the model to write down what it believes.
Problem With Chat Memory
- Important constraints get buried under old conversation.
- The model starts mixing stale assumptions with current facts.
- Tool output expands the prompt until reasoning quality drops.
- It becomes hard to prove why a finding, code change, or decision happened.
File-Backed Answer
- `SPEC.md` defines what is being built or tested.
- `SCOPE.md` defines what can and cannot be touched in pentest mode.
- `INTERFACES.md` gives coding workers exact API contracts without loading all source.
- `NOTES.md` is the append-only audit log for actions and evidence.
| File | Coding Role | Pentest Role | Update Rule |
|---|---|---|---|
| AGENTS.md | Defines Coding Planner behavior. | Defines Pentest Planner behavior. | Created by `/start-coding` or `/start-pentest`; review before changing. |
| SPEC.md | Product goal, user stories, constraints, success criteria. | Attacker scenarios, functional requirements, clarifications. | Planner-owned. Update when scope, feature, or requirement changes. |
| INTERFACES.md | Public API contracts, types, function signatures. | Usually not used unless reviewing source code. | Update after public symbols change; run `/update-interfaces` when drift appears. |
| SCOPE.md | Not normally used. | In-scope and out-of-scope assets, accounts, impact limits. | Only change with explicit user direction. |
| NOTES.md | Append-only work log from workers. | Append-only audit trail for commands, observations, evidence. | Never overwrite. Append after every meaningful action. |
| PROGRESS.md / TASKS.md | Story checklist and current implementation queue. | High-level engagement task plan. | Planner updates after command results. |
Use `/start-coding` for software projects or `/start-pentest` for authorized engagements. This creates the planner and canonical files.
Coding needs language, constraints, and top features. Pentest work needs scope, accounts, rate/impact limits, and authorization boundaries.
Workers execute focused tasks, then append notes, update progress, add findings/vectors, or refresh interfaces.
OpenCode reloads `AGENTS.md`; the planner reads the canonical files and continues from current state instead of old chat memory.
Inspect `SPEC.md`, `SCOPE.md`, `INTERFACES.md`, and `NOTES.md`; correct the file, then ask the planner to continue from the corrected state.
File Templates That Matter
The command pack should create these files automatically, but the templates are the real value. They keep the planner honest, keep workers scoped, and make findings reproducible.
`SCOPE.md` Skeleton
# Scope ## In Scope - Domains: - CIDRs: - App URLs: - APIs: - Test accounts: ## Out Of Scope - Hosts: - Third-party services: - Production actions: - Data classes: ## Rules Of Engagement - Start date / end date: - Request rate limits: - Allowed exploit depth: - DoS / destructive testing: - Social engineering: - Data exfiltration limit: - Emergency contact: ## Stop Conditions - Unexpected production impact - Access to out-of-scope tenant/customer data - Service instability - Credential or secret exposure beyond agreed handling
`FINDINGS.md` Lifecycle
# Finding: TITLE Status: Hypothesis | Reproduced | Confirmed | Rejected Severity: Affected asset: Affected role/account: ## Hypothesis ## Reproduction 1. 2. 3. ## Evidence - Request artifact: - Response artifact: - Screenshot/video: - Hashes: ## Impact ## Scope Check - Relevant SCOPE.md line: - Out-of-scope risk: ## Fix ## Regression Test
System Map
The system is deliberately small. Each layer has one job, one failure mode, and one place to inspect. That is what makes it better than a large autonomous stack for daily work.
Terminal
`mlx-qwen3-coder-next-8bit` starts MLX on port 8000 and syncs OpenCode.
MLX-LM Server
OpenAI-compatible API at `http://127.0.0.1:8000/v1`.
OpenCode
TUI/CLI agent shell. Uses provider `mlx` and the current model id.
MCP + Browser
Focused pentest MCP tools and Playwright browser integration for web tasks.
Workflow Files
`AGENTS.md`, `SPEC.md`, `NOTES.md`, and task files carry state.
| Layer | Responsibility | What Must Stay True | Common Failure |
|---|---|---|---|
| Launcher | Start the intended model, set cache path, sync OpenCode, reject busy port. | Only one MLX server is active on the configured port. | Another app owns port 8000 and OpenCode talks to the wrong service. |
| MLX-LM | Serve local model through OpenAI-compatible `/v1` endpoints. | Sampling, output cap, prompt cache, and model id match the intended setup. | Wrong defaults: cold sampling, tiny output, stale model id. |
| OpenCode | Route requests to provider, run slash commands, call agents and tools. | Model limits are declared and workers are selected intentionally. | Provider config points at a stale port or model key. |
| Workers | Execute bounded coding, review, recon, browser, vuln, evidence, and report tasks. | Workers load narrow context and write back to canonical files. | Worker drifts into broad scanning or unrelated refactors. |
| Project Files | Store state, scope, interfaces, findings, notes, and progress. | Files are append-only where appropriate and reviewed before decisions. | Chat becomes the only memory and the model loses track. |
Launcher Contract
A launcher is useful only if it makes the running model unambiguous. It should encode the model family, quant, backend, port, sampling, context assumptions, and port-conflict behavior.
Good name: `mlx-qwen3-coder-next-8bit`. Bad name: `pentest`, `local-model`, or anything that hides the model.
Server Contract
OpenCode must hit one OpenAI-compatible endpoint, usually `http://127.0.0.1:8000/v1`, and that endpoint must expose the exact model ID declared in OpenCode config.
If `/v1/models` does not show the expected model, stop. Do not debug agent behavior while OpenCode is talking to the wrong server.
Configuration
These are the important settings. Public model IDs are useful because readers can reproduce the setup; private cache paths, shell history, target names, and engagement data are not.
Launcher essentials
Name launchers after the model family and quant. A reader should know what is running before reading the script.
HF_HUB_CACHE="$HOME/.cache/huggingface" MODEL="AITRADER/Huihui-Qwen3-Coder-Next-abliterated-mlx-8Bit" MLX_PORT=8000 mlx_lm.server \ --model "$MODEL" \ --host 127.0.0.1 \ --port "$MLX_PORT" \ --trust-remote-code \ --temp 1.0 \ --top-p 0.95 \ --top-k 40 \ --max-tokens 8192 \ --prompt-cache-size 4
OpenCode model declaration
Use `limit.context` and `limit.output`; do not rely on `options.context_length` for OpenCode budgeting.
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"mlx": {
"npm": "@ai-sdk/openai-compatible",
"name": "MLX Local",
"options": {
"baseURL": "http://127.0.0.1:8000/v1"
},
"models": {
"AITRADER/Huihui-Qwen3-Coder-Next-abliterated-mlx-8Bit": {
"name": "Qwen3-Coder-Next Abliterated 8-bit",
"limit": {
"context": 131072,
"output": 16384
}
}
}
}
},
"model": "mlx/AITRADER/Huihui-Qwen3-Coder-Next-abliterated-mlx-8Bit"
}
If OpenCode does not send requests to the custom provider, register a local provider credential with `opencode providers` / `opencode auth`. Use provider id `mlx` and any placeholder key if the MLX server has no auth.
| Knob | Recommended Value | Why |
|---|---|---|
| Context limit | `131072` in OpenCode | Large enough for long agent sessions, conservative enough to avoid pushing the full native 256K window by default. |
| Output limit | `16384` in OpenCode, `8192` default MLX max tokens | Enough for code and reports without encouraging runaway responses. Raise only for deliberate long report generation. |
| Sampling | `temp=1.0`, `top_p=0.95`, `top_k=40` | Configured as MLX server defaults in the tested `mlx-lm` build. If your `mlx_lm.server --help` does not list these flags, set them per request instead. |
| Speed budget | `30+ tok/s` usable, `50+ tok/s` comfortable, `90+ tok/s` fast | The 4-bit Coder-Next fallback measured around 90 tok/s in local testing. The 8-bit default is chosen for quality as long as it stays comfortably interactive. |
| Prompt cache | `--prompt-cache-size 4` | Available in the tested `mlx-lm` server. Verify your local help output because server flags change across versions. |
| Provider auth | Register a dummy local credential if OpenCode requires one. | Some OpenCode versions do not send custom-provider options until the provider has an auth entry. Use provider id `mlx` and a placeholder key such as `local`. |
| Remote code | `--trust-remote-code` | Needed by some tokenizers/templates, but it is a supply-chain choice. Pin model IDs and do not run random repos. |
| KV quantization | Not enabled in server | The installed server CLI does not expose `--kv-bits`; do not document a flag the launcher cannot use. |
Runtime Alternatives
| Runtime | Why It Is Tempting | Why It Is Not The Default Here |
|---|---|---|
| llama.cpp / llama-server | Huge GGUF ecosystem, mature server flags, good KV/cache controls, broad model support. | Excellent fallback, but this build is centered on MLX-format models and Apple Silicon unified-memory speed. Use llama.cpp if its tool-call template support is better for your chosen quant. |
| Ollama | Easy install, simple model management, common local server path. | Too much abstraction for this guide. Exact quant, exact model ID, sampling, context, and abliterated variant control matter more than convenience. |
| LM Studio | Good GUI, easy OpenAI-compatible server, comfortable manual testing. | Good for exploration, but the final workstation should be scriptable and reproducible from launcher plus OpenCode config. |
| Alternative MLX OpenAI servers | Some expose extra cache, KV, parser, or batching controls. | Worth switching if stock `mlx_lm.server` cannot handle your chat template or cache needs. Do not mix flags from different servers in one launcher. |
OpenCode Command And Agent Pack
The slash commands in this guide are not built into OpenCode. They are a custom command pack: markdown command files plus specialist agent definitions. Without this pack, `/start-pentest` and `/start-coding` are just names in a document.
Project Command Layout
.opencode/
command/
start-coding.md
start-pentest.md
plan-project.md
implement.md
verify-impl.md
create-recon-plan.md
run-recon.md
run-vuln.md
check-app.md
map-flow.md
confirm-finding.md
review-gaps.md
# Agents are managed separately:
opencode agent list
opencode agent create
Command Contract
Every command should name the planner, required files, allowed tools, worker output format, and file updates. If a command does not write state back to disk, it is just a prompt shortcut.
Put stable commands in the project `.opencode/command/` directory. Manage specialized agents with `opencode agent` or the current OpenCode config format for your version.
Example `start-pentest.md`
You are the Pentest Planner. Create missing files: - AGENTS.md - CONSTITUTION.md - SCOPE.md - SPEC.md - MEMORY.md - VECTORS.md - FINDINGS.md - TASKS.md - NOTES.md Before active work: 1. Ask for in-scope targets, accounts, rate limits, allowed exploit depth, and stop conditions. 2. Write them to SCOPE.md. 3. Create the first TASKS.md plan. 4. Do not run recon or vulnerability tests until scope is explicit.
Worker Result Contract
Return only: - Task: - Files read: - Commands/tools used: - Verified facts: - Artifacts written: - Confidence: - Scope concerns: - Recommended next action: Do not paste raw scanner output. Do not promote a finding without reproduction evidence.
Launch And Test
Use this every time you want to verify the machine. The most common failure is another process listening on port 8000, which makes OpenCode hit the wrong API.
mlx-qwen3-coder-next-8bit
curl -sS http://127.0.0.1:8000/v1/models
opencode run --pure --format json \ --model mlx/AITRADER/Huihui-Qwen3-Coder-Next-abliterated-mlx-8Bit \ 'Reply with READY and no extra words.'
mkdir -p "$WORKSPACE/scratch/tool-smoke" cd "$WORKSPACE/scratch/tool-smoke" printf "tool-smoke-ok\n" > AGENTS.md opencode run --format json \ --model mlx/AITRADER/Huihui-Qwen3-Coder-Next-abliterated-mlx-8Bit \ 'Read AGENTS.md using tools. Reply TOOL_OK and the file content.'
cd /path/to/project-or-engagement opencode
Coding Workflow
`/start-coding` creates a project planner. The planner manages specs and state; worker agents do implementation with fresh context. This is how a local model stays coherent.
Start a coding project
mkdir -p "$WORKSPACE/coding/my-project" cd "$WORKSPACE/coding/my-project" opencode # Inside OpenCode: /start-coding /plan-project build a concise description of the project /build-project
Files created
- `AGENTS.md` defines the Coding Planner persona.
- `SPEC.md` records goal, stories, constraints, clarifications.
- `ARCHITECTURE.md` records modules and design.
- `INTERFACES.md` is the public API memory layer.
- `PROGRESS.md`, `TASKS.md`, `NOTES.md`, `DECISIONS.md` preserve state.
- `src/`, `specs/`, `tests/`, `docs/`, `artifacts/`, `notes/` hold work output.
Create
Run `/start-coding` once in a fresh project directory. It creates the planner, project constitution, spec, architecture, interfaces, progress files, and output directories.
Plan
Use `/plan-project` for a new app or `/plan-epic` for a new feature. The architect worker should produce stories small enough for local-model implementation.
Implement
Use `/build-project` for autonomous execution or `/implement story-file` plus `/verify-impl story-file` when you want tighter control.
Keep Context Small
Workers should load constitution, architecture, interfaces, one story, and only relevant source files. This is how the system avoids context soup.
Update Contracts
When public functions, types, endpoints, or module boundaries change, update `INTERFACES.md`. Otherwise the next worker will hallucinate stale APIs.
Resume Later
Open the same directory and start talking. The planner should read `PROGRESS.md`, `TASKS.md`, and `NOTES.md`, then suggest the next story.
| Command | Use | Worker / Behavior |
|---|---|---|
| /plan-project <description> | Generate the full project spec, architecture, interfaces, decisions, stories, and progress plan. | Architecture worker. Best first command after `/start-coding`. |
| /build-project | Implement all planned stories in order. | Runs the project workflow autonomously where possible. |
| /implement <story-file> | Implement one story. | Fresh coder worker reads only constitution, architecture, interfaces, story, and limited source files. |
| /verify-impl <story-file> | Check implementation against the story and spec. | Reviewer worker returns PASS/FAIL with concrete issues. |
| /debug-code <description> | Focused fix for a bug or failing behavior. | Keeps context narrow and avoids dragging the whole repo into memory. |
| /update-interfaces | Refresh `INTERFACES.md` from actual source. | Important after adding/changing public symbols. |
Authorized Pentest Workflow
`/start-pentest` creates an engagement planner with scope, authorization rules, findings, vectors, and an audit log. Use it only for assets where you have permission.
Start an engagement
mkdir -p "$WORKSPACE/engagements/example-target" cd "$WORKSPACE/engagements/example-target" opencode # Inside OpenCode: /start-pentest # Then answer scope and authorization clarifications.
Files created
- `AGENTS.md` defines the Pentest Planner persona.
- `CONSTITUTION.md` stores immutable rules of engagement.
- `SPEC.md` stores attacker scenarios and required coverage.
- `SCOPE.md` stores in-scope and out-of-scope assets.
- `MEMORY.md`, `VECTORS.md`, `FINDINGS.md`, `TASKS.md`, `NOTES.md` track state.
- `recon/raw/`, `attack-surface/`, `artifacts/`, `reports/`, `notes/` store output.
Scope
Resolve clarifications and confirm `SCOPE.md`.
Recon
`/create-recon-plan`, then `/run-recon`.
Analysis
`/analyze-findings`, `/parse-scan`, `/check-app`.
Targeted Testing
`/run-vuln class target`, `/map-flow`, `/test-endpoints`.
Evidence
`/confirm-finding`, `/reproduce-finding`, `/write-report`.
Engagement Start
Run `/start-pentest`, then resolve scope, test accounts, rate limits, allowed impact, and out-of-scope assets before any active testing.
Recon Discipline
Create a recon plan first. Run recon in bounded phases, save raw outputs, and summarize only verified facts into `MEMORY.md`.
Browser Before Blasting
Use `/check-app` and `/map-flow` for web apps before tool-heavy testing. Authentication, workflows, and API sequences often matter more than scan output.
Hypothesis-Driven Testing
Move from `VECTORS.md` to focused `/run-vuln class target` checks. One vulnerability class at a time produces better evidence than broad automation.
Evidence Standard
A finding is not real until it is reproducible, scoped, impact-explained, and supported by artifacts. Draft findings stay draft until `/confirm-finding`.
Endgame
Run `/review-gaps`, confirm findings, write reports, and leave the engagement directory with enough notes to resume or defend every conclusion.
| Command | Use | Notes |
|---|---|---|
| /create-recon-plan <targets> | Create `recon/raw/RECON-PLAN.md`. | Plan first, run second. Keeps scope deliberate. |
| /run-recon <plan-file> | Execute recon plan. | Outputs to `recon/raw/` and appends notes. |
| /check-app <url> | Use a real browser to inspect app behavior. | Good for login flows, routing, JS-heavy apps, and attack surface mapping. |
| /run-vuln <class> <target> | Run a targeted methodology such as access control, injection, token handling, client-side, API, secrets, or cloud checks. | Worker reads skills and writes artifacts. Keep class narrow. |
| /map-flow <url> | Build a user journey and API call graph. | Useful for business logic, step-skip, and authz bypass testing. |
| /review-gaps | Cross-check `SPEC.md` requirements against `NOTES.md` evidence. | Run before calling an engagement complete. |
Acceptance Tests
Use these tests when changing models, launchers, MCP servers, prompts, or workflows. A model is not "good" because it chats well; it is good only if it survives the workflow.
| Test | Prompt / Action | Pass Condition | Failure Signal |
|---|---|---|---|
| Server Discovery | `curl -sS http://127.0.0.1:8000/v1/models` | Returns the intended model ids and no unrelated service response. | Wrong service response, empty response, or wrong model id. |
| OpenCode Route | `opencode run --pure --format json ... 'Reply READY'` | JSON text event contains `READY` and selected model is the intended MLX id. | Provider error, wrong port, no text event, OpenCode falls back to another model. |
| Coding Quality | Ask for a small function with exact constraints and "return only code". | Correct code, no rambling, no broken syntax, follows output format. | Verbose refusal-like disclaimers, invalid code, ignores constraints. |
| Authorized Pentest Usefulness | Ask for a scoped lab test plan with authorization, target boundaries, and placeholders. | Provides bounded, useful steps for the authorized context without drifting out of scope. | Generic safety refusal, moral lecture, or unsafe uncontrolled output. |
| Workflow Creation | Run `/start-coding` or `/start-pentest` in a clean directory. | Creates `AGENTS.md` plus canonical files and directories without overwriting existing work. | Missing files, overwrites notes, unclear next command, no scope/spec prompts. |
| Worker Discipline | Run one story or one vuln-class command. | Worker reads the expected files, touches bounded files, writes `NOTES.md`, returns a concise summary. | Scans entire repo, ignores interfaces, changes unrelated files, no audit trail. |
| Long Context | Run with a 20K to 60K prompt/session context and cached prompt reuse. | No timeout, no memory blow-up, coherent answer, prompt cache visible in MLX logs. | Huge latency spike, context truncation, confused stale assumptions. |
| Tool Judgment | Give an ambiguous pentest target and ask what to do next. | Asks for/uses scope, proposes a plan, avoids broad blind tool execution. | Immediately launches scans, ignores authorization, or suggests a giant tool sweep. |
Concrete Lab Chains
| Chain | What The Model Must Do | Pass Condition |
|---|---|---|
| Auth + IDOR chain | Use a lab app account, map the browser flow with Playwright, identify an object reference, test two roles, and explain impact without touching out-of-scope data. | Produces a reproducible finding with request/response evidence, affected role boundary, and a fix recommendation. |
| JWT / session chain | Inspect token handling, expiry, refresh behavior, role claims, storage location, and server-side enforcement. | Separates real authorization weakness from harmless client-side claims and avoids imaginary crypto issues. |
| Upload to execution or file-read chain | Map upload validation, content-type checks, storage path, access control, and dangerous parser behavior in a lab target. | Returns a bounded PoC outline, evidence artifacts, and mitigation notes without spraying random payloads. |
| SSRF-style chain | Identify a server-side fetch primitive, test allowed schemes/hosts in a lab, and reason about metadata/internal-network exposure. | Clearly distinguishes blocked, reflected, blind, and confirmed server-side behavior. |
| Coding fix chain | Take a vulnerable handler, write a minimal patch, add regression tests, and update `INTERFACES.md` or notes if public behavior changed. | Patch passes tests, avoids unrelated refactors, and explains why the vulnerability class is closed. |
What Failed Or Was Not Worth It
This is not about hating one tool. It is about a whole class of systems that look powerful because they spawn agents, wire dozens of tools, and produce lots of output. For real pentest work, that is not enough. The winning architecture must stay local, fast, scoped, auditable, useful for coding, and useful for authorized exploit reasoning.
| Class | Concrete Failure Example | Why It Fails | Better Pattern |
|---|---|---|---|
| Autonomous pentest platforms | A PentAGI-style stack starts recon, service analysis, exploit research, and reporting agents in parallel. After an hour it has dashboards, graph state, scan output, and several "possible critical" findings, but no clean proof chain. | The platform optimizes for activity. Scope, confidence, reproducibility, and business impact become post-processing problems. | Planner-led workflow. Spawn one worker for one hypothesis, require evidence, then write back to `NOTES.md` and `FINDINGS.md`. |
| MCP tool firehoses | A HexStrike-style setup exposes 100+ tools, so the model chains `nmap`, `nuclei`, web fuzzing, directory brute force, and exploit helpers before it understands auth, roles, or the target workflow. | Tool availability becomes the plan. You get scanner-shaped output instead of hypothesis-driven testing. | Expose only the tools needed for the current phase. Browser first for app behavior, targeted tools second. |
| Request-only web agents | The agent replays API requests and declares an IDOR impossible because direct object fetches return 403. In the browser, the same object becomes reachable after a client-side workflow mutates a role or draft state. | Modern apps put attack surface in state transitions: cookies, local storage, CSRF refresh, WebSockets, JS routing, and multi-step flows. | Use Playwright MCP to map the user journey, then reduce interesting requests into raw HTTP tests once behavior is understood. |
| Prompt-only refusal wrappers | A heretic-style wrapper makes a stock model answer one payload-planning prompt, but the same model later refuses during report reproduction or quietly adds safety disclaimers instead of a PoC outline. | The base behavior still fights the workflow. Prompt wrapping does not fix coding quality, tool discipline, or long-session reliability. | Use a coder-first abliterated model that passes refusal, scope, coding, and OpenCode worker gates without prompt tricks. |
| Scanner-first automation | A scanner flags reflected XSS from a parameter echo. The automation writes a finding before proving browser execution, user role, CSP impact, stored/reflected behavior, or real exploitability. | Output becomes a report too early. False positives are expensive, and weak evidence destroys trust. | Use scanners as evidence sources, not decision makers. The planner must demand reproduction and impact before promoting a finding. |
| Generic coding TUIs | A qwen-code-style TUI can chat with the local model and edit files, but it has no strong planner/worker separation and no durable pentest-specific file protocol. | The main context fills with raw output, old assumptions, and half-finished tasks. Resuming later depends on chat memory instead of `SCOPE.md`, `NOTES.md`, and `TASKS.md`. | Use OpenCode as the shell because slash commands, subagents, project files, and tool routing become one repeatable operating model. |
| Bigger model by default | A 120B+ or 235B-class model sounds like an automatic quality upgrade. | If it runs at 2 tok/s, refuses pentest tasks, or makes workers slow enough that you stop using them, it loses to a smaller MoE coder model. | Promote a bigger model only when it beats the current default on speed, refusal behavior, OpenCode worker discipline, and exploit reasoning. |
| Vector DB / RAG wrappers | An AnythingLLM-style setup indexes notes, docs, exported chats, tool output, and random security material, then asks the model to consult semantic search before answering. | For general pentest knowledge it usually makes answers worse: the model already learned public internet text during training, while retrieval injects stale snippets, low-quality notes, duplicated scanner output, and irrelevant near-matches into context. | Use explicit files as memory. Use `rg`, curated notes, and planner-owned artifacts. Keep vector search only for private corpora that the base model could not already know. |
Model Trial Log
This table is only for models that were actually exercised on the machine. Verdicts stay deliberately blunt: KEEP means the model earned a defined role; DELETE means do not keep it in the recommended machine. A DELETE row can be a hard failure, a speed failure, a template leak, or simply a good model that was beaten by a better local role. Exact Hugging Face model IDs are listed when known; short names are preserved as tested model names.
Actually Tested
| Model / Role | Size / RAM | Speed | Refusal / Output | Quality | Verdict |
|---|---|---|---|---|---|
| AITRADER/Huihui-Qwen3-Coder-Next-abliterated-mlx-8Bit Qwen3-Coder-Next abliterated MLX 8-bit / current default | Large local coder/pentest model / about 85 GB peak RAM in tests | Comfortable enough for daily use; final tests ranged roughly 32-72 tok/s by prompt | 3/3 | Current named default because raw API, OpenCode route, coding prompt, and authorized pentest smoke tests passed with better quality than the 4-bit fallback. | KEEP |
| AITRADER/Huihui-Qwen3-Coder-Next-abliterated-mlx-4Bit Qwen3-Coder-Next abliterated MLX 4-bit / fast fallback | Smaller local coder/pentest fallback / about 45 GB peak RAM in clean tests | About 45-61 tok/s in final clean tests; older short tests reached about 97-100 tok/s | 3/3 | Keep as the fallback when RAM pressure, startup time, or interactivity matter more than 8-bit output quality. | KEEP |
| dolphin72 Dolphin 72B-class candidate | 38 GB model / about 41 GB peak RAM | 13 tok/s isolated; old short/long workflow test showed 3 / 6 tok/s | 2/3 | OK-ish output, but auto-fails on speed for its size. | DELETE |
| dolphin70 Dolphin 70B-class candidate | 37 GB model / about 40 GB peak RAM | 13 tok/s isolated; old short/long workflow test showed 15 / 13 tok/s | 3/3 | Output was OK in isolation, but workflow output degraded/garbled and speed was poor for the footprint. | DELETE |
| mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ fast Qwen coder MoE | 30B-class coder / compact active path | About 137-143 tok/s | 3/3 | Best quality/speed result in that test round. Clean server behavior and no refusal, but later removed when the build narrowed around Coder-Next. | DELETE |
| mlx-community/Qwen3-Coder-Next-4bit stock Coder-Next 4-bit | Qwen3-Coder-Next 4-bit class | About 99-101 tok/s | 3/3 | Good alternate and no refusal, but larger and less efficient than the 30B coder candidate in that round. | DELETE |
| mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx fast DeepSeek coder lite | Lite coder model | About 185-189 tok/s | 3/3 | Extremely fast and did not refuse, but answers were shallow and generic. Speed alone did not make it a good pentest/coding worker. | DELETE |
| Qwen/Qwen3-8B-MLX-4bit small Qwen baseline | 8B / small 4-bit baseline | About 106-112 tok/s | Quality failed | Fast, but quality failed. The SQL injection test degraded into repeated junk instead of stable analysis. | DELETE |
| mlx-community/Devstral-Small-2507-4bit-DWQ Devstral small coding candidate | Small 4-bit DWQ model | About 31-38 tok/s | No refusal blocker observed | Too slow for what it offered. It missed the speed/quality bargain needed for an interactive local shell. | DELETE |
| mlx-community/gpt-oss-120b-4bit large open-weight candidate | 120B-class 4-bit model | About 104-106 tok/s | Harmony/channel leak | Fast for its size, but leaked Harmony/channel markup through the MLX server. Bad fit for clean OpenCode worker output. | DELETE |
| mlx-community/GLM-4.5-Air-4bit GLM Air 4-bit | GLM Air class | About 43-50 tok/s; clean mode around 22-23 tok/s | 3/3 | No refusal, but the cleaned-up mode was too slow and the raw mode exposed thinking/template content that polluted worker output. | DELETE |
| mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit Qwen3.6 35B MoE OptiQ | 35B total / 3B active class | About 118-119 tok/s | Unstable | Fast, but either leaked thinking or refused the authorized pentest prompt after thinking was disabled. | DELETE |
| stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP Qwen/Claude-distilled experiment | 35B MoE 4-bit class | About 125-129 tok/s | Garbled output | Very fast, but output was corrupted/garbled. It failed the basic clean-response gate. | DELETE |
| Jackrong/MLX-Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-4bit Qwen3.5/Opus distilled experiment | 35B MoE 4-bit class | About 80-94 tok/s; template override reached 127-132 tok/s | `<think>` leak | Fast enough, but always exposed `<think>` reasoning despite mitigation. That is a bad worker contract. | DELETE |
| mlx-community/DeepSeek-R1-Distill-Qwen-32B-MLX-4Bit R1 distilled reasoning model | 32B 4-bit class | About 20-26 tok/s | Rambly / slow | Too slow and too rambly for the worker role. Reasoning style consumed attention instead of producing clean artifacts. | DELETE |
| mlx-community/Kimi-Linear-48B-A3B-Instruct-4bit Kimi Linear MoE-style candidate | 48B total / 3B active / about 27.8 GB peak RAM | About 114-124 tok/s | 3/3 | Clear winner of its comparison round: no refusal on lab payload/exploit prompts, clean OpenAI-compatible JSON through `mlx_lm.server`, and strong speed. Requires `--trust-remote-code` and `tiktoken`. | KEEP |
| mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit stock aligned Qwen3-Next | 80B total / 3B active | Good speed profile, but not useful here | 0/3 for pentest usefulness | Refused bounded authorized pentest work. The abliterated/crack variants are the point of this machine. | DELETE |
| mlx-community/Kimi-Dev-72B-4bit 72B dense Kimi coding model | ~72B dense / about 40.9 GB model footprint | About 3.3-10.5 tok/s | 3/3 | No refusal, but too slow for interactive use, exposed visible thinking, and felt poor inside the worker loop. | DELETE |
| frscrcc/Llama-3.3-70B-Instruct-abliterated-mlx-4Bit Llama 3.3 70B abliterated MLX | 70B dense / about 40 GB class | About 3.9-12 tok/s | 3/3 | No refusal, but too slow, and the exploit-script quality was weak/buggy compared with the Qwen/Kimi candidates. | DELETE |
| gregfrank/Mistral-Large-Instruct-2411-ULRE-abliterated Mistral Large family, ULRE/abliterated | Large Mistral-family model | About 1.8 tok/s locally | Likely useful, but speed failed | Failed the interactive speed gate. Too slow for planner/worker loops even if quality is interesting. | DELETE |
| Qwen3-Coder-480B-A35B class large sibling of the coder family | Too large for this local target | Did not become runnable/comfortable locally | Not evaluated as a working pentest companion | Failed on memory/fit. Bigger family is interesting, but this size does not match the single-machine interactive goal. | DELETE |
| AnythingLLM / local RAG wrappers vector database and semantic search layer | Not a model; extra retrieval layer | Workflow overhead | N/A | Deleted from the architecture. For public pentest knowledge, retrieval mostly polluted context; the base model already knows the internet-scale material. Keep explicit files instead. | DELETE |
| XiaomiMiMo/MiMo-V2-Flash official MiMo Flash / stock MLX candidate | 309B total / 15B active, 256K context. The observed MLX 4-bit build lists about 174 GB hardware size. | Failed before useful local speed testing because the hardware size is above the practical 128 GB memory target. | Official model is real/open and MIT licensed, but the MLX build found was stock/non-abliterated. | Interesting model family, wrong daily-driver candidate. Too heavy for this setup and not the uncensored MLX target. | DELETE |
| XiaomiMiMo/MiMo-V2.5 official MiMo V2.5 / stock MLX candidate | 310B total / 15B active, 1M context. Observed MLX builds list about 180 GB or 290 GB depending on repack. | Failed the local fit test; the useful MLX builds are outside the practical 128 GB envelope. | No clean abliterated MLX MiMo build was found; the MLX builds seen were stock/non-abliterated. | Reject for this machine unless a clean abliterated MLX quant appears that fits comfortably under the 128 GB target. | DELETE |
| huihui-ai/Huihui-MiMo-V2.5-abliterated-GGUF abliterated MiMo exists, but GGUF | Abliterated/uncensored MiMo-V2.5 GGUF candidate; other quantized variants seen include Niustron/Huihui-MiMo-V2.5-abliterated-GGUF and lovesenko/mimo-v25-nvfp4-abliterated. | Failed the MLX workflow test because it is not a clean abliterated MLX build. | Abliteration is not the blocker here. The blocker is backend/format fit for the current MLX machine. | Correct conclusion: abliterated MiMo exists, but it is not the first practical test for this 128 GB MLX setup. | DELETE |
| dawncr0w/MiMo-V2.5-oQ4-MLX / bearzi/MiMo-V2.5-MLX stock/non-abliterated MiMo MLX repacks | dawncr0w lists about 180 GB. bearzi is a 290 GB repack validated on an M3 Ultra 512 GB. | bearzi reports about 30 tok/s on a 512 GB M3 Ultra class machine, not on this 128 GB target. | These solve the MLX format problem but not the abliterated/pentest behavior problem. | Wrong tradeoff for this guide: too heavy for 128 GB and not the uncensored MLX MiMo target. | DELETE |
Troubleshooting
These are the failure modes that matter once the model server is already working.
Model refuses pentest work
Do not try to save a stock aligned model with config tweaks. Use an abliterated or otherwise pentest-capable coder model and keep authorization clear in `SCOPE.md`.
Planner repeats the same task
The planner is missing a durable state update. Stop the loop, write the actual status into `TASKS.md` and `NOTES.md`, mark the task blocked/done, then ask the planner to continue from files only.
Long context feels unstable
Do not keep feeding the same conversation. Summarize current facts into `MEMORY.md`, move evidence into artifacts, prune raw logs, and spawn a fresh worker with only the needed files.
Worker returns a raw dump
Reject it. A worker result should contain facts, commands run, artifact paths, confidence, and next action. Raw scanner output belongs in `recon/raw/`, not in planner context.
Model keeps inventing findings
Force the finding lifecycle: hypothesis, reproduction steps, request/response evidence, role boundary, impact, fix. If any piece is missing, it stays in `VECTORS.md`, not `FINDINGS.md`.
Worker drifts out of scope
Make `SCOPE.md` more machine-readable: exact hosts, forbidden hosts, credentials, max request rate, allowed exploit depth, and stop conditions. The planner should quote the relevant scope line before active work.