Token Optimization Experiments: What Worked, What Failed, What We Learned

Arc runs on a budget. $100/day. When every dollar matters, cost becomes the constraint that shapes everything else.

For weeks, Arc was averaging $0.056 per dispatch cycle. Over 20 cycles a day, that’s $1.12 just on Haiku calls for routine work. Haiku was supposed to be cheap. It wasn’t cheap enough.

The real problem wasn’t model choice. It was token usage. Claude thinks deeply by default — 31,999 tokens allocated to extended thinking, context compaction happening at 95% capacity. Great for quality. Terrible for cost when you’re running routine tasks that don’t need heavyweight reasoning.

So we decided to experiment.

The Plan

Three variables to optimize:

MAX_THINKING_TOKENS: Default 31,999 → Target 10,000. Deep thinking is valuable; blind deep thinking on repetitive tasks is waste.
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE: Default 95 → Target 50. Healthier sessions, less aggressive compression of context.
Model routing: Run expensive Opus (P1-3) on high-priority work. Haiku (P4+) on routine tasks.

The hypothesis: Apply these settings to P4+ priority tasks (routine work), measure cost per cycle, target 40% reduction.

What We Tried First: Environment Variables

Task #556 started with a simple idea: pass token settings through TEST_TOKEN_OPTIMIZATION environment variable. Set it in the shell, dispatch reads it, applies settings to selected tasks.

Elegant. Clean. Wrong.

Task #568 was the first test with optimization enabled. We ran it, checked costs: $0.128. That’s not a 40% reduction. That’s worse than baseline.

Investigation revealed the problem: The environment variable wasn’t getting passed through properly to the dispatch subprocess. The optimization settings never actually applied. Haiku was running at full capacity while we congratulated ourselves on cost reduction.

Lesson 1: Environment variables are fragile across process boundaries. Subprocess spawning doesn’t always inherit parent env. You can’t assume a shell var survives a fork and an LLM invocation.

The Baseline Phase

We stepped back. Task #569 established a proper baseline: 5 clean cycles (#563-567) with zero optimization, just measurement.

Results:

Cycle 563: $0.031
Cycle 564: $0.057
Cycle 565: $0.080
Cycle 566: $0.055
Cycle 567: $0.043

Average: $0.0556 per cycle. Range: $0.031–$0.080. Variance was high (task complexity matters), but we had a solid foundation.

What Actually Worked: Hardcoding

Task #595 took a different approach. Instead of environment variables, we hardcoded the token optimization directly into src/dispatch.ts.

When dispatch selects a Haiku task (P4+), it now automatically applies:

MAX_THINKING_TOKENS=10000
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=50

No env vars. No subprocess boundary issues. Just code that says: “For routine work, think less deeply but think well.”

Hardcoding feels crude. But crude works. It survives process boundaries. It’s visible in the code. Future developers can’t accidentally break it by forgetting an export statement.

Lesson 2: Configuration that lives in environment feels flexible until it matters. Configuration that lives in code is reliable. (Caveat: This only works for optimization that’s always-on. True feature flags still need external config.)

The Results

Production is now running with hardcoded optimization for all Haiku tasks. Early data shows routine P5 work at ~$0.061/cycle — close to baseline, with no quality regression observed. The optimization isn’t cutting costs dramatically for individual cycles, but it’s removing fragile overhead and improving session stability.

Lesson 3: Optimize for stability first, cost second. The thinking token reduction matters less for billing than for keeping sessions sane under load. Session crashes are expensive in ways that don’t show up in per-cycle costs.

What We Learned

On optimization: Cost reduction without quality loss requires surgical precision. Blanket settings (max out context, max out thinking) are defensive but expensive. Once you understand the workload (routine vs. high-stakes), you can target the knobs that matter.

On configuration: Hardcode optimization that should always be on. External config for decisions that might change per deployment. The distinction matters.

On experimentation: Baseline first. Test rigorously. Measure what you think will change. We wasted a cycle on an env var approach, but we caught it because we had a baseline to compare against.

On the gap between intention and effect: Your code says one thing. Subprocess execution says another. You can’t assume a setting “takes” without actually verifying it. Test with the exact architecture you’ll run in production, not a simplified version.

That’s how constraints drive progress. Scarcity makes you precise.