AI Doesn't Replace Engineering Fundamentals—It Requires Them
- Suzanna Capone
- May 20
- 8 min read
Why the teams shipping production AI fastest are the ones obsessed with the basics
Everyone's talking about AI. Not everyone's getting value.
We've seen the pattern repeat: impressive demos that never reach production. Innovation theater that checks a board-mandated AI box. Black box systems that teams can't debug six months later. Unoptimized LLM calls burning $10K+/month without delivering proportional business value.
The challenge isn't building with AI—it's knowing how to build it right.
Here's what we've learned building production agentic systems for clients across ecommerce, compliance, and enterprise operations: AI doesn't replace engineering fundamentals. It amplifies the consequences of ignoring them.
The teams moving fastest aren't the ones with the fanciest models or the biggest budgets. They're the ones who architected for observability, engineered context flows deliberately, and designed systems their teams can actually maintain.
Start with the Right Problem—Most Don't Need AI
Not every problem needs AI. Forcing AI onto the wrong problem is expensive, slow, and creates technical debt you'll be unwinding six months later.
We had a client come to us with "we need AI to modernize compliance." Vague. Unfocused. The kind of request that leads to six months of scope creep, a POC that impresses in demos but never ships, and a team that's burned out on "AI transformation."
Through discovery, we found the actual problem: manual review of 1,000+ vendor websites for FDA/FTC violations. 45-60 minutes per site. Subject to human fatigue, inconsistency, and high error rates on edge cases. Compliance team bottleneck preventing vendor growth. Clear liability exposure as merchant of record.
That problem had the shape of something AI could solve.
Our filter:
Reasoning required? When you need some level of reasoning, and in this case risk assessment, an agentic workflow is a good fit.
Data-heavy? Enough structured or unstructured data to train or prompt effectively
Repetitive? Happens frequently enough that automation ROI is measurable
Clear success criteria? Can you define "better" objectively? (Time saved, errors reduced, cost avoided, risk mitigated)
If a problem passes this filter, we architect a pilot to prove the approach. Fast iteration, real data, measurable outcomes. After that, If we can't demonstrate ROI in 6-8 weeks, it's the wrong problem for AI—or the wrong AI approach for the problem. The vague "modernize compliance" request would have led to scope creep and an unclear path to production.
The technical principle: Map the current system architecture before proposing AI. Understand data flows, failure modes, latency requirements, error handling. Identify where AI adds value vs where traditional engineering is more appropriate. Some problems are better solved with deterministic rules engines, better data pipelines, or non- agentic workflow automation—not LLMs.
Observability Is Not Optional—Instrument on Day One
Agentic systems make autonomous decisions across multiple reasoning steps. When something goes wrong in production—and it will—you need complete visibility into the decision chain. Which agent was invoked? What context did it have? Which tool did it call? Why did it choose that tool over alternatives? What was the tool response? How did the agent interpret that response?
Without observability, you're debugging a black box. With observability, you have a traceable decision graph.
We worked on a project processing tens of billions of tokens—our largest AI implementation to date. Early in development, the team had limited visibility into how prompt changes or model updates affected system behavior. Changes shipped to production, then we'd discover performance regressions, cost spikes, or accuracy drops.
We instrumented the entire system with LangSmith from day one. Not after production issues emerged—from the start.
What we tracked:
Agent-level tracing: Every agent invocation logged with input context, reasoning steps, tool calls, and output
Decision paths: Why the Planning Agent chose to delegate to the Payment Specialist vs the Inventory Specialist
Tool execution: Which tools were called, with what parameters, what they returned, how long they took
Token consumption: Per-agent, per-request token counts to identify context bloat
Latency breakdown: Where time was spent in the workflow (LLM calls, tool execution, data fetching)
The observability didn't just help us debug—it revealed optimization opportunities we couldn't have seen otherwise.
What instrumentation revealed
LangSmith traces showed us the system was asking the LLM to extract quotes from documents by writing them out. The pattern was clear in the logs: the LLM was regenerating content we already had, consuming expensive output tokens, and occasionally introducing subtle misquotes or hallucinations.
This observation led to a breakthrough we call "deterministic quoting"—a context engineering pattern that eliminated the issue entirely. (More on the technical implementation below in Context Engineering.)
We would never have discovered this optimization without granular observability into what the system was actually doing.
The technical principle: Instrument your system like you'd instrument any distributed system. Use tools like LangSmith, LangFuse, or build custom tracing if needed. Name every agent, every tool, every decision point. Log inputs, outputs, and reasoning. Make traces explorable for both engineers (who need full technical detail) and non-technical stakeholders (who need "why did the AI do this?" answers). Set up dashboards tracking token consumption, latency P50/P95/P99, error rates, and cost per request. Alert when metrics drift.
Without observability, you're deploying a black box and hoping it works. With observability, you're operating a system you understand.
Context Engineering Makes or Breaks Your System
LLMs are stateless functions. They only know what you tell them in each request. Garbage context in = garbage decisions out.
The naive approach: send everything to the LLM and let it figure out what's relevant. This is expensive (tokens cost money), slow (processing time scales with context size), and often produces worse results (LLMs get distracted by irrelevant information).
The engineering approach: design context flows deliberately.
On a project processing billions of tokens, context bloat would have made the system economically unviable. We couldn't send entire 35-page documents, full product catalogs, or complete conversation histories to the model on every request.
Our context engineering strategy:
1. Scope aggressively
Send only what's needed for the current task. If we need data from pages 2-3 of a 35-page invoice, we extract and send those pages—not the full document.
Implementation: preprocessing pipeline that identifies relevant sections before LLM invocation. For documents: semantic chunking with vector embeddings to retrieve relevant passages. For structured data: SQL queries or API calls scoped to the exact data needed.
2. Progressive disclosure
Sometimes you don't need a vector database or complex retrieval system. Instead, provide the LLM with a map to more detailed information—like links on a wiki page.
Give the agent skills or documentation that reveal where to find detailed information rather than sending all the details upfront. The agent can request additional context as needed.
Example: instead of sending 50 pages of API documentation, send a structured index of available endpoints with brief descriptions. When the agent needs details on a specific endpoint, it requests that section.
3. Preserve deliberately
Not all agents need the same context. We designed context preservation per agent type:
Domain expert agents (Inventory Specialist, Payment Specialist, Product Specialist): maintain persistent context about their domain. These agents keep product catalog data, policy documents, and domain-specific rules in memory across requests.
Ephemeral agents (data extractors, format converters): use throwaway context. They process input, produce output, and forget everything. No state maintained.
Orchestration agents (Planning Agent, Reflection Agent): maintain conversation state and decision history, but not domain data.
This reduces token consumption and prevents context pollution (agents seeing irrelevant data from other domains).
4. Front-load essentials only
System prompts get core instructions: agent role, capabilities, constraints, output format requirements.
Dynamic context gets injected per request: customer history, current inventory levels, recent transactions, user permissions.
We don't send static reference data (product catalog, policy docs) in every prompt. Instead: vector embeddings for semantic search, retrieval-augmented generation (RAG) for relevant passages, progressive disclosure for navigable documentation, caching for frequently accessed reference data.
5. Use deterministic selection instead of generation
Remember the pattern observability revealed—the LLM regenerating quotes we already had? Here's how we solved it.
The standard approach when you need an LLM to extract quotes from a document: ask the LLM to read the document and write out the quotes.
The problem: When an LLM writes quotes, it can hallucinate, misquote, paraphrase unintentionally, or subtly alter meaning. Plus, output tokens cost more than input tokens—having the LLM write long quotes is expensive and slow.
Our solution: Split incoming content into sentences and assign each a numeric index. Instead of asking the LLM to write quotes, we ask it to select which quotes to extract by index number.
Before:
LLM task: "Extract relevant quotes from this document"
LLM output: "The report states that 'revenue increased by 23% year-over-year'..."
(Risk: hallucination, misquoting, expensive output tokens)After:
Document preprocessed into indexed sentences:
[1] "Revenue increased by 23% year-over-year."
[2] "Operating expenses declined by 8%."
[3] "Net margin improved to 15.2%."
...
LLM task: "Select sentence indices for relevant quotes"
LLM output: [1, 22, 48]
System retrieves exact sentences at those indices
(Zero hallucination risk, minimal output tokens, exact quotes)Results:
No hallucination opportunity: LLM selects existing content, doesn't generate it
Cheaper: Output tokens cost more than input tokens. Typing [1, 22, 48] is dramatically fewer output tokens than writing full quotes
Faster: Fewer output tokens = faster response times
Accurate: Quotes are exact matches from source material
This pattern applies beyond quoting: any time you need the LLM to reference existing content (code snippets, data records, document sections), have it select by index rather than regenerate.
6. Design for long-running sessions
Agentic systems in production often run long conversations. Context grows unbounded unless you design for compaction.
Our approach:
Sliding window: Keep last N conversation turns in full detail, summarize older context
Hierarchical summarization: Maintain conversation summary that updates incrementally
Explicit memory management: Agents decide what to remember vs what to discard
Behavior steering middleware: Monitor conversation state and dynamically nudge agent behavior to prevent drift from intended outcomes (staying on task, using appropriate tools, maintaining conversation quality)
The technical principle: Treat context as a scarce resource. Design information architecture before writing prompts. Use progressive disclosure to provide navigation rather than dumping all information upfront. Use vector embeddings for semantic retrieval. Implement RAG for large knowledge bases. Cache frequently accessed data at the agent level. Design memory compaction strategies for long-running sessions. Use behavior steering middleware to prevent drift. Measure token consumption per agent and alert when it spikes—it's usually a sign of context bloat or inefficient prompting.
For tasks requiring the LLM to reference existing content, use deterministic selection (indices, IDs, keys) rather than asking the LLM to regenerate content.
Context engineering is the difference between a system that costs $50/request and one that costs $0.50/request with better accuracy.
Engineering Fundamentals Scale—Hype Doesn't
The uncomfortable truth: most AI projects fail not because the technology isn't ready, but because teams skip the fundamentals.
They jump straight to implementation without identifying whether the problem is actually suited for AI. They ship systems they can't debug because they didn't instrument from day one. They burn budget on unoptimized context because they didn't design information flows. They deploy to production without understanding latency characteristics, cost per request, or failure modes.
The teams shipping production AI fastest are the ones who treat AI systems like distributed systems—because that's what they are:
✓ They identify problems where AI delivers measurable, defensible value
✓ They architect for observability before writing the first agent
✓ They engineer context flows deliberately, with clear data pipelines and retrieval strategies
✓ They design for failure: retry logic, fallback strategies, graceful degradation
✓ They monitor and alert on: latency (P50/P95/P99), token consumption, cost per request, error rates, accuracy metrics
These aren't AI-specific principles. They're distributed systems engineering fundamentals and we are doubling down on them.
![Logo [color] - single line w_ cart_edite](https://static.wixstatic.com/media/17b7e3_2ff2eac6a2994af799992822fabc1202~mv2.png/v1/fill/w_354,h_50,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/Logo%20%5Bcolor%5D%20-%20single%20line%20w_%20cart_edite.png)
Comments