Prompt Engineering — Transformer Math

Module 35 · Applications

✍️ Prompt Engineering

Adding 'think step by step' improves GPT-4 math accuracy by 40%

Status:

Assuming the model and runtime settings are fixed, the prompt is your primary lever. Prompt engineering is the art of structuring inputs to get reliable, high-quality outputs — from role assignment and few-shot examples to chain-of-thought reasoning and schema-enforced structured output.

📝

Anatomy of a Prompt

What you’re seeing: the five concentric layers of every LLM API call — System Prompt, Few-Shot Examples, Context/Documents, User Query, and Output Format — stacked in the order the model processes them. What to try:locate the “Output Format” layer and notice it sits closest to the generation boundary — that’s why schema-enforcement instructions placed there are more reliable than instructions buried in the system prompt.

Dashed border on Output Format = optional but strongly recommended for production. The left arrow marks the direction of information flow — system sets the frame, examples show the pattern, context provides facts, query drives the response.

🎮

Side-by-Side: Naive vs. Engineered Prompt

Compare a vague prompt with a well-structured one. The same model, the same task — vastly different results.

Naive Prompt

Summarize this article.

Typical Output

This article talks about various aspects of climate change and its effects on the environment. It mentions several studies and discusses potential solutions. The author concludes that action is needed.

Vague, no structure, no specifics, unknown length

Engineered Prompt

<system>
You are a research analyst. Summarize
articles into structured briefs.
</system>

<user>
Summarize this article in exactly 3 bullet
points. Each bullet must include:
- One key finding with a specific number
- The source study or dataset cited

Article: {article_text}
</user>

Typical Output

- Global temperatures rose 1.1C since pre-industrial levels (IPCC AR6, 2021)

- Arctic sea ice extent declined ~12.8% per decade since 1979 (NSIDC satellite data)

- Renewable energy investment hit $495B in 2022, up 17% YoY (BloombergNEF)

Structured, specific, verifiable, consistent format

✨ Insight · The difference: role assignment, explicit format constraints, and concrete requirements. The model has the same knowledge in both cases — the engineered prompt just extracts it reliably.

💡

The Intuition

Prompt Anatomy

Every API call has three parts: system prompt (sets role, constraints, and personality), user prompt (the actual request), and optionally assistant prefill(pre-fill the assistant's response to steer format). System prompts have higher priority than user messages — this is the hierarchy that enables guardrails.

Few-Shot Learning

Include 2-5 input/output examples directly in the prompt. The model pattern-matches against these examples to produce consistent outputs. — more examples cost tokens without improving quality. Place examples after instructions but before the actual query.

Chain-of-Thought (CoT)

Adding "Let's think step by step" to a prompt dramatically improves performance on reasoning tasks. Why? It forces the model to generate intermediate tokens that represent reasoning steps, rather than jumping directly to an answer. Each generated token conditions the next, building a reasoning chain. Kojima et al. (2022) showed this zero-shot trigger . (Wei et al., 2022 is the related few-shot CoT paper; both lines of work are complementary.)

Structured Output

Three approaches, from weakest to strongest: (1) Prompt-based — ask for JSON in the prompt (brittle, model may add commentary). (2) JSON mode — API flag that guarantees valid JSON but not a specific schema. (3) tool_use / function calling — define a JSON Schema and the model strongly targets it; guaranteed schema conformance requires strict mode (OpenAI Structured Outputs or Anthropic strict tool use). Combine with Zod (TypeScript) or Pydantic (Python) for end-to-end type safety.

Prompt Caching

Anthropic's prompt caching stores the KV-cache for repeated prompt prefixes. . Design prompts with a long, stable prefix (system prompt + few-shot examples) and variable content at the end. .

Self-Consistency Decoding

Chain-of-thought prompting picks one reasoning path. Self-consistency samples multiplereasoning paths at temperature > 0 and takes the majority-vote answer. The intuition: if many independent reasoning chains converge on the same answer, that answer is more likely correct. Wang et al. (2022) showed . Cost: more inference compute for samples, but each sample is independent and can be batched in parallel.

Temperature Interaction

Temperature and prompts interact: a well-constrained prompt (strict format, examples, tool_use) can tolerate higher temperatures because the output space is already narrowed. A vague prompt at high temperature produces chaos. Rule of thumb: use for deterministic tasks (extraction, classification), for creative tasks with structure, and only for open-ended generation with strong post-filtering.

💡 Tip · The prompt engineering hierarchy: system prompt constrains the space, few-shot examples show the pattern, CoT enables reasoning, tool_use enforces output structure. Layer them for maximum reliability.

Quick check

Derivation

GSM8K accuracy jumps from ~18% to ~57% with zero-shot CoT. What is the causal mechanism?

The phrase activates a CoT fine-tuning head in the weights.It increases sampling diversity by raising temperature internally.It retrieves similar solved problems from the model's memory.Intermediate reasoning tokens become context, constraining the answer token.

Quick Check

Why use tool_use for structured output instead of just asking for JSON?

📐

Token Cost Math

System Prompt Cost Over N Requests

A system prompt of tokens sent with every request accumulates linearly. With prompt caching (90% discount on cached tokens):

Example: tokens, requests, . Without cache: . With cache: . That is a .

Few-Shot Scaling: Diminishing Returns

Accuracy improvement from few-shot examples follows a logarithmic curve — :

Where is number of examples, is zero-shot accuracy, and is a task-dependent constant. Typical values: .

Quick check

Derivation

System prompt = 2,000 tokens, input price = $3/MTok, 10,000 requests. Cached cost is ~$6 vs uncached $60. What fraction of compute was eliminated?

~90% — 9,999 of 10,000 requests skip full prefix computation.~50% — caching halves the per-request prefix cost.~10% — only the output tokens are cached, input still computed.~99% — all 10,000 requests share one cached KV-state.

Code: Prompting Techniques

python

# --- Zero-shot ---
messages = [
    {"role": "system", "content": "You are a sentiment classifier."},
    {"role": "user", "content": "Classify: 'This product is terrible.' → positive/negative"}
]

# --- Few-shot ---
messages = [
    {"role": "system", "content": "Classify sentiment as positive or negative."},
    {"role": "user", "content": "'I love it!' →"},
    {"role": "assistant", "content": "positive"},
    {"role": "user", "content": "'Waste of money.' →"},
    {"role": "assistant", "content": "negative"},
    {"role": "user", "content": "'This product is terrible.' →"},
]

# --- Chain-of-Thought ---
messages = [
    {"role": "system", "content": "Think step by step before answering."},
    {"role": "user", "content": """Q: If a train travels 120km in 2 hours,
then 90km in 1.5 hours, what is the average speed?
Let's think step by step."""},
]

# --- Structured Output with tool_use ---
import anthropic

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=[{
        "name": "extract_contact",
        "description": "Extract contact info from text",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "email": {"type": "string", "format": "email"},
                "phone": {"type": "string"},
            },
            "required": ["name", "email"],
        },
    }],
    tool_choice={"type": "tool", "name": "extract_contact"},
    messages=[{
        "role": "user",
        "content": "Extract: John Smith, john@example.com, 555-0123"
    }],
)
# Returns structured tool input (strict mode required for guaranteed schema conformance)

🔧

Break It — See What Happens

No system prompt

Too many few-shot examples (context overflow)

Quick check

Trade-off

You remove the system prompt from a customer support bot. Which failure mode is most likely under adversarial user input?

Higher latency — the model spends more tokens self-describing its role.Prompt injection succeeds — no guardrails exist to override malicious instructions.The model refuses all requests — defaulting to the safest possible behavior.Output quality improves — the model has more context window for the user query.

📊

Real-World Numbers

Model	Input $/MTok	Output $/MTok	Context Window
Claude Sonnet 4	$3.00	$15.00	200K tokens
Claude Opus 4	$15.00	$75.00	200K tokens
GPT-4o	$2.50	$10.00	128K tokens
Gemini 2.5 Pro	$1.25	$10.00	1M tokens
Claude Haiku 3.5	$0.80	$4.00	200K tokens

Technique	Typical Token Overhead	Accuracy Impact
System prompt	200-2,000 tokens	Baseline for consistency
Few-shot (3 examples)	300-900 tokens
Chain-of-thought	+50-200% output tokens	+20-40% on reasoning tasks ()
tool_use schema	100-500 tokens	near-100% valid JSON with strict mode (vs ~85% prompt-only)
Prompt caching	0 (same tokens)

✨ Insight · Prompt engineering is about ROI: each technique adds tokens (cost) but improves output quality. The sweet spot for most production systems is a well-crafted system prompt + 3 few-shot examples + tool_use for structured output + prompt caching. CoT only when reasoning is required.

Quick check

Trade-off

Single-path CoT hits 57% on GSM8K. Self-consistency (k=40) hits 78%. What is the cost multiplier and latency impact?

40× cost, 40× latency — samples run serially.2× cost, 1× latency — self-consistency reuses 50% of tokens via caching.40× cost, ~1× latency — parallel independent samples.10× cost, 10× latency — deduplication reduces effective sample count.

🧠

Key Takeaways

What to remember for interviews

1System prompts set role, constraints, and personality and take precedence over user messages — they are the primary lever for reliability and safety in production systems.
2Few-shot examples follow a logarithmic accuracy curve: the largest gains come from 0→3 examples; beyond 5 examples, quality improves marginally while token cost scales linearly.
3Chain-of-thought prompting improves reasoning by forcing the model to generate intermediate tokens before the final answer — each token conditions the next, building a reasoning chain.
4tool_use / function calling enforces a JSON Schema on model output, making structured extraction far more reliable than asking for JSON in plain text (which can produce malformed output).
5Prompt caching stores the KV-cache for repeated prompt prefixes at 90% cost reduction — design prompts with a long, stable prefix (system + few-shot) and variable content at the end to maximize cache hits.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 7 of 7

Compare zero-shot, few-shot, and chain-of-thought prompting. When would you use each?

★☆☆

GoogleAnthropic

Why use tool_use / function calling for structured output instead of just asking the model to return JSON?

★★☆

AnthropicOpenAI

When should you use few-shot prompting vs. fine-tuning? What are the tradeoffs?

★★☆

OpenAIGoogle

What is prompt injection and how do you defend against it? Give concrete examples.

★★☆

AnthropicOpenAI

Design a system prompt for a customer support bot that handles refunds. Walk through your design decisions.

★★☆

AnthropicGoogle

Explain prompt caching. How does it work and when does it save money?

★★☆

AnthropicOpenAI

How would you design prompts for strict JSON outputs under adversarial user input?

★★☆

OpenAIGoogle

🤖 Agents & ReAct

→

Transformer Math