AI Hero Self-Study Syllabus
A complete self-paced course built from Matt Pocock's public writing, open-source skills, conference talks, and Anthropic's official documentation.
Setup Checklist
Work through this before Module 1. Each line should take under 5 minutes unless noted.
Runtimes
▶Read: My 7 Phases Of AI Development
Email*
My 7 Phases Of AI Development
I've identified seven phases of development with AI that consistently lead to shipping great work. These phases apply whether you're using Ralph loops, GSD, Spec Kit, or any other AI coding approach.
The specific implementation is up to you, but these seven phases represent a common pattern across successful AI-assisted development workflows.
This guide is for engineers who believe that fundamentals matter in the AI age. It's not for vibe coders, it's for people serious about AI engineering and building applications that are built to last.
Phase 1: The Idea
Every project starts with an idea - the reason you're invoking this process. This could be:
- An entire app you want to build
- A specific feature or bug fix
- A codebase refactor
The idea can be as large or small as you need. This process scales from massive projects to narrow, focused tasks.
Refining Your Idea
Before moving to research or prototyping, refine your idea iteratively. I use a
/grill-meskill that walks through questions to flesh out the concept and make it more concrete.This early refinement helps clarify requirements and uncover assumptions before investing time in research or prototypes.
Phase 2: Research (Optional)
If your idea involves external dependencies or difficult exploration phases, include a research phase.
For example, if you're integrating with Stripe or an uncommon API, create a
RESEARCH.mdasset that caches all relevant information inside your repo where agents can access it.Why Research Matters
Agents often work in fresh context windows. If exploration is difficult (external APIs, hard-to-access documentation), caching that information in a
research.mdfile saves repeated exploration and improves agent performance.Important: Research assets typically live only for the duration of this sprint or feature development. Research can go out of date or cause agents to take wrong turns if kept too long.
Phase 3: Prototyping
Prototyping is essential when you need to impose your taste on the outcome. At this stage, you're still exploring what you're building and how it should work.
Create multiple variations on a throwaway route, letting the LLM show you different approaches. Iterate through a couple of sessions to find the best option.
This applies to:
- UI design and behavior
- Software architecture decisions
- Testing external service integrations
Prototyping early lets you commit the winning design to your codebase, making it available to agents during implementation. By the time you write the PRD, concrete examples are more valuable than abstract descriptions.
Phase 4: Product Requirements Document (PRD)
With research and prototyping complete, it's time to properly describe the destination. You should now understand the end state clearly.
Focus on what users will see and how it will behave, not implementation details. The PRD (Product Requirements Document) describes the end state, not the journey.
During PRD creation, hammer out the design by prompting the agent to grill you on every decision point. Walk through your entire decision tree to uncover edge cases and requirements. I use a
/write-a-prdskill that is purpose-built for this process.Phase 5: Implementation Planning (Kanban Board)
Break down the PRD into an implementation plan. A Kanban board is a list of tickets with blocking relationships that describes all the work needed.
While you could create a single sequential plan, Kanban boards enable effective parallelization. Find all non-blocking tickets and spin up an agent for each one. I use a
/prd-to-issuesskill to automate this breakdown.GitHub issues works well for both PRDs and Kanban boards, though it lacks built-in blocking relationships. Linear is a better option if you need that feature.
Phase 6: Execution
Run a coding agent to execute all tickets on the Kanban board. This is where the actual code gets written.
Most times, a sequential agent working through each ticket is sufficient. However, with a well-structured Kanban board, you can parallelize by running multiple agents on non-blocking tickets simultaneously.
I use Ralph loops, which work effectively with this setup. Ralph loops allow agents to work autonomously while maintaining code quality through automated testing and validation.
Running Away From Keyboard (AFK)
With proper setup (research, prototype, Kanban board, and PRD), you can run the execution loop away-from-keyboard and get excellent results.
The key is ensuring your agents have all the context they need:
- Research assets for external dependencies
- Prototype code for design patterns
- Clear PRD for requirements
- Well-defined tickets with acceptance criteria
When these pieces are in place, agents can make informed decisions without constant human intervention.
Phase 7: Quality Assurance
Once execution completes, have the agent create a QA plan for human review. This plan should outline specific test scenarios, edge cases, and acceptance criteria to verify.
QA typically uncovers issues or improvement opportunities, resulting in more Kanban tickets and another execution loop. This is expected and healthy - you'll iterate through phases 5-7 multiple times until you reach a polished product.
Each iteration should bring you closer to production quality:
- Agent creates QA plan
- Human reviews and tests the implementation
- Human identifies bugs, UX issues, or improvements
- New tickets are created
- Return to execution phase
Code Review and Human Involvement
QA involves humans reading the generated code to ensure quality, maintainability, and correctness. This may not always be necessary (especially with gray box architectures), but it's an important quality gate for production systems.
Look for:
- Logic errors or edge cases
- Security vulnerabilities
- Performance issues
- Code maintainability and readability
- Adherence to project patterns
These seven phases form the core framework for working effectively with AI agents.
Summary
Phase Purpose Key Deliverable 1. Idea Define what you want to build Problem statement 2. Research (optional) Explore external dependencies research.mdasset3. Prototype (optional) Test design and UX ideas Working prototype 4. PRD Document the end state Product requirements 5. Kanban Board Break down work into tickets Task list with dependencies 6. Execution Build the actual implementation Working code 7. QA Verify the completed work QAplan and feedbackThis framework will likely evolve as new patterns emerge. Code review deserves special attention - it could be integrated into the execution flow or expanded within the QA phase.
▶Read: The AI Engineer Mindset
The AI Engineer Mindset
Building with LLMs requires a fundamental shift in how you think about software development. You're no longer designing deterministic systems where inputs map to predictable outputs. Instead, you're working with probabilistic systems which are inherently unpredictable. To ship a great application, you need to clearly define your success criteria and build a culture of continuous improvement based on real user data.
Productionizing LLM Apps Is Hard
It's relatively simple to get a demo up and running. But productionizing an LLM-powered app is a different story:
The journey starts in what I call the "Vibes-Only Trough." You've got your app working with the LLM, tested it with a few examples, and it seems to be doing what you want. But when you put it in front of real users, they'll interact with it in ways you never imagined. The flexibility of natural language means users can push your system far beyond its intended use cases, exposing edge cases and failure modes you couldn't have predicted.
This is where systematic evaluation comes in. As you move up the "Data-Driven Slope," you build proper evaluation frameworks, collect data, and measure performance. You're no longer relying on vibes - you're making data-driven decisions about your app's performance.
But the journey won't be straightforward. Your app's quality will swing wildly as you experiment. Some changes will make things better, others worse.
Don't panic - this volatility is built into the game. The trick is tracking these changes systematically and learning from each iteration. This is why we need a new approach to development, one that embraces the probabilistic nature of LLMs while maintaining rigorous evaluation practices.
Defining Your Success Criteria
The first step in managing this probabilistic nature is defining what success looks like for your app. In traditional software, success criteria are often straightforward - a login system either works or it doesn't. But with LLM-powered apps, success is rarely binary.
Let's say you're building a sentiment analysis system that classifies customer reviews. What does "success" look like? You need specific, measurable criteria that align with your app's purpose:
- Accuracy: 92% agreement with human expert sentiment labels across diverse industries
- Response Time: 95% of classifications completed within 500ms
- Confidence Scoring: 90% of high-confidence predictions (>0.9) should match human judgment
The key is making these criteria specific and measurable. Instead of vague "good performance" or impossible targets like 100% accuracy, use concrete numbers based on industry benchmarks. Even subjective metrics like "code quality" can be quantified - for example, "95% of generated functions pass all unit tests on first run."
Without clear, measurable success criteria, you'll never know if your changes are making things better or worse. And in the probabilistic world of LLM apps, that's a recipe for endless tweaking without real progress.
For a deeper dive into crafting effective success criteria, check out Anthropic's guide on defining success for LLM applications.
Data Is Your Most Valuable Asset
Your app's performance is only as good as the data you collect about how it's being used. But collecting data isn't enough - you need to build a data-driven culture in your organization. This means making decisions based on real metrics, not gut feelings, and using that data to measure against your success criteria.
Every user interaction is a data point you can use to improve your app:
- Improve Prompts: Use successful and failed responses to refine your prompt engineering
- Fine-tune Models: Collect high-quality user interactions to create training data for model fine-tuning
- Track Success Rates: Monitor which features or responses users find most helpful
- Spot Edge Cases: Use failures to identify and handle unexpected scenarios
This can feel alien to traditional developers. Most write code, test it, and ship it. But with LLM-powered apps, you'll spend a lot of time staring at spreadsheets and tracking metrics. It's not glamorous work, but it's essential. The best AI engineers embrace this data-driven mindset, even when it means doing the boring stuff.
Start with simple feedback mechanisms like thumbs up/down buttons, then gradually build more sophisticated data collection as you learn what metrics matter most. I've heard that one of the largest code generation platforms has a dedicated Slack channel that alerts developers to investigate every downvote. Every negative interaction becomes a learning opportunity. If you can't measure it, you can't improve it.
Conclusion
Building LLM-powered apps requires a fundamental shift in mindset. You're moving from deterministic systems to probabilistic ones that need constant evaluation. Here's an uncomfortable truth: the first version of your app won't be great - you simply don't have enough data yet. But that's okay. The real magic happens when you build a culture of continuous improvement based on real user data.
The key isn't to eliminate uncertainty (that's impossible with LLMs), but to understand and manage it through systematic evaluation. Success comes from defining clear, measurable criteria and treating every user interaction as a learning opportunity. While this might feel foreign to traditional developers, it's essential to AI engineering.
In our next article, we'll dive deep into how to build evaluation systems - what we call "evals" - that help you understand and manage the probabilistic nature of your app. We'll explore how to create systematic ways to measure performance, track improvements, and make data-driven decisions about your app's development.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: Personal software is INSANE in the age of AI
Email*
Personal software is INSANE in the age of AI
Matt Pocock
Source Code Use with AI
I built a personal course video manager. It handles 105 videos across 41 lessons for my courses. Nothing fancy. Just software adapted to exactly how I work.
Here's what it does: I record a video. The app grabs the transcript and the local files I used during recording. I select an LLM (usually Haiku 4.5), choose the output format (steps to complete, skill building text, whatever), and hit go.
The LLM churns out accompanying text that I review and either save directly to the repo as a README or copy to my clipboard. Then I can immediately turn that transcript into an article or edit the video description.
The whole workflow lives in one place. I recorded this article's source video in the same software I'm describing right now.
I have zero interest in making this a commercial application. I'm showing you this because it demonstrates something critical about AI-powered personal software.
AI Handles the Grunt Work
Everything the AI touches in my workflow is pure grunt work. Transcription. Text generation. Repurposing content from one format into another.
I'm not delegating any of my thinking to the LLM. I review all of its outputs closely—at least the stuff my users see.
The moment you start delegating your thinking to an LLM, you're screwed.
But if you build a system that integrates AI into workflows you already have? That's when it gets powerful.
Personal Software That Knows Your Workflow
Generic AI tools make impressive demos. They fall flat in daily use because they don't know who you are, what you're working on, or how you work.
My course manager is optimized for me. It knows my file structure. It understands my content formats. It plugs into my existing repos.
That's the future I'm betting on. Not generic tools everyone uses the same way. Bespoke applications that deeply integrate with your specific needs and projects.
The AI helps me move faster. It doesn't replace my judgment. It takes work off my hands without taking over my thinking.
Build systems that leverage AI for the repetitive stuff. Keep the thinking for yourself.
Exercise 0.1 — Your own one-pager
Write notes/journal.md with:
▶Read: What Is An LLM?
What Is An LLM?
In this article, we'll cover the basics of large language models. We'll talk about what they are, how they work, and touch on the process of creating them.
Most of the resources out there go really deep into how LLMs work - we're not going to do that. Instead, I'll give you a brief overview of the most common concepts so you can go ahead and get building.
What Is A Large Language Model?
A large language model is essentially a massive compressed file - think of it like a 1TB zip file. This file contains a bunch of numbers, encoded as 16-bit floats. These numbers are the parameters of the model.
0.1239784871238176123 // Parameter 1 0.1515689756890123123 // Parameter 2These parameters represent the 'brain' of the model. They are the result of the model's pre-training: a process that takes a huge amount of text data and 'compresses' it into these numbers. They represent the model's understanding of the world, and give it the ability to remember facts and make decisions.
The number of these parameters represents the size of the model's brain. In general, models with larger brains perform better, but run slower. A model with 70B parameters will run ~10x slower than a model with 7B parameters.
Already, we're looking at a size vs speed tradeoff - a common theme when choosing large language models.
How Do You Run A Large Language Model?
In order to get the model to do anything useful, you need to perform inference on the model.
Inference is the process of sending text to the model and getting a response back. This is done using an inference function - a piece of software that takes the parameters of the model and runs an algorithm on them to find the next word. This is far cheaper than pre-training the model, and can be done on your laptop.
For a deep-dive into how inference works, check out this incredible interactive walkthrough.
Sampling Strategy
To find the next word, the model looks at all the possible tokens it could choose, and picks one.
To do so, it uses a sampling strategy picked by the developer. This strategy determines how the model chooses the next word. The most common strategies are:
- Greedy Sampling: The model always picks the most likely word.
- Top-K Sampling: The model picks from the top K most likely words.
- Top-P Sampling: The model picks from the words that make up P% of the probability mass.
- Temperature Sampling: The model introduces randomness into the selection process, allowing for more diverse outputs.
It's beyond the scope of this article to go into the details of these strategies. Usually, as an AI Engineer, you don't have the ability to change the sampling strategy of the model you're using. However, you can tweak variables, like the temperature, to get different results.
What Are Input Tokens?
When you send text to the model, it first needs to be tokenized. This is the process of breaking the text up into individual words, and then converting those words into numbers. These numbers are the input tokens, which are passed to the inference engine.
Each model has its own tokenizer. Tiktokenizer is a great playground for exploring different tokenizers.
How Do You Create A Model?
In order to acquire the parameters, you need to train the model. Training large language models is an extremely involved process that requires a lot of time, expertise, and money. Learning how to do it is outside the scope of this article.
A rough guide is to take a chunk of the internet, let's say 10TB of data. You use 6,000 GPU's for 12 days, at the cost of around $2M. And you end up with a ~140GB file with all the parameters of the model.
The training process has two main phases:
- Pre-training: This gives the model its knowledge by compressing vast amounts of internet data into parameters
- Post-training: This shapes the model's personality and behavior through careful instruction and example.
You end up with a huge file of parameters - a kind of 'compressed' version of all of the data the model was trained on, with its personality shaped by post-training. Without post-training, the model would just be an inert blob of knowledge - it wouldn't know how to behave like a helpful assistant.
Resources
- Intro To Large Language Models by Andrej Karpathy
- How I Use LLM's by Andrej Karpathy
How Do You Introspect A Model?
It's possible (though very difficult) to dive into the parameters of a model to work out which ones correspond to which real-world concepts. For instance, Anthropic found that models can:
- Share concepts across languages, suggesting a kind of universal "language of thought"
- Plan ahead when writing (like planning rhymes in poetry)
- Use multiple parallel pathways for tasks (like mental math)
- Sometimes fabricate plausible-sounding reasoning rather than showing their true thought process
By understanding how models think, we can better anticipate their behavior and potentially remove some of the "magic" that makes them hard to trust. This could lead to more controlled, deterministic AI systems in the future - but the research is still in its early stages.
Resources
- Tracing the thoughts of a large language model by Anthropic
- Golden Gate Claude by Anthropic
Conclusion
Large language models compress vast amounts of knowledge into numerical parameters. While they're built on complex math, you don't need to understand their inner workings to use them effectively. Getting a basic lay of the land is enough to build smart intuitions about how they work.
In the next article, we'll explore how to choose the right LLM for your specific needs.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: What Are Tokens?
Email*
What Are Tokens?
Tokens are the fundamental building blocks that help Large Language Models (LLMs) process text. Understanding them is essential, especially since you're billed based on token usage.
What Are Tokens?
Tokens are simply numbers that represent how the LLM "thinks" about the text you provide. The process of converting text into tokens is called encoding.
The tokenization process works in two parts:
- The tokenizer splits text into tokens it recognizes
- These tokens are converted into numbers
Decoding is the reverse process:
- Numbers are converted back into text tokens
- The tokens are joined together to form the output
The LLM Process Flow
The complete LLM process looks like this:
- Tokenizer encodes your input text into tokens
- LLM processes your tokens
- LLM produces output tokens
- Output tokens are decoded back into readable text
To clarify, input tokens include:
- Your conversation history with the LLM
- System prompts
- Tool definitions
Output tokens are what the LLM sends back as a response.
You're billed for both input and output tokens, typically at different rates. One way to save money is to design your prompts to generate fewer output tokens.
How Tokens Are Created
The tokenization process starts with a large corpus of text - similar to what's used to train the LLM itself. Let's imagine a tiny corpus consisting of just one sentence: "the cat sat on the mat."
First, all individual characters are extracted:
T H E space C A T space S A T space O N space T H E space M A TEach of these characters becomes its own token in the vocabulary.
Next, common groupings of characters are identified:
- "TH" appears in "the" (twice)
- "HE" appears in "the" (twice)
- "AT" appears in "cat", "sat", and "mat"
Each of these groupings also gets assigned its own token.
Then, groups of groups are identified - like "TH" + "HE" creating "THE" (the word "the"), which gets its own token.
Vocabulary Size Matters
The goal is to create a large vocabulary of tokens because larger vocabularies can split words into fewer tokens, making processing more efficient.
For example, a vocabulary size of 1,000 tokens might split "understanding" into 5 tokens. A vocabulary size of 50,000 tokens might split it into 3 tokens, and a vocabulary size of 200,000 tokens might split it into 2 tokens.
Having a larger vocabulary means you can split words into fewer tokens, making processing more efficient.
Handling Unusual Words
The tokenizer struggles with uncommon words. For example, "O Frabjous Day" from Lewis Carroll's poem gets split into many tokens because "Frabjous" is a made-up word that doesn't appear frequently in the training corpus.
We can see that it turns it into 7 tokens - more than we'd expect from only 15 characters.
Final Thoughts
I hope that helps demystify tokens a bit. I found the tiktokenizer playground really useful for understanding this stuff.
Let me know if you have any questions - and what else would you like me to cover next?
Matt
▶Read: What Is The Context Window?
Email*
What Is The Context Window?
The context window is made up of input and output tokens. The input tokens might include a system prompt and a user prompt (the message from the user). And then the output tokens is whatever the assistant comes back with:
The context window is the input and output tokens combined.
As the conversation gets longer, as more and more messages get put into the conversation, the number of tokens used grows.
And as you can imagine, this can't go on forever. Every model has a hard-coded limit for the number of tokens it can see at any one time.
Context Window Limits
If we imagine a super long conversation, eventually we will hit a limit. And if you try to query the LLM with this super long conversation and you hit the limit, then you will probably get some kind of error back from the API.
You might even hit the limit during the generation of a message. So this message might start outside the context window limit, but as it goes or it carries on until it hits the context window limit.
The model itself isn't quite smart enough to work around its own context window limits. And so you will hit issues like this sometimes.
The "Lost in the Middle" Problem
But the biggest issue with context windows is that the bigger they get, the more "lost in the middle" issues you get.
If we imagine a huge conversation, where these rings are the individual messages, the messages at the start of the history have quite a big impact on the output, and the ones at the end do too, but the stuff in the middle the LLM pays a bit less attention to.
This is a well-known phenomenon and it's much more pronounced the larger the context window gets.
So when you're assessing a model, you shouldn't just think, "wow, this context window limit's huge, that's incredible, I can put so much stuff in there."
That model will still probably have lost-in-the-middle issues. It may even have trouble retrieving information from its own context window.
And so even if that model supports a big context window, you'll definitely still get better results from using fewer tokens in the context.
▶Read: Messages, System Prompts and Reasoning Tokens
Email*
Messages, System Prompts and Reasoning Tokens
The first part of understanding LLMs is understanding the protocol for talking to them - the messages that make up the conversation.
What is a user message? What is a system prompt? What are tool calls? What are reasoning tokens? Let's break it down now.
User And Assistant Messages
A simple conversation with an LLM might look like this. You have user messages, which are messages from you, the user. LLM messages are called assistant messages.
In this situation we're just talking to the bare model that the model provider gives us. But if we want to customize the LLM's behavior, then we can use a system prompt.
System Prompts
System prompts are messages at the very start of the history, which are visible to the LLM, but usually not to the user.
They're powerful, too. The LLM, if there's a conflict between the user and the system prompt, will usually obey the system prompt. For instance, we can tell it to respond in Morse code, and then even if we say stop doing Morse code, it will reply in Morse code.
Of course, this doesn't always work and there's plenty of examples of people jailbreaking the system prompt, but this is at least the theory.
Reasoning Tokens
Lots of models can send back reasoning tokens, where the model appears to think through its output before it responds. These reasoning tokens are just another part of the assistant message.
This is an important concept here that messages can contain multiple parts. This is useful for file parts too, where you can send files to LLMs or get files back from them in image generation use cases.
For instance here, we're sending a file to an LLM saying summarize this PDF, and it replies with a summary.
Tools
With tools you have the ability to say you have access to a write file tool to the LLM. This might write a file to our file system or something.
Then the user is going to say, write a
todo.mdfile. The assistant then produces a tool call message. That tool call has anidon it. And it also has an instruction to write a filetodo.mdwith empty contents.Now we, in our applications, take that tool call and execute it, and we send back to the LLM a tool result with the same
idas the tool call and the result of what happened, usually as a string.We get back from the assistant a summary of what happened, "wrote a todo.md file successfully".
So tools are kind of like a conversation between the LLM and our system, where:
- The tool call is the LLM asking us for something
- The tool result is us giving it the information it needs
That's how AI-powered apps like Claude Code and Cursor can do things on your system. We'll dig deeper into tools later.
▶Read: What Are Tools?
Email*
What Are Tools?
Giving the LLM tools is done via the system prompt. The system prompt is just another message in the message history which describes to the LLM what it's supposed to be doing, and in this case, what tools it can call.
For each tool, we're providing three things:
- The name of the tool (e.g.,
writeFile) - A description of the tool (e.g., "Write a file to the file system")
- The parameters it takes and their types (e.g.,
pathandcontent)
These parameters are specified in JSON schema, so we can pass anything that JSON schema supports, like objects, arrays, and other complex types.
These tool definitions get injected into the system prompt, and any other information you've provided to the system prompt goes below it. There's nothing particularly fancy going on - it's just tool definitions inside the system prompt.
The LLM Chooses a Tool
The magic happens when we ask the LLM to choose a tool. We've got our system prompt saying "You have access to the following tools." Then we add a user message, which says, "Write a new file called .gitignore."
We then receive back an assistant message with a tool call inside. This is just an instruction from the LLM indicating which tool to call. It has an
idon it and also contains the required parameters.In this example, it's writing an empty file to the path ".gitignore".
Tool Calls vs. Tool Execution
This tool call is just an instruction for which tool should be called. Nothing has happened yet - the assistant has just produced a message. That's it.
The tool then needs to be executed on our machine. The LLM has created this message, but we then need to actually execute the creation of the file on our machine.
Implementation Requirements
This means that for every single tool in the system prompt, we're going to have functions in our code base that match up to those.
If the tool execution is successful, we're then going to send back a user message to the LLM. It's going to have the same
idas the previous tool call, and we're going to send it a message saying what happened when we executed the tool.Handling Errors
This error handling is really important. If there are any errors when the tool is executed, we need to show that error message to the LLM so that it can do something differently.
So a tool result could be a success or it could be a failure.
The Complete Flow
Let's go one more time through the entire flow:
- We specify the tool in the system prompt: "You have access to the following tools..." passing it a bunch of JSON schema
- We send a user message saying "Write a new file called .gitignore" (note that we don't have to specifically say "call this tool" - the LLM itself decides which tool to call)
- The LLM produces a tool call message with all the right parameters
- We see this tool call message and execute it on our machine
- We send the result back with a message saying what happened
- The LLM sees this entire history and responds with a summary
In this case, it said "Done. What should go in there?"
Summary of Tools
That's what tools are - they're just ways of getting LLMs to produce certain types of messages which you can then intercept and execute on your machine, and give those results back to the LLM.
With this simple loop, you can build really, really powerful applications.
- The name of the tool (e.g.,
▶Read: What Is An Agent?
Email*
What Is An Agent?
Ever since Anthropic dropped the article building effective agents, everyone's been talking about agents and workflows. Both agents and workflows are ways of building more powerful systems with LLMs, and they both involve orchestrating multiple calls to the LLM.
What is a Workflow?
A workflow does this through predetermined steps where you have one LLM call which goes to another call which goes to another call. These predetermined steps are written in code by the developer.
The code itself decides when to stop the program, when to call the next LLM. It's all written into the code itself.
What is an Agent?
An agent, though, doesn't use predetermined steps. It calls an LLM and gives it a bunch of tools, different options of things that it can do next. The LLM decides which tool to call and then responds to the result of those tools.
The LLM itself decides when to stop the program when it thinks it's finished. In other words, this is the LLM making it up as it goes along, which hands a lot more power to the LLM, but of course makes it less predictable.
Comparing Agents and Workflows
Both agents and workflows involve multiple LLM calls. If you're just making one LLM call, it's not really either of those things.
The crucial difference is who decides when to stop:
Agents are really good in situations where the steps to complete the task are not particularly clear, where it needs the ability to improvise to figure out its way through a difficult problem.
But workflows are great for things that need to be done the same way again and again. Workflows are often unfairly maligned because they're not as exciting and sexy as agents.
You'll often get better results from using a workflow than using an agent, as long as the task is clearly specified.
Parallel Workflows Example
For instance, you can use a workflow to parallelize work. Let's say we take in a chunk of text, we can split it into two parts, summarize each of them independently, and then summarize the summaries afterwards.
That's the difference between agents and workflows. I don't know whether it's me being old and boring, but I am more excited by workflows than agents.
▶Read: What Can You Use LLM's For?
Email*
What Can You Use LLM's For?
Before we dive into the world of LLMs, and how they work, we first need to know where we're going.
What are LLMs actually used for these days? What utility do they have? What can you build with them?
In this article, we'll explore the different applications of LLMs. We'll also look at things you shouldn't build with LLMs - things that are better suited to more deterministic tools.
Use LLMs For...
Unstructured Data -> Structured Data
Most companies have access to a lot of unstructured data. These could be transcripts from support calls, customer emails, invoices, or even just notes from meetings.
This data is hard to work with. It's hard to search, hard to analyze, and hard to read.
For example, a friend of mine has started a PhD looking at historical hawk migration patterns. Data for how these hawks moved is recorded in ancient, unstructured logs from the 19th century. Despite being carefully archived and digitized, reading through these logs is an absurdly time-consuming task.
The solution? To use LLMs to convert these logs into tabular data. LLMs can read the logs, understand the patterns, and convert them into a structured format. This structured format can then be used for analysis, visualization, and further research.
This is perhaps the most common, powerful use case for LLMs. The world has collected vast amounts of data in the past few decades. LLMs are now making that data trawlable and accessible.
This is a common theme—LLMs are often used for data tasks that it would be impractical (or expensive) to hire humans for. This opens up new possibilities for working with previously inaccessible data.
Labeling & Classification
Another common task for LLMs is classification. They can be fed an input and asked to attach labels to it, which helps in organizing and understanding the data more effectively.
One striking example comes from The Prompt Report. In their case study, they attempt to detect "signal that is predictive of crisis-level suicide risk in text written by a potentially suicidal individual". They used data from the subreddit r/SuicideWatch, and the LLM had to match up with an expert's analysis.
The LLM, when provided with the text, would have to classify whether or not it contained elements of either "frantic hopelessness" or "entrapment". It would reply with "positive" (i.e., that the text contained risk signals) or "negative".
Classification systems have been around in machine learning for a long time. They usually require significant amounts of data to train. LLMs make this process easier by only requiring a simple prompt to change their behavior to a classifier. Very useful.
Recent developments in LLMs also mean it's easier to retrieve structured data from them. Check out a classification example from the Vercel AI SDK tutorial.
Question Answering
Another common use case for LLMs is as a question answerer. You can feed an LLM a question, and it will give you a response based on its training data.
However, LLMs have several downsides when used as a knowledge base. Their training data has a cut-off point, so it doesn't have access to up-to-date information. They often can't cite sources for their answers, which makes it hard to verify their accuracy.
Therefore, connecting LLMs to external data sources is a common pattern.
This external data source could be a database, a search engine, or any API. LLMs can call external services (using tools) to get the most up-to-date information.
This is not foolproof—careful work is needed to make sure the LLM does not hallucinate or provide incorrect information. But question answerers, in the form of chatbots or search engines, are a common use case for LLMs.
DeepResearch, a now-common offering from Perplexity, Google, OpenAI, and others, is a good example of this. It's a pattern where an entire academic-style report is generated from a simple query.
Agents
The fact that LLMs can access external tools has a lot of folks very excited. It means that LLMs can be used to do things in the world, not just generate text.
This pattern is often called an "agent"—a system that can take actions in the world, respond to user inputs, and interact with other systems.
One can imagine a coding agent acting like a team member—contactable via Slack, able to write code, deploy it to production, and communicate with the user.
This is similar to the promise of agents like Devin.
However, agents have not yet had their breakout moment—certainly not in the way chatbots have. Agents are yet to find their final form in terms of user experience.
Don't Use LLMs For...
Naive Chatbots
It can feel very tempting to build chatbots with LLMs. It's very simple to set up. You feed the LLM a prompt, give it access to a conversation history, and you're good to go.
"Chat with our docs." "Chat with our support bot." "Chat with your search results." Naive chatbots are thin wrappers around LLMs, hastily thrown together to make a product seem more interactive.
However, productionizing chatbots is an extremely difficult problem. If you're not careful, they will frustrate your users and damage your brand. It is notoriously difficult to make a chatbot only respond to relevant queries without veering off-topic.
The big model providers (OpenAI, Anthropic, Google, etc.) come with built-in guardrails to prevent their models from saying anything brand-damaging. But the surface area is so large—any potential conversation you can think of—that these guardrails will likely never be perfect. A famous example is Google's Gemini asking the user to die.
It only takes one determined user to jailbreak your chatbot and make it say something inappropriate. Don't ship chatbots without proper safeguards.
Deterministic Systems
A good rule of thumb for AI systems is "if it can be built deterministically, it should be."
LLMs are probabilistic systems. They are designed to choose the next word in a piece of text, over and over again, from a choice of many possible options. Depending on how the next word is selected (their "sampling strategy"), they can produce different outputs from the same input.
However, this design also makes them prone to several failure modes:
- Hallucinations: generating text that is not grounded in reality
- Sycophancy: overly conforming to the user's point of view, instead of providing a balanced response
These failure modes can be worked around, but they require careful design and testing. This means that if you can build a system deterministically, you should.
Deterministic systems are far easier to test, debug, and maintain. They are safer to put into production and are often faster and cheaper to run.
Conclusion
Deterministic systems are not going away. They are infinitely easier than AI apps to build, test, and maintain. In a world where folks are throwing LLMs at every problem, being able to sniff out when not to use them is a valuable skill.
Deterministic systems should be your default choice for any task, until you hit a barrier which can only be solved by an LLM.
But LLMs do have their place. Let's take the LLM use cases we've seen so far and put them into two buckets.
First, there are the tasks that are too expensive to hire humans for:
- Converting unstructured data into structured data
- Labeling and classification
Then, there are the tasks that are too complex for deterministic systems:
- Question answering
- Text generation
- Agents
So any task that falls into one of these buckets is a good candidate for an LLM.
▶Read: 5 Questions To Ask Before Choosing An LLM
5 Questions To Ask Before Choosing An LLM
Choosing the right model is crucial to the success of your AI-powered app. But it's not an easy call.
It's a tough, many-layered decision that you don't only make once. You'll have to make it over and over again as new models emerge and your app evolves.
I've split this decision down into several key questions you'll need to ask yourself when choosing a model.
1. Should I Use An Open Or Closed Model?
There are two main types of models you'll need to choose from: open and closed.
Open Models
Open source models are models that are free to download and use. However, you'll need to host and run them yourself if you want to build an app with them.
Open models can be run on your own hardware, or on a cloud provider like AWS or Azure.
The Open LLM Leaderboard is a great place to look for open models.
Closed Models
Closed-source models are ones controlled by companies. You need to pay to use them, but they are hosted by the company, so you don't need to worry about running them yourself.
The most powerful models in the world are currently closed-source. But open models are improving all the time.
Chatbot Arena is a good place to compare closed-source against open-source models.
Model Providers, API Providers, and Hosting Your Own Model
There are two types of companies who host models for you to use:
Model providers use closed-source models. You pay to use their models, but they're hosted by the company. These include OpenAI, Google, Anthropic, Deepseek, and many more.
API providers host open source models, and charge you a fee for using them. They include Hugging Face, Groq, and others.
You can also host your own model. This is the most flexible option, but also the most expensive. You'll need to pay for the hardware to run the model, and you'll be responsible for keeping it up and running.
2. How Much Will It Cost?
The way you'll pay for your LLM usage changes depending on whether you're hosting your model or not.
Cost Per Token
Most model providers charge by token. The more tokens you use, the more you pay. The most common way this is expressed is "cost per 1m tokens".
You don't just get charged for input tokens (what you say to the model), but also for completion tokens (how the model replies).
This is a traditional 'pay per usage' model - just like serverless platforms charging you for compute time. Tokens are a decent metric for how much you're using the model.
There is a worldwide race-to-the-bottom on token prices, as companies compete for market share. Price comparison websites like Helicone's are useful for comparing prices.
Hosting Open-Source Models
Hosting open-source models can be a more cost-effective option. Instead of paying per token, you now pay a fixed fee to host the model.
This also has the benefit of keeping you entirely in control of your own data. This can be important for data residency and privacy reasons.
However, models need to be hosted on powerful hardware, which can be expensive. You'll need to balance the cost of hosting the model with the cost of using a model provider.
My general suggestion is to start with third-party APIs. They give you the most flexibility and are reasonably cost-effective. Later, you can move to hosting your own models if you need to.
3. How Important Is Latency?
Another important feature to consider is latency. Latency is the approximate time it takes for the model to respond to a query. Fast responses can be crucial for many use cases, and can make your application more useful to users.
Latency is affected by the size of the model - smaller models are faster to run, but often less accurate.
Latency is also affected by the hardware the model is running on. More powerful hardware can run models faster.
And finally, it's affected by any inference optimizations made to the model. This helps speed up the inference function of the model. They include quantization, distillation and parallelism - and are somewhat outside the bounds of this article.
Measuring Latency
There are two main metrics to consider when measuring latency:
- TTFT: Time to first token: how long it takes for the model to start generating a response
- TPOT: Time per output token: how long it takes for the model to generate each token
4. How Do I Assess Model Performance?
Cost and latency are important, but a model is useless if it cannot perform the task required.
As a general rule, smaller models will perform worse. A smaller model will have fewer parameters - and so a smaller space in which to store information.
Public Benchmarks
The performance of a model is an extremely slippery metric. It's extremely difficult to look at two models and say which one is better.
A common approach that model providers (and open-source projects) take is to use benchmarks. These are standardized tests that measure the performance of a model on a specific task. These benchmarks can grade the model at certain tasks, such as translation, summarization, question-answering, or coding.
Benchmarks are a decent early indicator for a model's performance. However, there is a constant danger of model providers overfitting their models to these benchmarks. This can either happen by the model accidentally consuming the benchmark data during pre-training, or organizational pressure to improve benchmark scores.
In some benchmarks, models are compared against each other - with humans grading which model produces the better output. Chatbot Arena is a good example of this - and worth checking out for an early indicator of model performance.
Specialized Models
Some models will perform better at certain tasks. This often depends on the model's training data - if it has been trained on large amounts of code examples, it will be better at coding tasks. The same is true for many disciplines: translation, classification, summarization, etc.
If you have a specialized task, it's worth looking for models which are specialized for that task. These models will usually outperform general-purpose models. They are also often smaller - so, faster and more efficient.
Reasoning Models
Some models have been specifically designed to pause before providing a response. These are the reasoning class of models, a trend initiated by OpenAI's o1.
These models often perform better at tasks requiring forward planning and critical thinking, like coding and math problems. They also output their planning process with reasoning tokens, which can be useful to stream to the user in real-time.
However, they are often more expensive than regular models, and take longer to respond. It's a performance/latency tradeoff.
Evals
The only way a model can truly be evaluated is by testing it in the context of your application. This is why building evals for your system is so crucial.
Evals are a set of benchmarks you run on your own system. They let you see whether your system is improving or degrading over time. We'll cover them in more depth later.
5. How Big A Context Window Do I Need?
The context window is the number of tokens the model can see at a time. The larger the context window, the more information the model can use to generate its next word.
This limit is counted in tokens, and counts both input and completion tokens. Passing too long an input to a model (or forcing it to generate too long a response) can cause an API error, or prevent it from generating a response.
The context window size is related to the mechanism the model uses to generate text - so is tied to the design of the model itself. Context window sizes are growing all the time. Currently, Gemini models have the largest context windows.
Since the context window is limited on all models, managing it is a constant battle for AI engineers. Patterns like chunking in RAG are designed to squeeze more information into the context window.
Conclusion
These five factors are important for choosing your model:
- Open or Closed
- Cost
- Latency
- Performance
- Context Window
Leaderboards and benchmarks are a good place to start. However, the only way to truly assess a model's suitability is to test it in your application via experimentation with your own evals.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: 17 Techniques For Improving Your LLM-Powered App
17 Techniques For Improving Your LLM-Powered App
Once you clearly understand your success criteria, have picked your model, and written some basic evals, it's time to start improving your system.
The process of improving your system comes down to two things:
- Improving your feedback loop (evals)
- Improving the performance of the system itself
We've already looked at how to improve your evals. In this article, I'll give you an overview of the main ways you can improve your system.
But first, let's talk about the mindset you need to have when improving your system.
Try the Simple Thing First
We've already seen that improving an AI system is an experimental process. You need to try things out, see what works, and iterate.
Techniques for improving a system range from simple and cheap to complex and expensive. Tweaking a prompt? Cheap. Training a model from scratch? Astonishingly expensive.
I call this the Staircase Of Complexity Hell:
The key is to start at the top of the staircase, and work your way down only when you've exhausted all the simpler options. Simple techniques can provide a huge improvement for a small amount of effort.
Techniques To Try
This list is ordered from the simplest techniques to the most complex. Start at the top, and work your way down. I've linked to further resources where you can get more details on each technique.
1. Your First Prompt
Problem: You've got to start somewhere.
Solution: Here are some basic tips for improving your prompts:
- Be clear, direct, and specific.
- Think of the LLM as a brilliant, but very new, employee.
- Remember that the LLM has no context on your norms, styles, or guidelines.
Resources
- Anthropic's Advice is a useful guide for understanding how to write good prompts.
- Anthropic's Prompt Library is a great way to explore good prompts for your use case.
- OpenAI's docs on writing clear instructions.
2. Role-Based Prompting
Problem: You want the LLM to behave in a certain way no matter the input.
Solution: Use role-based prompting to get the LLM to adopt a persona.
This could be as diverse as adjusting the tone of voice:
Or even the accent:
Or prime the LLM to talk about a certain topic:
This is an extremely common technique, and very cheap to implement. It's usually done in a system prompt.
Resources
- My video on System Prompts in Vercel's AI SDK
- Anthropic's docs on role-based prompting
- OpenAI's docs on asking the model to adopt a persona
3. XML Tags
XML Tags On The Input
Problem: You want to pass multiple pieces of information to the LLM in a single prompt.
Solution: Use XML tags.
XML tags can help provide delimiters for different parts of the prompt.
An example from Anthropic's docs is a financial report:
XML Tags On The Output
Problem: You want the LLM to respond with multiple different outputs.
Solution: Tell the LLM to respond with different outputs based on the XML tags.
You can also tell your LLM to respond with different outputs based on the XML tags in the prompt. This can give you more control over the structure of the response.
You may want the LLM to review an article for you. You may want it to provide a , a , and .
This technique was popularised by Anthropic, but most models also support it.
Resources
- Anthropic's docs on using XML tags in your prompts
- OpenAI's docs mention using XML tags as delimiters
4. Constraining The LLM's Response Format
Problem: You want to tightly constrain the text that comes back from the LLM, such as asking it to reply with JSON or a single word.
Solution: Use structured outputs or explicit format instructions.
Older versions of Claude supported "prefilling" the assistant message to steer the response format. As of Claude 4.6, prefilling returns a 400 error. The modern alternatives are better:
Structured Outputs (recommended): Most LLM providers now support structured output schemas that guarantee the response matches a specific JSON shape. With the Vercel AI SDK, use or :
System prompt instructions: For simpler constraints, tell the model what format you want in the system prompt:
Both approaches are more reliable than prefilling ever was. Structured outputs give you type-safe, validated responses. System prompt instructions work when you need flexible text output in a specific shape.
Resources:
5. Structured Outputs
Problem: You want the LLM to return structured data instead of text.
Solution: Use structured outputs.
Structured outputs are a way to get the LLM to return data in a structured format, like JSON. Most LLM providers support providing a JSON schema description of the output you want.
The Vercel AI SDK is a particularly good toolset for this.
Resources
- Anthropic's Docs on JSON mode
6. Reasoning
Problem: The LLM is not doing well enough at complex, multi-step reasoning tasks, like coding or math problems.
Solution: Prompt the LLM to reason through the problem using chain-of-thought (CoT) prompting.
Chain-of-thought prompting encourages the LLM to break down problems step-by-step, leading to more accurate and nuanced outputs. This technique is particularly effective for tasks that require complex reasoning, analysis, or problem-solving.
There are three main approaches to chain-of-thought prompting, from simplest to most complex:
- Basic CoT: Simply include "Think step-by-step" in your prompt. While simple, this lacks guidance on how to think.
- Guided CoT: Outline specific steps for the LLM to follow in its thinking process.
- Structured CoT: Use XML tags like and to separate reasoning from the final answer.
Chain-of-thought prompting trades speed for quality. The LLM must process and output its reasoning steps, so the response time gets longer. This matters most in real-time applications - a chatbot needs quick responses, while a code reviewer can take longer for detailed analysis.
Resources
- Anthropic's docs on chain-of-thought prompting
- OpenAI's advice on giving the model time to think before coming to a conclusion
7. Multishot Prompting
Problem: The LLM needs to understand a specific pattern or format but isn't getting it from a single example.
Solution: Provide multiple examples to help the LLM understand the pattern.
Multishot prompting can achieve results similar to fine-tuning, but without the cost and complexity of training a new model. It works by showing the model examples of what you want it to do.
It's straightforward - provide a few examples of input and output, and the model learns the pattern. No training data or compute resources needed.
Here's a practical example for writing product descriptions:
After seeing these examples, the model learns to write product descriptions with sensory language and focus on benefits. If you then give it "Input: Coffee maker", it will generate a similar style description.
Multishot prompting contrasts with zero-shot prompting, where you just describe what you want without examples.
Resources
- Anthropic's docs on multishot prompting
- OpenAI's docs on providing examples
8. Temperature
Problem: The LLM's outputs are either too deterministic (boring) or too random (unreliable).
Solution: Adjust the temperature parameter to control the randomness of outputs.
You can pass a temperature parameter to the LLM. This controls how random or deterministic the LLM's outputs are.
Think of temperature as your creativity dial. When you're writing code or need precise facts, you'll want to specify 0.0-0.3 - this makes the model stick to the most likely outputs. For general chat or creative writing, a medium setting of 0.4-0.7 gives you a nice balance. And when you're brainstorming or need fresh ideas, bump it up to 0.8-1.2.
Higher temperature means more interesting outputs, but might produce more hallucinations. My general suggestion is to start conservative and dial it up only when you need more variety.
Either way, it's a relatively cheap technique to try.
Resources
- OpenAI's API Reference explains how temperature affects token sampling
9. Tool Calling
Problem: LLMs are limited to text generation and can't directly interact with external systems or perform actions in the world.
Solution: Give the LLM access to specific functions or tools it can call to extend its capabilities beyond text generation.
Tool calling bridges the gap between an LLM's internal capabilities and the external world. It allows LLMs to perform actions like making API calls, accessing databases, or manipulating files. The LLM describes what it wants to do, and the system executes the appropriate tool with the specified parameters.
This pattern is particularly useful when you need your LLM to interact with external services, perform system operations, or access data that isn't in its training data. It's a fundamental building block for creating more capable AI applications.
You can learn how to implement this pattern using Vercel's AI SDK in my tutorial.
Resources
- Anthropic's docs on tool calling
- OpenAI's docs on function calling
10. LLM Call Chaining
Problem: A single LLM call isn't sufficient to complete a complex task.
Solution: Break down the task into multiple LLM calls that build on each other.
When you need to perform multiple specialized operations on the same input, trying to do everything in a single prompt often leads to subpar results. Each operation might need different expertise and focus.
This is where LLM call chaining comes in. Instead of asking one prompt to do everything, you break the task into specialized steps. Each prompt focuses on one aspect of the task, and its output becomes the input for the next prompt in the chain.
Take code analysis and fix generation as an example. The first prompt acts as a code analyzer, identifying and categorizing issues in the code. It provides context for each issue, creating a structured analysis.
The second prompt then uses this analysis to generate targeted fixes, building on the first prompt's insights. This separation of concerns allows each prompt to be optimized for its specific task, leading to better results than trying to do both operations in a single prompt.
This pattern can be applied to many other scenarios:
- First analyze a document's structure, then generate a summary
- First identify key points in a debate, then craft a balanced response
- First extract facts from research, then write a layperson explanation
- First identify bugs in code, then generate fixes for each one
Resources
- Anthropic's docs on prompt chaining
- UPDATE: OpenAI has removed their guide on using inner-monologue
11. RAG
Problem: Your LLM is making up facts because it can't access the information it needs.
Solution: Give it access to real data through retrieval augmented generation.
RAG is a powerful technique for grounding your LLM's responses in actual data and reducing hallucinations. Every LLM has a cutoff date for its training data - it can't know about events or information after that date. Instead of relying on what it learned during training, it can look up fresh information as needed.
You've got two main ways to feed data to your LLM. Web search gives you access to current information and public knowledge. Company databases and documentation let you tap into private, domain-specific information. This is particularly useful when you need answers about your company's internal processes or want to ensure your LLM's responses are up-to-date.
RAG shouldn't be your first port of call when building an LLM application. It adds significant complexity to your system - you need to manage data sources, handle retrieval, and ensure your context windows stay within limits.
Resources
- OpenAI's article on RAG and semantic search. There is also a section in their official docs.
12. Chunking
Problem: The information you want to retrieve is too large to fit in the context window.
Solution: Break down the information into smaller, manageable chunks.
Chunking is a fundamental technique in RAG systems that breaks down large documents into smaller, more manageable pieces. The goal is to create chunks that are both semantically meaningful and small enough to fit within your model's context window.
The complexity of chunking comes from the many ways you can split content. Here are the main approaches:
- Token-based: Splits content based on token count, ensuring you stay within model limits
- Character-based: Splits by character count, useful for raw text processing
- Sentence-based: Preserves natural language boundaries
- Paragraph-based: Maintains larger semantic units
- Semantic boundaries: Uses embeddings to find natural break points
- Document-structure: Respects document formatting (headers, sections, etc.)
After chunking, you'll need to find the most relevant chunks for each query. Here are the main ways to do this:
- BM25: A traditional search algorithm that finds exact word matches, great for technical terms and error codes
- Embeddings: Converts text into vectors to find semantically similar chunks
- Hybrid Search: Combines BM25 and embeddings for better results
- LLM Reranking: Uses another LLM to carefully read and rank chunks by relevance
Each approach has its strengths - BM25 excels at exact matches, while embeddings capture meaning. Many systems combine multiple approaches for the best results.
Resources
- Pinecone's guide on chunking strategies
- Anthropic's research on contextual retrieval
- Research paper on evaluating chunking strategies
13. Agentic Loops
Problem: LLM call chaining is too rigid for complex tasks. It requires predefined steps and stopping points, making it unsuitable for open-ended problems where the number of steps is unpredictable.
Solution: Pass control to an autonomous agent that can plan, execute, and adapt based on environmental feedback.
LLM call chaining uses predefined steps and stopping points, which limits its ability to handle unpredictable tasks. Agentic loops hand more control to the LLM - letting it decide when to stop based on task progress. The agent learns when to stop through real-world feedback.
The resulting system is more powerful because it adapts to unpredictable paths. Instead of following predefined steps, it learns and adjusts based on each interaction. This makes it effective for complex problems where the solution isn't known in advance.
This autonomy comes with a cost - increased latency from decision-making at each step. The LLM must evaluate the current state and choose the best path forward. This makes agentic loops slower than LLM call chaining, but more capable of handling complex tasks.
This pattern works well for:
- Complex code modifications across multiple files
- Research tasks requiring multiple information sources
- Customer support scenarios with unpredictable paths
- Data analysis requiring multiple processing steps
Resources
- Anthropic's Article on building effective agents
14. Parallelizing LLM Calls
Problem: Your LLM-powered system is taking too long because it processes tasks one at a time, creating unnecessary delays.
Solution: Run multiple LLM calls in parallel to handle independent tasks simultaneously, dramatically reducing total processing time.
There are only two ways to make a system faster: do less work, or do more work at the same time. When you need to process multiple tasks independently, running LLM calls in parallel can dramatically improve performance.
You can parallelize when tasks are independent and don't rely on each other's results:
- Analyzing multiple documents
- Generating different variations of content
- Processing multiple user queries simultaneously
You cannot parallelize when tasks must happen in sequence:
- When each step depends on the previous one's output
- When maintaining strict order is crucial for the final result
The performance benefits are significant. A system processing 10 documents sequentially might take 10 seconds, while parallel processing could complete in just 2-3 seconds. You should always be looking for opportunities to parallelize - even in systems that seem sequential, there might be independent components that can be processed concurrently.
Resources
- Anthropic's Article on building effective agents mentions parallelizing LLM calls
- OpenAI's docs include a section on parallelizing LLM calls
15. Evaluator-Optimizer
Problem: Your LLM's responses aren't meeting the quality standards you need, even after multiple attempts.
Solution: Create an automated loop where one LLM generates responses while another evaluates and provides feedback for improvement.
The evaluator-optimizer workflow creates a self-improving system where two LLMs work together. The first LLM generates responses, while the second evaluates them against specific criteria. This evaluation feeds back into the generation process, creating a continuous improvement loop.
This pattern is particularly effective when you have clear evaluation criteria and when iterative refinement provides measurable value. You'll know it's a good fit when human feedback demonstrably improves LLM responses, and when an LLM can provide similar quality feedback.
The pattern excels in scenarios like literary translation, where an evaluator LLM can catch nuanced meaning that the translator might miss initially. It's also powerful for complex search tasks requiring multiple rounds of searching and analysis, where the evaluator decides if further searches are needed.
Resources
- Anthropic's Article on building effective agents mentions the Evaluator-Optimizer pattern
16. LLM Routers
Problem: Different types of queries need different handling strategies.
Solution: Use an LLM to route queries to the most appropriate handler.
LLM routers act as intelligent dispatchers, analyzing each query and sending it to the right specialized handler.
The router first analyzes each query to determine its type and complexity. A customer service system might classify queries into these categories:
Based on this classification, the router connects the query to the appropriate handler with its specific set of instructions and capabilities.
This routing approach delivers several key advantages. It improves accuracy by ensuring each query is handled by the most suitable specialized system. And it allows each specialized LLM to focus on its specific domain, similar to how LLM chaining breaks down complex tasks into specialized steps.
It also solves a fundamental limitation of LLMs - most models can only handle a limited number of tools (often 30 or fewer). By routing queries to specialized handlers, you can create a system that effectively handles an unlimited number of tools, as each handler only needs access to its relevant subset.
However, adding an LLM router introduces an additional sequential step that increases latency, as each query must first be analyzed before being routed to the appropriate handler.
Resources
- Anthropic's Article on building effective agents mentions LLM routers
- UPDATE: OpenAI has removed their documentation on intent classification
17. Fine-Tuning
Problem: Your LLM's outputs need to match specific quality requirements that simpler techniques can't achieve.
Solution: Fine-tune a base model on your specific data to improve its performance for your use case.
Fine-tuning lets you adapt existing models to your specific needs. You can start with a relatively small dataset of high-quality examples that demonstrate exactly the kind of output you want - whether that's matching your brand voice, handling specialized terminology, or maintaining consistent formatting. A fine-tuned smaller model can often outperform larger, more expensive models on your specific task.
Fine-tuning sits between pre-training and prompt engineering in terms of cost and complexity. While it's an order of magnitude cheaper than training a model from scratch, each fine-tuning run will incur additional costs. There's also a risk of overfitting to specific model versions, which can make it harder to transition to newer, better models in the future.
The best time to consider fine-tuning is when you have a working system that's already using simpler techniques. You've validated your use case, gathered real-world data, and identified specific areas where the model's performance needs improvement. Fine-tuning then becomes an optimization step to push your system's performance even further.
Resources
- OpenAI has a section in their docs on fine-tuning.
- Anthropic's fine-tuning guide provides detailed requirements and best practices.
18. The Next Big Thing
The field of AI engineering moves at an astonishing pace. Every week brings new models, techniques, and tools promising to revolutionize how we build AI systems. It's impossible to keep up with everything, but you don't need to.
Each new AI development needs to earn its place in your system. Ask yourself: "Does this make things simpler or more complex? Is it solving a real problem?" The most valuable developments reduce costs, improve performance, or make your system more maintainable.
The next big thing in AI will come and go, but simplicity remains a reliable guide. Experiment with new techniques using your own evals - test them against your specific use case and success criteria. Focus on developments that help you build more effective systems with less complexity.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
Deep path
- Open the roadmap diagram pack:
sources/pdfs/ai-engineer-roadmap.pdf. Matt uses these diagrams in his cohorts — re-draw two of them on paper from memory after reading the posts. - Skim the paid-course landing pages
(
aihero.dev-cohorts-*.md) for the curriculum phrasing — useful as a self-check that you know the vocabulary.
Exercise 1.1 — The token audit
Pick any ~500-word block of your own writing. In a single TypeScript
file (module-1/token-audit.ts):
import { encoding_for_model } from "tiktoken";
// ... count tokens for cl100k_base (GPT-4 family) and estimate cost for Opus at $15/1M in, $75/1M out.
Print: token count, estimated input cost for 1 call, estimated cost for a 100-request batch. No need to call any API.
Check: Script runs and prints a number. Reflect: What is the input cost of loading your entire CV into the context window, five times?
Exercise 1.2 — Agent vs. workflow
Write module-1/when-not-to-agent.md. Pick three tasks
from your own work:
▶Read: What Is An Agent?
Email*
What Is An Agent?
Ever since Anthropic dropped the article building effective agents, everyone's been talking about agents and workflows. Both agents and workflows are ways of building more powerful systems with LLMs, and they both involve orchestrating multiple calls to the LLM.
What is a Workflow?
A workflow does this through predetermined steps where you have one LLM call which goes to another call which goes to another call. These predetermined steps are written in code by the developer.
The code itself decides when to stop the program, when to call the next LLM. It's all written into the code itself.
What is an Agent?
An agent, though, doesn't use predetermined steps. It calls an LLM and gives it a bunch of tools, different options of things that it can do next. The LLM decides which tool to call and then responds to the result of those tools.
The LLM itself decides when to stop the program when it thinks it's finished. In other words, this is the LLM making it up as it goes along, which hands a lot more power to the LLM, but of course makes it less predictable.
Comparing Agents and Workflows
Both agents and workflows involve multiple LLM calls. If you're just making one LLM call, it's not really either of those things.
The crucial difference is who decides when to stop:
Agents are really good in situations where the steps to complete the task are not particularly clear, where it needs the ability to improvise to figure out its way through a difficult problem.
But workflows are great for things that need to be done the same way again and again. Workflows are often unfairly maligned because they're not as exciting and sexy as agents.
You'll often get better results from using a workflow than using an agent, as long as the task is clearly specified.
Parallel Workflows Example
For instance, you can use a workflow to parallelize work. Let's say we take in a chunk of text, we can split it into two parts, summarize each of them independently, and then summarize the summaries afterwards.
That's the difference between agents and workflows. I don't know whether it's me being old and boring, but I am more excited by workflows than agents.
▶Read: What Is Vercel's AI SDK?
Email*
What Is Vercel's AI SDK?
There's a pretty common problem when you're building AI-powered apps.
Let's say you're using OpenAI as your LLM provider.
You build all of this code to talk to OpenAI's API.
But then one day you think, "Oh, why don't we try Anthropic instead?"
But there's a problem: Anthropic's API is a little bit different.
This is especially true for things like streaming, structured outputs, and tool calling.
So you need to build all of this extra glue code just to try out a new model.
The AI SDK lets you handle that.
It's a library that you can call, which handles the interaction between you and the LLM.
So now you can call the AI SDK instead, which is just a library, and it will seamlessly handle talking to different providers.
Not only that, but it has some really nice helpers for common use cases.
It helps you stream text, work with structured outputs, do tool calling, and even handle agents seamlessly.
Do You Need To Deploy The AI SDK To Vercel?
The AI SDK is maintained by Vercel. But you don't need to use Vercel in order to use the AI SDK.
You don't need to pay Vercel any money - it's free, open-source software.
What Are The Different Parts Of The AI SDK?
There are three different parts of the AI SDK.
You have the AI SDK Core which is used for backend, so Node, Deno, Bun, whatever.
You also have the AI SDK UI which is a set of front-end hooks and components for linking up to an AI SDK backend.
There's also an AI SDK RSC framework, for building with React Server Components.
What Does This Tutorial Cover?
This tutorial is only going to cover the core part of the AI SDK.
This is going to fill you in on the basics and give you the grounding that you need to build virtually anything.
How Do You Install It?
It has probably the best NPM package of all time -
ai:pnpm add aiAll the core stuff is under the
aipackage:import { generateText, streamText, generateObject, streamObject, } from "ai";You can integrate with different providers by installing their specific package.
For instance, OpenAI:
pnpm add @ai-sdk/openaiAnd then you can use it like this:
import { openai } from "@ai-sdk/openai"; const model = openai("gpt-4");Conclusion
So that's the lay of the land.
You get a unified API where you can drop models in and out.
You can stream text, generate text, work with structured outputs, call tools, and build agentic loops.
Let's start with the absolute basics: generate text. See you in the next one!
▶Read: vercel ai sdk tutorial
Code Share
I've been building a lot of stuff with TypeScript and AI recently. I've been building agents, workflows, custom scripts and CLIs to help me work faster and smarter.
And Vercel's AI SDK has been an awesome tool in my toolkit.
It's a library that solves a lot of the pain points when working with LLMs.
And in this tutorial, I'm going to give you a full guide on how to get to grips with it.
This is really good! Learned a few things I didn't know.
Addy Osmani — Engineering leader @ Google Chrome
We're going to cover:
- Hot-swapping models
- Streaming
- Structured outputs
- Handling images and files
- Tool calling
- Building agents
You can use the AI SDK Core anywhere JavaScript runs. And that's why I'm so excited about it. It's so versatile, so powerful and it's just the right level of abstraction for almost any task.
Head to AI Hero's GitHub Repo for all the code, and to run the examples.
Code Share
▶Read: Generate Text With Vercel's AI SDK
Email*
Generate Text With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
Let's look at literally the simplest setup that the AI SDK supports: generating text.
Scrollycoding
1
You take the
generateTextfunction that you import fromai, pass it a prompt and a model, and you get back an object that contains some text.2
The model we're using is Anthropic's
claude-3.5-haiku, which we're getting from Vercel's@ai-sdk/anthropic.3
If we wanted to, we could specify a different model - let's say we're using
claude-3.5-sonnetinstead.4
We can then use the function that we've created called
answerMyQuestion.Let's say we ask it "what is the chemical formula for dihydrogen monoxide?"
When we run this, it will call Anthropic with our query and get back the answer to the question we asked it.
import { generateText } from "ai"; export const answerMyQuestion = async ( prompt: string, ) => { const { text } = await generateText({ model, prompt, }); return text; };There's also a bunch of other stuff in this object that we get back, but we'll cover that in our other examples.
// Lots of other properties on // this returned object! const { text } = await generateText({ model, prompt, }); return text;▶Read: Adding System Prompts With Vercel's AI SDK
Email*
Adding System Prompts With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
Sometimes you need the AI to act in a certain way no matter what prompt it receives.
In this example, we want the AI to summarize the text that it receives.
We want to give it a role. We want to give it instructions.
And we want to do all of that before it receives the prompt from the user.
Scrollycoding
1
To do that, we can use a system prompt.
Doing that in Vercel's AI SDK is as simple as passing a
systemproperty.2
Under the hood, this prepends a special message with a role of "system" that has our system prompt in it.
We could do that in the AI SDK as well if we wanted to. You can pass a
messagesproperty that has an array of messages.This can be done in
generateText,streamText, and all of the other APIs that contact LLMs and GenAI.const { text } = await generateText({ model, prompt: input, system: `You are a text summarizer. ` + `Summarize the text you receive. ` + `Be concise. ` + `Return only the summary. ` + `Do not use the phrase "here is a summary". ` + `Highlight relevant phrases in bold. ` + `The summary should be two sentences long. `, });Working with system prompts is one of the key parts of working with LLMs, so it's really nice that the AI SDK makes it so easy.
In our next example, we're going to see how easy it is to swap out your models whenever you need to.
▶Read: Hot-Swap AI Models With Vercel's AI SDK
Email*
Hot-Swap AI Models With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
I wanted to demonstrate for you just how flexible the Vercel AI SDK is when it comes to model selection.
Scrollycoding
1
Here we have an
askfunction that takes in a prompt and a model of a type ofLanguageModel.We can call this function with any model that the AI SDK provides.
2
For instance, we can call it with Anthropic:
3
Or with OpenAI:
import { generateText, type LanguageModel } from "ai"; export const ask = async ( prompt: string, model: LanguageModel, ) => { const { text } = await generateText({ model, prompt, }); return text; };This gives you a ton of flexibility with how you want to build your application.
The
askfunction is totally decoupled from the model that it uses.The
LanguageModeltype exposed by the AI SDK lets you do dependency injection. In other words, you can inject any model into your system.To me, this is one of the core selling points of the AI SDK.
In our next example, we're going to look at chat history and how you can preserve the history of a chat over time.
▶Read: Working With Message Histories In Vercel's AI SDK
Email*
Working With Message Histories In Vercel's AI SDK
Matt Pocock
Source Code Use with AI
It's pretty common if you're building any kind of chat bot to want to keep track of the conversation history.
This is so the LLM has context over the conversation you've already had.
So you can ask follow-up questions without having to rephrase your question every time.
Not only that, but understanding how conversations are persisted is really important for communicating with the LLM over the wire.
We're going to show how to do that with the Vercel AI SDK in this example.
The
CoreMessageTypeLet's start by understanding the shape of what a conversation history might look like.
The AI SDK exposes a type called
CoreMessage. This is an object that represents a message in a conversation.If we understand this type we'll understand a conversation history - I promise.
Scrollycoding
1
Let's try messing about with it to see what it contains. Let's create a messages array, assigning it the type of an array of core messages.
2
Every message has to contain a role property. This is a string that can be user, system, assistant or tool.
3
Each message must also contain a
contentproperty - this is the content of the message.In this example, the conversation history contains a single message from the user saying "Hello, you!".
4
To represent the LLM replying, we use the "assistant" role:
5
We had a brief look at system prompts earlier. They're represented in the message history with a role of
system.In this case, we're telling the LLM to greet everyone that speaks to it.
import { type CoreMessage } from "ai"; const messages: CoreMessage[] = [];We'll take a look at tools later when we cover tool calling.
As conversations get longer and longer, this array will grow with more
userandassistantmessages being added.Now we understand about the messages array, let's apply it to a real-world (ish) example.
Setting Up A Server
For the first time here, we're going to set up a server.
Scrollycoding
1
We're going to put it inside a
startServerfunction:2
We'll use the Hono package to create a server:
3
We'll add a route to the server:
4
Then, we'll serve it with
servefrom@hono/node-server:5
Finally, we'll wait for it to start using
node:eventsand return theserver:export const startServer = async () => {};This gives us a nice function we can call to start a server on
localhost:4317.Next, let's build the API call:
Scrollycoding
1
Our
/api/get-completionsroute is going to accept an array ofCoreMessageobjects.2
We can then pass the entire conversation history into
generateText.3
Based on the conversation history, the LLM will figure out what it should try to say next.
The messages that it suggests will be added to
result.response.messages.So we can return those as JSON using
ctx.json.import { type CoreMessage } from "ai"; app.post("/api/get-completions", (ctx) => { const messages: CoreMessage[] = await ctx.req.json(); });To sum up, we've build an API endpoint that accepts a conversation history and returns the next message the LLM would say.
This is a pretty common pattern. It means our server is entirely stateless.
The responsibility for tracking the conversation lies with the client.
There are other ways of doing this, like storing the conversations on your server, but this is a simple way to get started.
Calling Our Server
Now we have a server, we can call it from the client. I'm running this on node, but you could easily run this in the browser too.
Scrollycoding
1
Let's create a new array of messages, and start our server:
2
We'll use
fetchto send the messages to the server, andres.json()to turn the response into JavaScript objects.3
Finally, we'll append the new messages to the conversation history, and log it out.
import { type CoreMessage } from "ai"; import { startServer } from "./server.ts"; const messagesToSend: CoreMessage[] = [\ \ {\ \ role: "user",\ \ content: "What's the capital of Wales?",\ \ },\ \ ]; await startServer();We end up with an output that looks like this:
[\ \ {\ \ "role": "user",\ \ "content": "What's the capital of Wales?"\ \ },\ \ {\ \ "role": "assistant",\ \ "content": [\ \ {\ \ "type": "text",\ \ "text": "The capital of Wales is Cardiff."\ \ }\ \ ]\ \ }\ \ ]This array is ready to receive another message with the role of
user, and the process can repeat.Conclusion
In this example, we've shown how to keep track of a conversation history using the Vercel AI SDK.
We've learned about the messages array, including the various roles.
We've seen how to set up a server that accepts a conversation history and appends to it.
And we've seen how to call that server from a client.
▶Read: Streaming Text With Vercel's AI SDK
Email*
Streaming Text With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
Scrollycoding
1
In our previous example, we saw how we can generate text with the AI SDK.
But this spits out the text all at once at the end. What if we need to stream the text token by token?
2
For that, we can use the
streamTextfunction from the AI SDK.This takes in a model and a prompt in exactly the same way, but instead of just returning text, it returns a
textStream.3
The
textStreamis an async iterable. That means it can be streamed to a file or over a network connection.In this example, we're just going to stream it to
stdout.This
forloop waits for every chunk of thetextStreamand then writes that chunk tostdout.4
Let's say we ask it "what is the color of the sun?".
If we now run this code, we're going to see it streaming to our console.
And from here it's pretty easy to imagine hooking this up to a network request and then just streaming this to a UI.
5
The stream text function also returns a
textpromise.This means if you just want to wait for the full text anyway, you can just await the promise.
import { generateText } from "ai"; export const answerMyQuestion = async ( prompt: string, ) => { const { text } = await generateText({ model, prompt, }); return text; };▶Read: Structured Outputs with Vercel's AI SDK
Email*
Structured Outputs with Vercel's AI SDK
Matt Pocock
Source Code Use with AI
Often the thing that you want to get back from your LLM is not text but some kind of object.
You might want to scan a bank statement for multiple properties, like the account number and the balance.
The most efficient way to do this is with structured outputs.
This lets you ask the LLM a question, tell it what format you want the answer in, and then it will send you that information in that format.
In this example, we're going to ask the LLM for a recipe. We want the recipe name, an array of ingredients, and an array of steps that the chef needs to take to make the recipe.
{ "recipe": { "name": "Chocolate Cake", "ingredients": [\ \ {\ \ "name": "flour",\ \ "amount": "2 cups"\ \ },\ \ {\ \ "name": "sugar",\ \ "amount": "2 cups"\ \ }\ \ // ...\ \ ], "steps": [\ \ "Preheat the oven to 350 degrees F.",\ \ "Mix the flour, sugar, cocoa powder, baking powder, baking soda, and salt in a large bowl."\ \ // ...\ \ ] } }Scrollycoding
1
The first step is to create a
zodschema that describes the data type that we want to get back from the LLM.If you've never seen
zodbefore, I have a free tutorial on my sister site, Total TypeScript.To describe the recipe shape that we have here the
zodschema would look like this:2
We can pass this schema directly to the
generateObjectfunction from the AI SDK.3
I've also added a simple system prompt here to give the AI some context as to what we're doing.
The result that comes back contains a property called
objectwhich contains our recipe.Thanks to TypeScript being clever, we also get type-safe access to the
name,ingredients, andsteps.import { z } from "zod"; const schema = z.object({ recipe: z.object({ name: z.string(), ingredients: z.array( z.object({ name: z.string(), amount: z.string(), }), ), steps: z.array(z.string()), }), });Describing Properties
But we're not quite done here. We should provide the AI with more information about what each individual property means.
Currently, all it's got to go on are
name,ingredients, andsteps.We can do this by adding
zod'sdescribefunction on each property.const schema = z.object({ recipe: z.object({ name: z .string() .describe("The title of the recipe"), ingredients: z .array( z.object({ name: z.string(), amount: z.string(), }), ) .describe( "The ingredients needed for the recipe", ), steps: z .array(z.string()) .describe("The steps to make the recipe"), }), });Now it's clear to the AI what we're asking for from each property. This is especially useful when the property names are not that descriptive.
And finally, we can pass a
schemaNameproperty to thegenerateObjectfunction:const { object } = await generateObject({ model, system: `You are helping a user create a recipe. ` + `Use British English variants of ingredient names, like Coriander over Cilantro.`, schemaName: "Recipe", schema, prompt, });Let's give this a go and see what outputs we get. Let's ask how to make Baba Ganoush.
const recipe = await createRecipe( "How to make baba ganoush?", ); console.dir(recipe, { depth: null });When we run this, we're going to get back a recipe for Baba Ganoush.
pnpm run example v 08 { "name": "Baba Ganoush", "ingredients": [\ \ { "name": "Aubergine", "amount": "2 large" },\ \ { "name": "Tahini", "amount": "3 tablespoons" }\ \ // ...\ \ ], "steps": [\ \ "Preheat the oven to 200°C (400°F).",\ \ "Pierce the aubergines several times with a fork."\ \ // ...\ \ ] }And there we go, that's how you get structured outputs from the AI SDK.
▶Read: Generate Arrays With Vercel's AI SDK
Email*
Generate Arrays With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
So far we've looked at getting an LLM to return objects or enums (enumerated values) instead of just returning text.
But what if you want to return multiple objects - an array of objects? What then?
For instance, you might want your LLM to generate lots of fake data.
Zod Schema
Scrollycoding
1
Let's first create a Zod schema that we first encountered in our structured outputs example.
If you don't understand this I've got a free course on Zod on my sister site Total TypeScript.
2
We should also use descriptions on the schema fields to give the AI more context. Like we've seen before, we can use
.describehere.import { z } from "zod"; const schema = z.object({ name: z.string(), age: z.number(), email: z.string().email(), });Passing The Schema To
generateObjectThen we can pass this schema to the
generateObjectfunction - but we also pass in an output of array.import { generateObject } from "ai"; export const createFakeUsers = async ( input: string, ) => { const { object } = await generateObject({ model, prompt: input, system: `You are generating fake user data.`, output: "array", schema, }); return object; };Now the object that we get back from
generateObjectwill be this array of users.What's powerful about this is that we can seed various information about the users, so in this case, they're going to be from the UK like me.
Let's give this a go. We're going to generate some fake users:
const fakeUsers = await createFakeUsers( "Generate 5 fake users from the UK.", ); console.dir(fakeUsers, { depth: null });And just like that, we get our users.
This is particularly cool with
streamObjecttoo, where you can stream in the users as they're being created.So that's how to generate an array of structured objects using the AI SDK.
▶Read: Generate Enums With Vercel's AI SDK
Email*
Generate Enums With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
Another classic use case for LLMs is classification.
Let's say we want to pass a user comment to the LLM to classify as
positive,negative, orneutral.We want to get back from the LLM essentially one string, and that string is going to be either
positive,negative, orneutral.In traditional software parlance, this is called an enum — a set of enumerated values.
And we can use the AI SDK to generate that enum.
Scrollycoding
1
We're using the
generateObjectfunction again, but this time we're passing an output ofenum.2
We also pass an
enumproperty with an array of the possible strings we want it to return.3
The result we get back from
generateObjecthas anobjectproperty, and that object is our enum.import { generateObject } from "ai"; export const classifySentiment = async ( text: string, ) => { await generateObject({ model, output: "enum", prompt: text, system: `Classify the sentiment of the text as either ` + `positive, negative, or neutral.`, }); };Let's try out a few different statements to see if it works:
Scrollycoding
1
I'm not sure how I feelcomes out asneutral.2
This is terriblecomes out asnegative.3
And
I love this so muchcomes out aspositive.const result = await classifySentiment( `I'm not sure how I feel`, ); console.log(result); // neutralAnd just like that, we've got a sentiment analysis system. This is a really cool use case for enums, and it's great that the AI SDK makes it so simple.
▶Read: Streaming Objects With Vercel's AI SDK
Email*
Streaming Objects With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
Scrollycoding
1
In the previous example, I showed you how to get structured outputs from an LLM, but the outputs were all generated at once.
We waited for a little bit and then we saw all the outputs at once.
2
What if you want to see the output as they're generated? In other words, what if you want to stream an object?
You can do that by changing
generateObjecttostreamObject.export const createRecipe = async (prompt: string) => { const { object } = await generateObject({ model, schema, prompt, system: `You are helping a user create a recipe. ` + `Use British English variants of ingredient names,` + `like Coriander over Cilantro.`, }); return object.recipe; };You'll notice a couple of changes from the previous example.
First of all, we have to wait for the final result of the object. We're doing that by awaiting
result.object.The reason for this is that
streamObjectreturns its result as soon as the first chunk comes in. Awaitingresult.objectmeans we can wait for the final object.Retrieving Partial Objects
If we want access to the partial object as it's being generated, we can use
result.partialObjectStream.export const createRecipe = async (prompt: string) => { const result = await streamObject({ model, system: `You are helping a user create a recipe. ` + `Use British English variants of ingredient names,` + `like Coriander over Cilantro.`, schemaName: "Recipe", schema, prompt, }); for await (const obj of result.partialObjectStream) { console.clear(); console.dir(obj, { depth: null }); } const finalObject = await result.object; return finalObject.recipe; };This is an async iterable. That means we can use a
for await...ofloop to log every update to the object as it's posted to us.We're going to clear the console first and then log the object so we should see it streaming in live.
Let's give this a go and see what outputs we get. Let's ask it how to make hummus.
const recipe = await createRecipe( "How to make hummus?", );As you can see, the objects are coming in as they're generated and building the object up over time.
Then the final chunk of this stream contains the entire object.
Use Cases
The way you'd use this in an application is instead of just logging things to the console, you would send these chunks over the network.
That way, your users could see the object being built up in real time. It's always much nicer to see a progress indicator instead of it just waiting and then spitting it all out in one chunk.
So that's what the
streamObjectfunction does in the AI SDK.▶Read: Tool Calling With Vercel's AI SDK
Email*
Tool Calling With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
So far we've been getting LLMs to answer questions, scan documents, and do data extraction.
But they can do a lot more than that. LLMs can interact with the world.
The way they do that is by calling tools or functions that we provide them.
And Vercel's AI SDK has a first-class solution for that.
We're going to start by creating the simplest tool imaginable and then we're going to go on from there.
Scrollycoding
1
Our tool is simply going to log to the console. To create it we're going to import
toolfromai.2
The first thing that any tool needs is a description of the parameters it's going to receive.
We can add this by specifying parameters on the tool.
This is done with a Zod schema just like we did with structured outputs before.
3
We're also using
describeto describe the different parameters for the LLM.4
Next we need to say what the tool is going to do.
We do that by specifying an
executefunction.This
executefunction can be asynchronous, so it can do virtually anything - call APIs, write to a database, etc.In our case we're just going to log to the console.
5
Next we're going to add a
descriptionfield to the tool itself.This tells the LLM what it's supposed to do with the tool.
import { tool } from "ai"; const logToConsoleTool = tool({});Now our tool's been created, let's actually use it inside a
generateTextcall.Scrollycoding
1
Let's create our function called
logToConsole, passing in a model and a prompt.2
We'll give it a system prompt to encourage it to use the tool.
3
And finally let's pass it our tool.
import { generateText } from "ai"; const logToConsole = async (prompt: string) => { await generateText({ model, prompt, }); };To recap, we've created a tool, passed it to
generateTextand given it a simple system prompt.Let's see what happens when we run this.
Hello, world!Success! We're seeing "hello world" printed out to the console.
Debugging
That's pretty good, but it's quite opaque. How do we go in and debug this?
Scrollycoding
1
Let's destructure the
stepsproperty from the result ofgenerateText.stepsis an array of each of the steps taken by the LLM.We'll look at
stepslater because it starts to get into interesting stuff like agentic behavior and reasoning.2
For now we're just going to pull out a property from the first step taken which is tool calls.
This tells you all of the tools that were called during that step.
import { generateText } from "ai"; const logToConsole = async (prompt: string) => { const { steps } = await generateText({ model, prompt, system: `Your only role in life is to log ` + `messages to the console. ` + `Use the tool provided to log the ` + `prompt to the console.`, tools: { logToConsole: logToConsoleTool, }, }); };When we run this we can see that the tool name
logToConsolewas called.We can also see the arguments it was passed.
[\ \ {\ \ type: 'tool-call',\ \ toolCallId: 'toolu_012hbsiE2sXvPrAwAvE3kgxM',\ \ toolName: 'logToConsole',\ \ args: { message: 'Hello, world!' }\ \ }\ \ ]If we log
toolResultsinstead...console.dir(steps[0]?.toolResults, { depth: null });...we can see the result of the tool call. In this case we didn't return anything from our function so it's
undefined.[\ \ {\ \ toolCallId: 'toolu_012hbsiE2sXvPrAwAvE3kgxM',\ \ toolName: 'logToConsole',\ \ args: { message: 'Hello, world!' },\ \ result: undefined\ \ }\ \ ]These
toolResultscan be fed back into the LLM to provide it more information, especially when run over multiple steps.So debugging using
stepsis a way that you can get some insight as to what is happening with your tool calls.▶Read: Build Your First Agent With Vercel's AI SDK
Email*
Build Your First Agent With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
In our previous example, we saw how LLMs can call tools to do things in the world.
But they can do more than that - they can react to the information they receive from their tools.
This can create a powerful feedback loop where the LLM is continually grounding itself in the real world.
And this feedback loop is what most people, including Anthropic, call agents.
The Vercel AI SDK makes this super easy with a concept called steps.
We're going to make an agent that can retrieve the current weather for us at the city we specify.
Scrollycoding
1
To kick that off, we're going to create a
getWeathertool.We'll start by giving it a description and some parameters.
2
Then we'll implement the
executefunction.In this case, we're going to just stub it out by saying the weather in this city is twenty-five degrees.
But if we wanted to, we could call a weather API to get the actual weather.
import { tool } from "ai"; import { z } from "zod"; const getWeatherTool = tool({ description: "Get the current weather in the specified city", parameters: z.object({ city: z .string() .describe("The city to get the weather for"), }), });Next, we're going to hook this tool up to a function called
askAQuestion.Scrollycoding
1
We'll call
streamTextwith a model and a prompt, passing it thegetWeathertool.2
Then we'll loop over the text stream and print out the text to
stdout.3
Finally we'll ask it what the weather's like in London.
import { streamText } from "ai"; const askAQuestion = async (prompt: string) => { await streamText({ model, prompt, tools: { getWeather: getWeatherTool, }, }); };When we run this, we notice something interesting.
I'll help you check the current weather in London right away.We don't get the information we're looking for; it just says, "I'll help you."
Why would this be happening?
Debugging
stepsLet's debug this using the same strategy we used last time: looking at the
stepsproperty returned fromstreamText.import { streamText } from "ai"; const askAQuestion = async (prompt: string) => { const { steps } = await streamText({ model, prompt, tools: { getWeather: getWeatherTool, }, }); console.dir(await steps, { depth: null }); }; await askAQuestion(`What's the weather in London?`);Because we're using
streamTextwe have toawaitthe result ofsteps.Here's what it spits out.
There are several things to notice about this massive blob of JSON.
First, we notice that it's only one step long. The LLM only took one step here.
We can see that it called one tool and got a result back, from the
toolCallsandtoolResultsproperties.toolCalls: [\ \ {\ \ type: 'tool-call',\ \ toolCallId: 'toolu_011n3T6TJnwZLyR4G8h1ZcMz',\ \ toolName: 'getWeather',\ \ args: { city: 'London' }\ \ }\ \ ], toolResults: [\ \ {\ \ type: 'tool-result',\ \ toolCallId: 'toolu_011n3T6TJnwZLyR4G8h1ZcMz',\ \ toolName: 'getWeather',\ \ args: { city: 'London' },\ \ result: 'The weather in London is 25°C and sunny.'\ \ }\ \ ],So the
LLMcalled thegetWeathertool with the city of London and got back the result that the weather in London is 25°C and sunny.But it then decided to stop.
So it seems like the
LLMdid the right thing; it called the tool but then it stopped for some reason. How do we make it take more than one step?maxStepsBy default, the AI SDK only allows the LLM to take one step.
If we want to allow it to take more steps, we can pass
maxStepstostreamText.This forces this loop to stop after only two steps.
When we run this we get a really cool output:
I'll help you check the current weather in London right away. It looks like London is experiencing a pleasant day with sunny conditions and a temperature of 25°C (which is about 77°F). It sounds like a great day to be outside and enjoy the nice weather!We can see that the LLM is now reacting to the information that the tool provided.
And if we log the steps again we can see that two steps were taken.
Stop Signals
What would happen if we specified more than two steps? Let's try ramping up
maxStepsto 10 just to see what happens.Turns out, we get nearly the same result.
The LLM stops after two steps.
We can see that in the second step it has a
finishReasonofstop:finishReason: 'stop'This is because the LLM has a built-in mechanism to stop after it has completed its task.
This means that there are two ways that this loop can complete. Either the LLM stops itself or it reaches its
maxSteps.It's not a great idea to specify
maxStepsasInfinitybecause sometimes the LLM simply won't stop itself.Conclusion
To wrap up, we've seen how to create a simple agentic loop with the Vercel AI SDK.
Using
maxStepslets the LLM take multiple steps and react to its own tool results.This lets you build systems that ground the LLM in reality and make it more useful.
▶Read: Create Embeddings With Vercel's AI SDK
Email*
Create Embeddings With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
Vercel's AI SDK has a really simple way to create embeddings.
Embeddings are a way to represent words, images, or data in a high-dimensional space.
This is extremely useful when you want to see how similar two things are to each other.
This makes embeddings a really powerful primitive for tasks like searching or categorization.
Creating Embeddings
Scrollycoding
1
The first step is to grab the embedding model, the one we're using is from OpenAI.
2
We're just going to embed some words: dog, cat, car, and bike. So we're going to put those in an array.
3
Then we're going to use the
embedManyfunction from the AI SDK, passing in our model and the values.import { openai } from "@ai-sdk/openai"; const model = openai.embedding( "text-embedding-3-small", );Let's log this out to see what we've created here.
As we can see, it's a bunch of arrays containing a load of numbers. These are vectors and these vectors represent a location in multi-dimensional space.
Depending on the model, these arrays might be thousands of numbers long. And these numbers, i.e., the vector, represent the LLM's understanding of what that word is.
Creating a Vector Database
We can now collect those vectors together with the values they represent in a vector database.
const vectorDatabase = embeddings.map( (embedding, index) => ({ value: values[index], embedding, }), );This is an extremely simple version of a vector database. It's just a list of vectors with the values they represent attached.
In the real world, you might want to use Postgres and use an extension like
pgvectorto add the ability to query vectors.Searching for Similar Embeddings
Let's now use this vector database. We're going to search for an entry in our database which is most similar to a search term.
We're going to have to embed one more word, our search term.
For that, we can use the
embedfunction from the AI SDK.import { embed } from "ai"; const searchTerm = await embed({ model, value: "Canine", });Let's search for the word that's most similar to "Canine".
Now we've got all of the vectors we need. That's literally all the information we require to perform a similarity search. We don't need to query the LLM again.
We now need to calculate the cosine similarity between the search term and each of the vectors in the database.
Scrollycoding
1
Fortunately, the AI SDK exposes a
cosineSimilarityfunction.We're going to map over each entry in the database, calculating the similarity of each entry to the search term.
2
Then we're going to sort them by similarity.
import { cosineSimilarity } from "ai"; const entries = vectorDatabase.map((entry) => { return { value: entry.value, similarity: cosineSimilarity( entry.embedding, searchTerm.embedding, ), }; });If we log this out, we notice that the similarity is a number between zero and one, calculated to many decimal places.
[\ \ { "value": "Dog", "similarity": 0.8918085834539862 },\ \ { "value": "Cat", "similarity": 0.5911997598415127 },\ \ { "value": "Car", "similarity": 0.5564415138726317 },\ \ { "value": "Bike", "similarity": 0.5002829969392777 }\ \ ]And we'll notice that
Dogis at the top - that's because it's the most similar toCanine.This relatively simple setup can drive all sorts of awesome features like search and categorization.
And it's really nice that the AI SDK exposes the
embed,embedMany, andcosineSimilarityfunctions - as well as the ability to query embedding models.▶Read: Build an Alt Text Generator With Vercel's AI SDK
Email*
Build an Alt Text Generator With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
Another thing that the AI SDK handles out of the box is passing images and files to LLMs.
LLMs can then look at the file or the image and do things with it. It's an extremely powerful feature.
Not all LLMs support this, but for the ones that do, AI SDK also lets you do it.
In this case, we want the LLM to generate some alt text for the image that we pass it.
We might use this alt text on a website or something to help folks who can't actually see the image understand what's in it.
Scrollycoding
1
We're going to use a pretty simple system prompt:
2
Then we're going to create a function called
describeImagewhich is going to receive a path to an image on our local file system.const systemPrompt = `You will receive an image. ` + `Please create an alt text for the image. ` + `Be concise. ` + `Use adjectives only when necessary. ` + `Do not pass 160 characters. ` + `Use simple language. `;This looks fine, but we're not yet using the
imagePathinside our function.We can't pass the image path directly as a prompt; we first need to actually load the image into memory and then pass that into the
generateTextfunction.Scrollycoding
1
To load it into memory, we're going to use
readFileSyncfrom Node.js.This stores the image in memory as a
Uint8Array, essentially a raw representation of the bytes that make up the image.2
We could, of course, use
readFilefromfs.promisesif we wanted non-blocking I/O. ButreadFileSyncis fine for our purposes.3
Now we can pass this
Uint8Arrayinto thegenerateTextfunction. But we can't just pass it in as a prompt; we need to pass it in as amessagesarray:This is in the same format as the chat history we saw before, but this time we have
contentwhich is an array of different message parts.The message part that we're passing is of type
image, and then we're passing theUint8Arrayinto theimageproperty.import { readFileSync } from "fs"; import { generateText } from "ai"; export const describeImage = async ( imagePath: string, ) => { const imageAsUint8Array = readFileSync(imagePath); const { text } = await generateText({ model, system: systemPrompt, }); return text; };Let's try running this. I've got an image of some fireworks. Let's pass it in and see what happens.
const description = await describeImage( "./fireworks.jpg", ); console.log(description);We get back our beautiful description:
Colorful fireworks display over a city skyline at night, with bursts of red, white, and blue reflections on the water. Spectators watch from the shoreline.To sum up, we read an image into memory, passed it directly into
generateTextvia themessagesarray, and got back a description of that image. Pretty sweet.Reading From A URL
Our current approach works if you have the file in memory. But what if you only have a URL to the file?
Well, there's a really nice shortcut - You can pass the URL directly to the AI SDK.
import { generateText } from "ai"; export const describeImage = async ( imageUrl: string, ) => { const { text } = await generateText({ model, system: `You will receive an image. ` + `Please create an alt text for the image. ` + `Be concise. ` + `Use adjectives only when necessary. ` + `Do not pass 160 characters. ` + `Use simple language. `, messages: [\ \ {\ \ role: "user",\ \ content: [\ \ {\ \ type: "image",\ \ image: new URL(imageUrl),\ \ },\ \ ],\ \ },\ \ ], }); return text; };We're wrapping the URL with
new URLto indicate to the AI SDK that this is a URL on the web that we want to send along to the LLM.Let's test it out. We've got an image of a church that we wanted to describe, that I've hosted on GitHub:
const description = await describeImage( "https://github.com/ai-hero-dev/ai-hero/blob/main/internal/assets/image.jpg?raw=true", ); console.log(description);When we run this, it's going to pass the URL to the LLM. The LLM will then download the image and have a look at it. And then we get back our description.
Lake Bled in Slovenia with church on small peninsula and castle on cliff, surrounded by mountains. Calm water reflects buildings and autumn trees.This is a really great way to take a shortcut when you're working with images hosted on the web.
▶Read: Extract Structured Data From PDF's With Vercel's AI SDK
Email*
Extract Structured Data From PDF's With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
Extracting structured data out of unstructured data is one of the most powerful use cases for LLM's.
In this example, we're going to be passing an arbitrary file to an LLM and getting it to analyze it for us.
In this case, a PDF of an invoice. We're going to pass it the PDF, and it's going to extract some structured data for us.
It means you can turn all the documents that you might have on your system into something that you can store in a database, query, and search.
Creating the Schema
Since we're using structured data, let's create a
Zodschema to handle this.import { z } from "zod"; const schema = z .object({ total: z .number() .describe("The total amount of the invoice."), currency: z .string() .describe("The currency of the total amount."), invoiceNumber: z .string() .describe("The invoice number."), companyAddress: z .string() .describe( "The address of the company or person issuing the invoice.", ), companyName: z .string() .describe( "The name of the company issuing the invoice.", ), invoiceeAddress: z .string() .describe( "The address of the company or person receiving the invoice.", ), }) .describe("The extracted data from the invoice.");It's a pretty big one. We've got the total amount of the invoice, the currency, the invoice number, the address, the company name, and the invoicee address.
We've seen this before, but note how I'm providing descriptions to every single property I can. I want to give the LLM the best chance of success.
Extracting Data from the Invoice
Scrollycoding
1
Let's create an
extractDataFromInvoicefunction. Inside, we'll pass the schema togenerateObject, with a little system prompt.We're expecting an
invoicePath, which is a path to a PDF on our file system.2
The image example that we saw before used the
messagesarray. We can do the same thing with the PDF, but this time we're going to use a content type offileinstead ofimage.We're using
readFileSynchere to grab the raw binary data from the file on the file system, and pass that directly to the AI SDK.3
We also need to pass it a MIME type to tell the LLM what sort of file it's receiving. It could probably work this out on its own by checking the magic numbers of the file, but it's just polite, isn't it?
4
Finally, we return the object so we should get the object back that we're expecting.
import { generateObject } from "ai"; export const extractDataFromInvoice = async ( invoicePath: string, ) => { await generateObject({ model, system: `You will receive an invoice. ` + `Please extract the data from the invoice.`, schema, }); };Running the Example
Let's give it a go. I've got a PDF of a fake invoice here. Let's pass it in and see how it does.
const result = await extractDataFromInvoice( "./invoice.pdf", ); console.dir(result, { depth: null });As we can see, we get back the data we were looking for from this invoice.
So this combines two things which we've learned about already: passing arbitrary files to the Vercel AI SDK and using structured data. Very, very cool.
▶Read: Use Local Models With Vercel's AI SDK
Email*
Use Local Models With Vercel's AI SDK
Matt Pocock
Source Code Use with AI
You can use Vercel's AI SDK to connect to locally running models. In fact, not just locally running models but models running at any URL.
Scrollycoding
1
The AI SDK has a function called
createOpenAICompatiblewhich lets you communicate with models that have an OpenAI-compatible API.2
In our case, I'm using an app called LM Studio which exposes this API on
localhost:1234.So I can install
ai-sdk/openai-compatibleand then create an LM Studio provider.3
And I can use this provider to grab a model.
I'm using an empty string here because if you pass an empty string it will default to choosing the model you have loaded in LM studio.
4
I can then use this model by passing it into
generateText.I've specified
maxRetriesas zero here. By default, the SDK will retry queries three times to make things more robust and handle any network issues.But since the model is on our local network, we want it to fail instantly if it can't reach it.
5
Let's give this a go. We're going to ask the LLM a story about its grandmother:
And if we run it we get a story about the LLM's grandmother.
import { createOpenAICompatible } from "@ai-sdk/openai-compatible";So this is a nice simple setup with how you can connect the Vercel AI SDK to a local model.
▶Read: Not Found
The page you are looking for does not exist.
▶Read: Streaming In 'Next Question Suggestions' With Vercel's AI SDK
Email*
Streaming In 'Next Question Suggestions' With Vercel's AI SDK
One extremely common pattern for AI-powered apps is to provide suggested next questions for the user to ask. This lets users who are not familiar with these interfaces get started quickly.
Let's look at a basic implementation where when we initiate a conversation, it suggests next questions that stream in as the user requests them.
High-Level Overview
The implementation involves a POST request to
/api/chat, receiving UI messages from the body. These UI messages get converted into model messages before the main processing begins.The flow consists of two streams combined into one parent stream:
- The first stream for our initial response
- The second stream for follow-up suggestions
To compose these streams together, we create a parent stream with
createUIMessageStream, which exposes a writer:const stream = createUIMessageStream<MyMessage>({ execute: async ({ writer }) => { // Stream initial response const messagesFromResponse = await streamInitialResponse(modelMessages, writer); // Generate follow-up suggestions const followupSuggestions = generateFollowupSuggestions([\ \ ...modelMessages,\ \ ...messagesFromResponse,\ \ ]); // Stream the suggestions to frontend await streamFollowupSuggestionsToFrontend(followupSuggestions, writer); }, });Streaming the Initial Response
Here's how the initial response is streamed:
const streamInitialResponse = async ( modelMessages: ModelMessage[], writer: UIMessageStreamWriter<MyMessage>, ) => { // 1. Stream the initial response - can be any // streamText call with tool calls, etc. const streamTextResult = streamText({ model: mainModel, messages: modelMessages, }); // 2. Merge the stream into the UIMessageStream writer.merge(streamTextResult.toUIMessageStream()); // 3. Consume the stream - this waits until the // stream is complete await streamTextResult.consumeStream(); // 4. Return the messages from the response, to // be used in the followup suggestions return (await streamTextResult.response).messages; };This function calls
streamTextusing the Gemini 2.0 Flash model (though you could use any model with the AI SDK). The stream is merged into the UI message stream usingwriter.merge, and we wait for the stream to complete withstreamTextResult.consumeStream().Finally, we return the messages produced by the call, which will be used in the next function.
Generating Follow-Up Suggestions
After getting the initial response, we pass both the original messages and the response messages to
generateFollowupSuggestions:const generateFollowupSuggestions = ( modelMessages: ModelMessage[], ) => // 1. Call streamObject, which allows us to stream // structured outputs to the frontend streamObject({ model: suggestionsModel, // 2. Pass in the full message history messages: [\ \ ...modelMessages,\ \ // 3. And append a request for followup suggestions\ \ {\ \ role: 'user',\ \ content:\ \ 'What question should I ask next? Return an array of suggested questions.',\ \ },\ \ ], // 4. These suggestions are made type-safe by // this Zod schema schema: z.object({ suggestions: z.array(z.string()), }), });This function uses
streamObjectwith a Zod schema that defines an array of strings for suggestions. We could add a system prompt and further context engineering here, but this simple approach works well enough.Streaming Suggestions to the Frontend
The follow-up suggestions get piped into
streamFollowupSuggestionsToFrontend:const streamFollowupSuggestionsToFrontend = async ( // 1. This receives the streamObject result from // generateFollowupSuggestions followupSuggestionsResult: ReturnType< typeof generateFollowupSuggestions >, writer: UIMessageStreamWriter<MyMessage>, ) => { // 2. Create a data part ID for the suggestions - this // ensures that only ONE data-suggestions part will // be visible in the frontend const dataPartId = crypto.randomUUID(); // 3. Read the suggestions from the stream for await (const chunk of followupSuggestionsResult.partialObjectStream) { // 4. Write the suggestions to the UIMessageStream writer.write({ id: dataPartId, type: 'data-suggestions', data: chunk.suggestions?.filter( // 5. Because of some AI SDK type weirdness, // we need to filter out undefined suggestions (suggestion) => suggestion !== undefined, ) ?? [], }); } };The suggestions are treated as a custom part of the message. We define the type of this message by specifying a
UIMessage, passingneveras the first parameter andsuggestions: string[]as the second.Type Safety for Custom Message Parts
We declare our custom message type to ensure type safety:
export type MyMessage = UIMessage< never, { suggestions: string[]; } >;This makes our code type-safe when writing to streams - we can only pass in a string array to the
data-suggestionspart.Frontend Implementation
In the frontend, we use the
useChathook with our custom message type:const { messages, sendMessage } = useChat<MyMessage>({}); const [input, setInput] = useState(``); const latestSuggestions = messages[\ \ messages.length - 1\ \ ]?.parts.find( (part) => part.type === 'data-suggestions', )?.data;We extract the latest suggestions from the most recent message's parts. These might be undefined if we have no messages yet or if suggestions haven't started streaming.
The suggestions are then rendered as buttons:
<ChatInput suggestions={ messages.length === 0 ? [\ \ 'What is the capital of France?',\ \ 'What is the capital of Germany?',\ \ ] : latestSuggestions } input={input} onChange={(text) => setInput(text)} onSubmit={(e) => { e.preventDefault(); sendMessage({ text: input, }); setInput(''); }} />We also provide default suggestions if there are no messages yet. When a user clicks a suggestion button, it populates the input field.
Summary
This pattern allows us to stream suggestions to the frontend in the same API endpoint as the rest of our content, creating a seamless experience for users. The suggestions update in real-time as they become available, helping users navigate the conversation more easily.
The key components are:
- A unified stream combining initial response and suggestions
- Type-safe message parts for structured data
- Real-time streaming of suggestions to the frontend
This approach creates a more guided, user-friendly experience for AI conversations.
▶Read: workshops ai sdk v6 crash course
100% TYPESCRIPT, 100% PRODUCTION-READY, 0% HYPE EDITION featuring AI SDK v6
This course has been upgraded from v5 to v6!
Integrating with LLMs is finally worth it.
It was less than a year ago when I made the very first commit to
evalite, my eval tool for LLM output. Kinda crazy to think people were out there rawdogging LLM outputs for years.(And some still are. If that's you, no judgment — just keep reading!)
LLM integration was exciting, but it was such a mess: kludges, workarounds, constantly shifting APIs (every vendor uniquely horrible in their own way)… and a painful lack of production-ready tooling. Just about everything was roll-your-own, and then re-roll-your-own again when interfaces were suddenly deprecated. You had to spend more time managing than making.
Today, it's a very different developer experience, thanks to AI SDK.
AI SDK changes everything:
- Now you can write once, run anywhere (virtually any LLM provider, and lots of framework and hosting options).
- Now you've got the tooling and patterns you need to focus on building features instead of reinventing the wheel.
- Now your AI integration code can be as beautiful, modern, safe, and durable as your code for everything else.
It's 100% TypeScript, too!
And version 5 is the best yet: more elegant, more refined, more complete.
So if you've been biding your time, waiting for stable, useful and production-ready APIs, tooling, and patterns before investing your precious time and energy into building wicked cool stuff…
I'm happy to report: that day is finally here.
Today is the best day to learn how to add LLM-powered features to your TypeScript apps.
And I've made a brand new Crash Course to help you get there fast.
ngl I was kinda struggling to understand everything going on in behind AI tools in the last few years but I just binged the AI SDK v5 crash course by Matt Pocock and everything makes so much sense now. Once again Matt proved himself to be the GOAT of online education.
Alberto
Your ticket to shipping modern LLM-powered features in TypeScript
Your existing TypeScript skills are more valuable than ever, and I'm here to help you level up to AI engineer.
My new AI SDK v6 Crash Course includes 94 videos guiding you through 59 hands-on exercises over 10 modules. (But don't worry, they're fun & fast!)
You will come away with a solid understanding of:
- LLM fundamentals, like how tokens and context windows actually work
- Core LLM design patterns: agents, workflows, tool-calling, and more
- Production-ready tools like instrumentation and error handling
- How to use Eval-Driven-Development to stamp out hallucinations
- And, of course… how to fully exploit the power of AI SDK v6 to get those things done
That's right: you can go from "what is a token, anyway??" to shipping production-ready LLM features that work, with output safeguards, message persistence, and error handling.
All you need to bring to the table is TypeScript. Everything else you need is included.
And the Crash Course is not a lecture where you sit and struggle to focus. You know me, that's not how we roll!
You'll learn through building real projects:
- A chat title generator, complete with a full eval dataset
- A generator/evaluator workflow that writes Slack messages for you and automatically refines them
- A model guardrail to secure your application
- A request router which can route simple requests to cheaper LLM’s
- A full-fledged research agent to search the web for you
In short, you'll be ready to plan, develop, and launch production-ready AI-powered features for your apps in TypeScript.
And never fear: you'll learn *so much, *but the individual videos are short & sweet at 2-3 minutes each, and many of the exercises are bite-sized! It can fit into any schedule, even yours.
By the way, if you've been building AI integrations the "old-fashioned way" (sorry)? This is the perfect opportunity to update your skillset. You'll learn all of AI SDK's core functionality as well as new v5 features like
ModelMessagesandUIMessages, custom data parts, type-safe tools from backend to front (big win for us TypeScript nerds), and the massively more elegantpartsarray for message rendering and persistence.And in the update we'll also be covering the breaking changes and new v6 features such as streaming objects via
streamText!Of course, the technology and AI SDK itself will continue to evolve and improve. But with this hands-on, real-world, tool-building experience under your belt, you'll be able to evolve and improve right along with it.
(Don't forget: AI SDK supports virtually every major framework!)
Buy Now
A complete education for just $149 if you enroll today
This is a self-paced course. When you enroll today, you'll get instant access to the entire course — all 10 modules — to keep, and go through at your leisure. You'll be shipping your first feature in as little as 7 days.
Any way you cut it, AI SDK v6 is the best way to take advantage of the potential of LLMs, and my Crash Course is the best way to get started.
Buy Now
US$149
Why AI SDK really is revolutionary and v6 is the best yet
My Crash Course really has two goals:
- Empower you to ship LLM-powered features with modern engineering principles, and
- Master AI SDK v6 itself
There's a reason for that second part: It's an abstraction layer and engineering toolkit in one. It saves you so much work.
To get the same level of production-ready code, you'd have to put in 2 or 3 times the work without it, conservatively, reinventing the wheel and accounting for platform changes. And then you wouldn't have the power of open source keeping things up-to-date… meaning even more work for you in the future, too.
In my opinion, if you're developing for AI in TypeScript and you're not using AI SDK v6, you're making a mistake. I'm even making AI SDK the basis of all AI Hero courses going forward.
It's worth every penny invested.
yoniwisin
The AI SDK library has zoomed to 3.5 million downloads a week for a reason:
- AI SDK is a fully open source app toolkit. You can copy it, modify it, fork it, even repackage and relicense it as you like, thanks to the very permissive the Apache 2.0 license.
- It integrates with every framework. It's for TypeScript and JavaScript. It provides a set of hooks for every major frontend framework (and even gives you abstractions for Vanilla JS).
- It works with essentially every LLM, including open source ones that run on your own hardware. Seriously — check out the list of supported providers.
- It gives you the power to switch models with 1 line of code. Because AI SDK gives you a unified interface for every LLM, you can switch models or providers as easily as editing your config. You can also use multiple models in the same codebase, all with unified commands. No matter what changes may come to the ecosystem, you're ready.
- It makes instrumentation easy. You can skip the slog of writing your own instrumentation for observability, because AI SDK can automatically export OpenTelemetry to all major platforms.
- It runs anywhere. The AI SDK is totally platform agnostic. It’ll run on Deno. Bun. Node. AI SDK is made by Vercel, but you're free to choose whatever you like.
- It's gaining wide ecosystem support. The AI SDK is so darn popular (3 million weekly downloads) that platforms like Langfuse, Braintrust and others have built hooks to integrate with it. And I expect many more to come.
All of that adds up to saved time, saved energy, saved money, saved hassle — more freedom, more fun, and more time to spend making exciting AI-powered features.
AI SDK v6 makes LLM-powered apps as modern as the rest of app development. Finally!
Don't you want this for yourself and your projects?
It can be yours. And with the Crash Course, it won't take long to get there.
Buy Now
Content
AI SDK v6 Crash Course
Loading priceLoading
Buying for myself or for my teamFor myselfFor my team
Buy Now
30-Day Money-Back Guarantee
Includes
- 1 Workshop
- Lifetime Access to Lessons
- Customizable Invoice
- English Transcripts & Subtitles
- Progress Tracking
- Access to Discord Community
- Completion Certificate
AI SDK v6 Crash Course
Matt Pocock
Buy Now
▶Read: Working With Message Histories In Vercel's AI SDK
Email*
Working With Message Histories In Vercel's AI SDK
Matt Pocock
Source Code Use with AI
It's pretty common if you're building any kind of chat bot to want to keep track of the conversation history.
This is so the LLM has context over the conversation you've already had.
So you can ask follow-up questions without having to rephrase your question every time.
Not only that, but understanding how conversations are persisted is really important for communicating with the LLM over the wire.
We're going to show how to do that with the Vercel AI SDK in this example.
The
CoreMessageTypeLet's start by understanding the shape of what a conversation history might look like.
The AI SDK exposes a type called
CoreMessage. This is an object that represents a message in a conversation.If we understand this type we'll understand a conversation history - I promise.
Scrollycoding
1
Let's try messing about with it to see what it contains. Let's create a messages array, assigning it the type of an array of core messages.
2
Every message has to contain a role property. This is a string that can be user, system, assistant or tool.
3
Each message must also contain a
contentproperty - this is the content of the message.In this example, the conversation history contains a single message from the user saying "Hello, you!".
4
To represent the LLM replying, we use the "assistant" role:
5
We had a brief look at system prompts earlier. They're represented in the message history with a role of
system.In this case, we're telling the LLM to greet everyone that speaks to it.
import { type CoreMessage } from "ai"; const messages: CoreMessage[] = [];We'll take a look at tools later when we cover tool calling.
As conversations get longer and longer, this array will grow with more
userandassistantmessages being added.Now we understand about the messages array, let's apply it to a real-world (ish) example.
Setting Up A Server
For the first time here, we're going to set up a server.
Scrollycoding
1
We're going to put it inside a
startServerfunction:2
We'll use the Hono package to create a server:
3
We'll add a route to the server:
4
Then, we'll serve it with
servefrom@hono/node-server:5
Finally, we'll wait for it to start using
node:eventsand return theserver:export const startServer = async () => {};This gives us a nice function we can call to start a server on
localhost:4317.Next, let's build the API call:
Scrollycoding
1
Our
/api/get-completionsroute is going to accept an array ofCoreMessageobjects.2
We can then pass the entire conversation history into
generateText.3
Based on the conversation history, the LLM will figure out what it should try to say next.
The messages that it suggests will be added to
result.response.messages.So we can return those as JSON using
ctx.json.import { type CoreMessage } from "ai"; app.post("/api/get-completions", (ctx) => { const messages: CoreMessage[] = await ctx.req.json(); });To sum up, we've build an API endpoint that accepts a conversation history and returns the next message the LLM would say.
This is a pretty common pattern. It means our server is entirely stateless.
The responsibility for tracking the conversation lies with the client.
There are other ways of doing this, like storing the conversations on your server, but this is a simple way to get started.
Calling Our Server
Now we have a server, we can call it from the client. I'm running this on node, but you could easily run this in the browser too.
Scrollycoding
1
Let's create a new array of messages, and start our server:
2
We'll use
fetchto send the messages to the server, andres.json()to turn the response into JavaScript objects.3
Finally, we'll append the new messages to the conversation history, and log it out.
import { type CoreMessage } from "ai"; import { startServer } from "./server.ts"; const messagesToSend: CoreMessage[] = [\ \ {\ \ role: "user",\ \ content: "What's the capital of Wales?",\ \ },\ \ ]; await startServer();We end up with an output that looks like this:
[\ \ {\ \ "role": "user",\ \ "content": "What's the capital of Wales?"\ \ },\ \ {\ \ "role": "assistant",\ \ "content": [\ \ {\ \ "type": "text",\ \ "text": "The capital of Wales is Cardiff."\ \ }\ \ ]\ \ }\ \ ]This array is ready to receive another message with the role of
user, and the process can repeat.Conclusion
In this example, we've shown how to keep track of a conversation history using the Vercel AI SDK.
We've learned about the messages array, including the various roles.
We've seen how to set up a server that accepts a conversation history and appends to it.
And we've seen how to call that server from a client.
▶Read: Anthropic's new Model Context Protocol In 2 Minutes
Anthropic have released a new open standard for connecting "AI assistants to the systems where data lives".
It's a JSON-RPC-based protocol that lets an LLM (in this case, Claude) communicate with a long-running server. Claude can request what different tools are available to use, and then call them.
There's a TypeScript and Python SDK, meaning you can easily create your own MCP servers to connect to your own data sources, or perform custom actions.
▶Read: model context protocol tutorial
Code Share
The model context protocol, released by Anthropic, is exploding across the world of AI.
It's the USB-C of AI. It helps link user-facing applications to powerful API's.
Link Cursor to your GitHub account. Link Claude Desktop to your email. The possibilities are endless.
In this tutorial, we'll cover everything you need to know to start shipping MCP servers.
We'll be working in TypeScript, my favourite language. We'll look at:
- Why the Model Context Protocol exists
- Choosing an MCP Client
- Building MCP Servers in TypeScript
- Distributing MCP Servers on NPM
- Hosting MCP Servers on serverless platforms
By the end, you will be able to ship your MCP idea so that anyone around the world will be able to use it.
Head to AI Hero's GitHub repo for all the code, and to run the examples.
Code Share
▶Read: How Does The Model Context Protocol Work?
Email*
How Does The Model Context Protocol Work?
Now we understand a little bit about what MCP might be used for, and why it's interesting, let's actually look at how it works.
To understand how it works we're going to have to understand a diagram that Anthropic posted, which shows the architecture of MCP:
Let's explain it step by step.
MCP Hosts and Clients
The MCP host is the program that's going to access the MCP servers. This might be Claude Desktop, Cursor, Windsurf, or any other application that supports MCP.
This host probably uses an LLM of some kind. That LLM will be able to call tools that are defined in the MCP server.
On this host, you're going to run multiple clients - each client will maintain a relationship to a single MCP server. When the host starts up - i.e. when you start Cursor - each client will connect to an MCP server.
So, you can have one host with multiple clients, each interacting with a different server.
You can also build your own clients - and we may get to that in a future video. For now, check out MCP's documentation.
The MCP Server
The MCP server is the server that's going to be running the tools that the host wants to call.
This server could be running locally, or it could be running on a remote server.
One thing to note is that the server could be a completely different language to the host. Just like the front end and back end of an application can be in different languages, the host and server can too.
The Transport and the Protocol
The client connects to its server using a transport. This transport is responsible for sending messages between the client and the server.
There are currently two supported transports. You can communicate via
stdio- in other words, via the terminal. Or you can communicate through HTTP via server-sent events. This is useful if you want to run your server on a remote machine. We'll cover both of these later.But what do the messages sent across this transport actually look like? Well, this is the protocol.
It's a set of JSON shapes that the client and server can send to each other, based on JSON-RPC 2.0. They look like this:
type Request = { jsonrpc: "2.0"; id: number | string; method: string; params?: object; }; type Response = { jsonrpc: "2.0"; id: number | string; result?: object; error?: { code: number; message: string; data?: unknown; }; };Requests contain a
methodthat the server should perform, and theparamsthat should be passed to that method.Responses contain a
resultif the method was successful, or anerrorif it wasn't.They each have an
idwhich is used to match up the request with the response, and ajsonrpcfield which tracks the version of the underlyingjsonrpcprotocol.An Example Communication
To drill this in, let's look at an example. Let's imagine we have a tool called
createGitHubIssuewhich creates an issue on GitHub.We'll skip over some of the preliminary steps - the MCP client initializing the connection - and start with the client sending a request to the server.
1. List The Tools
// Client sends... { "jsonrpc": "2.0", "id": 1, "method": "tools/list" }This is a request to list all of the tools that are available on the server. The server will respond with a list of tools:
// ...server sends back: { "jsonrpc": "2.0", "id": 1, "tools": [\ \ {\ \ "name": "createGitHubIssue",\ \ "description": "Create a GitHub issue",\ \ "inputSchema": {\ \ "type": "object",\ \ "properties": {\ \ "title": { "type": "string" },\ \ "body": { "type": "string" },\ \ "labels": {\ \ "type": "array",\ \ "items": { "type": "string" }\ \ }\ \ }\ \ }\ \ }\ \ ] }Note how the server responds with the name of the tool, a description, and the required inputs for the tool. The input is a JSON schema describing a title, body and labels.
2. Call The Tool
The client decides to call the tool, and sends a request:
// Client sends... { "jsonrpc": "2.0", "id": 2, "method": "tools/call", "params": { "name": "createGitHubIssue", "arguments": { "title": "My Issue", "body": "This is the body of my issue", "labels": ["bug"] } } }This describes the
nameof the tool to be called, and theargumentsit should be passed.The server responds with the result of the tool:
// ...server sends back: { "jsonrpc": "2.0", "id": 2, "content": [\ \ {\ \ "type": "text",\ \ "text": "Issue 143 created successfully!"\ \ }\ \ ], "isError": false }It returns an array of content parts. These content parts can be
text- as shown above - orimage(for images) andresourcefor binary data. It also returns an optionalisErrorflag.If the tool call had errored, the returned object would look slightly different:
// ...if error, server sends back: { "jsonrpc": "2.0", "id": 2, "content": [\ \ {\ \ "type": "text",\ \ "text": "Error creating issue: Unauthorized"\ \ }\ \ ], "isError": true }This tells the client that the tool call was unsuccessful.
Conclusion
This is the basic structure of how MCP works. We have a host that runs clients, which connect to servers. The clients and servers communicate using a transport and a protocol.
▶Read: Serving MCP Servers Over stdout
Model Context Protocol Tutorial
Email*
Serving MCP Servers Over stdout
Matt Pocock
Source Code Use with AI
I'm going to show you the simplest way you can set up an MCP server.
We're going to create an MCP server, connect it to Claude Code, and then use it to run any script we want.
And we're going to do this all with a single TypeScript file and no build step.
Let's go.
The Server
We're going to start by creating a
main.tsfile.We're going to initialize an
McpServer:import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"; const server = new McpServer({ name: "Weather Service", version: "1.0.0", });We're getting this MCP server from the
@modelcontextprotocol/sdkpackage.Scrollycoding
1
We can then add a tool to this server. We first define the name of the tool,
getWeather.2
Then we define what arguments the tool needs to receive in order to run. In this case, we only need the city, which is a string. And we specify this using Zod.
3
Finally, we add a callback function that only runs when the tool is called. We return an array of content objects, where in this case, we're returning text. And we say that the weather in that place is sunny.
This is the function where you'd actually go and call the weather API if we were implementing this for real.
server.tool("getWeather");Our server is actually complete. But we need some way for Claude Code to communicate with our server.
It's going to do this by running this file and communicating with it via stdin.
To get this working, we can use a
StdioServerTransportfrom@modelcontextprotocol/sdk.We first define the transport, then connect the server to it.
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; const transport = new StdioServerTransport(); await server.connect(transport);Connecting to Claude Code
Now in Claude Code, we can run
claude mcp add. This starts an interactive dialogue where you can choose the server you want to connect to.Right now Claude Code only supports
stdiotransport - which is what we're using.In the video above, I walk through the interactive steps, but you can run this with a single command:
claude mcp add "weather-example" npx tsx "/path-to-the-file.ts"This tells Claude that in order to run the file, it should call
npx tsx /path-to-the-file.ts.npxlets us run any script fromnpm.tsxis a fabulous way to run TypeScript files without a build step. And then we pass in the path to our file.And just like that we should be able to run
claudeand have it communicate with our MCP server. Check out the video above for a demonstration.Why Is This Cool?
This is awesome because it allows us to connect arbitrary TypeScript functions to Claude Code without needing to set up a build step or a server.
This is a really powerful way to customize your own Claude Code instance, or other MCP hosts like Cursor and Windsurf.
As more desktop apps start integrating MCPs, this is going to be a really powerful way to extend them with custom capabilities.
▶Read: Using MCP Prompts
Model Context Protocol Tutorial
Using MCP Prompts
Matt Pocock
Source Code Use with AI
We've discussed so far how to add tools to your MCP server.
But something I haven't seen discussed a lot online is the fact you can use an MCP server to provide prompts to users.
This lets you use MCP as a prompt directory - a collection of prompts that users can pull in to achieve their goals quickly.
I'm going to show you one of my favorite prompt templates that I used in the creation of this article.
The Code
Let's start with our boilerplate code: the server and the transport:
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"; import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; const server = new McpServer({ name: "Prompt Directory", version: "1.0.0", }); // ...more code in here const transport = new StdioServerTransport(); await server.connect(transport);This server is going to house a
cleanTranscriptionprompt. I record most of my articles via dictation, and this prompt helps me clean up mistakes in the transcription.Scrollycoding
1
Let's start by adding a
promptto the server. This has a very similar syntax totool. But instead ofserver.tool, we'll callserver.prompt:2
Next we'll add a description to describe what the tool actually does.
3
This tool is going to take in the path to the file to clean up. So let's specify that using zod:
4
Finally, we'll add a callback to return the value of the prompt. This comes back in a messages array, with a role of
userand some text content.server.prompt("cleanTranscription");This means we've added a prompt template called clean transcription to our MCP server.
The Prompt
For those interested, here is the prompt I use:
const getPrompt = (path: string) => ` Clean up the transcript in the file at ${path} Do not edit the words, only the formatting and any incorrect transcriptions. Turn long-form numbers to short-form: One hundred and twenty-three -> 123 Three hundred thousand, four hundred and twenty-two -> 300,422 Add punctuation where necessary. Wrap any references to code in backticks. Include links as-is - do not modify links. Common terms: LLM-as-a-judge ReAct Reflexion RAG Vercel `;There are around two hundred common terms, so I've left them off for brevity.
The Demo
We can connect to this server from Claude Code using the same technique as we saw before.
When we do, the prompt template is sourced as a command we can run from within Claude. We can access it by typing
/cleanTranscriptionwithin the Claude Code interface. It shows the prompt via autocomplete:When we run it, we are prompted to add the
pathargument:And it runs the prompt on the file.
Conclusion
This is a simple example of how you can use an MCP server as a prompt directory. You can provide your users with a set of prompts that they can use to drive the behavior of their LLM.
This is pretty powerful as a personal productivity tool, and also as a way to provide commands for your MCP clients for often-used workflows.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: The Problem With MCP: Stateful Servers
Model Context Protocol Tutorial
The Problem With MCP: Stateful Servers
Matt Pocock
Source Code Use with AI
In a previous example, we created a simple MCP server that could be served over HTTP.
But it had some downsides.
One Client At A Time
The most obvious is that it could only serve one client at a time. It saved its current 'transport' in a single variable:
import { SSEServerTransport } from "@modelcontextprotocol/sdk/server/sse.js"; import express from "express"; // The transport variable let transport: SSEServerTransport | undefined = undefined; const app = express(); app.get("/sse", async (req, res) => { transport = new SSEServerTransport("/messages", res); await server.connect(transport); });This meant that if a second client connected, the first client would lose connection.
Stateful Server
But let's imagine that we fix this. Instead of saving the transport in a single variable, we'd store it using an id where we could later retrieve it.
Even if we did this, we'd still have a problem.
This approach requires a stateful server. We'd need to keep track of the transports in memory. If the server process exits, the transports would be lost.
This puts constraints on how the server can be deployed. Serverless environments like Vercel or AWS Lambda would not work - they dispose of in-memory state between requests.
We'd need to deploy it to a long-running server, like a VPS.
Store The State In Redis
A better approach is to store the state in a database. We could use a key-value store like Redis.
Now the transport information is stored in Redis, the server can be stateless. This means it can be deployed to serverless environments.
Vercel Example
This is the approach that
mcp-on-vercelhas taken. I recommend checking out their repo to learn more.While their repo is designed to work with Vercel, it can be adapted to fit to any serverless platform.
Conclusion
MCP servers are stateful by nature. But by storing the state in a database, we can deploy them to serverless environments.
This would be my recommended approach for deploying MCP servers to production.
However, it does beg the question: was there a way of designing this protocol so that the service could be stateless?
Having to hook up a key-value store feels like more overhead than necessary. Could the client not store the state instead?
I'll explore this in a future article.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: Logging: A Huge MCP Footgun
Model Context Protocol Tutorial
Logging: A Huge MCP Footgun
Matt Pocock
Source Code Use with AI
There's a really easy foot gun that you can trigger when you're using the
stdiotransport with an MCP server.Here it is: console logs don't work.
Why
console.logDoesn't WorkAn MCP server is really just a standard Node application. The client runs the MCP server, and monitors the
stdoutof the server for any MCP messages. It then sends back MCP messages to the server'sstdin.However, Node apps use stdio for another purpose: logging. When you log to the console, you're writing to the same stream that the MCP server is using to communicate with the client.
This means that if you log to the console in your server, you're going to be sending messages to the client that aren't MCP-formatted. Depending on the client, this could cause it to crash. And at the very least, your logs will be swallowed by the program consuming them.
Why Doesn't
sseHave This Problem?The
ssetransport doesn't have this problem because it doesn't usestdiofor transporting messages:This means that it's free to use
stdoutfor logging.It's perfectly possible to use
sselocally, so you may prefer using that method if accessing the console is important to you.How to Log
My preferred solution is to create a custom logger that writes to a local file -
mcp-server.log. Here's my implementation:// 1. appendFileSync import { appendFileSync } from "fs"; import { join } from "path"; // 2. LOG_FILE const LOG_FILE = join( import.meta.dirname, "mcp-server.log", ); // 3. formatMessage function formatMessage( level: string, message: string, data?: unknown, ): string { const timestamp = new Date().toISOString(); const dataStr = data ? `\n${JSON.stringify(data, null, 2)}` : ""; return `[${timestamp}] [${level}] ${message}${dataStr}\n`; } // 4. logger export const logger = { log(message: string, data?: unknown) { const logMessage = formatMessage( "INFO", message, data, ); appendFileSync(LOG_FILE, logMessage); }, // ... };- We use
appendFileSyncto write to a file synchronously. - We define the path to the log file.
- We define a function to format the message.
- We export a logger object with a
logmethod that writes to the log file. You can export other methods likeerror,warn, etc.
This way, you can log to a file without interfering with the MCP protocol.
A Possible Alternative: Notifications
A reader alerted me to the fact that MCP does have a concept of 'notifications', albeit somewhat buried in the documentation.
In theory, this would allow the server to send a message to the client.
However, the implementation of this would be up to the client - so whether you could use this mechanism for logging is an open question.
If anyone has use notifications for logging, I'd love to hear about it in the Discord.
Another Possible Alternative:
stderrI've had a tip that logging to
stderr- i.e, usingconsole.error- is not read by the MCP client.This means that you can potentially
teethe process running the MCP server, and log it out to a file.I'll cover this in more depth soon - consider this a TODO.
Conclusion
You've got two choices when you're considering logging in an MCP server:
- Use a transport that doesn't interfere with
stdout, likesse. - Write to a
.logfile.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
- We use
▶Read: Publish Your MCP Server To NPM
Model Context Protocol Tutorial
Publish Your MCP Server To NPM
Matt Pocock
Source Code Use with AI
A huge benefit of MCP's design is that you can very easily distribute MCP servers using freely available tools.
In this example, we're going to take our simple MCP server and publish it to NPM.
Once we've done that, anyone with Node installed will immediately be able to run our MCP server via
npx. We'll be able to push updates to it, and users will get those updates automatically.We're going to be following recommendations from my previous article on Total TypeScript on how to create an NPM package.
The Basics
The way users are going to run our MCP server is by using
npx. This is a tool that comes with Node.js that lets users run any package on NPM.They'll be running this command:
npx -y @mattpocockuk/mcp-server-example@latestThis will:
- Download the latest version of our package to a global cache (using
the
@latesttag) - Ignore the confirmation message that usually appears when running a
package for the first time (using the
-yflag). - Run a file referenced inside that package's
package.jsonfile
So, it's a way of running a package without having to install it.
It's worth noting that this is not without its dangers. If you're running a package from an untrusted source, you should always check the source code first.
Setting Up The Server
The server and the transport will sit in a single file called
main.ts. This is the file that will be executed when the user runs our server.We're going to add a single line to the top of this file: a shebang.
This is a special line that tells the operating system how to run the file. In this case, we're telling it to run the file using Node.js.
#!/usr/bin/env node import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"; import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; import { z } from "zod"; const server = new McpServer({ name: "Weather Service", version: "1.0.0", }); server.tool( "getWeather", { city: z.string(), }, async ({ city }) => { return { content: [\ \ {\ \ type: "text",\ \ text: `The weather in ${city} is sunny!`,\ \ },\ \ ], }; }, ); const transport = new StdioServerTransport(); await server.connect(transport);package.jsonWe'll need a
package.jsonfile to describe our package. Here's what it looks like:{ "name": "@mattpocockuk/mcp-server-example", "version": "0.0.1", "description": "An example of a MCP server", "bin": { "mcp-server-example": "./dist/main.js" }, "files": ["dist"], "type": "module", "scripts": { "build": "tsc", "ci": "npm run build", "prepublishOnly": "npm run ci" }, "dependencies": { "@modelcontextprotocol/sdk": "^1.7.0" } }Note that the comments will need to be deleted if you copy/paste this into your own
package.json.package.json\#binThe most important part of this file is the
binfield. This tellsnpxwhich file to run when executing your MCP server. The details for how this logic works is described here. The important quote is:If the package has a single entry in its
binfield inpackage.json[...] then that command will be used.So, having a single entry in the
binfield, no matter what its name is, means thatnpxwill run that file. In our case, that will bemcp-server-example.{ "bin": { "mcp-server-example": "./dist/main.js" }, }The file we're going to run is
./dist/main.js. This is the file that will be built by TypeScript.TypeScript
TypeScript is going to be responsible for turning our
.tsfile into a file that Node can run - a.jsfile.We'll start by adding a
tsconfig.jsonfile:{ "compilerOptions": { /* Base Options: */ "esModuleInterop": true, "skipLibCheck": true, "target": "es2022", "allowJs": true, "resolveJsonModule": true, "moduleDetection": "force", "isolatedModules": true, "verbatimModuleSyntax": true, /* Strictness */ "strict": true, "noUncheckedIndexedAccess": true, "noImplicitOverride": true, "module": "NodeNext", "outDir": "dist", "lib": ["ES2022"] } }This will tell typescript how to build our project. The most important part is the
outDirfield. This tells TypeScript where to put the built files. It means thatmain.tswill turn intodist/main.js.Publishing Our Package
To publish our package, we'll need to run the following command:
npm publish --access publicThis will publish our package to NPM. The
--access publicflag means that anyone can install it.You may get one of several errors:
- If you haven't logged in, you'll need to run
npm loginfirst. - If your package name is already taken, you'll need to change it.
But once you've worked through them, your package will be published!
Testing It Out
You can test it out by running the following command:
npx -y @mattpocockuk/mcp-server-example@latestThis will run the latest version of your package. You can also run a specific version by changing
@latestto@1.0.0or whatever version you want to run.Oddly, you won't see anything in the terminal. This is because the server is monitoring stdin for commands.
You can try connecting it to a MCP server like Cursor, and asking it for the weather in your city.
Conclusion
In this article, we've seen how easy it is to publish an MCP server to NPM. This means that anyone with Node.js installed can run your server with a single command.
This is a really powerful way of distributing your MCP servers to anyone who wants to use them.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
- Download the latest version of our package to a global cache (using
the
▶Read: Google's Introduction To Agents Is GREAT
Email*
Google's Introduction To Agents Is GREAT
In September 2024, a group of folks at Google published a whitepaper on agents.
There's nothing particularly controversial or new in this paper.
But it's a really good set of definitions for some terminology around agents.
It defines what an agent is, what a tool is, and has a couple of extra interesting definitions which we'll get into.
I think it's a great introduction to the field and that's why we're covering it.
Definition Of 'Agent'
So let's start with the juicy stuff - the definition:
An agent is "an application that attempts to achieve a goal by observing the world and acting upon it using the tools that it has at its disposal."
This sounds about right, but of course a thermostat would also be an agent under this definition.
Helpfully the paper also draws a line between agents versus models.
There's a nice table here for us:
- Models are limited to what is available in their training data, whereas agents have access to external systems.
- Agents have access to chat history, which lets them add additional information to their context over time.
In other words, a model by itself cannot be an agent. An agent is a system containing memory, reasoning, planning, and tool calling.
But they also shy away from defining agents strictly. There's a funny turn of phrase in the introduction where they say this combination of factors "invokes the concept of an agent."
Basically if it smells
agenticit's probably an agent.Orchestration Layer
The paper also talks about orchestration layers as being important for agents.
The orchestration layer governs how the agent "takes in information, performs some internal reasoning, and uses that reasoning to inform its next decision or action."
In other words it's the feedback loop that I've covered elsewhere in my Vercel AI SDK tutorial.
Definition Of 'Tools'
Pretty much everyone agrees though that for a system to be
agenticit needs access to tools.The whitepaper defines tools as "bridging the gap between the agent's internal capabilities and the external world."
But it goes further than most tool definitions. It actually defines different types of tools: extensions and functions.
Extensions
Extensions are ways to extend the capability of the agent. They show the example of calling APIs, for instance to fetch the weather or fetch directions between two places.
You can think of extensions as things the agent can do without asking.
These are like typical tool definitions. We tell it how to use each tool using examples and tell the agent what arguments are needed to successfully call the tool.
Functions
Functions are different to extensions. You teach the LLM how to use them in exactly the same way. But instead of the agent having control over calling the tool, instead the user has control.
Let's take the example of deleting a post in a database. The system the agent is deployed on may not actually have the rights to delete that post. And for various reasons you may not want to give it access to that.
So instead of deleting the post itself, the agent tells the user "delete this post."
This is what it defines as functions: "the logic and execution of calling the actual API endpoint is offloaded away from the agent and back to the client-side application."
There are several reasons you might do this:
- As we described before, you don't want to give the agent permission to perform those actions
- You want to do some human-in-the-loop review
- Or you want to build the agent while reusing existing API endpoints
So to sum up: with extensions the agent handles the tool execution; with functions the client handles the tool execution.
My Thoughts
This definition is a bit funky. I always have a bit of an issue with hanging concepts on such commonly used words. If you're a developer you already know what a function is. And the definition of function versus extension is kind of unclear.
The whitepaper goes on to mention Google's Vertex AI platform. So it's possible that functions versus extensions is a key concept there.
I do like it as a teaching device but I'm not sure if I'm going to be using that definition myself.
Conclusion
Overall Google's whitepaper is a pretty nice introduction to the field, and adds a couple of interesting definitions to the discussion.
There's more stuff in there like examples using LangChain and Python, so it's worth checking out yourself.
But I hope you enjoyed that summary and I will see you for the next one.
▶Read: Logging: A Huge MCP Footgun
Model Context Protocol Tutorial
Logging: A Huge MCP Footgun
Matt Pocock
Source Code Use with AI
There's a really easy foot gun that you can trigger when you're using the
stdiotransport with an MCP server.Here it is: console logs don't work.
Why
console.logDoesn't WorkAn MCP server is really just a standard Node application. The client runs the MCP server, and monitors the
stdoutof the server for any MCP messages. It then sends back MCP messages to the server'sstdin.However, Node apps use stdio for another purpose: logging. When you log to the console, you're writing to the same stream that the MCP server is using to communicate with the client.
This means that if you log to the console in your server, you're going to be sending messages to the client that aren't MCP-formatted. Depending on the client, this could cause it to crash. And at the very least, your logs will be swallowed by the program consuming them.
Why Doesn't
sseHave This Problem?The
ssetransport doesn't have this problem because it doesn't usestdiofor transporting messages:This means that it's free to use
stdoutfor logging.It's perfectly possible to use
sselocally, so you may prefer using that method if accessing the console is important to you.How to Log
My preferred solution is to create a custom logger that writes to a local file -
mcp-server.log. Here's my implementation:// 1. appendFileSync import { appendFileSync } from "fs"; import { join } from "path"; // 2. LOG_FILE const LOG_FILE = join( import.meta.dirname, "mcp-server.log", ); // 3. formatMessage function formatMessage( level: string, message: string, data?: unknown, ): string { const timestamp = new Date().toISOString(); const dataStr = data ? `\n${JSON.stringify(data, null, 2)}` : ""; return `[${timestamp}] [${level}] ${message}${dataStr}\n`; } // 4. logger export const logger = { log(message: string, data?: unknown) { const logMessage = formatMessage( "INFO", message, data, ); appendFileSync(LOG_FILE, logMessage); }, // ... };- We use
appendFileSyncto write to a file synchronously. - We define the path to the log file.
- We define a function to format the message.
- We export a logger object with a
logmethod that writes to the log file. You can export other methods likeerror,warn, etc.
This way, you can log to a file without interfering with the MCP protocol.
A Possible Alternative: Notifications
A reader alerted me to the fact that MCP does have a concept of 'notifications', albeit somewhat buried in the documentation.
In theory, this would allow the server to send a message to the client.
However, the implementation of this would be up to the client - so whether you could use this mechanism for logging is an open question.
If anyone has use notifications for logging, I'd love to hear about it in the Discord.
Another Possible Alternative:
stderrI've had a tip that logging to
stderr- i.e, usingconsole.error- is not read by the MCP client.This means that you can potentially
teethe process running the MCP server, and log it out to a file.I'll cover this in more depth soon - consider this a TODO.
Conclusion
You've got two choices when you're considering logging in an MCP server:
- Use a transport that doesn't interfere with
stdout, likesse. - Write to a
.logfile.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
- We use
▶Read: Your App Is Only As Good As Its Evals
Your App Is Only As Good As Its Evals
As we discussed in our previous article, building with LLMs requires a fundamental shift in how you think about software development. You're no longer designing deterministic systems where inputs map to predictable outputs. Instead, you're working with probabilistic systems which are inherently unpredictable.
The key tool for managing this uncertainty is evals. Evals are the AI engineer's unit tests. They are how you wrangle predictability from a probabilistic system. They are an indispensable part of productionizing any AI app.
Let's break down what evals are, and why AI apps need them so badly.
Why Traditional Testing Isn't Enough
Traditional software testing relies on deterministic relationships between inputs and outputs. Each component has a clear domain of responsibility:
But LLM-powered systems are different. Every input goes through a complex transformation process that's hard to predict:
In AI systems, no change is small. Their attention and transformation mechanisms are inscrutable. Whether the butterfly "flaps" or "Flaps" its wings may change the output. To put it mildly, building robust systems with them requires care.
Manual QA Is Not Enough
It's easy to get an impressive AI demo working quickly. But getting that AI system to production is not easy.
Specifically, it's hard to know whether the things you're changing about your app are getting better or worse. You make a change, try a few favourite prompts and see if it "feels" better. But this is a dangerous way to work.
A "Manual QA-only" approach in deterministic software is usually doable. You say "I added a new page" and the QA team can rigorously test the new page, and smoke test the previous pages.
But in probabilistic systems, it is a killer. When any change can affect the entire system, you need a way to know if your system is getting better or worse.
This is especially true for large changes, like which model you use, or the design of your prompt.
Evals
The key is automation. We need to evaluate our app every time we make a change, or every time the underlying model changes.
Traditional, Deterministic Systems
In deterministic systems, automating testing is relatively straightforward. You can feed some inputs in and check the outputs.
const output = myNormalSystem(input); // Will fail if the output doesn't match assert(output === "my-desired-output");These assertions are 'pass' or 'fail'. And usually, an app has to pass every test to be considered production-ready.
Probabilistic Systems
But writing these tests for AI isn't as straightforward.
Let's say your app generates written articles. You want to check that the output is good enough for production. You might need to write assertions for:
- Factuality: checking if all statements in the output are factually correct
- Writing style: ensuring that the text is elegant and well-written
- Prompt fidelity: ensuring that the output actually corresponds to what the user asked.
These are qualitative metrics. Instead of a pass/fail, they need to be represented by a score. Each time you change your app, you need to know if it made the system 5% better, or 50% worse.
This is what evals do - they give you a score you can use to see how well your AI system is performing.
Three Types Of Evals
There are three main types of evals you can run on your AI system.
Deterministic Evals
There are deterministic evals, which can be written as simple assertions.
const article = writeArticleWithLLM(prompt); // Article should be more than 300 words long assert(article.length >= 300); // Article should be less than 2,000 words long assert(article.length <= 2000);These are traditional pass/fail checks. You would pass a wide variety of prompts into your system, and check each time if they pass these tests.
They're simple to write, but only cover a subset of what you want to evaluate.
Human Evaluation
For more probabilistic metrics, you have two choices.
You can use human evaluation to check whether your system is performing correctly. This is often your only choice early on, when you don't have a lot of data.
This is expensive, and time consuming - but all AI systems will rely on human input to some extent.
LLM As A Judge
Another technique is to pass the results of your prompts into another LLM, and use that LLM as a judge. This is currently a very fashionable way to evaluate AI systems.
Let's imagine you may want to make sure your app is telling the truth. You can do that by passing your system's output into a LLM, along with some ground truth.
An example of this in action can be found on the Evalite docs.
LLM-as-a-judge makes certain evaluations possible - but at a cost. Running LLMs are expensive, so you need to think carefully about what cadence you run them at. Running your evals every time your files change, for example, would be prohibitively expensive.
Common strategies include splitting your evals into two sets - a smaller group for local testing, and a larger group to be run daily.
How Do I Improve My Evals Over Time?
Your evals are the method by which you monitor and improve your AI system. This also means that the dataset you use to evaluate your system is crucial.
You need to make sure that your evals are representative of the data your system will see in production. If you're building a classifier, you need to make sure your evals cover all the edge cases your system will see.
This means it's crucial to build in observability and feedback systems into your application. Once your app is deployed, your users will be the judge of whether your system is working or not. Simple feedback buttons, like upvotes and downvotes, can give you extremely valuable insights into how your system is performing.
The Data Flywheel
Vercel, creators of v0, have written about the AI Native Flywheel. They describe the importance of evals in the AI development process.
The best data for your evals comes from your users. By carefully monitoring how users are using your app, you can build a feedback loop that will help you improve your system over time. Let's take an example:
- A user asks your app "build me a classy React application"
- Your app generates some React code. But instead of making the UI look "classy", it uses classes in the code.
- The user downvotes the response. Perhaps they even leave a comment explaining why.
- You take the prompt "build me a classy React application", and create a new eval for it. You add it to your eval suite.
- You improve the system until it passes the eval.
- You re-deploy. The next time a user hits this prompt, they get a better response.
This is the data flywheel in action. By carefully monitoring your system, and building in feedback loops, you can ensure your system is always improving.
How Do I Run Evals?
There are many methods for running evals. A large number of startups have entered the space, offering tools to run your evals and view them online.
Braintrust is a popular choice. They provide a cloud platform for running evals and sharing the results with your team, along with many other features. You can write your evals in TypeScript using their SDK. However, they use rate limits on your evals - which can be frustrating when you're trying to iterate quickly.
I maintain a library called Evalite, which is a lightweight eval runner based on the TypeScript test runner Vitest.
Evalite is designed to allow you to run your evals locally. There's no cloud platform attached, so you can run your evals as often as you like. It's a good choice if you're just starting out.
How Evals Work In Practice
Imagine an eval kind of like a function:
const score = runEval({ // 1. The prompts we'll test with data: [\ \ "Fish species in the Mediterranean",\ \ "Story of the first Moon landing",\ \ "Are Krakens real?",\ \ ], // 2. A function to generate outputs based // on our prompts task: async (topic) => { return generateArticle(topic); }, // 3. The scorers we'll use to generate // the final score scorers: [\ \ // Checks if output is long enough\ \ length,\ \ // Uses an LLM to check if it's accurate\ \ factualAccuracy,\ \ // Uses an LLM to check writing style\ \ writingStyle,\ \ ], }); // 4. A score between 0-100% console.log(score);We pass in a set of prompts (1), then the task to run (2), then the methods we're using to score our output (3).
Finally, we get back a score on how well our function performed (4).
This, at its heart, is what an eval is. This API is loosely inspired by Braintrust's autoevals library.
It's The Evals, Stupid
This image, based on Vercel's AI Native Flywheel, captures the impact evals have on your application.
They should be the center of your feedback loop. As more users use your app (distribution), they'll give you more data (usage). You can use this data to improve your app (data), and then run your evals again (evals).
These evals let you respond to new techniques and models, and put you on a constant path of improvement.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: The Three Types Of Evals
Email*
The Three Types Of Evals
Deterministic Evals
- Deterministic evals are evals that return a pass/fail result. They seek to extract determinism from a probabilistic system.
- They are the "most useful kind of eval" according to Ian Webster. These evals should be fast, and developer focused.
- A brilliant example is from Discord's Ian Webster, where they checked that their AI bot, Clyde, always replied with a lowercase letter at the beginning of its messages. This meant that their bot was imitating the behavior of a Gen-Z user.
LLM-as-a-Judge
- Some evaluations can be done via LLM's.
- autoevals templates are a good example of various types of these evals. Humor judges if something is funny. Battle compares two responses to find which one is better.
- You can even use LLM's to check factuality, by providing a ground truth statement to check the response against.
- Failing a LLM-as-a-judge evaluator is often a good indicator that a human should take a look. So, it's more like a smoke test than a real test (personal opinion).
Human Feedback
- Some evals can only be usefully evaluated by humans. These involve long-form text generation and certain types of factuality.
- Human oversight is needed for any type of LLM app.
▶Read: Evalite - an Early Preview
Email*
Evalite - an Early Preview
It feels crazy to me that there's no TypeScript-native, local-first tool for testing LLM-powered apps.
I want a simple test runner that can:
- Run my evals on a watch script
- Show me a UI for viewing traces, scores, inputs and outputs
- Not need me to sign up for an API key
So, I'm building one.
It's based on Vitest, and it's called Evalite.
Here's an early preview.
Don't know what evals are? Learn about them here.
▶Read: Evalite v1 Preview: Fast Evals, Built-in Scorers
Matt Pocock
Source Code Use with AI
If you've built evals before, you know the pain. Every time you need to check if your LLM's output is correct, you're writing another custom scorer. Is the SQL valid? Did it hallucinate? Is the JSON well-formed?
Evalite v1 (still in beta) solves this with 10 production-ready scorers, plus a major architecture upgrade that makes getting started trivial.
Check out the full docs at v1.evalite.dev.
10 Built-In Scorers
Evalite v1 ships with scorers for the most common eval scenarios. No more reinventing the wheel.
String Scorers
These are deterministic scorers for simple text validation:
- exactMatch - checks if output exactly matches expected string
- contains - checks if output contains a substring
- levenshtein - fuzzy string matching using Levenshtein distance
The levenshtein scorer is particularly useful for SQL generation or code output where minor formatting differences shouldn't fail the eval:
scorers: [\ \ {\ \ scorer: ({ output }) =>\ \ levenshtein({\ \ actual: output,\ \ expected: "SELECT * FROM users WHERE id = 1",\ \ }),\ \ },\ \ ];RAG Scorers
These use LLM-as-a-judge to evaluate RAG pipelines:
- faithfulness - detects hallucinations by checking if output is grounded in context
- answerSimilarity - compares semantic similarity between output and expected answer
- answerCorrectness - evaluates factual correctness against ground truth
- answerRelevancy - checks if output actually answers the question
- contextRecall - measures if all relevant context was retrieved
Example using faithfulness to catch hallucinations:
scorers: [\ \ {\ \ scorer: ({ output, input }) =>\ \ faithfulness({\ \ question: input.question,\ \ answer: output,\ \ groundTruth: input.context, // Retrieved context\ \ model: yourModel,\ \ }),\ \ },\ \ ];Advanced Scorers
For specialized use cases:
- toolCallAccuracy - evaluates if agents called the right tools with correct arguments
- noiseSensitivity - tests prompt robustness by adding noise and checking consistency
Tool call accuracy is essential for agent evals:
scorers: [\ \ {\ \ scorer: ({ output }) =>\ \ toolCallAccuracy({\ \ actualCalls: output.toolCalls,\ \ expectedCalls: [{ toolName: "search", input: { query: "..." } }],\ \ }),\ \ },\ \ ];Mix and Match Scorers
The real power comes from combining scorers. A comprehensive RAG eval might use:
scorers: [\ \ {\ \ scorer: (opts) => faithfulness({ ...opts, model: yourModel }),\ \ },\ \ {\ \ scorer: (opts) => answerRelevancy({ ...opts, model: yourModel }),\ \ },\ \ {\ \ scorer: (opts) => contextRecall({ ...opts, model: yourModel }),\ \ },\ \ ];Each scorer returns a 0-1 score. Evalite aggregates them to give you an overall eval score.
In-Memory by Default
The biggest architectural change: Evalite v1 uses in-memory storage by default.
Previously, you needed to set up SQLite, which added friction for new users. Now you can run
npx evaliteand start evaluating immediately.Want persistence? Switch to SQLite in your config:
// evalite.config.ts import { defineConfig } from "evalite/config"; export default defineConfig({ storage: { type: "sqlite", path: "./evalite.db", }, });But for most development workflows, in-memory is simpler and removes a setup step.
Deep Vercel AI SDK Integration
Evalite v1 is built around the Vercel AI SDK. Wrap any AI SDK model with
wrapAISDKModel()to get automatic tracing and caching.Cache Everything
Not just scorers - cache your entire eval pipeline. Wrap models used in your task function, in scorers, anywhere:
import { wrapAISDKModel } from "evalite/ai-sdk"; import { openai } from "@ai-sdk/openai"; const model = wrapAISDKModel(openai("gpt-4")); evalite("RAG Eval", { data: [...], task: async (input) => { // Cached automatically const result = await generateText({ model, prompt: input.question, }); return result.text; }, scorers: [\ \ {\ \ // Also cached automatically\ \ scorer: (opts) => faithfulness({ ...opts, model }),\ \ },\ \ ], });This transforms watch mode. Change scorer logic, tweak thresholds, refactor eval structure - the expensive LLM calls stay cached. Only run what changed.
evalite watchThe wrapper works across all AI SDK methods:
generateText(),streamText(),generateObject(), andstreamObject().Zero overhead in production -
wrapAISDKModel()is a no-op when called outside Evalite's context. Your production code runs exactly as before.DX Improvements
The biggest DX improvement is auto .env support. Environment variables load automatically.
The Evalite UI got several upgrades:
- Dark mode - Theme switcher for light/dark preferences
- Table rendering - Objects and arrays render as markdown tables instead of JSON trees
- Rerun button - Re-run evals in watch mode without restarting
- AI SDK message UI - Pass AI SDK messages directly and get custom UI rendering
Getting Started
Evalite v1 is still in active development (beta). There's no formal migration guide yet as features are still evolving.
To try it:
pnpm install evalite@betaFull documentation at v1.evalite.dev.
Feedback Welcome
Since v1 is still beta, your feedback shapes the final release. Found a bug? Want a scorer we're missing?
Join the discussion on Discord or open an issue on GitHub.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: workshops day 4 intro to evals
You've built a DeepSearch agent and even hooked it up to observability and can see what it's doing.
How do you know if it's actually getting better? How do you measure success and ensure your experiments are leading to real improvements, not just changes based on "vibes"?
Day 4 is where we start vibe-checking our Agent performance and iterate towards an objectively better AI product
You move from subjective feelings to objective, data-driven evaluation. It's time to learn LLM Evals – the AI engineer's equivalent of unit tests – designed to bring predictability to your probabilistic system.
We'll first take a second to discuss why evals matter then install an open-source tool I created, Evalite to set up and start running evals on our agents.
Evalite is built on top of vitest and is a great option for you so that you don't have to rely on a third-party cloud provider for your tests.
You'll walk through initializing Evalite in your project, setting up the evals folder, and understanding its
.eval.tsfile structure.Next we'll choose success criteria which will provide a score taking into account factuality, relevance, source utilization, timeliness, and speed.
You'll use that criteria to write your first scorer in Evalite and see if you can get your agent to a 100 score.
By the end of Day 4, you'll have set up a foundational evaluation framework for your DeepSearch agent. You'll understand how to define what "good" looks like and have the tools to start measuring it, paving the way for more sophisticated evaluations and a truly data-driven approach to improving your AI application.
Day 4: Vibe-check your AI App Through Evals with Evalite
Matt Pocock
▶Read: workshops day 5 evals continued
You've taken the first steps into the world of LLM evaluations, setting up Evalite and writing your initial deterministic scorer. Now, it's time to level up your testing game.
Day 5 dives deeper into creating more sophisticated evaluations and building the datasets that fuel them, all driven by the powerful concept of the "Data Flywheel."
In short, the Data Flywheel is the process of users interacting with your application that are then fed into evals which improves your product.
"Evals -> Better Product -> More Users -> More Data -> Better Evals"
The next step is to implement LLM-as-a-Judge which takes another LLM model to assess the accuracy (and other success criteria) of your agents answers.
However, your LLM Judge won't be of much use if you don't have a solid dataset to work off of. You'll need to build this dataset yourself and provide "ground truth" for your LLM Judge to compare the answers it gets from your agent to.
Building a proper dataset is no joke, you'll be spending a majority of your time doing this. Simple true/false questions are easy to come by but our goal here is to make your agent fail. To really stress test your agent you'll need a set of "multi-hop" reasoning questions that forces your agent to break down complex problems to provide the proper answer.
By the end of Day 5, you'll have significantly expanded your evaluation toolkit. You'll be able to implement LLM-as-a-Judge evals for complex criteria like factuality and have a solid methodology for building and iterating on evaluation datasets that push your DeepSearch agent to its limits, driving continuous improvement.
Day 5: Expand your Evals with LLM-as-a-Judge and Datasets
Matt Pocock
▶Read: workshops day 3 observability
Building products around AI can be frustrating because your agent responses can be working one day and totally off on the next.
When things get complex with long conversations, database calls, and multiple LLM tools, understanding what's really happening under the hood is crucial.
This means observability is a major key to success when building an AI driven product
Day 3 is all about giving you the power to see, debug, and confidently improve your agent.
You will:
- Learn observability and traces are
- Set up Langfuse (a popular open-source solution)
- Explore your first few traces
- Set up custom trace metadata for your application
- Implement a time-aware web scraper tool
By the end of Day 3, you'll have transformed your DeepSearch agent from a mystery into a well-understood system. You'll have the skills to diagnose issues quickly, understand agent behavior clearly, and deploy with greater confidence.
Day 3: Debug and Improve your Agent through Observability
Matt Pocock
▶Read: Securing your AI App with Guardrails
Email*
Securing your AI App with Guardrails
Matt Pocock
Source Code Use with AI
Here's a pattern for validating incoming requests to your LLM application to make sure that they're safe. We're in TypeScript and we're using the AI SDK.
The Fast Model Approach
The juiciest nugget is inside the
checkIsRequestSafefunction. The idea here is that we use a very, very fast model, in this case Gemini 2.0 Flash Lite.const checkIsRequestSafe = async ( modelMessages: ModelMessage[], ) => { const guardrailResult = await generateText({ model: google('gemini-2.0-flash-lite'), system: GUARDRAIL_SYSTEM, messages: modelMessages, }); // The text will either be '0' or '1' const text = guardrailResult.text.trim(); // If text is exactly '1', the request is safe. if (text === '1') { return true; } return false; };We pass it all of the messages in the conversation history so far, and we get it to run extremely quickly by only returning a single token. This is done inside the guardrail system prompt.
## Response Format Respond with just the number: 1 or 0Because it's just returning a single token, this will run very, very quickly. And it means that we can do some imperative logic here, saying if the text is exactly
1, then the request is safe, so we can returntrue.Alternative Approaches
Lots of folks pointed out in the comments it doesn't have to be
0or1, you can use a single word. Since for the LLM it's actually generating tokens under the hood, numbers.So it would take exactly the same amount of time for it to generate the word "safe" as it would to generate the word "one", but this is the implementation I landed on and it works pretty well.
Streaming Text to the Frontend
Next we should look at this
streamTextToWriterfunction. It's useful any time that you need to stream any arbitrary text to the front end.const streamTextToWriter = ( writer: UIMessageStreamWriter, message: string, ) => { const id = crypto.randomUUID(); writer.write({ type: 'text-start', id, }); writer.write({ type: 'text-delta', id, delta: message, }); writer.write({ type: 'text-end', id, }); };Here we're streaming three parts,
textStart,textDeltaandtextEnd, all with the sameid. To be honest it's pretty verbose and I wish the AI SDK would make this a little bit easier, since all we need to do is really stream this message to the frontend.Continuing with Normal Streaming
Finally then we continue streaming with the writer and with the model messages.
const continueStreaming = ( writer: UIMessageStreamWriter, modelMessages: ModelMessage[], ) => { const streamTextResult = streamText({ model: google('gemini-2.0-flash'), messages: modelMessages, }); writer.merge(streamTextResult.toUIMessageStream()); };This is a normal
streamTextcall where we're merging it into the writer.Applications of Fast LLM Responses
This logic of a very, very fast LLM response for the rest of your logic is something that we can map into lots of different use cases:
- We could use this as a router to route certain requests through a smarter LLM
- Or route simple ones to a stupid fast LLM, which would be an amazingly cost-saving measure
▶Read: Here's 2024's best resource on prompt engineering
The Prompt Report is an extraordinary piece of work. It's a meta-analysis of the entire literature of prompt engineering. The Latent Space crew mentioned it as one of their top reads for 2025.
The thing I love most about it is its careful taxonomy of different terms:
📁 Prompting Techniques ├── 📁 Zero-Shot │ ├── 📄 Emotion Prompting │ ├── 📄 Role Prompting │ ├── 📄 Style Prompting │ └── 📄 Self-Ask │ ├── 📁 Few-Shot │ ├── 📄 Exemplar Generation │ ├── 📄 Exemplar Ordering │ └── 📄 Exemplar Selection │ ├── 📁 Thought Generation │ └── 📁 Chain of Thought │ ├── 📁 Zero-Shot CoT │ │ ├── 📄 Analogical Prompting │ │ ├── 📄 Step-Back Prompting │ │ └── 📄 Thread-of-Thought │ │ │ └── 📁 Few-Shot CoT │ ├── 📄 Active-Prompt │ ├── 📄 Auto-CoT │ ├── 📄 Memory-of-Thought │ └── 📄 Prompt Mining │ ├── 📄 Ensembling ├── 📄 Self-Criticism └── 📄 DecompositionYes, this isn't even all of the different techniques mentioned. It even goes into multimodal and multi-lingual techniques, as well as security.
Each section links to further papers. This makes it an ideal place to start if you're looking for more detail in this area.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: Anthropic thinks you should build agents like this
In December 2024, Anthropic dropped an article on Building Effective Agents. It really clarified a few concepts for me. Here's my summary:
Agents vs Workflows
It defines Agents as different from Workflows.
Workflows use "predefined code paths", whereas agents "dynamically direct their own processes". The difference between them, then, is the degree of autonomy. Agents have more autonomy, workflows run on pre-defined rails.
This still feels to me like a gradient - but having the scale be "degree of autonomy" is something I've not seen much before.
Don't Use Frameworks By Default
The article repeatedly warns against using frameworks as a first port of call. They "often create extra layers of abstraction that can obscure the underling prompts and responses".
They mention LangGraph, Amazon Bedrock Agents, Rivet and Vellum. These frameworks "make it tempting to add complexity when a simpler setup would suffice".
Instead, using LLM API's directly is what they recommend to start with. You should only use a framework when "you understand the underlying code".
This sounds right to me - the primitives of building good AI workflows don't sound that difficult to manage. Putting them in a framework prematurely feels odd. Note that I don't consider something like Vercel's AI SDK a framework - it's just a compatibility library.
Many Workflow Patterns, One Agentic Pattern
The most wonderful part of this article is how it describes different workflows. Far from building effective agents, this article really describes how to build effective workflows. It describes:
- Prompt chaining, where "each LLM call processes the output of the previous one"
- Routing, where an LLM router "classifies an input and directs it to a specialized followup task"
- Parallelization, where LLM calls are run in parallel.
- Orchestrator-workers, where a "central LLM dynamically breaks down tasks" and "delegates them to worker LLM's".
- Evaluator-optimizer, where "one LLM call generates a response while another provides evaluation" in a loop.
Finally, it defines a single pattern it would define as agentic. Agents:
- "Plan and operate independently, potentially returning to the human for further information or judgement"
- "Gain ground truth from the environment at each step", using tool call results or code execution
- Terminate upon completion, or using a "stopping condition (such as a maximum number of iterations)"
In summary, agents are "typically just LLMs using tools based on environmental feedback in a loop".
I really dig this definition - it demystifies agents nicely.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: Google's Introduction To Agents Is GREAT
Email*
Google's Introduction To Agents Is GREAT
In September 2024, a group of folks at Google published a whitepaper on agents.
There's nothing particularly controversial or new in this paper.
But it's a really good set of definitions for some terminology around agents.
It defines what an agent is, what a tool is, and has a couple of extra interesting definitions which we'll get into.
I think it's a great introduction to the field and that's why we're covering it.
Definition Of 'Agent'
So let's start with the juicy stuff - the definition:
An agent is "an application that attempts to achieve a goal by observing the world and acting upon it using the tools that it has at its disposal."
This sounds about right, but of course a thermostat would also be an agent under this definition.
Helpfully the paper also draws a line between agents versus models.
There's a nice table here for us:
- Models are limited to what is available in their training data, whereas agents have access to external systems.
- Agents have access to chat history, which lets them add additional information to their context over time.
In other words, a model by itself cannot be an agent. An agent is a system containing memory, reasoning, planning, and tool calling.
But they also shy away from defining agents strictly. There's a funny turn of phrase in the introduction where they say this combination of factors "invokes the concept of an agent."
Basically if it smells
agenticit's probably an agent.Orchestration Layer
The paper also talks about orchestration layers as being important for agents.
The orchestration layer governs how the agent "takes in information, performs some internal reasoning, and uses that reasoning to inform its next decision or action."
In other words it's the feedback loop that I've covered elsewhere in my Vercel AI SDK tutorial.
Definition Of 'Tools'
Pretty much everyone agrees though that for a system to be
agenticit needs access to tools.The whitepaper defines tools as "bridging the gap between the agent's internal capabilities and the external world."
But it goes further than most tool definitions. It actually defines different types of tools: extensions and functions.
Extensions
Extensions are ways to extend the capability of the agent. They show the example of calling APIs, for instance to fetch the weather or fetch directions between two places.
You can think of extensions as things the agent can do without asking.
These are like typical tool definitions. We tell it how to use each tool using examples and tell the agent what arguments are needed to successfully call the tool.
Functions
Functions are different to extensions. You teach the LLM how to use them in exactly the same way. But instead of the agent having control over calling the tool, instead the user has control.
Let's take the example of deleting a post in a database. The system the agent is deployed on may not actually have the rights to delete that post. And for various reasons you may not want to give it access to that.
So instead of deleting the post itself, the agent tells the user "delete this post."
This is what it defines as functions: "the logic and execution of calling the actual API endpoint is offloaded away from the agent and back to the client-side application."
There are several reasons you might do this:
- As we described before, you don't want to give the agent permission to perform those actions
- You want to do some human-in-the-loop review
- Or you want to build the agent while reusing existing API endpoints
So to sum up: with extensions the agent handles the tool execution; with functions the client handles the tool execution.
My Thoughts
This definition is a bit funky. I always have a bit of an issue with hanging concepts on such commonly used words. If you're a developer you already know what a function is. And the definition of function versus extension is kind of unclear.
The whitepaper goes on to mention Google's Vertex AI platform. So it's possible that functions versus extensions is a key concept there.
I do like it as a teaching device but I'm not sure if I'm going to be using that definition myself.
Conclusion
Overall Google's whitepaper is a pretty nice introduction to the field, and adds a couple of interesting definitions to the discussion.
There's more stuff in there like examples using LangChain and Python, so it's worth checking out yourself.
But I hope you enjoyed that summary and I will see you for the next one.
▶Read: workshops day 6 task decomposition
Our DeepSearch agent is becoming increasingly capable, but relying on a single, monolithic system prompt for every action from searching and scraping to answering has its limits. It makes the system brittle, sensitive to prompt changes, and restricts our ability to optimize individual tasks (like using different models or temperatures for different steps).
Day 6 is about a fundamental architectural shift: decomposing our agent's operations into distinct, manageable tasks. We'll move from letting the AI SDK manage the tool-calling loop to building our own custom loop, giving us granular control and unlocking a new level of sophistication in our app.
To do this there will be a few major steps you will take.
First you'll create a
SystemContextclass in typescript that will act as our container for building out our loop. You'll create a 'Next Action Picker' that will have smart agent decide what step to take next then build the loop.By the end of Day 6, you will have fundamentally re-architected your DeepSearch agent. You'll have a powerful, custom loop that gives you precise control over each step of its operation, paving the way for more advanced features, better per-task optimization, and a more robust, maintainable system.
Day 6: Agent Architecture through Task Decomposition
Matt Pocock
▶Read: workshops day 7 task decomposition continued
Now that you've restructured your AI app through Task Decomposition, you have much better control but as it currently stands the UX is not great.
In Day 7, we will make our apps internal workings (the steps it takes) visible to the user, ensuring our telemetry is insightful, and persist conversations so that they can be properly continued.
The first step is to show the Agent's throughts to the user. We'll modify our
getNextActionfunction to accomplish this so that when the llm has completed a step it can notify the user with proper information.One bug you might have noticed up to this point is that if you send multiple messages in a chat, the LLM doesn't remember previous conversation. We'll fix that by providing chat history through
SystemContext.And finally we will persist conversations, including agent steps, to our database so agents and users can take advantage of past conversations.
By the end of Day 7, your agent will not only be more powerful internally but also more transparent to the user and able to maintain conversational context and persist its state.
Day 7: Improved App UX and Persistance with Agent Task Decomposition
Matt Pocock
▶Read: workshops day 8 agents vs workflows
In this workshop we're going to adjust the amount of power we give to the LLM to choose our control flow.
Doing this bumps up against a hot topic in AI engineering circles - agents vs workflows.
The opposite of an 'agent' is a 'workflow'.
The less 'agentic' your application is, the more predictable it becomes. Entirely deterministic applications do the same thing every time - even if they contain some probabilistic components.
Agentic systems aren't all bad - letting the LLM handle the control flow means your application becomes very flexible.
But in a system like DeepSearch, we may want to experiment with adding more determinism to our system.
In our next few exercises, we're going to experiment with turning the dial back towards determinism.
Day 8: Agents vs Workflows
Matt Pocock
▶Read: workshops day 9 advanced patterns
In this workshop you'll be taking on a couple more advanced challenges to really flex those AI muscles you've been building over the last few days.
- Creating an evaluator-optimizer loop
- And showing sources in the frontend
The evaluator-optimizer loop
Really what we've been calling our agent is actually simply an if-else statement.
- If we have enough information, we answer the question.
- If we don't have enough information, we search.
So this is feeling less to me like an agent, and more like an evaluator. The description of an evaluator-optimizer loop from Anthropic's Building Effective Agents article feels apt:
One LLM call generates a response while another provides evaluation and feedback in a loop.
In this lesson, you will embrace that design, and optimize around it.
Showing Sources In The Frontend
The current problem with our setup is that we don't receive anything from the frontend until an action is taken:
One thing I really like from observing other DeepResearch implementations is the way they display their sources. They're often displayed as a list of cards, with a favicon, title and snippet.
In this lesson we will be copying that pattern.
Day 9: Advanced Patterns
Matt Pocock
▶Read: Implementing Anthropic's "think" Tool In TypeScript
Matt Pocock
Source Code Use with AI
Anthropic just released some new research about a simple technique that can help Claude (and likely other LLMs) with tasks requiring complex problem solving.
It's simple—you provide the LLM with a "think" tool it can call.
(Want to know what tools are? Check out this guide.)
The theory is that this tool is useful to give the LLM a moment to think before making a decision. It provides a structured way for the model to reflect on the information it has before proceeding.
This allows the LLM to save important information in its context, which can be used later to make better decisions. It echoes familiar concepts like ReAct and Reflexion.
Let's try implementing it.
The Prompt
We're first going to pull out a
descriptionvariable, which will serve as the tool's description:const description = ` Use the tool to think about something. It will not obtain new information or change the database, but just append the thought to the log. Use it when complex reasoning or some cache memory is needed.`;This text is pulled from Anthropic's article.
The Tool
Scrollycoding
1
Let's start with a
streamTextcall from the AI SDK:2
Let's use Claude 3.7 and set a
maxStepsof 10:3
Next, let's pass it a tool called
think, passing thedescriptionwe defined earlier:4
We'll next add a
parametersobject to thethinktool.This object will contain a
thoughtfield, which is a string. The description is also taken from the article.5
Finally, let's add an
executefunction to thethinktool.This function won't do anything—it'll simply return the
thoughtpassed to it.This then gets saved in the context, and in future iterations of the model, it can be used to make better decisions.
import { streamText } from "ai"; const result = await streamText({});Conclusion
And that's it! We've implemented the "think" tool in Claude 3.7 Sonnet.
This is a really useful technique for applying in certain situations. Anthropic recommends it for "complex tasks requiring policy adherence and reasoning in long chains of tool calls." It's a simple addition to your LLM implementation that can yield meaningful improvements in just a few lines of code.
Happy experimenting!
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: STORM: Generate High-Quality Articles Based On Real Research
Email*
STORM: Generate High-Quality Articles Based On Real Research
I came across a really cool paper today. It's trying to address a really hard problem: writing really good articles from scratch.
It's called STORM. Not only is it a cool name, But it's a really great example of a very complex AI workflow.
So while you may not be interested in the domain of generating articles, Storm is a great paper To look at if you want to build better ai workflows.
Let's take a look.
▶Read: workshops day 2 persistence
Using
useChatfrom Vercel AI SDK makes it easy to get a basic chatbot up and running. But turning that quick prototype into a fully functional, persistent chat product is where most developers get stuck.How do you save user and assistant messages as they stream in? What’s the right moment to create and store a new chat ID? How do you support routing by
chatIdand keep your UI in sync when the chat history changes?This workshop is all about bridging that gap—taking your chat from demo to durable, with backend persistence, real-time updates, and solid routing.
You will:
- Build a
POST /api/chatendpoint usingcreateDataStreamResponseto stream completions from OpenAI - Use Drizzle to create a new chat record and return the generated
chatIdover the stream itself - Automatically redirect users to the new
/chat/:chatIdroute once it’s available - Save each user and assistant message to your database as it streams in
- Rehydrate the chat UI with existing messages pulled from the database when loading a past conversation
- Keep your sidebar list of chats updated using route revalidation, so it always reflects the latest state
By the end of this workshop, you’ll have a end-to-end persistent chat experience—streaming, saving, routing, and reloading conversations seamlessly. More importantly, you’ll know how to build chat apps that feel real and resilient, not just impressive prototypes.
Day 2: Persistence
Matt Pocock
- Build a
▶Read: cohorts build deepsearch in typescript
This is a cohort-based course and the lessons will start unlocking on July 14th, 2025.
Building AI applications that are genuinely useful involves more than just hitting an LLM API and getting back stock chat responses.
The difference between a proof-of-concept and a production application lies in the details.
Generic chat responses might work for demos, but professional applications need appropriate outputs that align with specific requirements.
In a professional environment code is (ideally) tested, metrics are collected, analytics are displayed somewhere.
AI development can follow these established patterns.
You will hit roadblocks when trying to:
- Implement essential backend infrastructure (databases, caching, auth) specifically for AI-driven applications.
- Debug and understand the "black box" of AI agent decisions, especially when multiple tools are involved.
- Ensure chat persistence, reliable routing, and real-time UI updates for a seamless user experience.
- Objectively measure AI performance moving beyond subjective "vibe checks" for improvements.
- Manage complex agent logic without creating brittle, monolithic prompts that are hard to maintain and optimize.
In this course you will build out a "DeepSearch" AI application from the ground up to help you understand and implement these patterns and ensure a production-ready product.
Loading

Days 00-02: Getting Started
You’ll start with a project that is already built out using Next.js TypeScript (of course), PostgreSQL through the Drizzle ORM, and Redis for caching.
The first couple of days you will implement fundamental AI app features building out a Naive agent. You’ll start by hooking up an LLM of your choice to the Next.js app using the AI SDK and implement a search tool it can use to supplement its knowledge when conversing with users.
Chats with an LLM don’t save themselves so you will also save conversations to the database as well.
Days 03-05: Improve Your Agent through Observability and Evals
The first real differentiator between a vibe-coded side project and a production-ready product you can feel confident putting in front of customers is observability and evals.
You need to know what is going on with your LLM calls as well as need an objective means to judge the output that LLMs are producing.
This is exactly what the next few days are about. You’ll hook up your app to LangFuse and get familiar with looking through traces produced by the application.
Once you can see what your LLM is doing, now’s the time to test inputs and outputs of your agent using evals. Evals are the unit test of the AI application world and we’ll start by wiring up Evalite which is an open-source, vitest-based runner. You’ll learn about what makes good success criteria and build out your evals including implementing LLM-as-a-Judge and custom Datasets specific to your product. We’ll also discuss how you can capture negative feedback intro traces that you can feed into your app to make it better.
Days 06-07: Agent Architecture
Up until this point, your app is driven by an increasingly large prompt that will become unwieldy and impossible to test and iterate on when your app complexity grows.
We’ll take these next two days to revisit the over-all architecture of our application and refactor it to better handle complex multi-step AI processes. The primary idea behind this refactor is called Task Decomposition where you allow a smart LLM model determine the next steps to take based on the current conversation but allow room to hand off actual action to focused or cheaper models.
Days 08-09: Advanced Patterns
The last two days we will evaluate what our deepsearch agent is and how we can further optimize output. You’ll learn the differences between “Agent” and “Workflow” and see how in this use-case we’ll lean harder into workflow patterns to build a more reliable product.
In ai-land, this pattern is called the evaluator-optimizer loop which effectively means that if the agent has enough information, it will answer the question presented but if it doesn’t it will search for more information. With this pattern defined we’ll embrace and optimize around its design.
By the end of this cohort you will be confident in building out AI applications that are reliable and will improve through iteration and user feedback. LLM models and the whole AI field is changing rapidly so understanding these fundamentals will give you a foundation to navigate building applications for years to come.
Contents
- 1.1What Are We Building?
- 1.2Installation Instructions (Don't Skip This!)
- 1.3Explore The Repo
- 1.4Setting Up Postgres
- 1.5Using Drizzle & Drizzle Studio
- 1.6Setting Up Redis
- 1.7FAQ's
- 2.1Introduction
- 2.2Choose An LLM
- 2.3Our First Model Call
- 2.4Set Up Discord Authentication
- 2.5Create A Naive Agent With Serper
- 2.6Showing Tool Calls In The Frontend
- 2.7Search Grounding (optional)
- 2.8Rate Limiting (optional)
- 2.9Rate Limiting Anonymous Users (optional)
- 2.10Connecting Our App To MCP Servers (optional)
- 3.1Create Database Resources For Persisting Messages
- 3.2Persist Chats To The Database
- 3.3Creating New Chats In The Frontend
- 3.4Showing The Saved Chats In The Frontend
- 3.5Fixing The 'New Chat' Button (optional)
- 3.6Adding 'use-scroll-to-bottom' (optional)
- 4video-playerDay 3: Debug and Improve your Agent through Observability\ Available from July 16, 2025
- 4.1Choosing An Observability Platform
- 4.2Integrating Langfuse
- 4.3Passing Extra Metadata To Langfuse
- 4.4Adding A Scraper
- 4.5Making The LLM Date-Aware (optional)
- 4.6Improving Our Crawler (optional)
- 4.7Reporting DB Calls To Langfuse (optional)
- 5.1Initializing Evalite
- 5.2Choosing Our Success Criteria
- 5.3Making Our System Testable
- 5.4Our First Deterministic Eval
- 5.5Adding A Global Rate Limiter (optional)
- 5.6Optimizing Our Prompt (optional)
- 6video-playerDay 5: Expand your Evals with LLM-as-a-Judge and Datasets\ Available from July 18, 2025
- 6.1The Data Flywheel
- 6.2Our First LLM-As-A-Judge Eval
- 6.3Create A Simple Dataset
- 6.4Organizing Our Dataset Into Dev, CI and Regression (optional)
- 6.5Assessing Answer Relevancy (optional)
- 6.6Extracting The Parameters Of Our System (optional)
- 7.1What's Wrong With Our Current Approach?
- 7.2Designing Our New System Prompt
- 7.3Creating a Next Action Picker
- 7.4Implementing The Loop
- 7.5Connecting Our Loop To The Frontend
- 7.6Optimize Our Answering System Prompt With Exemplars (optional)
- 7.7Smoothing Our Streaming (optional)
- 8.1Showing The Steps Taken in The Frontend
- 8.2Fixing Telemetry
- 8.3Passing The Message History
- 8.4Persisting Our New Setup To The Backend
- 8.5Generating Chat Titles (optional)
- 8.6Adding Geolocation Info To The System Prompt (optional)
- 9.1Agents vs Workflows
- 9.2Collapse Search and Crawl into one tool
- 9.3Search, Scrape, Summarize
- 9.4Making A Query Rewriter
- 9.5Use A Combined Search/Scrape API Instead (optional)
- 9.6Resumable Streams (optional)
- 10.1Building An Evaluator
- 10.2Showing Sources In The Frontend
- 10.3Implementing Guardrails (optional)
- 10.4Implement An Ask Clarifying Questions Step (optional)
- 10.5Showing Usage In The Frontend (optional)
- 10.6Migrating to AI SDK v5 (optional)
July 14—July 25, 2025
This cohort has already started
You can still join the waitlist to be notified when the next cohort starts.
Email*
Join Waitlist
July 14—July 25, 2025
Build DeepSearch in TypeScript
▶Read: STORM: Generate High-Quality Articles Based On Real Research
Email*
STORM: Generate High-Quality Articles Based On Real Research
I came across a really cool paper today. It's trying to address a really hard problem: writing really good articles from scratch.
It's called STORM. Not only is it a cool name, But it's a really great example of a very complex AI workflow.
So while you may not be interested in the domain of generating articles, Storm is a great paper To look at if you want to build better ai workflows.
Let's take a look.
▶Read: How To Make Codebases AI Agents Love
Email*
How To Make Codebases AI Agents Love
AI is not a super-powered developer. It's a new starter with no memory. Every time you spawn an agent, it's like the guy from Memento stepping into your codebase going, "Okay, I'm here, what am I doing?"
Your codebase, way more than your prompt or your
AGENTS.mdfile, is the biggest influence on AI's output. If it's designed wrong, it costs you in three ways.Poor feedback loops. The AI doesn't receive feedback fast enough, so it doesn't know if what it changed actually did what it intended.
Hard to navigate. The AI finds it super hard to make sense of things, find files, and work out how to test things.
Cognitive burnout. You end up trying to hold together AI and your codebase, patching it all up manually.
What AI Actually Sees
Imagine this is your codebase:
Each square is a module that exports some functionality — a function, a variable, a component. Inside, you have vague groupings: a thumbnail editor, a video editor, authentication, CRUD forms.
You understand this mental map. But the AI doesn't. What it sees is this:
A bunch of disparate modules that can all import from each other. No groupings, no relationships. And the file system doesn't help — it's all jumbled together.
You're spawning 20+ new starters a day to look at your codebase and make changes. Your codebase needs to be friendly and navigable for them.
The Solution: Deep Modules
The file system and design of your codebase needs to match the internal map you have of it. The best way I've found to do that is with deep modules.
Deep modules comes from "A Philosophy of Software Design." The idea is simple: lots of implementation controlled by a simple interface.
Instead of many small modules:
You end up with big chunks of functionality with simple, controllable interfaces. Any exports have to come through that interface.
Grey Box Modules
Deep modules create a natural seam in your codebase. You carefully control and design the interface. The implementation inside? Delegate that to AI.
Write tests that lock down the module's behavior. Then you don't need to look inside. You can if you want to — to apply taste, influence the outcome, or improve performance — but as long as the tests pass, you don't need to care.
These are grey box modules. You own the interface. AI owns the implementation. Tests keep it honest.
Improved Navigability
Give each module its own folder with a clear public interface. The AI can see all the services on the file system, read their types, and understand what they do — without digging into the implementation.
We've designed our codebase for progressive disclosure of complexity. The interface sits at the top and explains what the module does. When we need to, we can look inside.
Reduced Cognitive Burnout
Instead of holding hundreds of interrelated modules in your head, you keep seven or eight chunks. The AI manages what's inside each one. You only worry about designing the interfaces and how they fit together.
This is still a million miles from vibe coding. You need to apply taste at the boundaries — deciding what goes into which module. But the mental map is radically simpler.
Good Practice Remains Good Practice
This is nothing new. This is how good codebases have been designed for 20 years. What works for humans is also great for AI.
Summary
Your codebase is probably not ready for AI. Instead of deep modules, you've got a web of interconnected shallow modules:
These are hard to navigate, hard to test, and hard to keep in your head.
The fix is deep modules with clear interfaces and strong tests. Think about module boundaries from your PRDs through to your implementation issues. Tests and feedback loops are essential — they're how your AI new starters know if their changes work.
Some languages make this easier than others. In TypeScript, it's not easy to enforce these boundaries — I've been using Effect more and more because it makes modularizing your codebase simple.
▶Read: Essential AI Coding Feedback Loops For TypeScript Projects
Email*
Essential AI Coding Feedback Loops For TypeScript Projects
Matt Pocock
Source Code Use with AI
When working with AI coding agents, especially those operating independently, you need feedback loops so the AI can verify its own work. Feedback loops are especially important when you're doing AFK coding, such as with Ralph Wiggum.
1. Set Up TypeScript and Type Checking
TypeScript is essentially free feedback for your AI. Use it over JavaScript.
Create a
typecheckscript in yourpackage.json:{ "scripts": { "typecheck": "tsc" } }TypeScript catches errors the AI would never find without testing in a browser.
2. Add Automated Tests
Use a test framework like Vitest for logical errors:
{ "scripts": { "test": "vitest" } }Basic unit tests covering core functionality help keep the AI on track.
3. Install Husky for Pre-commit Hooks
Husky enforces feedback loops before every commit.
Install and initialize Husky:
pnpm install --save-dev husky pnpm exec husky initCreate a
.husky/pre-commitfile that runs all your checks:npx lint-staged npm run typecheck npm run testIf any step fails, the commit is blocked and the AI gets an error message.
Another powerful feedback loop is making sure the LLM can access your running dev server locally to check your frontend. See this video for how to set that up.
4. Set Up Automatic Code Formatting
Use lint-staged with Prettier to auto-format code before commits.
Install lint-staged:
pnpm install --save-dev lint-stagedConfigure
.lintstagedrc:{ "*": "prettier --ignore-unknown --write" }This runs Prettier on all staged files and automatically restages them. All AI-generated code now conforms to your formatting standards.
We could also run ESLint here since it works nicely with lint-staged.
Why This Works for AI
AI agents don't get frustrated by repetition. When code fails type checking or tests, the agent simply tries again. This makes feedback loops (and pre-commit hooks, especially) incredibly powerful for AI-driven development.
▶Read: Tracer Bullets: Keeping AI Slop Under Control
In this article, I'm going to help you solve the slop problem by encouraging your AI agent to think in tracer bullets, small pieces of functionality that get built end-to-end.
It's a classic software technique that works incredibly well with AI.
The Problem: Too Much Slop
AI has a natural inclination to sycophancy. It aims to please, in all aspects of its behavior. "You're absolutely right!".
In code, this means it wants to produce complete solutions all at once. It has in mind the idea for a finished file, or a finished feature, and it produces all of the code needed in one leap.
It doesn't stop to validate assumptions or get feedback. It just keeps going, building entire layers in isolation, without ever testing whether the critical path actually works.
The result? You end up with enormous chunks of code that need reworking. Slop. And there is a huge review burden on the developer to check all this crap.
What This Looks Like in Practice
For example, you ask the AI to build a database service with an API that connects to it.
The AI builds:
- All the API endpoints (
GET,POST,PUT,DELETE) - Complete request/response models
- Error handling middleware
- Authentication logic
- Rate limiting
- Logging infrastructure
Only after all that does it try to connect to the database. And surprise, the connection string format is wrong. Or it's using an incorrect column type. Or the tests don't actually hit the database.
The pragmatic programmer calls this "outrunning your headlights.". The AI is building too much in the dark, without feedback loops to validate assumptions early. And you, the developer, end up with a huge mess to clean up.
The Solution: Tracer Bullets
The concept of a tracer bullet comes from The Pragmatic Programmer. It's a small, end-to-end slice of functionality that touches all the layers of your system at once.
Instead of building horizontal layers in isolation, you build one tiny vertical slice:
- Build a small feature end-to-end
- Test it immediately
- Get feedback
- Move to the next slice in a fresh context window
- Repeat
This approach solves the "outruns its headlights" problem directly. You get feedback loops built into the process instead of churning out features blindly.
And once you've tested that small slice, you can feel confident shipping features that work, rather than hoping the big layers all fit together.
Forcing the AI to Think Small
You have to be extremely explicit with your prompt. AI's natural inclination is to build big layers in isolation. You need to make it do an end-to-end slice across all the vertical layers.
Here's a prompt you can add to your Ralph loop or a 'build-feature' skill to force the AI to think in tracer bullets:
## Tracer Bullets When building features, build a tiny, end-to-end slice of the feature first, seek feedback, then expand out from there. Tracer bullets comes from the Pragmatic Programmer. When building systems, you want to write code that gets you feedback as quickly as possible. Tracer bullets are small slices of functionality that go through all layers of the system, allowing you to test and validate your approach early. This helps in identifying potential issues and ensures that the overall architecture is sound before investing significant time in development.My Results With This Prompt
Here's a recent issue I worked on using tracer bullets with an AI agent:
Looking at the issue, I need to implement a "Reveal in File System" feature for videos. Let me break this down: ## Task Breakdown for Issue #161 1. **Tracer bullet**: Create backend endpoint that executes the PowerShell command to reveal a file in Windows Explorer from WSL 2. Add "Reveal in File System" to standalone videos context menu 3. Add "Reveal in File System" to video editor actions dropdown 4. Add "Reveal in File System" to sidebar video context menu I'll start with the **tracer bullet** - creating the backend endpoint and wiring it up to one location.Rather than building all four UI locations at once, I focused the AI on building just one end-to-end slice: the backend endpoint connected to a single UI location. This allowed me to validate that the core functionality worked before expanding it out.
The Classics Have the Answers
Tracer bullets aren't new. Test-driven development isn't new. These are old concepts from old books, foundational ideas that have been formulating best practices for decades.
The problem? When new technology emerges, people get excited and forget to go back to the classics. They chase what's shiny instead of what's proven.
But the principles apply harder to AI than they ever did to humans. Context window constraints make the discipline non-negotiable. You can't ignore tracer bullets with an AI agent the way you might with a human developer. The consequences are immediate and visible.
Bottom Line
The next time you're working with an AI agent, ask yourself: Am I letting it outrun its headlights? Am I getting it to validate assumptions early, or is it building in the dark?
Use tracer bullets. Force the agent to think small, build end-to-end, get feedback, and move forward with confidence.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
- All the API endpoints (
▶Read: My Skill Makes Claude Code GREAT At TDD
Email*
My Skill Makes Claude Code GREAT At TDD
For the last few weeks, I've been using a TDD skill I wrote to do most of my non-frontend work.
It solved a lot of the problems that I previously had experienced with LLMs and tests.
If you want to try it out, here it is:
npx skills add mattpocock/skills/tddFor a longer breakdown, here you go:
The Problem: Why LLMs Fail at Tests
When you ask an LLM to "write a feature," it tends to work in horizontal slices. It writes the entire feature first, then writes tests for that feature afterward. This is problematic because you're not really verifying that the tests are actually testing what they're supposed to.
Here's what happens in practice:
Horizontal Slicing (❌ Bad) Vertical Slicing (✅ Good) RED: write all tests RED→GREEN: test1→impl1 GREEN: write all code RED→GREEN: test2→impl2 REFACTOR: cleanup RED→GREEN: test3→impl3 The core issue: Tests written in bulk test imagined behavior, not observed behavior.
When an LLM generates 10 tests upfront and then implements to pass them all, several bad things can happen:
- Tests verify mocks instead of real code paths
- Tests might not even run properly or have short circuits built in
- When the LLM's context is running low, it might just rewrite the test to make it pass instead of writing real implementation
Bad tests aren't just a review problem, they're a debt problem. Every test you create has to be maintained forever, just like code. Tests that aren't tied to actual behavior or are too coupled to implementation details become expensive liabilities.
The Solution: Red-Green-Refactor Vertical Slices
My TDD skill constrains Claude to work in vertical slices using tracer bullets:
ONE test → ONE implementation → repeatEach cycle responds to what you learned from the previous cycle. Because you just wrote the code, you know exactly what behavior matters and how to verify it.
The Three Phases
RED: Write ONE test that fails. Just one.
GREEN: Write minimal code to pass that test only. Nothing speculative.
REFACTOR: After all tests pass, clean up duplications and simplify.
This constraint prevents cheating. If a test fails first, the LLM can't fake it, it has to write real implementation.
How the Skill Changes What Claude Builds
When you use this approach on a real feature, something interesting happens: the tests become a conversation Claude is having with its own code.
Each test asks a different question about the implementation:
- "Does this observable behavior work?"
- "How does the system handle edge cases?"
- "What happens when conditions change?"
This interrogation means Claude discovers things about its own implementation as it goes, rather than just checking boxes. And sometimes, a test that you write will pass immediately, not because it's a wasted test, but because the implementation is already robust enough to handle it.
What Makes a Test Good (vs Bad)
I've included details in the skill about good and bad tests - here's what I described:
Good Tests
Good tests exercise real code paths through public interfaces, not implementation details. They describe WHAT the system does, not HOW it does it.
// GOOD: Tests observable behavior through the interface test("user can checkout with valid cart", async () => { const cart = createCart(); cart.add(product); const result = await checkout(cart, paymentMethod); expect(result.status).toBe("confirmed"); });A good test reads like a specification: "user can checkout with valid cart" tells you exactly what capability exists. These tests survive complete internal refactors because they don't care about internal structure.
Bad Tests
Bad tests are coupled to implementation. They mock internal collaborators, test private methods, or verify through external means instead of using the interface.
// BAD: Tests implementation detail (mocking internals) test("checkout calls paymentService.process", async () => { const mockPayment = jest.mock(paymentService); await checkout(cart, payment); expect(mockPayment.process).toHaveBeenCalledWith(cart.total); }); // BAD: Bypasses interface to verify (queries DB directly) test("createUser saves to database", async () => { await createUser({ name: "Alice" }); const row = await db.query("SELECT * FROM users WHERE name = ?", ["Alice"]); expect(row).toBeDefined(); });The warning sign: your test breaks when you refactor, but the behavior hasn't changed. If you rename an internal function and tests fail, those tests were testing implementation, not behavior.
The Key Difference
Good Tests Bad Tests Exercise real code through public interfaces Mock internal collaborators Describe WHAT the system does Test HOW it's implemented Survive internal refactors unchanged Break on refactoring without behavior change Read like specifications Test the shape of data structures Focus on user-facing behavior Verify through external means (DB queries, call counts) The Planning Phase (Before Any Code)
In the skill, I found that using a planning phase before any code was written was extremely important. I decided to implement it using these questions:
- What interface changes are needed? What functions, methods, or APIs are being added or modified?
- Which behaviors matter most? You can't test everything. Prioritize critical paths and complex logic over edge cases.
- Can we design for deep modules? A deep module has a small interface but handles complex logic internally. This makes testing simpler and the API cleaner.
- Can we design for testability? Functions should accept dependencies rather than create them. They should return results instead of producing side effects.
The better answers the user provided to these questions, the higher the code quality got.
Why This Matters for Claude Code Users
The skill isn't about perfect tests. It's about honest tests through forced constraints.
By structuring Claude's work as one test, one implementation, repeat, you prevent it from:
- Writing imagined behavior instead of observed behavior
- Mocking internals and faking test passes
- Over-engineering the solution upfront
- Writing tests that are coupled to implementation details
The tests become trustworthy. You can delegate large parts of your work to Claude, not just code review but actual feature building, because you know the tests are honest.
And when you can trust the tests, you can trust the code.
▶Read: Real-world feature build with Claude Code: every step explained
Email*
Real-world feature build with Claude Code: every step explained
In this video, I walk through a real-world example of using Claude Code with my course video manager, a React Router application I use daily for creating content, organizing courses, and managing videos.
I'll show you my entire workflow: from initial feature brainstorming with the "grill me" skill, through creating PRDs and breaking them into issues, to running autonomous Ralph loops that implement features while I'm away from my desk.
This isn't a polished tutorial. It's a raw, unscripted look at how I actually work with AI agents on a real codebase with over 1,200 commits and 637 closed issues. You'll see the messy parts, the QA loops, and the decisions I make along the way.
The feature I'm building involves improving how "ghost" and "real" lessons work in my app. Ghost lessons exist only in the database, while real lessons have actual files on disk. The current workflow is clunky, requiring multiple steps to create or delete lessons. By the end of this session, I'll have streamlined the entire process and added support for "ghost courses" too.
If you want to learn these skills from the ground up, check out Claude Code for Real Engineers, my cohort that teaches you how to apply real engineering practices to AI-assisted development.
▶Read: 5 Agent Skills I Use Every Day
Email*
5 Agent Skills I Use Every Day
I've been an engineer for nearly a decade. Right now, process has never been more important.
You have access to a fleet of middling to good engineers that you can deploy at any time. But these engineers have a critical flaw: they have no memory. They don't remember things they've done before.
This means you need extremely strict and well-defined processes to get them to do useful work. As a developer, you're constantly steering your agents, keeping them on the right track.
My way of fixing this has been to create a LOT of agent skills. Each skill I've designed helps me encode my process so that AI has a strict path to follow every single time:
The result? The code quality that AI produces has shot up dramatically.
1.
/grill-me: Flesh Out an Ideanpx skills@latest add mattpocock/skills/grill-meThis is my favorite skill. It's only three sentences long, but it's incredibly impactful.
The Grill Me Skill:
Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one by one. And finally, if a question can be answered by exploring the code base, explore the code base instead.
The concept of a "design tree" comes from The Design of Design by Frederick P. Brooks. It's the idea that as you're designing something, you need to walk down all of the branches of a design tree.
For example, you might be designing a search page and need to decide between an advanced search interface or a simple text box. If you choose advanced search, you need to figure out all the filters and sorting methods. You keep walking down the tree until you've fully understood your design before committing to code.
When I invoke this skill, I want to reach a shared understanding with the LLM. Claude Code tends to spit out a plan really early when in plan mode, creating a document before we've truly understood each other. But the grill me skill forces that conversation.
In one conversation about adding a feature to my course video editor, Claude asked me 16 questions. And that was a relatively short grilling session. I've had sessions lasting nearly half an hour with 30, 40, or even 50 questions on really complex features.
The key takeaway: Skills don't have to be long to be impactful. You just need to choose the right words at the right time.
2.
/write-a-prd: From Conversation to Documentnpx skills@latest add mattpocock/skills/write-a-prdOnce I've reached a shared understanding with the LLM, I invoke my next skill:
/write-a-prd.This skill asks the LLM to create a Product Requirements Document. It may skip steps if they're not necessary. For example, if you've already done a deep interview, it moves straight to step four.
The workflow includes:
- Ask the user for a detailed description
- Explore the repo to verify assertions
- Interview the user relentlessly (using the grill me skill)
- Sketch out major modules needed
- Write the PRD using a template and submit as a GitHub issue
The important part of any PRD is the user stories. These describe the desired behavior of your system in language, drawing from Agile methodology.
3.
/prd-to-issues: Breaking Down the Destination into a Journeynpx skills@latest add mattpocock/skills/prd-to-issuesThe PRD describes your destination. But what you really need is the journey to get there.
That's what the PRD to Issues skill does. It takes a PRD and turns it into a Kanban board of independently grabbable issues.
The process:
- Locate the PRD (fetch it if needed)
- Explore the code base
- Draft vertical slices - break the PRD into tasks that flush out unknown unknowns quickly
The tracer bullet analogy applies here. Each issue is a thin vertical slice cutting through all integration layers, not a horizontal slice of one layer.
The skill also establishes blocking relationships between tasks. For instance, one issue might not be blocked by anything, so it can be picked up independently. This is useful if you have parallel agent setups where multiple agents can work simultaneously.
4.
/tdd: Increasing Code Qualitynpx skills@latest add mattpocock/skills/tddHow do you execute on a skill? How do you make the implementation rock solid and increase code quality?
You use a TDD skill. TDD stands for Test-Driven Development, and it forces (or rather, encourages) the agent to follow a red-green-refactor loop.
This skill is substantial. It includes philosophy on refactoring, mocking, and what deep modules are. Doing really good TDD has been the most consistent way to improve agent outputs.
The workflow starts with confirming what interface changes are needed. When an AI looks at a bad codebase, it sees many tiny, undifferentiated modules. But if you restructure it into larger modules with thin interfaces on top, the AI can navigate it much more easily.
The skill then:
- Confirm which behaviors to test
- Design interfaces for testability
- Write one test at a time (test first)
- Implement code to make the test pass
- Look for refactoring candidates
Red-green-refactor with agents is incredible. It creates a loop that continues until complete.
5.
/improve-codebase-architecture: Making Your Code Agent-Friendlynpx skills@latest add mattpocock/skills/improve-codebase-architectureTDD demands a lot of your codebase. In a badly structured code base, test boundaries are unclear. Where should you test? At which layer?
When your code base has clear module boundaries, testing becomes much easier.
The
/improve-codebase-architectureskill explores your codebase naturally, looking for confusions:- Where does understanding one concept require bouncing between many small files?
- Where have pure functions been extracted just for testability, but real bugs hide in how they're called?
- Where do tightly coupled modules create integration risk?
Then it presents candidates for deepening opportunities - chances to deepen shallow modules into deeper ones.
Do this once a week or after a surge of development. As you keep refining your codebase, you'll notice the quality of the agent's output goes up.
If you have a garbage code base, the AI will produce garbage within that code base.
Why This Matters: Treating AI Like Engineers
The most successful way to get code quality up from agents is to treat them like humans. Humans with weird constraints, sure - humans with no memory who are cloned and go right to work. But humans nonetheless.
Check out the skills repository to get started.
▶Read: An Introduction To Plan Mode
This article is for anyone who's used an AI coding agent and been frustrated by the results.
If you've ever felt that your AI agent produces low quality code, or doesn't understand your codebase, you need to try plan mode.
In plan mode, you iterate with the agent on what you want to build before any code gets written. You discuss requirements, validate assumptions, and refine your approach. The agent explores your codebase and builds context. By the time you start coding, the agent knows exactly what to do and has all the information it needs.
I'll be using Claude Code examples throughout this article - it's where I first discovered plan mode. But plan mode has been added to other AI coding agents too. Check your tool's documentation to see if it's supported.
What Is Plan Mode?
When you start Claude Code with
--permission-mode plan, or type/planin a session, you enter a restricted mode:Action Allowed in Plan Mode Read files Yes Explore codebase Yes Analyze code Yes Edit files No Run commands No Execute tests No The agent explores your repo, asks clarifying questions, and outputs a step-by-step plan. Once you're happy with the plan, you exit plan mode and let the agent execute it.
I use plan mode for almost everything - even small bug fixes. The crucial thing is that it validates my assumptions and helps me iterate toward a correct solution before committing anything to code.
The Plan Mode Coding Loop
The loop is simple:
- Dictate what you want. Can be high-level ("add user auth") or
specific ("refactor
handleSubmitinLoginForm.tsx"). - Enter plan mode.
- Iterate until the agent produces a plan you're happy with.
- Exit plan mode and let the agent execute.
For large features, you might spend an hour in plan mode before writing any code.
You usually don't clear the context window between planning and execution. This means that when the agent starts executing, it already has all the relevant files and context from the planning phase. This is why plan mode works - it's not just about the plan document, it's about priming the agent's context.
The exception is multi-phase plans, where you break large features into separate planning/execution cycles. That's outside the scope of this article, but worth exploring once you're comfortable with the basic loop.
ALWAYS Use Dictation
If you're not using dictation with AI, you're falling behind. Dictation lets you spit out ideas faster than any other input method.
I use Wispr Flow on Windows. Superwhisper is great on Mac.
The key insight: AI doesn't need grammatically perfect input. Especially in plan mode - the agent rewrites your prompt anyway. So messy, stream-of-consciousness dictation works fine. The grammatically incorrect inputs at the top of the conversation history don't matter once the agent has refined them into a plan.
How Plan Mode Helps The Agent
An agent can only work with what's in its context window. If it hasn't read a file, it can't edit it well, and if it doesn't understand your project structure, it'll make bad assumptions about where things belong.
Plan mode builds that context window before any code gets written. The agent explores your codebase, reads the files it'll need to modify, and discovers how different parts connect. By the time you exit plan mode, the agent has already loaded everything it needs.
The plan document matters too. It gives the agent a clear set of instructions to follow during execution, rather than figuring things out as it goes. Between the loaded files and the explicit steps, you've eliminated a ton of failure modes. Code quality goes up significantly when the agent knows exactly what it's working with and what it needs to do.
How Plan Mode Helps The Dev
You don't know what you want until you see it. This is true for clients, and it's true for you too. Don't fool yourself that you have perfect clarity before you start. You don't.
Plan mode lets you prototype before touching the code. You iterate with the agent on what you're building - discussing tradeoffs, validating assumptions, catching edge cases - all before a single line of code exists. This is the same process you'd go through with a colleague, except faster.
Plan mode is a forcing function for concrete requirements. You wouldn't chuck vague requirements at a human colleague and expect good results. Without plan mode, that's exactly what you're doing to an AI. Anyone who's maintained an open source repo and found themselves peering at a grubby screenshot of an error message knows what I'm talking about.
Plan mode makes you articulate what you want clearly enough that someone else can implement it. And often, explaining what you want reveals that you wanted something slightly different.
How To Prompt Better In Plan Mode
You can customize how Claude Code behaves via an AGENTS.md file in your project root. These tips help make plan mode more effective.
Make Your Plans Concise
Plans shouldn't be novels. Tell the agent to sacrifice grammar for brevity. This makes plans scannable and keeps the focus on what matters.
When writing plans, be extremely concise. Sacrifice grammar for the sake of concision.Make The Planner Ask Clarifying Questions
The agent will ask some questions by default, but you can make it more thorough. This catches ambiguities before they become bugs.
At the end of each plan, list unresolved questions. Ask about edge cases, error handling, and unclear requirements before proceeding.Put The Summary At The End
In a terminal, you read the end of the output first. Tell the agent to put the actionable summary last so you don't have to scroll up.
End every plan with a numbered list of concrete steps. This should be the last thing visible in the terminal.Isn't This Slower Than Doing It Myself?
Sometimes, yes. If you know a repo inside-out and can see the solution instantly, you'll probably be faster without the AI.
But plan mode isn't just about speed on familiar ground. It extends your reach. When you're in an unfamiliar codebase, or you need to think through requirements before diving in, plan mode is essential. It lets you contribute in places you couldn't before.
I use plan mode for everything, even when I'm confident I could code it faster myself. Three reasons:
These tools keep improving. Speed is one of the main areas they're getting better at. Every time I use plan mode, I build intuition about what AI can handle and how to communicate requirements to it. That skill compounds.
Less mental fatigue. Focusing on high-level decisions uses fewer brain cycles than fixing syntax errors and chasing type mismatches. I get more done with less exhaustion.
AI is the best rubber duck I've ever had. It understands codebases instantly and is available 24/7. Plan mode forces me to articulate what I want, which makes me better at building software. Both my code quality and output have gone up since I started using it.
The question isn't "is this slower right now?" It's "what am I optimizing for?" If you're optimizing for learning, reduced fatigue, and long-term effectiveness, plan mode wins even when you could technically code faster alone.
Conclusion
Plan mode isn't optional. It's a foundational tool for AI-assisted coding.
- Use plan mode for everything, even small bug fixes
- Use dictation - AI doesn't need perfect grammar
- Configure AGENTS.md to make plans concise with questions at the end
- Planning helps the agent by building its context window before it codes
- Planning helps you by forcing you to articulate your requirements
- It might not always be faster right now, but it's a long-term investment in your skills
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
- Dictate what you want. Can be high-level ("add user auth") or
specific ("refactor
▶Read: Creating The Perfect Claude Code Status Line
Email*
Creating The Perfect Claude Code Status Line
I think I've created the perfect Claude code status line:
matt/course-video-manager | main | S: 0 | U: 1 | A: 0 | 17.3%First, it has the repo that I'm working on relative to my repository's directory (
matt/course-video-manager). If you're like me, you probably have a single directory that has all of your repositories on.So I have it set up in my status line just to show the relative path based on my
~repos/directory.Then I have some Git information:
main | S: 0 | U: 1 | A: 0I have the branch that I'm on and I have the staged, unstaged, and added changes. This is kind of optional, but I do like it just to be able to, when I'm only working in the terminal and you're not using an IDE, see what's happening in Git at all times.
And then finally, and this is the most important one, is the percentage of context window that I've used up in this session:
17.3%This is a constant source of paranoia for me. I always want to be using the minimum amount of context I can, because the more you put in the context window, the worse the AI is going to be.
Having this constantly at my fingertips and monitoring it going, "Ooh, I think about 60 is probably where I want to stop" is just amazing.
How It's Set Up
The way this is set up is by using this
settings.jsonfile inside.claudedirectory:{ "hooks": {}You can see it runs
bash ~/.claude/statusline-wrapper.sh. This status line wrapper basically takes one status line command script.Then we also take the output from a CLI called
ccStatusLine. This basically lets me get the context percentage from it:{ "version": 3, "lines": [\ \ [\ \ {\ \ "id": "1",\ \ "type": "context-percentage",\ \ "color": "yellow",\ \ "bold": true,\ \ "rawValue": true\ \ }\ ```\ \ Then we combine the outputs of these files.\ \ ## [Try It Yourself](https://www.aihero.dev/creating-the-perfect-claude-code-status-line\#try-it-yourself)\ \ If you want to configure this yourself, here's how to set it up.\ \ ### [Step 1: Install ccstatusline](https://www.aihero.dev/creating-the-perfect-claude-code-status-line\#step-1-install-ccstatusline)\ \ First, install the ccstatusline package globally:\ \ ```\ npm install -g ccstatusline\ ```\ \ ### [Step 2: Configure ccstatusline](https://www.aihero.dev/creating-the-perfect-claude-code-status-line\#step-2-configure-ccstatusline)\ \ Create the config file at `~/.config/ccstatusline/settings.json`:\ \ ```\ {\ \ "version": 3,\ \ "lines": [\ \ [\ \ {\ \ "id": "1",\ \ "type": "context-percentage",\ \ "color": "yellow",\ \ "bold": true,\ \ "rawValue": true\ \ }\ \ ],\ \ [],\ \ []\ \ ],\ \ "flexMode": "full-minus-40",\ \ "compactThreshold": 60,\ \ "colorLevel": 2,\ \ "inheritSeparatorColors": false,\ \ "globalBold": false,\ \ "powerline": {\ \ "enabled": false,\ \ "separators": [""],\ \ "separatorInvertBackground": [\ \ false\ \ ],\ \ "startCaps": [],\ \ "endCaps": [],\ \ "autoAlign": false\ \ }\ \ }\ ```\ \ The key part here is `"rawValue": true` which shows just the percentage without the "Ctx:" label, and `"bold": true` to match the color of the git numbers.\ \ ### [Step 3: Create the Git Status Script](https://www.aihero.dev/creating-the-perfect-claude-code-status-line\#step-3-create-the-git-status-script)\ \ Create `~/.claude/statusline-command.sh`:\ \ ```\ #!/bin/bash\ \ # Read JSON input\ \ input=$(cat)\ \ # Extract values from JSON (without jq)\ \ cwd=$(echo "$input" | sed -n 's/.*"current_dir":"\([^"]*\)".*/\1/p')\ \ # Git information (skip optional locks for performance)\ \ if git -C "$cwd" rev-parse --git-dir > /dev/null 2>&1; then\ \ # Get repo name relative to ~/repos/\ \ repo_name=$(echo "$cwd" | sed "s|^$HOME/repos/||")\ \ # Get branch\ \ branch=$(git -C "$cwd" --no-optional-locks rev-parse --abbrev-ref HEAD 2>/dev/null)\ \ # Count staged files\ \ staged=$(git -C "$cwd" --no-optional-locks diff --cached --name-only 2>/dev/null | wc -l)\ \ # Count unstaged files (modified + deleted, not untracked)\ \ unstaged=$(git -C "$cwd" --no-optional-locks diff --name-only 2>/dev/null | wc -l)\ \ # Count untracked files\ \ untracked=$(git -C "$cwd" --no-optional-locks ls-files --others --exclude-standard 2>/dev/null | wc -l)\ \ printf '\033[01;36m%s\033[00m | \033[01;32m%s\033[00m | S: \033[01;33m%s\033[00m | U: \033[01;33m%s\033[00m | A: \033[01;33m%s\033[00m' \\ \ "$repo_name" "$branch" "$staged" "$unstaged" "$untracked"\ \ else\ \ # Not a git repo\ \ printf '\033[01;36m%s\033[00m' "$cwd"\ \ fi\ ```\ \ **Note:** Change `~/repos/` on line 12 if your repositories live in a different directory.\ \ ### [Step 4: Create the Wrapper Script](https://www.aihero.dev/creating-the-perfect-claude-code-status-line\#step-4-create-the-wrapper-script)\ \ Create `~/.claude/statusline-wrapper.sh`:\ \ ```\ #!/bin/bash\ \ # Read JSON input once\ \ input=$(cat)\ \ # Get git info from existing script\ \ git_info=$(echo "$input" | bash ~/.claude/statusline-command.sh)\ \ # Get context percentage from ccstatusline\ \ context_pct=$(echo "$input" | npx ccstatusline)\ \ # Combine outputs\ \ printf '%s | %s' "$git_info" "$context_pct"\ ```\ \ ### [Step 5: Make Scripts Executable](https://www.aihero.dev/creating-the-perfect-claude-code-status-line\#step-5-make-scripts-executable)\ \ ```\ chmod +x ~/.claude/statusline-command.sh\ \ chmod +x ~/.claude/statusline-wrapper.sh\ ```\ \ ### [Step 6: Update Claude Settings](https://www.aihero.dev/creating-the-perfect-claude-code-status-line\#step-6-update-claude-settings)\ \ Add this to your `~/.claude/settings.json`:\ \ ```\ {\ \ "statusLine": {\ \ "type": "command",\ \ "command": "bash ~/.claude/statusline-wrapper.sh"\ \ }\ \ }\ ```\ \ ### [Step 7: Restart Claude Code](https://www.aihero.dev/creating-the-perfect-claude-code-status-line\#step-7-restart-claude-code)\ \ Restart Claude Code to see your new status line in action!\ \ **Share**\ \ Copy URL▶Read: How To Use Claude Code Hooks To Enforce The Right CLI
Email*
How To Use Claude Code Hooks To Enforce The Right CLI
One of the most common questions I get about Claude Code is: how do I force it to use the right CLI command?
How do I get it to use
pnpminstead ofnpm? How do I get it to call my wrapper script instead of usingnpxdirectly? How do I block it from running certain commands entirely?The Problem with CLAUDE.md
The obvious answer is to put an instruction in your
CLAUDE.mdfile:Use pnpm instead of npm for all package management tasks.This works most of the time. But there are two problems.
It wastes instruction budget. This instruction is only relevant when Claude runs a package manager command. Putting it in
CLAUDE.mdmakes it global context for every task in the repo. LLMs have a limited instruction budget—around 500 instructions before they start getting confused. You want that budget spent on the hard stuff: planning, implementation, architecture. Not reminders about which package manager to use.It's not deterministic. Adding "don't use
git push" to yourCLAUDE.mdreduces the chance of a force push. It doesn't prevent it. You're burning instruction budget on something that still isn't guaranteed.Claude Code Hooks: A Deterministic Solution
Hooks let you run deterministic code at specific points during Claude Code's execution cycle. They're configured in your
.claude/settings.jsonfile.The hook we care about is
PreToolUse. It fires before a tool call executes and can block it. If the hook exits with code2, the action is blocked and the error message is fed back to Claude so it can adjust.Here's the structure of a
PreToolUsehook that blocks a Bash command:{ "hooks": { "PreToolUse": [\ \ {\ \ "matcher": "Bash",\ \ "hooks": [\ \ {\ \ "type": "command",\ \ "command": ".claude/hooks/enforce-pnpm.sh"\ \ }\ \ ]\ \ }\ \ ] } }The hook script receives JSON on stdin with the tool name and arguments. It checks the command, and either exits
0to allow it or exits2to block it:#!/bin/bash INPUT=$(cat) COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command') if echo "$COMMAND" | grep -qE "^npm "; then echo "Blocked: use pnpm instead of npm" >&2 exit 2 fi exit 0Claude sees the "use pnpm instead of npm" message and retries with the correct command. No instruction budget wasted, and the wrong command is impossible to run.
The Prompt
You don't need to write these hooks by hand. Claude Code already knows how to create them. Here's a prompt you can paste directly into Claude Code that converts your
CLAUDE.mdinstructions into deterministic hooks:Take the instructions in your @CLAUDE.md file and turn them into deterministic Claude Code hooks in this project directory. Not all the instructions will be deterministic: only do the ones you can, such as instructions to use one CLI command over another, or disallowing certain CLI commands. Hooks should be added to `.claude/settings.json` under the `hooks` key, using the `PreToolUse` event with a `Bash` matcher. Use separate bash scripts in `.claude/hooks/` for running the hooks: ```sh #!/bin/bash INPUT=$(cat) COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command') if echo "$COMMAND" | grep -q "drop table"; then echo "Blocked: dropping tables is not allowed" >&2 exit 2 fi exit 0 ``` First, confirm with the user which hooks will be created. Second, implement the hooks. Third, provide the user with instructions to test the newly created hooks (by restarting Claude Code).The prompt tells Claude to read your existing
CLAUDE.md, identify which instructions can be enforced deterministically, and convert them intoPreToolUsehook scripts. It asks for confirmation before creating anything, then walks you through testing.▶Read: This Hook Stops Claude Code Running Dangerous Git Commands
I've put together a skill to allow you to prevent Claude code from ever running dangerous git commands. Add it via
npx skills add:npx skills add mattpocock/skills/git-guardrails-claude-codeClaude will guide you through the setup - choose whether to install it for your current project or globally across all projects, then customize which git commands you want to block.
Why This Skill Exists
The Docker Sandbox Story
I've been experimenting with Ralph-style workflows recently, and the most powerful way I've found to get Claude to run autonomously is via Docker Sandbox.
It lets Claude Code run in YOLO without asking permission for every command. And it protects your system by running it inside a microVM. You can let your agent work unattended while you check in every few minutes.
The tradeoff? Docker Sandbox doesn't restrict what commands Claude can run, it just isolates where they run.
The Gap Docker Sandbox Leaves
Here's the problem: your git history lives inside that sandboxed folder.
Docker Sandbox prevents Claude from reaching outside the project directory, but it can't stop Claude from running destructive git commands within that directory. A single
git reset --hardorgit push --forcecan wipe out weeks of work.This skill fills that gap with a hard guardrail - preventing Claude Code from running dangerous commands.
What Gets Blocked
The skill blocks these dangerous patterns by default:
git push(all variants, including--force)git reset --hardgit clean -f/git clean -fdgit branch -Dgit checkout ./git restore .
These aren't "never allow" rules, they're "safe by default" rules.
You can customize the blocked patterns during installation, or edit them anytime. If your workflow legitimately needs Claude to push to a remote PR, you can allow it.
How It Works: The Hook System
The skill uses a
PreToolUsehook that intercepts bash commands before Claude executes them.Here's the filtering script:
#!/bin/bash INPUT=$(cat) COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command') DANGEROUS_PATTERNS=( "git push" "git reset --hard" "git clean -fd" "git clean -f" "git branch -D" "git checkout \." "git restore \." "push --force" "reset --hard" ) for pattern in "${DANGEROUS_PATTERNS[@]}"; do if echo "$COMMAND" | grep -qE "$pattern"; then echo "BLOCKED: '$COMMAND' matches dangerous pattern '$pattern'. The user has prevented you from doing this." >&2 exit 2 fi done exit 0The script reads the incoming command, checks it against the danger list, and blocks execution if it matches.
What Claude Sees
When Claude tries to run a blocked command, it sees a clear message:
BLOCKED: 'git push origin main' matches dangerous pattern 'git push'. The user has prevented you from doing this.Claude understands the constraint and adapts. It's a hard guardrail, not a suggestion buried in prose.
Verify It's Working
After installation, test the guardrail:
- Clear your conversation context
- Ask Claude to run
git push - Watch it fail and see the blocked message
This confirms the hook is active and Claude can't accidentally bypass it.
Conclusion
Enjoy! These kinds of skills are great fun to put together, and if you're looking for more of them, join my newsletter.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: Never Run Claude /init
Email*
Never Run Claude /init
If you're using Claude Code or any coding agent, there's an
initcommand that promises to create aCLAUDE.mdoragents.mdfile with documentation about your codebase.Never run it. If you find one that was auto-generated, delete it.
The file it creates will burn tokens, distract the agent, and go out of date faster than a pear on a hot day. Research confirms this: unnecessary context files make tasks harder, not easier.
How Agents Use Context
An agent's context window gets divided into phases:
Phase Purpose Flexibility System Prompt LLM instructions, MCP servers, system tools, CLAUDE.mdcontentNot flexible, hardwired at start Exploration Understanding what's in the codebase Very flexible Implementation Writing and modifying files Very flexible Testing Running tests, debugging, feedback loops Very flexible, can balloon if issues occur Exploration, implementation, and testing are flexible. A simple task needs little exploration. A bug-free implementation needs little testing.
But the system prompt is hardwired the moment the agent starts. Everything in your
CLAUDE.mdinflates it, leaving less room for the phases that actually do the work.Shrinking the system prompt gives you more space for actual work and reduces your costs.
The Instruction Budget
LLMs have a context window, but they also have an instruction budget: a limit on how many instructions they can follow at once.
Each sentence in your
CLAUDE.mdis an instruction. LLMs can realistically handle around 300 to 400 instructions at a time. Bigger models push this to maybe 500.If you're stuffing dozens of irrelevant instructions into
CLAUDE.md, you're burning that budget before the agent even starts working on your task.The Globality Problem
A common piece of advice: if the agent does something you don't like, put a rule in
CLAUDE.md.Maybe it uses npm instead of pnpm. Maybe it reaches for a React pattern you hate. So you add a line.
The problem is that
CLAUDE.mdis global. Every instruction applies to every session: frontend, backend, docs, database, all of it.That React rule is useful for frontend sessions. But the next session might be purely backend work where it's completely irrelevant. The session after that might be documentation.
Every line you add has a cost that compounds across every session, whether it's relevant or not.
What Init Actually Generates
Init commands tend to generate the same categories of content. All of it is problematic.
Command listings. Init loves dumping every script from
package.jsoninto the file. These are trivially discoverable. The agent can just readpackage.json. You're paying tokens to duplicate the source of truth.Architecture descriptions. Framework names, rendering modes, compiler settings. The agent can discover all of this from config files and imports. A
react-router.configfile already tells it you're using React Router. Aneffectimport tells it you're using Effect.File and service references. This is the worst category. Init documents specific files, services, and their relationships. The moment you rename a file, move a service, or change an implementation, this documentation is wrong. It now actively misleads the agent.
Implementation patterns. How specific features work, what patterns to use where. Not only is this discoverable from the code, it's only relevant to a fraction of sessions. Most tasks won't touch most patterns.
The throughline: everything init generates is either trivially discoverable from source or will go stale. Your file system is the documentation. If you structure it well, the agent gets an accurate picture of your architecture from the source of truth, not from a rotting summary.
The Solution
Trust the Explore Step
Every modern coding agent has an explore phase. Before making changes, it reads files, searches the codebase, and builds up context just-in-time for the task at hand.
This is strictly better than a static
CLAUDE.mdbecause it only loads what's relevant, and it always reflects the current state of the code.Use Skills for Steering
There is useful steering you want to give the agent. Maybe you want it to prefer reducers for complex UI state. Maybe you want it to follow a specific testing pattern.
This belongs in skills: discoverable instructions that the agent can pull in when relevant, without burning the instruction budget on every session.
Keep CLAUDE.md Almost Empty
If basic setup stuff doesn't belong in
CLAUDE.md, and steering belongs in skills, what's left?Almost nothing. My entire
CLAUDE.mdis:you are on WSL on WindowsSix words. It's there because WSL has unintuitive path resolution issues that the agent can't discover on its own. That's the bar: only include what is both undiscoverable and globally relevant.
The Bottom Line
Never run init. The file it generates dumps irrelevant information into your global context, inflates your system prompt, wastes your instruction budget, and rots the moment your code changes.
You'll either burn tokens keeping it up to date, or you'll delete it. Just skip to deleting it.
Trust the explore step. Use skills for steering. Keep
CLAUDE.mdnearly empty.▶Read: A Complete Guide To AGENTS.md
Have you ever felt concerned about the size of your
AGENTS.mdfile?Maybe you should be. A bad
AGENTS.mdfile can confuse your agent, become a maintenance nightmare, and cost you tokens on every request.So you'd better know how to fix it.
What is AGENTS.md?
An
AGENTS.mdfile is a markdown file you check into Git that customizes how AI coding agents behave in your repository. It sits at the top of the conversation history, right below the system prompt.Think of it as a configuration layer between the agent's base instructions and your actual codebase. The file can contain two types of guidance:
- Personal scope: Your commit style preferences, coding patterns you prefer
- Project scope: What the project does, which package manager you use, your architecture decisions
The
AGENTS.mdfile is an open standard supported by many - though not all - tools.CLAUDE.md
Notably, Claude Code doesn't use
AGENTS.md- it usesCLAUDE.mdinstead. You can symlink between them to keep all your tools working the same way:# Create a symlink from AGENTS.md to CLAUDE.md ln -s AGENTS.md CLAUDE.mdWhy Massive
AGENTS.mdFiles are a ProblemThere's a natural feedback loop that causes
AGENTS.mdfiles to grow dangerously large:- The agent does something you don't like
- You add a rule to prevent it
- Repeat hundreds of times over months
- File becomes a "ball of mud"
Different developers add conflicting opinions. Nobody does a full style pass. The result? An unmaintainable mess that actually hurts agent performance.
Another culprit: auto-generated
AGENTS.mdfiles. Never use initialization scripts to auto-generate yourAGENTS.md. They flood the file with things that are "useful for most scenarios" but would be better progressively disclosed. Generated files prioritize comprehensiveness over restraint.The Instruction Budget
Kyle from Humanlayer's article mentions the concept of an "instruction budget":
Frontier thinking LLMs can follow ~ 150-200 instructions with reasonable consistency. Smaller models can attend to fewer instructions than larger models, and non-thinking models can attend to fewer instructions than thinking models.
Every token in your
AGENTS.mdfile gets loaded on every single request, regardless of whether it's relevant. This creates a hard budget problem:Scenario Impact Small, focused AGENTS.mdMore tokens available for task-specific instructions Large, bloated AGENTS.mdFewer tokens for the actual work; agent gets confused Irrelevant instructions Token waste + agent distraction = worse performance Taken together, this means that the ideal
AGENTS.mdfile should be as small as possible.Stale Documentation Poisons Context
Another issue for large
AGENTS.mdfiles is staleness.Documentation goes out of date quickly. For human developers, stale docs are annoying, but the human usually has enough built-in memory to be skeptical about bad docs. For AI agents that read documentation on every request, stale information actively poisons the context.
This is especially dangerous when you document file system structure. File paths change constantly. If your
AGENTS.mdsays "authentication logic lives insrc/auth/handlers.ts" and that file gets renamed or moved, the agent will confidently look in the wrong place.Instead of documenting structure, describe capabilities. Give hints about where things might be and the overall shape of the project. Let the agent generate its own just-in-time documentation during planning.
Domain concepts (like "organization" vs "group" vs "workspace") are more stable than file paths, so they're safer to document. But even these can drift in fast-moving AI-assisted codebases. Keep a light touch.
Cutting Down Large
AGENTS.mdFilesBe ruthless about what goes here. Consider this the absolute minimum:
- One-sentence project description (acts like a role-based prompt)
- Package manager (if not npm; or use
corepackfor warnings) - Build/typecheck commands (if non-standard)
That's honestly it. Everything else should go elsewhere.
The One-Liner Project Description
This single sentence gives the agent context about why they're working in this repository. It anchors every decision they make.
Example:
This is a React component library for accessible data visualization.That's the foundation. The agent now understands its scope.
Package Manager Specification
If you're In a JavaScript project and using anything other than npm, tell the agent explicitly:
This project uses pnpm workspaces.Without this, the agent might default to
npmand generate incorrect commands.Corepack is also great You could also use
corepackto let the system handle warnings automatically, saving you precious instruction budget.Use Progressive Disclosure
Instead of cramming everything into
AGENTS.md, use progressive disclosure: give the agent only what it needs right now, and point it to other resources when needed.Agents are fast at navigating documentation hierarchies. They understand context well enough to find what they need.
Move Language-Specific Rules to Separate Files
If your
AGENTS.mdcurrently says:Always use const instead of let. Never use var. Use interface instead of type when possible. Use strict null checks. ...Move that to a separate file instead. In your root
AGENTS.md:For TypeScript conventions, see docs/TYPESCRIPT.mdNotice the light touch, no "always," no all-caps forcing. Just a conversational reference.
The benefits:
- TypeScript rules only load when the agent writes TypeScript
- Other tasks (CSS debugging, dependency management) don't waste tokens
- File stays focused and portable across model changes
Nest Progressive Disclosure
You can go even deeper. Your
docs/TYPESCRIPT.mdcan referencedocs/TESTING.md. Create a discoverable resource tree:docs/ ├── TYPESCRIPT.md │ └── references TESTING.md ├── TESTING.md │ └── references specific test runners └── BUILD.md └── references esbuild configurationYou can even link to external resources, Prisma docs, Next.js docs, etc. The agent will navigate these hierarchies efficiently.
Use Agent Skills
Many tools support "agent skills" - commands or workflows the agent can invoke to learn how to do something specific. These are another form of progressive disclosure: the agent pulls in knowledge only when needed.
We'll cover agent skills in-depth in a separate article.
AGENTS.mdin MonoreposYou're not limited to a single
AGENTS.mdat the root. You can placeAGENTS.mdfiles in subdirectories, and they merge with the root level.This is powerful for monorepos:
What Goes Where
Level Content Root Monorepo purpose, how to navigate packages, shared tools (pnpm workspaces) Package Package purpose, specific tech stack, package-specific conventions Root
AGENTS.md:This is a monorepo containing web services and CLI tools. Use pnpm workspaces to manage dependencies. See each package's AGENTS.md for specific guidelines.Package-level
AGENTS.md(inpackages/api/AGENTS.md):This package is a Node.js GraphQL API using Prisma. Follow docs/API_CONVENTIONS.md for API design patterns.Don't overload any level. The agent sees all merged
AGENTS.mdfiles in its context. Keep each level focused on what's relevant at that scope.Fix A Broken
AGENTS.mdWith This PromptIf you're starting to get nervous about the
AGENTS.mdfile in your repo, and you want to refactor it to use progressive disclosure, try copy-pasting this prompt into your coding agent:I want you to refactor my AGENTS.md file to follow progressive disclosure principles. Follow these steps: 1. **Find contradictions**: Identify any instructions that conflict with each other. For each contradiction, ask me which version I want to keep. 2. **Identify the essentials**: Extract only what belongs in the root AGENTS.md: - One-sentence project description - Package manager (if not npm) - Non-standard build/typecheck commands - Anything truly relevant to every single task 3. **Group the rest**: Organize remaining instructions into logical categories (e.g., TypeScript conventions, testing patterns, API design, Git workflow). For each group, create a separate markdown file. 4. **Create the file structure**: Output: - A minimal root AGENTS.md with markdown links to the separate files - Each separate file with its relevant instructions - A suggested docs/ folder structure 5. **Flag for deletion**: Identify any instructions that are: - Redundant (the agent already knows this) - Too vague to be actionable - Overly obvious (like "write clean code")Don't Build A Ball Of Mud
When you're about to add something to your
AGENTS.md, ask yourself where it belongs:Location When to use Root AGENTS.mdRelevant to every single task in the repo Separate file Relevant to one domain (TypeScript, testing, etc.) Nested documentation tree Can be organized hierarchically The ideal
AGENTS.mdis small, focused, and points elsewhere. It gives the agent just enough context to start working, with breadcrumbs to more detailed guidance.Everything else lives in progressive disclosure: separate files, nested
AGENTS.mdfiles, or skills.This keeps your instruction budget efficient, your agent focused, and your setup future-proof as tools and best practices evolve.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: My AGENTS.md file for building plans you actually read
Most developers are skeptical about AI code generation at first. It seems impossible that an AI could understand your codebase the way you do, or match the instincts you've built up over years of experience.
But there's a technique that changes everything: the planning loop. Instead of asking AI to write code directly, you work through a structured cycle that dramatically improves the quality of what you get.
This approach transforms AI from an unreliable code generator into an indispensable coding partner.
The Plan Loop: A Four-Step Process
Every piece of code now goes through the same cycle.
Plan with the AI first. Think through the approach together before writing any code. Discuss the strategy and get alignment on what you're building.
Execute by asking the AI to write the code that matches the plan. You're not asking it to figure out what to build—you've already done that together.
Test the code together. Run unit tests, check type safety, or perform manual QA. Validate that the implementation matches what you planned.
Commit the code and start the cycle again for the next piece.
Why This Matters
This loop is completely indispensable for getting decent outputs from an AI.
If you drop the planning step altogether, you're really hampering yourself. You're asking the AI to guess what you want, and you'll end up fighting with hallucinations and misunderstandings.
Planning forces clarity. It makes the AI's job easier and your code better.
Rules for Creating Great Plans
Here are the key rules from my
CLAUDE.mdfile that make plan mode effective:## Plan Mode - Make the plan extremely concise. Sacrifice grammar for the sake of concision. - At the end of each plan, give me a list of unresolved questions to answer, if any.These simple guidelines transform verbose plans into scannable, actionable documents that keep both you and the AI aligned.
Copy them into your
CLAUDE.mdorAGENTS.mdfile, and enjoy simpler, more readable plans.Or, run this script to append them to your
~/.claude/CLAUDE.mdfile:mkdir -p ~/.claude && cat >> ~/.claude/CLAUDE.md << 'EOF' ## Plan Mode - Make the plan extremely concise. Sacrifice grammar for the sake of concision. - At the end of each plan, give me a list of unresolved questions to answer, if any. EOFJoin over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: Connect Claude Code To A GitHub MCP Server
Model Context Protocol Tutorial
Email*
Connect Claude Code To A GitHub MCP Server
Before we get into the abstract details about what MCP is, let's take a look at it working.
I'm a developer and my audience is mostly developers, so let's look at a developer use case.
Let's connect up a GitHub MCP server to Claude Code. Claude Code is a really powerful terminal app for using an LLM to modify a codebase.
I'll walk through the whole demo in the video above, but here are the steps I followed:
Install Claude Code.
Get a GitHub API Key and copy it into my clipboard.
Run this command:
claude mcp addAdd
github-serveras the name of the MCP server.Scope it to the project.
Add
npxas the command.Add
-y @modelcontextprotocol/server-github@latestas the args.Add
GITHUB_PERSONAL_ACCESS_TOKEN=your-token-herefor the env.Your MCP server should now be ready to use.
▶Read: Getting Started With Ralph
Ralph is a technique for running AI coding agents in a loop. You run the same prompt repeatedly. The AI picks its own tasks from a PRD. It commits after each feature. You come back later to working code.
This guide walks you through building your first Ralph loop. We'll use Claude Code and Docker Desktop.
I'm assuming Linux, but you can point an AI at this article and have it translate for your OS or AI Coding CLI.
For more tips on getting the most out of Ralph, check out my 11 tips for AI coding with Ralph.
1. Install Claude Code
Claude Code is Anthropic's CLI for agentic coding. Install it with the native binary:
curl -fsSL https://claude.ai/install.sh | bashIf you get "command not found: claude" after installing, add the install location to your PATH:
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc source ~/.bashrcAlternatively, install via npm:
npm i -g @anthropic-ai/claude-codeRun
claudeto authenticate with your Anthropic account.2. Install Docker Desktop
Docker Desktop lets you run Claude Code in an isolated sandbox. The AI can execute commands, install packages, and modify files without touching your local machine.
Install Docker Desktop 4.50+, then run:
docker sandbox run claudeOn first run, you'll authenticate with Anthropic. Your credentials are stored in a Docker volume.
Key benefits of sandboxes:
- Your working directory mounts at the same path inside the container
- Git config is auto-injected for proper commit attribution
- One sandbox per workspace - state persists between runs
See the Docker Sandboxes docs for more.
3. Create your Plan File
Ralph needs a PRD (Product Requirements Document) to pick tasks from. You could write one manually, but it's faster to use Claude's plan mode.
Run Claude:
claudeAnd press
shift-tabto enter plan mode. You'll be able to iterate on a plan until you're happy with it.When you're happy with the plan, tell Claude to save it to
PRD.md.Also create an empty progress file:
touch progress.txtThe PRD defines the end state. The progress file tracks what's done. Claude reads both on each loop iteration, finds the next unchecked item, implements it, and updates progress.
The PRD can be in any format - markdown checklist, JSON, plain prose. What matters is that the scope is clear and the agent can pull out individual tasks. For more tips on writing good PRDs, see my 11 tips for AI coding with Ralph.
4. Create Your
ralph-once.shScriptBefore going fully AFK, start with a human-in-the-loop Ralph. You run the script, watch what it does, then run it again. This builds intuition for how the loop works.
Create
ralph-once.sh:#!/bin/bash claude --permission-mode acceptEdits "@PRD.md @progress.txt \ 1. Read the PRD and progress file. \ 2. Find the next incomplete task and implement it. \ 3. Commit your changes. \ 4. Update progress.txt with what you did. \ ONLY DO ONE TASK AT A TIME."Key elements:
Element Purpose --permission-mode acceptEditsAuto-accepts file edits so the loop doesn't stall @PRD.mdPoints Claude at your requirements doc @progress.txtTracks completed work between runs ONLY DO ONE TASKForces small, incremental commits Make it executable:
chmod +x ralph-once.shRun it with
./ralph-once.sh. Watch what Claude does. Check the commit. Run it again.5. Create your
afk-ralph.shScriptOnce you're comfortable with human-in-the-loop Ralph, wrap it in a loop:
#!/bin/bash set -e if [ -z "$1" ]; then echo "Usage: $0 <iterations>" exit 1 fi for ((i=1; i<=$1; i++)); do result=$(docker sandbox run claude --permission-mode acceptEdits -p "@PRD.md @progress.txt \ 1. Find the highest-priority task and implement it. \ 2. Run your tests and type checks. \ 3. Update the PRD with what was done. \ 4. Append your progress to progress.txt. \ 5. Commit your changes. \ ONLY WORK ON A SINGLE TASK. \ If the PRD is complete, output <promise>COMPLETE</promise>.") echo "$result" if [[ "$result" == *"<promise>COMPLETE</promise>"* ]]; then echo "PRD complete after $i iterations." exit 0 fi doneThe
-pflag runs Claude in print mode - non-interactive, outputs to stdout. This lets us capture the result and check for the completion sigil../afk-ralph.sh 20Go make coffee. Come back to commits.
Element Purpose set -eExit on any error $1(iterations)Caps the loop to prevent runaway costs -pPrint mode - non-interactive output <promise>COMPLETE</promise>Completion sigil Claude outputs when done 6. Make It Your Own
Ralph is just a loop. That simplicity makes it infinitely customizable.
You can swap the task source. Instead of a local PRD, pull tasks from GitHub Issues, Linear, or beads. The agent still chooses what to work on - you just change where the list lives.
You can change the output. Instead of committing to main, each iteration could create a branch and open a PR. Useful for triaging a backlog of issues.
You can run different loop types entirely:
Loop Type What It Does Test Coverage Finds uncovered lines, writes tests until coverage hits target Linting Fixes lint errors one by one Duplication Hooks into jscpd, refactors clones into shared utilities Entropy Scans for code smells, cleans them up Any task that fits "look at repo, improve something, commit" works with Ralph.
For deeper guidance on feedback loops, task sizing, prioritization, and more, read my 11 tips for AI coding with Ralph.
Want more on autonomous AI coding? Join my newsletter to get notified when new articles drop.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: 11 Tips For AI Coding With Ralph Wiggum
If you're using AI coding CLIs like Claude Code, Copilot CLI, OpenCode, and Codex, this article is for you.
Most developers use these tools interactively. You give it a task, watch it work, and intervene when it goes off-track. This is "human-in-the-loop" (HITL) coding.
But there's a new approach called Ralph Wiggum. Ralph runs your AI coding CLI in a loop, letting it work autonomously on a list of tasks. You define what needs to be done. Ralph figures out how - and keeps going until it's finished. In other words, it's long-running, autonomous, and unsupervised AFK coding.
This is not a quickstart guide. If you want to get up and running fast, read Getting Started With Ralph.
The 11 Tips
# Tip Summary 1 Ralph Is A Loop What Ralph is and why it works 2 Start With HITL, Then Go AFK The two modes of running Ralph 3 Define The Scope How to specify what "done" looks like 4 Track Ralph's Progress Using progress files between iterations 5 Use Feedback Loops Types, tests, and linting as guardrails 6 Take Small Steps Why smaller tasks produce better code 7 Prioritize Risky Tasks Tackle hard problems first 8 Explicitly Define Software Quality Don't let Ralph cut corners 9 Use Docker Sandboxes Isolate AFK Ralph for safety 10 Pay To Play Cost considerations and tradeoffs 11 Make It Your Own Alternative loop types and customization 1. Ralph Is A Loop
AI coding has evolved through distinct phases in the last year or so. Let's briefly define them:
Vibe coding is where you let the AI write code without really checking it. You "vibe" with the AI, accepting its suggestions without scrutiny. It's fast, but the code quality suffers.
Planning is where you ask the AI to plan before it codes. In Claude Code, you can enter plan mode to have the AI explore your codebase and create a plan before writing code. This improves quality, but you're still limited to what fits in one context window.
Multi-phase plans break large features into phases, each handled in a separate context window. You write a different prompt for each phase: "Implement the database schema," then "Add the API endpoints," then "Build the UI." This scales better, but requires constant human involvement to write each prompt.
Ralph simplifies everything. Instead of writing a new prompt for each phase, you run the same prompt in a loop:
# ralph.sh # Usage: ./ralph.sh <iterations> set -e if [ -z "$1" ]; then echo "Usage: $0 <iterations>" exit 1 fi # For each iteration, run Claude Code with the following prompt. # This prompt is basic, we'll expand it later. for ((i=1; i<=$1; i++)); do result=$(docker sandbox run claude -p \ "@some-plan-file.md @progress.txt \ 1. Decide which task to work on next. \ This should be the one YOU decide has the highest priority, \ - not necessarily the first in the list. \ 2. Check any feedback loops, such as types and tests. \ 3. Append your progress to the progress.txt file. \ 4. Make a git commit of that feature. \ ONLY WORK ON A SINGLE FEATURE. \ If, while implementing the feature, you notice that all work \ is complete, output <promise>COMPLETE</promise>. \ ") echo "$result" if [[ "$result" == *"<promise>COMPLETE</promise>"* ]]; then echo "PRD complete, exiting." exit 0 fi doneEach iteration:
- Looks at a plan file to see what needs to be done
- Looks at a progress file to see what has already been done
- Decides what to do next
- Explores the codebase
- Implements the feature
- Runs feedback loops (types, linting, tests)
- Commits the code
The key improvement here is that the agent chooses the task, not you.
With multi-phase plans, a human writes a new prompt at the start of each phase. With Ralph, the agent picks what to work on next from your PRD. You define the end state. Ralph gets there.
I've used Ralph to add tests to my AI Hero CLI and build features for my course video manager. Others have built entire programming languages with it.
2. Start With HITL, Then Go AFK
There are two ways to run Ralph:
Mode How It Works Best For Script HITL (human-in-the-loop) Run once, watch, intervene Learning, prompt refinement ralph-once.shAFK (away from keyboard) Run in a loop with max iterations Bulk work, low-risk tasks afk-ralph.shFor HITL Ralph, keep a
ralph-once.shthat runs a single iteration. You watch everything it does and step in when needed.For AFK Ralph, always cap your iterations. Infinite loops are dangerous with stochastic systems. I typically use 5-10 iterations for small tasks, or 30-50 for larger ones.
HITL Ralph resembles pair programming. You and the AI work together, reviewing code as it's created. You can steer, contribute, and share project understanding in real-time.
It's also the best way to learn Ralph. You'll understand what it does, refine your prompt, and build confidence before going hands-off.
Once your prompt is solid, AFK Ralph unlocks real leverage. Set it running, do something else, come back when it's done.
I built a small CLI to ping me on WhatsApp when Ralph finishes. This means much less context switching. I can fully engage with another task. My loops usually take 30-45 minutes, though they can run for hours.
The progression is simple:
- Start with HITL to learn and refine
- Go AFK once you trust your prompt
- Review the commits when you return
3. Define The Scope
Before you let Ralph run, you need to define what "done" looks like. This is a shift from planning to requirements gathering. Instead of specifying each step, you describe the desired end state and let the agent figure out how to get there.
Formats For Defining Scope
There are many ways to define scope for Ralph:
- A markdown list of user stories
- GitHub issues or Linear tasks (more on this later)
- Using beads
One approach I like comes from Anthropic's research on long-running agents. They structure PRD items as JSON with a
passesfield:{ "category": "functional", "description": "New chat button creates a fresh conversation", "steps": [\ \ "Click the 'New Chat' button",\ \ "Verify a new conversation is created",\ \ "Check that chat area shows welcome state"\ \ ], "passes": false }Ralph marks
passestotruewhen complete. The PRD becomes both scope definition and progress tracker - a living TODO list rather than a waterfall document.Why Scope Matters
You don't need a structured TODO list. You can give Ralph a vague task - "improve this codebase" - and let it track its own progress.
But the vaguer the task, the greater the risk. Ralph might loop forever, finding endless improvements. Or it might take shortcuts, declaring victory before you'd consider the job done:
I ran Ralph to increase test coverage on my AI Hero CLI. The repo had internal commands - marked as internal but still user-facing (I use them). I wanted tests for everything.
After three iterations, Ralph reported: "Done with all user-facing commands." But it had skipped the internal ones entirely. It decided they weren't user-facing and marked them to be ignored by coverage.
The fix? Define exactly what you want covered:
What to Specify Why It Prevents Shortcuts Files to include Ralph won't ignore "edge case" files Stop condition Ralph knows when "complete" actually means complete Edge cases Ralph won't decide certain things don't count Adjusting PRDs Mid-Flight
One benefit of this approach: you can adjust while Ralph is running.
- Already implemented but wrong? Set
passesback tofalse, add notes, rerun. - Missing a feature? Add a new PRD item even mid-loop.
You're not editing a linear multi-phase plan. You're describing a different end state. Ralph will get there.
As long as scope and stop condition are explicit, Ralph will know when to emit
<promise>COMPLETE</promise>.Try It Out
Use plan mode, and create a
prd.jsonfile for your next feature. Use this prompt to generate structured PRD items:Convert my feature requirements into structured PRD items. Each item should have: category, description, steps to verify, and passes: false. Format as JSON. Be specific about acceptance criteria.4. Track Ralph's Progress
Every Ralph loop I run emits a
progress.txtfile, committed directly to the repo. I took this inspiration from Anthropic's article on long-running agent harnesses.This addresses a core challenge: AI agents are like super-smart experts who forget everything between tasks. Each new context window starts fresh. Without a progress file, Ralph must explore the entire repo to understand the current state.
A progress file short-circuits that exploration. Ralph reads it, sees what's done, and jumps straight into the next task.
What Goes In The Progress File
Keep it simple and concise:
- Tasks completed in this session
- Decisions made and why
- Blockers encountered
- Files changed
You can also include the PRD item that was just completed, any architectural decisions, and notes for the next iteration.
Why Commits Matter
Ralph should commit after each feature. This gives future iterations:
- A clean
git logshowing what changed - The ability to
git diffagainst previous work - A rollback point if something breaks
The combination of progress file plus git history gives Ralph full context without burning tokens on exploration.
Cleanup
Don't keep
progress.txtforever. Once your sprint is done, delete it. It's session-specific, not permanent documentation.Try It Out
Add progress tracking to your Ralph prompt:
After completing each task, append to progress.txt: - Task completed and PRD item reference - Key decisions made and reasoning - Files changed - Any blockers or notes for next iteration Keep entries concise. Sacrifice grammar for the sake of concision. This file helps future iterations skip exploration.5. Use Feedback Loops
Ralph's success depends on feedback loops. The more loops you give it, the higher quality code it produces.
Types of Feedback Loops
Feedback Loop What It Catches TypeScript types Type mismatches, missing props Unit tests Broken logic, regressions Playwright MCP server UI bugs, broken interactions ESLint / linting Code style, potential bugs Pre-commit hooks Blocks bad commits entirely The best setup blocks commits unless everything passes. Ralph can't declare victory if the tests are red.
Why Feedback Loops Matter
Great programmers don't trust their own code. They don't trust external libraries. They especially don't trust their colleagues. Instead, they build automations and checks to verify what they ship.
This humility produces better software. The same applies to AI agents.
Every tip in this article works for human developers too. Feedback loops, small steps, explicit scope - these aren't AI-specific techniques. They're just good engineering. Ralph makes them non-negotiable.
Try It Out
Add explicit feedback loop requirements to your Ralph prompt:
Before committing, run ALL feedback loops: 1. TypeScript: npm run typecheck (must pass with no errors) 2. Tests: npm run test (must pass) 3. Lint: npm run lint (must pass) Do NOT commit if any feedback loop fails. Fix issues first.6. Take Small Steps
The rate at which you can get feedback is your speed limit. Never outrun your headlights.
Humans doing a big refactor might bite off a huge chunk and roll through it. Tests, types, and linting stay red for hours. Breaking work into smaller chunks means tighter feedback loops - less work before you receive feedback.
The same applies to Ralph, but with an additional constraint: context windows are limited, and LLMs get worse as they fill up. This is called context rot - the longer you go, the stupider the output.
The Tradeoff
Each Ralph iteration has startup costs. Ralph must pick a task, explore the repo, and gather context. These tokens are spent per-loop.
If you're doing a large refactor, you don't want Ralph renaming one variable per iteration. But:
- Larger tasks mean less frequent feedback
- More context means lower quality code
- Smaller tasks mean higher quality, but slower progress
Sizing Your PRD Items
For AFK Ralph, keep PRD items small. You want the agent on top form when you're not watching.
For HITL Ralph, you can make items slightly larger to see progress faster. But even then, bias small.
A refactor item might be as simple as: "Change one function's parameters. Verify tests and types pass."
In your prompt, guide Ralph on step size. My tendency: small steps. Code quality over speed - especially when AFK, where speed matters less anyway.
Try It Out
Add step-size guidance to your Ralph prompt:
Keep changes small and focused: - One logical change per commit - If a task feels too large, break it into subtasks - Prefer multiple small commits over one large commit - Run feedback loops after each change, not at the end Quality over speed. Small steps compound into big progress.7. Prioritize Risky Tasks
Ralph chooses its own tasks. Without explicit guidance, it will often pick the first item in the list or whatever seems easiest to implement.
This mirrors human behavior. Developers love quick wins. But seasoned engineers know you should nail down the hard stuff first, before the easy work buries you in technical debt.
Spikes And Integration
Focus on spikes - things you don't know how they'll turn out. Build features end-to-end rather than layer by layer. Integrate early.
If you have modules that need to work together, tell Ralph to integrate them first. Don't wait until the end of your sprint to discover they don't fit.
Task Type Priority Why Architectural work High Decisions cascade through entire codebase Integration points High Reveals incompatibilities early Unknown unknowns High Better to fail fast than fail late UI polish Low Can be parallelized later Quick wins Low Easy to slot in anytime HITL For Risky Tasks
Risky tasks need more human involvement. Use HITL Ralph for early architectural decisions - the code from these tasks stays forever, and any shortcuts here will cascade through the entire project.
Save AFK Ralph for when the foundation is solid. Once the architecture is proven and the risky integrations work, you can let Ralph run unsupervised on the lower-risk tasks.
Try It Out
Add prioritization guidance to your Ralph prompt:
When choosing the next task, prioritize in this order: 1. Architectural decisions and core abstractions 2. Integration points between modules 3. Unknown unknowns and spike work 4. Standard features and implementation 5. Polish, cleanup, and quick wins Fail fast on risky work. Save easy wins for later.8. Explicitly Define Software Quality
Not all repos are made alike. A lot of code out there is prototype code - demos, short-lived experiments, client pitches. Different repos have different bars for quality.
The agent doesn't know what kind of repo it's in. It doesn't know if this is a throwaway prototype or production code that will be maintained for years. You need to tell it explicitly.
What To Communicate
Repo Type What To Say Expected Behavior Prototype "This is a prototype. Speed over perfection." Takes shortcuts, skips edge cases Production "Production code. Must be maintainable." Follows best practices, adds tests Library "Public API. Backward compatibility matters." Careful about breaking changes Put this in your AGENTS.md file, your skills, or directly in your prompt.
The Repo Wins
Your instructions compete with your codebase. When Ralph explores your repo, it sees two sources of truth: what you told it to do and what you actually did. One is a few lines of instruction. The other is thousands of lines of evidence.
You can write "never use
anytypes" in your prompt. But if Ralph seesanythroughout your existing code, it will follow the codebase, not your instructions.Agents amplify what they see. Poor code leads to poorer code. Low-quality tests produce unreliable feedback loops.
This is software entropy - the tendency of codebases to deteriorate over time. Ralph accelerates this. A human might commit once or twice a day. Ralph can pile dozens of commits into a repo in hours. If those commits are low quality, entropy compounds fast.
This means you need to:
- Keep your codebase clean before letting Ralph loose
- Use feedback loops (linting, types, tests) to enforce standards
- Make quality expectations explicit and visible
Try It Out
Add quality expectations to your
AGENTS.mdor Ralph prompt:This codebase will outlive you. Every shortcut you take becomes someone else's burden. Every hack compounds into technical debt that slows the whole team down. You are not just writing code. You are shaping the future of this project. The patterns you establish will be copied. The corners you cut will be cut again. Fight entropy. Leave the codebase better than you found it.9. Use Docker Sandboxes
AFK Ralph needs permissions to edit files, run commands, and commit code. What stops it from running
rm -rf ~? You're away from the keyboard, so you're not going to be able to intervene.Docker sandboxes are the simplest solution:
docker sandbox run claudeThis runs Claude Code inside a container. Your current directory is mounted, but nothing else. Ralph can edit project files and commit - but can't touch your home directory, SSH keys, or system files.
The tradeoff: your global AGENTS.md and user skills won't be loaded. For most Ralph loops, this is fine.
For HITL Ralph, sandboxes are optional - you're watching. For AFK Ralph, especially overnight loops, they're essential insurance against runaway agents.
10. Pay To Play
One question I get a lot is, "How much will this cost?" Surely running AFK Ralph overnight is a way to rack up enormous bills?
I never feel comfortable giving financial advice, especially to folks in low-income countries. But Ralph is completely configurable to how much you want to spend.
HITL Is Still Worth It
If you never run AFK Ralph, HITL Ralph still has big benefits over multi-phase planning. Running the same prompt over and over feels nicer than specifying a different prompt for each phase.
Approach Effort Per Phase Best For Multi-phase plans Write new prompt One-off large tasks HITL Ralph Rerun same prompt Learning, refinement AFK Ralph Set and forget Bulk work, automation I'm on the Anthropic 5x Max plan at around £90/month. I've run AFK Ralph a few times, but most of my usage is HITL.
Why Not Local Models?
I don't think open source models you can run on your laptop are good enough for Ralph yet. They require powerful GPUs, and the output quality isn't there. In AI coding, you have to pay to play.
The Golden Age
But we need to contextualize this. For the next couple of years, we're in a golden age where you can do magical things with AI faster than humans - but the market still pays human wages. The market hasn't adjusted to the fact that we all have access to extremely powerful AI coding tools.
Yes, you have to pay. But the rewards are there if you're willing to claim them.
11. Make It Your Own
Ralph is just a loop. That simplicity makes it infinitely configurable. Here are some ways to make it your own:
Swap The Task Source
The examples in this article use a local
prd.json. But Ralph can pull tasks from anywhere:Task Source How It Works GitHub Issues Ralph picks an issue, implements it Linear Ralph pulls from your sprint Beads Ralph works through a beadfile The key insight stays the same: the agent chooses the task, not you. You're just changing where that list lives.
Change The Output
Instead of committing directly to main, each Ralph iteration could:
- Create a branch and open a PR
- Add comments to existing issues
- Update a changelog or release notes
This is useful when you have a backlog of issues that need to become PRs. Ralph triages, implements, and opens the PR. You review when you're ready.
Alternative Loop Types
Ralph doesn't need to work through a feature backlog. Some loops I've been experimenting with:
Test Coverage Loop: Point Ralph at your coverage metrics. It finds uncovered lines, writes tests, and iterates until coverage hits your target. I used this to take AI Hero CLI from 16% to 100% coverage.
Duplication Loop: Hook Ralph up to
jscpdto find duplicate code. Ralph identifies clones, refactors into shared utilities, and reports what changed.Linting Loop: Feed Ralph your linting errors. It fixes them one by one, running the linter between iterations to verify each fix.
Entropy Loop: Ralph scans for code smells - unused exports, dead code, inconsistent patterns - and cleans them up. Software entropy in reverse.
Any task that can be described as "look at repo, improve something, report findings" fits the Ralph pattern. The loop is the same. Only the prompt changes.
Try It Out
Try one of these alternative loop prompts:
# Test Coverage Loop @coverage-report.txt Find uncovered lines in the coverage report. Write tests for the most critical uncovered code paths. Run coverage again and update coverage-report.txt. Target: 80% coverage minimum.# Linting Loop Run: npm run lint Fix ONE linting error at a time. Run lint again to verify the fix. Repeat until no errors remain.# Entropy Loop Scan for code smells: unused exports, dead code, inconsistent patterns. Fix ONE issue per iteration. Document what you changed in progress.txt.I look forward to seeing your own Ralph Wiggums - fingers up noses, flying through windows, eating paste, and shipping code.
Want to learn more about Ralph? I'll be publishing a lot more about autonomous AI coding on my newsletter. Sign up to get notified when new articles drop.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: Here's How To Stream Claude Code With AFK Ralph
This guide is for people who've tried Ralph with Claude Code and hit that frustration point: running the AFK script and ending up staring at a blank screen.
If you've never heard of Ralph, start here:
The Problem: A Blank Screen
When you want Ralph to run while you're away from your keyboard, you might use a script like this:
#!/bin/bash set -e if [ -z "$1" ]; then echo "Usage: $0 <iterations>" exit 1 fi for ((i=1; i<=$1; i++)); do result=$(docker sandbox run --credentials host claude \ --print \ "<your prompt here>") if [[ "$result" == *"<promise>COMPLETE</promise>"* ]]; then echo "Ralph complete after $i iterations." exit 0 fi doneThe issue here is frustrating: when you run Claude with the
--printflag, you get zero streaming output. Your terminal goes blank.You walk away, and you have absolutely no idea what's happening. Is Claude working? Is it stuck? Did something break? You won't know until it's finished.
The dream with AFK Ralph is to get the best of both worlds: you want real-time visibility into what Claude is doing, but you also want to leave it running while you step away.
The Solution: Streaming with jq
Claude can output
stream-jsonformat, which gives you every single message as it happens. But that output is extremely verbose and unreadable.By combining
stream-jsonwithjqfiltering, you can extract just the useful information and stream it to your terminal in real-time. At the same time, you capture the final result to check for the<promise>COMPLETE</promise>marker.Here's the complete script:
#!/bin/bash set -e if [ -z "$1" ]; then echo "Usage: $0 <iterations>" exit 1 fi # jq filter to extract streaming text from assistant messages stream_text='select(.type == "assistant").message.content[]? | select(.type == "text").text // empty | gsub("\n"; "\r\n") | . + "\r\n\n"' # jq filter to extract final result final_result='select(.type == "result").result // empty' for ((i=1; i<=$1; i++)); do tmpfile=$(mktemp) trap "rm -f $tmpfile" EXIT docker sandbox run --credentials host claude \ --verbose \ --print \ --output-format stream-json \ "<your prompt here>" \ | grep --line-buffered '^{' \ | tee "$tmpfile" \ | jq --unbuffered -rj "$stream_text" result=$(jq -r "$final_result" "$tmpfile") if [[ "$result" == *"<promise>COMPLETE</promise>"* ]]; then echo "Ralph complete after $i iterations." exit 0 fi doneThis script accepts one argument: the number of iterations to run.
Walking Through the Script Structure
Breaking Down the Stream Filter
The stream filter does several important things:
- Selects assistant messages:
select(.type == "assistant")grabs only Claude's responses - Extracts text content:
.message.content[]? | select(.type == "text").textpulls out just the text portions - Fixes line endings:
gsub("\n"; "\r\n")replaces newlines with carriage return + newline - Adds spacing:
. + "\r\n\n"inserts extra space between messages
The carriage return replacement fixes a bug where the cursor wasn't returning to the first character of the line properly.
The Data Pipeline
Here's how data flows through the script:
Docker streams out
stream-jsonformatted data, but it includes some non-JSON lines just for noise. Thegrep --line-buffered '^{'filter ensures only valid JSON lines get processed.The
tee "$tmpfile"command writes everything to a temporary file without stopping the stream. You need this file later to check if Claude has finished.Finally,
jq --unbuffered -rj "$stream_text"applies the streaming filter and displays the text in real-time to your terminal.Conclusion
My hope is that relatively soon I'll be able to delete this article because Claude Code will have shipped a feature that allows you to stream the responses while still capturing the final output.
OpenCode already has this, and so there's no need to write an article like this for OpenCode. But until then, this is a workable solution to get real-time streaming output from Claude while running AFK Ralph.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
- Selects assistant messages:
▶Read: Why the Anthropic Ralph plugin sucks (use a bash loop instead)
Ralph is a development methodology based on continuous AI agent loops - a simple bash script that repeatedly feeds Claude a prompt file, allowing it to iteratively improve its work until completion.
For a complete walkthrough on implementing Ralph, check out the getting started guide.
What Is The Anthropic Plugin?
Anthropic shipped an official Ralph plugin for Claude Code that's designed to automate this loop for you.
Instead of running a script manually multiple times, you run the plugin once with a command like this:
/ralph-loop "Build a REST API for todos. Requirements: CRUD operations, input validation, tests. Output <promise>COMPLETE</promise> when done." --completion-promise "COMPLETE" --max-iterations 50The plugin then handles the looping automatically. You don't need to write bash scripts. You don't need to run commands repeatedly. The plugin sits inside Claude Code, intercepts your exit attempts, and feeds the prompt back into the session automatically.
On the surface, it sounds perfect. It automates Ralph so you don't have to think about it.
But there's a fundamental problem with how it's built.
The Problem With Anthropic's Ralph Plugin
To understand why the plugin fails, you need to understand how LLMs actually work under load.
As LLMs receive more tokens, the relationships between tokens scale quadratically. This means LLMs get exponentially more overloaded the more tokens you provide them. Each additional token makes it harder for the model to process information and make good decisions.
Practically speaking, every LLM has a smart zone and a dumb zone.
Zone Position Behavior Smart Zone First 40% of context Sharp, capable, makes good decisions Dumb Zone Last 60% of context Confused, mistakes, degraded performance At around the 40% context mark, LLMs start entering the dumb zone. Most people debate exactly where this boundary is, but everyone agrees the boundary exists.
Why A Bash Loop Works
Ralph keeps the AI in the smart zone. How? Each iteration of Ralph uses a fresh context window.
When you run the bash loop, Claude starts with an empty context. The PRD and progress file go in. Claude works on one task. Then the script exits.
When you run it again, Claude gets another fresh context window. The PRD and progress file go in again. Claude reads what was done before (via git history and file modifications), picks the next task, and implements it.
The context window never fills up with cruft. The AI always operates in the smart zone. It never degrades.
Each phase resets the context, keeping the LLM working in the smart zone where it does its best work.
How The Plugin Breaks This
The Anthropic plugin keeps everything inside a single Claude Code session.
Instead of exiting and restarting, the plugin uses a "stop hook" that intercepts Claude's exit attempts and feeds the same prompt back into the session. The loop happens entirely within one session.
This means:
- Iteration 1: Claude reads the PRD, implements task 1, context is ~20% full
- Iteration 2: Claude sees its previous work, implements task 2, context is ~35% full
- Iteration 3: Claude sees its previous work and the last iteration, implements task 3, context is ~50% full. We have entered the dumb zone.
With each iteration, the context window fills up. The plugin accumulates session history, previous attempts, and accumulated context. After 3-4 iterations, the AI is working entirely in the dumb zone.
In other words, the plugin guarantees that you're going to fill up the smart zone and enter the dumb zone. It fundamentally undermines the reason Ralph works in the first place.
The plugin keeps all iterations in one session, causing the context to fill up and degrade performance over time.
The Solution: Stick With The Bash Loop
If you want Ralph to actually work, use a bash loop instead of the plugin. Here are some resources:
- Read the getting started guide for detailed implementation advice
- Learn more with these 11 tips
- Learn how to stream AFK results with Claude Code here
Ralph works because it's ruthlessly simple. Keep it that way.
Join over 54,000 Developers Becoming AI Heroes
Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.
Email*
I respect your privacy. Unsubscribe at any time.
▶Read: events turn ai agents into autonomous software engineers with ralph
Half-day live workshop • February 11, 9 AM PST • Limited to 40 developers • $500
The Ralph
technique runs AI agents in loops with structured PRDs and feedback mechanisms. It's real, it's working in production, and you'll build it today.
Geoffrey Huntley figured out the pattern. Matt Pocock validated it with people shipping code. The technique has a name—Ralph (yeah, like Ralph Wiggum from The Simpsons, because of course it does)—and developers are using it right now to clear their GitHub backlogs while they sleep.
You're going to learn exactly how to build it.
Interactive, hands-on, real code running on YOUR backlog by 1 PM.
Here's the thing about AI coding agents right now
You've tried them. Claude, ChatGPT, Cursor, whatever.
You ask for Feature A. You get half a solution. You manually fix it. You ask for Feature B. It breaks Feature A. You're back to debugging. The cycle never ends.
You're still doing the engineer's job. The AI is just doing grunt work.
But here's what's actually happening right now, today, in production:
Developers are setting up agents that run in loops. They point them at GitHub backlogs. The agents execute tasks—real tasks, not toy examples—and create working PRs. Without constant human intervention. Without hallucinating broken code into existence.
This isn't theory. This is happening.
The question is: Are you going to keep manually prompting AI and hoping it works? Or are you going to learn the systematic technique that actually makes this autonomous?
Sold Out
Why this works now (and didn't before)
Two things changed:
1. The models got good enough
Opus 4.5 and GPT 5.2 can execute complex tasks if you structure them right. They're not perfect, but they're finally smart enough to work in loops with proper feedback.
2. The technique matured
This isn't about better prompting. It's about:
- Running agents in loops (not one-shot prompts)
- Structuring PRDs so agents understand the task breakdown
- Building feedback mechanisms (types, tests, browser validation) so agents catch their own mistakes
- Letting the loop run until the agent self-corrects instead of you fixing everything manually
When you combine capable models with this systematic approach, you get something that actually looks like autonomous execution.
Not magic. Not perfect. But real, usable leverage.
How we'll learn this: together, live, step-by-step
You'll be:
- Debugging real code on screen with Matt. When something breaks (and it will), we'll figure it out together. Live.
- Hooking up Ralph directly to your tools. You'll connect the loop to a GitHub backlog and see real PRs getting created.
- Asking your questions in real-time and getting answers immediately. Not a forum post you check in 3 days. Right there, right then.
- Working in groups. You'll be mob programming in groups of 2-3, bouncing ideas off each other and learning collaboratively.
By 1 PM, you'll have a Ralph setup that works for YOU.
A blog post can explain concepts. A video can show you a demo. But neither can debug YOUR specific setup when your bash loop fails, or answer "what happens if I structure my PRD THIS way?" immediately, or show you how other developers in the workshop are solving the same problems you're hitting.
That's what you're paying for. Immediate answers to your specific problems.
What you're going to learn (and build)
This is a half-day workshop. Four hours. Each module gives you something you can implement immediately.
Section 1: Getting Started
Your own bash loop, not vendor plugins
The Anthropic plugin doesn't get the best out of Ralph. You're going to run your own bash loop instead.
You'll build a working loop on your machine that gives you complete ownership. Works with Claude Code, OpenCode, Codex, or any coding agent CLI you choose. Your infrastructure, your control, your pricing.
We'll walk through the exact setup, debug it live if it breaks, and make sure you've got a loop running before we move on.
What this unlocks: Everything else. This is your foundation—once it's running, you own the infrastructure and the loop.
Section 2: Feedback Loops
How agents catch their own mistakes
Here's the secret: Agents are only as clever as the tools they use to interface with the world. Without feedback, they're coding blind. With the right tools, they can feel their way to working code.
For backend: We teach Ralph to do TDD. Red-Green-Refactor is the perfect feedback loop. Write a failing test (Red). Write code until it passes (Green). Clean up (Refactor). The test gives the agent eyes. But tests can lie—you'll learn how to structure PRDs so agents write tests you can actually trust.
For frontend: We hook Ralph up to an MCP server that can query the actual UI. The agent can see what's on screen, check if buttons exist, verify layouts rendered correctly. It's not guessing—it's looking.
The agent runs, gets feedback, tries again, self-corrects. That's the loop.
What this unlocks: Autonomy. Now the agent doesn't just execute—it validates, catches mistakes, and self-corrects.
Section 3: Hooking Up Ralph To Your Backlog
From GitHub issues to executed PRs
Wire an agent to your GitHub backlog so it executes tasks without you.
You'll structure your backlog so agents can parse it. Turn issues into agent-executable tasks using the specific pattern that works. Handle agent state and failures gracefully. Review PRs from agents (what to look for, what to ignore, what breaks most often).
This is where you'll start seeing actual PRs getting created from your real issues.
What this unlocks: Scale. Now you're not running the loop manually—it's connected to your backlog.
Section 4: Making PRD's Ralph Loves
The skill that makes or breaks agent execution
Most people can't write PRDs that agents execute correctly. Too vague? Agent guesses wrong. Too detailed? Agent misinterprets specifics.
Here's the framework you'll learn:
The 3-Layer PRD Structure:
- Context layer: What the agent needs to understand about the system
- Task layer: The specific change broken into atomic steps
- Validation layer: How the agent knows it succeeded
Plus the interview technique: You'll learn how to get the agent to help you build better PRDs by asking it the right questions about your task.
By the end, you'll have written PRDs that agents can actually execute—not theory, actual working examples.
Common mistakes that make agents hallucinate or fail:
- Being too abstract about "what should happen"
- Not specifying the validation criteria
- Forgetting to mention dependencies between tasks
- Using your internal jargon the agent doesn't understand
We'll debug these live.
What this unlocks: Production. Now your agent doesn't just run—it executes the RIGHT thing and ships real features.
Who this is for (and who it's not)
This is for you if:
- You already write code (JavaScript, TypeScript, Node, whatever—you're comfortable with code)
- You've used AI coding assistants before and know the frustration of babysitting them
- You want agents that actually execute tasks, not just suggest half-solutions
- You're willing to spend 4 hours to learn a technique that could save you 10+ hours a week
- You care about ROI on your time, not hype
This is NOT for you if:
- You're new to coding (this assumes you can read code and debug via logs)
- You're looking for a "no-code solution" (you're going to write and run code)
- You just want to watch a demo and not implement anything yourself
- You think AI will magically solve all your problems without you learning how it works
We're teaching practitioners who ship. If that's you, you're in the right place.
Smart concerns, not excuses
"Will this work on my real code? Or just demos?"
You're right to be skeptical. Most people oversell this.
Here's the thing: Ralph isn't trying to be fully autonomous. It's a hybrid. You balance HITL Ralph (Human In The Loop) with AFK Ralph (Away From Keyboard). You decide which 10% is worth your expertise—architecture decisions, design taste, the hard stuff—and which 90% can run unsupervised. Bug fixes. Refactors. Simple features. The stuff that eats your time but doesn't need your brain.
The key? Both modes use the exact same interface. Same prompts. Same setup. You work on the difficult stuff together with HITL Ralph, then hand it off to AFK Ralph when it's ready. Seamless.
Ralph works on brownfield. Ralph works on greenfield. Honestly, there's nothing I want to do without Ralph anymore because it's just such an improvement over every setup I've tried before.
That flexibility is the unlock. You're not betting everything on full autonomy. You're choosing when to supervise and when to let it run.
"I don't want to be dependent on Anthropic or some SaaS tool."
You won't be.
Ralph is just a bash loop over a CLI. The CLI doesn't matter. It's about the prompt you pass in and the feedback loops in your repo. You can swap out the CLI trivially.
I've done most of my work with Claude Code, but Ralph works equally well with Copilot CLI, with Codex, with OpenCode. If anything, it works better with OpenCode in certain ways.
No hard dependency on a model provider. Ralph is a technique that works across model providers.
You're learning the technique, not buying a tool.
"I don't know how to write good PRDs for agents. I'll be bad at it."
Writing PRDs for humans and agents isn't that different. The main difference? The agent will actually read the PRD. Humans often won't—they'll miss parts or de-emphasize what matters.
PRDs are how you frame large changes. How you build features that span more than one context window. Get a great PRD template, understand how to break it down, and you can use it for refactors, new features, greenfield—it's extremely flexible.
Understanding how to write PRDs puts you in the senior/lead bracket. This was true in the age of human coding. It's even more true in the age of AI coding.
I'm not claiming to make you an expert in 4 hours. PRD writing is wisdom you accrue over time. What I'm giving you are the frames—the structure within which you can experiment and tweak. You'll leave with a template that works and the understanding of why it works.
This is the meta-skill. This is what makes everything else work.
"4 hours isn't enough time to actually implement this."
The hardest part isn't the workshop. It's what comes after—developing intuition for Ralph and understanding how to steer it in your codebase.
This comes down to prompting. Prompting is always the hardest part of any AI system. It's experimental, not like traditional development. You tweak things. You iterate. It took me a solid three days of working with HITL Ralph before I felt comfortable letting it run AFK.
What I'm trying to do is shortcut that experimental phase. I'm giving you a prompt template I believe is a winner—battle-tested across real codebases. You'll iterate on it from there, but you won't be starting from zero.
By 1 PM, you'll have a working system. Not perfect. But running. Ready to generate PRs on your actual backlog.
The foundation will be running TODAY. Then you make it better. Then you scale it. Then you teach it to your team.
"Can't I just watch a YouTube video or read a blog post about this?"
Here's what a blog post can give you:
- The concept explained
- Maybe some code examples
- General best practices
Here's what a blog post CAN'T give you:
- Debugging your specific setup issues
- Answering "what happens if I structure my PRD THIS way instead?" immediately
- Seeing how other developers in the workshop are solving the same problems you're hitting right now
You're not paying for the information.
You're paying for immediate answers to your specific problems, live debugging when things break, and the wisdom that comes from seeing multiple developers working through the same challenges together.
What you need to bring
Technical requirements:
- Laptop with Node.js installed
- GitHub account with a repo you can work on
- Terminal comfort (you'll be running bash commands)
- Any one of these AI coding CLIs installed and working:
- Claude Code
- OpenCode
- Copilot CLI
- Codex
Mental requirements:
- Willingness to debug things that break
- Curiosity about how this actually works under the hood
- Patience for the parts that aren't perfect yet (this is cutting-edge, not battle-tested)
Register for February 11, 9 AM PST
$500 • Limited to 40 developers
We're capping at 40 because interactive means you get real feedback. You'll get screen time to debug YOUR code, not just watch me code.
When you register, you'll receive:
- Calendar invite with Zoom link sent immediately to your email
- Pre-workshop setup instructions so you're ready to code from minute one
- Access to workshop materials and code examples (templates, scripts, the exact bash loop setup)
- Recording of the session in case you need to review anything (but the real value is in the live debugging and Q&A)
Four hours. Real code. Real techniques. Real leverage.
Sold Out
Let's get this running
I've been talking to people using this pattern in production. I've debugged the parts that break. I've figured out the parts that matter.
This looks simple—just run an agent in a loop, right?
But there are lots of ways to screw it up. And lots of ways to make it better.
The PRD structure matters. The feedback loops matter. The way you break down tasks matters.
Those details are what turn "interesting concept" into "actually useful tool."
Once you understand them, you can build your own variations. You can iterate. You can make this work for your specific setup and your specific needs.
You won't be dependent on me or anyone else. You'll just know how it works.
So bring your GitHub issues. Bring your questions. Bring your skepticism.
Let's get this running.
Sold Out
-Matt
P.S. Not sure if this is for you? Ask yourself: "Do I spend more time babysitting AI coding assistants than I'd like?" If yes, this workshop is for you.
P.P.S. Those weird edge cases that still break? We're going to debug those exact scenarios live in the workshop. You'll see what breaks and why. That's how you learn to avoid them in your own PRDs.
February 11th, 2026
Sold Out
Join the waitlist to be notified if spots become available.
Email*
Join Waitlist
Event Details
DateFebruary 11th, 2026
Time
9:00 AM — 1:00 PM (Pacific time)
LocationOnline (remote) - A calendar invite will be sent to your email when you purchase.
February 11, 2026
Turn AI Agents Into Autonomous Software Engineers with Ralph
Sold Out Join Waitlist
▶Read: 9 Ways AI Coding Has Rewired My Brain
Email*
9 Ways AI Coding Has Rewired My Brain
I have been writing all the software that I work on completely 100% AI-contributed for a few months now. I tweeted about it, and people have been asking for more details.
People have been asking for more details on the nine ways it's changed how I think about coding. Let's break down each one.
1. Way More Time Thinking About Integration Testing
This is literally true even this morning. I was working on a CLI tool I use for teaching lessons and wanted to make updates with an AI agent.
I realized the testing was mostly done by manual QA. The reason? It's very Git-dependent, so I thought I would need to use GitHub to test it properly.
It turns out I was wrong. I can test it completely locally.
Here's what I did:
- Added an end-to-end test suite describing all user stories for how testing should operate
- Built a utility for creating temporary Git environments so the AI could test everything properly
- Ran the entire suite automatically with every change the AI made
Raising test boundaries lets you catch more bugs and work more comfortably with AI agents running code automatically.
2. Friction Via Pre-Commit Hooks, CI, and Strong Types is Super Desirable
Feedback loops are super important. They give the agent actual context about what's working and what's not in the real world.
Every single change the AI makes should trigger your pre-commit hooks, CI, and type checking so bugs get caught immediately. The more immediate the feedback, the better decisions the agent can make.
3. AI Has No Taste for UI - Prototype Extremely Aggressively
You see demos of AI creating beautiful one-shot UIs all the time. But AI struggles with iterating on existing brownfield UI.
Before committing to a PRD:
Ask the LLM for five different options for the UI change. Put them on throwaway routes so you can look at them. Iterate on the prototype without touching real code. Once you land on something you like, the AI can implement it properly.
4. AI Has No Taste for Software Architecture
Bad codebases have lots of shallow modules with big interfaces. Good codebases have a few big modules with simple interfaces.
Deep vs shallow modules:
- Deep modules: tiny interface with lots of implementation
- Shallow modules: big interface with little implementation
Deep modules are super important for this approach. They're easier to test and easier for the AI to work on without you needing to understand every implementation detail.
5. Deep, Grey-Box Modules with Simple Interfaces Are the King
If a module is deep enough, it becomes really easy to test around the box without worrying about what's inside.
Break your codebase into large modules and test at the boundaries. Then leave the implementation to the AI. I call them "grey-box modules" because you can look inside if you want to, but you're not really supposed to. Test at the right boundaries and you can ignore what's inside.
6. Use Effect.ts for Dependency Injection
Effecthas a first-class concept called services - reusable components that encapsulate common tasks across your application.They're complex, deep modules with simple interfaces. If you're building backend in TypeScript, I really cannot recommend
Effectenough. It makes this pattern incredibly straightforward and has been invaluable for my work with AI agents.7. Much More Meta-Programming
I'm always thinking about how to make my agent run automatically. This means defining my own processes and figuring out what I do.
Building features is simple:
- Add the tests
- Build the feature
- Run the tests and types
- Commit
But what about everything else? Triaging issues, backlog pruning, task prioritization - these are all things you can delegate to AI or make the grunt work automatic while retaining control.
8. Beware of Doc Rot
Lots of people stuff their repos with markdown docs. Every time the LLM searches for something, it finds docs that might be outdated.
You end up with "doc rot" - where the codebase and documentation have diverged. The LLM doesn't know which to trust. Let the AI generate its own docs during the exploration phase instead. Those docs never go out of date because they're just-in-time generated.
9. Much Higher Cognitive Load to Keep Up with Changes
This is true. But deep grey-box modules help by letting you trust the tests and give a cursory glance without understanding every detail.
I've noticed less cognitive load when I do this. I'm also mostly not parallelizing - the stuff I'm building doesn't need multiple agents running in parallel. But if you were running four or five projects in parallel, I could see that being gnarly.
That's how AI has rewired my brain - thinking more about testing, the shape of modules, my own processes, being skeptical of docs, and reducing cognitive load.
How has AI coding rewired your brain? What are you paying more attention to now than you used to?
▶Read: STORM: Generate High-Quality Articles Based On Real Research
Email*
STORM: Generate High-Quality Articles Based On Real Research
I came across a really cool paper today. It's trying to address a really hard problem: writing really good articles from scratch.
It's called STORM. Not only is it a cool name, But it's a really great example of a very complex AI workflow.
So while you may not be interested in the domain of generating articles, Storm is a great paper To look at if you want to build better ai workflows.
Let's take a look.
- git-guardrails-claude-code — Skill page on agentskills.so: https://agentskills.so/skills/mattpocock-skills-git-guardrails-claude-code. No dedicated post — mentioned in YouTube "Commands vs MCP vs Skills".
- github-triage — https://x.com/mattpocockuk/status/2045110469426323900 ("triage issues with a state machine-based labelling system").
- grill-me — https://www.aihero.dev/my-grill-me-skill-has-gone-viral (already scraped) + LinkedIn https://www.linkedin.com/posts/mapocock_my-grill-me-skill-went-viral-mattpocock-activity-7441843415232892928-e2RJ.
- improve-codebase-architecture — https://x.com/mattpocockuk/status/2033647563627212953
(one of his daily-five). README cites
CONTEXT.md+docs/adr/as input. - migrate-to-shoehorn — No dedicated post; tied to
@total-typescript/shoehornlibrary docs. - obsidian-vault — No dedicated post; referenced in his note-taking workflow (TS workshop appearance).
- qa — No dedicated post; mentioned in "Building a REAL feature with Claude Code" video as the manual-QA step at end of workflow.
- request-refactor-plan — No dedicated post; discussed in YouTube "Red Green Refactor is OP With Claude Code" (https://www.youtube.com/watch?v=hYZdIwFIy-c).
- scaffold-exercises — Internal AI Hero exercise
tooling; referenced in
ai-hero-cliREADME. No public deep-dive. - setup-pre-commit — https://www.aihero.dev/securing-your-ai-app-with-guardrails (already scraped) is the closest thematic post.
- tdd — https://www.aihero.dev/skill-test-driven-development-claude-code (already scraped) + https://www.youtube.com/watch?v=hYZdIwFIy-c "Red Green Refactor is OP".
- to-issues — https://x.com/mattpocockuk/status/2024874219662905676 ("Idea → PRD → Kanban Board → Ralph"). Step 2 of his core workflow.
- to-prd — Same workflow tweet as to-issues. Step 1 of his core workflow.
- triage-issue — https://x.com/mattpocockuk/status/2045110469426323900 (companion to github-triage; this one creates issues with TDD-based fix plan).
- ubiquitous-language — No dedicated post; ties into
domain-modelskill DDD lineage. Cite domain-model tweet. - write-a-skill — Implicit from skills repo README and from the YouTube "Commands vs MCP vs Skills" video (https://www.youtube.com/watch?v=xAIN7YHXfCY).
- zoom-out — No dedicated post; one-line citation = SKILL.md + workshop transcript context.
mattpocock/sandcastlerepo + README — fills the "sandcastle" gap entirely; only canonical artefact. https://github.com/mattpocock/sandcastle- YouTube "Commands vs MCP vs Skills (What I Use)" — https://www.youtube.com/watch?v=xAIN7YHXfCY — single best resource for primitives gap (commands, MCP, skills, subagents, hooks, plugins, memories all in one).
- YouTube "AI Coding Agents Advanced Guide: Subagents, Skills, MCP" — https://www.youtube.com/watch?v=DAaw7Ao_zUc — direct hit on subagents gap.
- Twitter subagent thread — https://x.com/mattpocockuk/status/1976313665407099187
— concrete plan-mode-with-subagents recipe; pairs with
design-an-interfaceskill. mattpocock/course-video-managerrepo — https://github.com/mattpocock/course-video-manager — real-world Ralph-loop target codebase; pairs with the "Building a REAL feature" video below.- YouTube "Building a REAL feature with Claude Code: every step explained" — https://www.youtube.com/watch?v=hX7yG1KVYhI — end-to-end research/prototyping/QA walkthrough on a real codebase. Fills research-phase + prototyping-phase gaps simultaneously.
- YouTube "Claude Code: Master Multi-Phase AI Coding Plans" — https://www.youtube.com/watch?v=_gNgJiICLzs — checkpoints/memory across context windows. Fills checkpoints + memory gaps.
mattpocock/ai-engineer-workshop-2026-project(Cadence) — https://github.com/mattpocock/ai-engineer-workshop-2026-project — student-facing exercise codebase that pairs with the workshop transcript.- YouTube "How the Ralph Technique Works (Full Tutorial)" — https://www.youtube.com/watch?v=6w65b8HWU8g — deepest standalone Ralph teaching after the viral thread.
- Evalite docs — https://www.evalite.dev/ — fills the eval/regression-testing gap; complements TDD skill.
- "Red Green Refactor is OP With Claude Code" — https://www.youtube.com/watch?v=hYZdIwFIy-c (TDD reinforcement)
- AGENTS.md complete guide LinkedIn post — https://www.linkedin.com/posts/mapocock_a-complete-guide-to-agentsmd-activity-7418690212379148288-3nXR (the post itself links to https://www.aihero.dev/a-complete-guide-to-agents-md which should already be in the 89 scraped posts)
mattpocock/poland-ai-talkrepo — multi-agent + memory + human-in-the-loop reference patternsmcp-server-from-prompts— though empty-README, the repo source is ~the only Matt artefact for "MCP server authoring as a coding pattern"- Permissions — Matt has not published a dedicated
post or video on Claude Code's
permissions/allowedTools/disallowedToolsmodel. Only oblique reference is the licensing tweet (https://x.com/mattpocockuk/status/2040536403289764275). Will need Anthropic docs (https://code.claude.com/docs/en/auto-mode and https://code.claude.com/docs/en/sub-agents) to fill this gap — supplement Matt with primary docs. - "Do-work" skill — searched, no public Matt artefact
under that exact name. Likely cohort-only material; closest public proxy
is the
tdd+improve-codebase-architectureskills + the Ralph loop driver. - Memory deep dive — no standalone Matt post; only references inside multi-phase-plans video and poland-ai-talk repo. Supplement with Anthropic memory docs.
- Anthropic / Docker official: Claude Code with Docker: Local Models, MCP, Sandboxes — https://www.docker.com/blog/run-claude-code-with-docker/ and Docker Sandboxes: Run Claude Code and More Safely — https://www.docker.com/blog/docker-sandboxes-run-claude-code-and-other-coding-agents-unsupervised-but-safely/. The new Docker microVM-backed sandboxes are the closest "official" take.
- Luke McConnell, Running Claude Code in Docker Sandbox: A
Complete Guide — https://www.mcconnellweb.com/posts/claude-code-docker-sandbox/.
Workmanlike Dockerfile + docker-compose walkthrough with the security
gotchas (don't mount
/var/run/docker.sock, use a non-root user, pin the credentials volume). RchGrav/claudeboxrepo — https://github.com/RchGrav/claudebox. Profile-based, reproducible Claude Code dev container; works as a paste-and-go template.textcortex/claude-code-sandbox(archived, see Spritz) — https://github.com/textcortex/claude-code-sandbox. Original "no-permission-prompts in a container" recipe.
Deep path
Exercise 2.1 — The five primitives
In a single folder module-2/primitives/, build five tiny
scripts. Each must be under 30 lines:
Deep path
Exercise 3.1 — One-file MCP server
In module-3/mcp-cv-server/:
Deep path
Exercise 4.1 — Three-type eval
Take your 02-object.ts from Module 2 (the CV paragraph
extractor). In module-4/evals/:
Deep path
Exercise 5.1 — Think-tool implementation
Port Anthropic's "think" tool into your own AI SDK script. The tool
does nothing — it just lets the model write reasoning into a structured
field. Follow
aihero.dev-implementing-anthropics-think-tool-in-typescript.md
for the interface, then write it yourself without peeking.
Check: Script runs, model uses think
before issuing other tool calls, traces visible.
Reflect: In which of your previous AI SDK scripts would
a think tool have changed the output? Why?
Exercise 5.2 — Decomposition by hand
Pick a research question from your own consultancy-sector work. Example: "Which UK economics consultancies have grown headcount fastest 2021–2025?"
In module-5/decompose.md, write:
Deep path (the old-book reading list — this is where the real value sits)
Matt closes the keynote with "buy a ton of the old books". Here's the short list, in order of what the keynote explicitly cites:
| Book | Why Matt cites it | Concept you'll recognise in his skills |
|---|---|---|
| John Ousterhout — A Philosophy of Software Design | Definition of complexity; deep vs shallow modules | improve-codebase-architecture,
design-an-interface |
| Andy Hunt & Dave Thomas — The Pragmatic Programmer | Software entropy; tracer bullets; "rate of feedback is your speed limit"; outrunning your headlights | tdd, tracer-bullets post |
| Frederick P. Brooks — The Design of Design | Design concept; design tree | grill-me |
| Eric Evans — Domain-Driven Design | Ubiquitous language | ubiquitous-language |
| Kent Beck — TDD By Example (and various essays) | Invest in design every day | tdd |
| Martin Fowler — Refactoring | Small tasks; refactoring under tests | Whole architecture module |
You don't need to read them cover to cover. Chapter 2 of Ousterhout and chapter 7 of Pragmatic Programmer ("Before the Project") are the most load-bearing for this syllabus.
Exercise 6.1 — Deep/shallow audit
Pick a small TypeScript project you own (maybe ~/dewey
or ~/mcp-apple-notes). In
module-6/audit.md:
Deep path
Exercise 7.1 — Status line that actually tells you something
Build a statusline.sh (or JSON config) showing: model,
current token usage, working dir, and git branch. Wire it up via
settings.json.
Check: claude session shows the line;
token count updates as you work. Reflect: What did you
learn about your sessions by watching the count?
Exercise 7.2 — Hooks as guardrails
Install the git-guardrails-claude-code skill if you
haven't. Then add one extra guardrail of your own — e.g., block
npm install -g (global installs), or require confirmation
before rm -rf anywhere under your home dir.
Check: Hook triggers in a test command. Reflect: Which dangerous thing have you ever actually done in a Claude Code session? Guard that.
Exercise 7.3 — Write your own skill
Pick one repetitive prompt you've typed into Claude at least three
times in the past month. Use the write-a-skill skill (you
have it) to turn it into a proper skill.
Save to ~/.claude/skills/<your-skill>/SKILL.md.
Test it in a fresh session.
Check: Skill appears in the available-skills list in a new session. Reflect: Is it actually a skill, or just a prompt template? (Skills should feel verb-shaped, not noun-shaped.)
Exercise 7.4 — AGENTS.md discipline
Read
aihero.dev-my-agents-md-file-for-building-plans-you-actually-read.md
and aihero.dev-a-complete-guide-to-agents-md.md. Then audit
your current CLAUDE.md in /Users/leath/CLAUDE.md. In
module-7/agents-md-audit.md:
Priorities for inclusion in syllabus
Top 10 discoveries ranked by gap-fill value (gaps named: permissions, checkpoints, subagents, memory, sandcastle, do-work skill, research phase, prototyping phase):
Honourable mentions just outside top 10:
Genuine remaining gaps (not findable from public Matt content alone)
Community Gap Coverage — Filling Matt Pocock's Cohort-Only Topics
A self-study syllabus of the best public substitutes for the topics Matt teaches in Claude Code for Real Engineers but underweights on his free blog. One H2 per topic. Where possible: link, do not paraphrase.
Conventions: "Best single piece" is a recommendation to read first. "Runners-up" widen the lens. "Recipe" is one paste-able artefact you can lift into a project today.
1. Sandboxing autonomous Claude Code with Docker
Best single piece: Geoffrey Huntley, how-to-ralph-wiggum / sandbox-environments.md — https://github.com/ghuntley/how-to-ralph-wiggum/blob/main/references/sandbox-environments.md. The only practitioner doc that compares E2B, Sprites/Fly, Modal, and local Docker side-by-side with cold-start times, session limits, and isolation level (Firecracker microVM vs gVisor vs container) so you can pick the right floor for your trust budget. Pair it with Huntley's "everything is a ralph loop" for context.
Runners-up:
Recipe (minimum-viable docker-compose for a Ralph loop):
services:
claude-sandbox:
build: .
volumes:
- ./workspace:/workspace
- claude-creds:/home/agent/.claude # persists login between runs
environment:
- ANTHROPIC_API_KEY
working_dir: /workspace
stdin_open: true
tty: true
network_mode: bridge # NOT host; never mount /var/run/docker.sock
volumes:
claude-creds:
Run
docker compose run --rm claude-sandbox bash -lc 'while :; do cat PROMPT.md | claude --dangerously-skip-permissions; done'.
This is essentially the Huntley pattern, but containerised.
2. Multi-phase plans / splitting work across context windows
Best single piece: Dex Horthy on Dev Interrupted, Ralph, RPI, and escaping the "Dumb Zone" — https://linearb.io/dev-interrupted/podcast/dex-horthy-humanlayer-rpi-methodology-ralph-loop. RPI = Research → Plan → Implement, each in a fresh context window with a hand-off file between them. Dex's framing ("you cannot outsource the thinking; you can outsource the volume") is the cleanest articulation of why phases must be human-gated.
Runners-up:
- MindStudio: GSD (Get Stuff Done) Framework for Claude Code — https://www.mindstudio.ai/blog/gsd-framework-claude-code-clean-context-phases. Three explicit phases (plan / execute / review), each with its own clean window. Closest direct analogue to Matt's pattern.
- Ilyas Ibrahim, The 4-Step Protocol That Fixes Claude Code's Context Amnesia — https://medium.com/@ilyas.ibrahim/the-4-step-protocol-that-fixes-claude-codes-context-amnesia-c3937385561c. Concrete hand-off file structure (current state, decisions made, next-step files).
- HumanLayer, 12-Factor Agents — https://github.com/humanlayer/12-factor-agents. Factor 3 ("Own Your Context Window") and Factor 10 ("Small Focused Agents") are the doctrinal basis for phase-splitting.
Recipe (hand-off file template):
# PHASE_N_HANDOFF.md
## Goal of next phase
<single sentence>
## Current repo state
- branch: ...
- last green test: ...
## Decisions locked-in
- chose X over Y because Z
## Files the next agent must read first
- src/...
- specs/...
## Out of scope
- ...
## Open questions for the human
- ...
Each phase ends by writing this file; the next phase starts with
cat PHASE_N_HANDOFF.md as its first action.
3.
Research-phase recipes — caching external context as
research.md
Best single piece: Geoffrey Huntley, Ralph
Wiggum as a "software engineer" — https://ghuntley.com/ralph/.
Defines the canonical file set: PROMPT.md,
fix_plan.md, AGENT.md, specs/*,
plus the rule "search the codebase before assuming" with parallel
sub-agents. The closest thing to a research-cache template in the
wild.
Runners-up:
- GitHub
spec-kit— https://github.com/github/spec-kit/blob/main/spec-driven.md. Standardisesresearch.md,data-model.md,plan.md,tasks.mdas artefacts with templates. - Addy Osmani, How to write a good spec for AI agents — https://addyosmani.com/blog/good-spec/. What goes in the research section vs the spec section vs the plan.
agentmd/agent.mdstandard — https://github.com/agentmd/agent.md. Cross-tool standardisation of the always-loaded research/context file.
Recipe (research.md skeleton for a brownfield
feature):
# Research — <feature>
## External APIs / libraries
- <name> — version, auth pattern, rate limits, gotchas, primary doc URL
## Codebase prior art
- <module>: handles <similar concern>, see file:line
## Domain glossary
- term — definition
## Constraints
- compliance, perf, dependency
## Unknowns (must resolve before Plan phase)
- ...
## Open tabs (URLs the human/agent already read)
- ...
Generate this with a research sub-agent, then commit it. The Plan phase loads it instead of re-fetching.
4. Prototyping three throwaway UIs and picking one
Best single piece: Dan Does Code, "Parallel Vibe
Coding: Using Git Worktrees with Claude Code" — https://www.dandoescode.com/blog/parallel-vibe-coding-with-git-worktrees.
Concrete: git worktree add per variant, run an agent in
each, diff the results, keep the winner. Generalises to UI variants
without buying into v0/Lovable.
Runners-up:
- Armin Ronacher, Building an Agent That Leverages Throwaway Code — https://lucumr.pocoo.org/2025/10/17/code/. Throwaway code as a first-class artefact, not waste; sequential rather than parallel, but the philosophy carries.
- Anna Arteeva, Choosing your AI prototyping stack: Lovable, v0, Bolt, Replit, Cursor, Magic Patterns compared — https://annaarteeva.medium.com/choosing-your-ai-prototyping-stack-lovable-v0-bolt-replit-cursor-magic-patterns-compared-9a5194f163e9. If you'd rather farm the variants out to a hosted tool, this is the comparison.
- MindStudio: Three Explorers Plus One Critic — https://www.mindstudio.ai/blog/claude-code-ultra-plan-multi-agent-architecture. Three parallel implementer agents + one reviewer; identical pattern to "throwaway three, pick one".
Recipe (three-variant UI bake-off):
for v in minimal dense playful; do
git worktree add ../proto-$v main
(cd ../proto-$v && claude -p "Build a throwaway $v variant of the dashboard at app/dashboard. \
Optimise for vibe and speed; ignore architecture. Commit when done.") &
done
wait
# Then: open all three in dev servers on different ports, demo, pick one, rm the other worktrees.
5. "Do-work" skill pattern (test → fix → retry, in-loop)
Best single piece: Lathesh Karkera, The Complete Guide to Testing Claude Code Skills With the Skill Creator — https://medium.com/@karkeralathesh/the-complete-guide-to-testing-claude-code-skills-with-the-skill-creator-1ae3821bd7b8. Walks through a SKILL.md whose entire job is "run tests, parse failures, fix, retry until green". Concrete and short.
Runners-up:
- MindStudio: Karpathy's AutoResearch Pattern Applied to Claude Code Skills — https://www.mindstudio.ai/blog/karpathy-autoresearch-pattern-claude-code-skills. Self-eval-and-improve loop encoded as a skill.
- Mager.co, Claude Code: How to Write, Eval, and Iterate on a Skill — https://www.mager.co/blog/2026-03-08-claude-code-eval-loop/. Eval harness for skills; same loop applied to the skill itself.
- TestDino, Playwright Skill for Claude Code: 82 E2E Tests — https://testdino.com/blog/playwright-skill-claude-code/. Real-world end-to-end example of the loop running against a browser.
Recipe (skills/do-work/SKILL.md):
---
name: do-work
description: Execute a task and self-verify. Run tests; on fail, fix and retry up to 3 times.
---
# Do Work
1. Plan the change in one sentence.
2. Make the change.
3. Run `npm test -- --run`.
4. If failing:
a. Read the failure output.
b. Identify the smallest fix.
c. Apply it. Goto 3.
5. Stop after 3 fix attempts; surface the failure to the human with a diagnosis.
6. On green, commit with a Conventional Commit message.
6. The Ralph loop beyond Matt — Dex Hy / HumanLayer and others
Best single piece: Geoffrey Huntley, everything
is a ralph loop — https://ghuntley.com/loop/. The
original. Single-process monolith, "one thing per loop",
while :; do cat PROMPT.md | claude-code; done. Compulsory
reading even if you end up rejecting the style.
Runners-up:
- Dex Horthy / HumanLayer, Ralph loops make agentic coding reliable with ruthless context resets — https://linearb.io/blog/dex-horthy-humanlayer-rpi-methodology-ralph-loop. Dex does not run Ralph the way Huntley does — he runs RPI, with humans in between phases. Reading the two side-by-side is the syllabus.
- Frank Bria,
frankbria/ralph-claude-code— https://github.com/frankbria/ralph-claude-code. Ralph with intelligent exit detection (stops when done; doesn't burn API forever). - Vercel Labs,
ralph-loop-agent— https://github.com/vercel-labs/ralph-loop-agent. Same idea ported to AI SDK; useful for cross-checking the abstraction. - Snwfdhmp,
awesome-ralph— https://github.com/snwfdhmp/awesome-ralph. Curated list; saves repeat searches.
How Dex differs from Matt/Huntley: Huntley's Ralph trusts eventual consistency and burns tokens. Dex's RPI inserts a human checkpoint between Research, Plan, and Implement to prevent "dumb-zone drift". Matt's cohort version sits between the two: bash-loop mechanics, but with multi-phase hand-offs.
7. Smart zone vs dumb zone — the original Dex writeup
Best single piece: Dex Horthy, No Vibes Allowed: Solving Hard Problems in Complex Codebases (HumanLayer talk) — https://bagrounds.org/videos/no-vibes-allowed-solving-hard-problems-in-complex-codebases-dex-horthy-humanlayer and the Dev Interrupted podcast that summarises it: https://linearb.io/dev-interrupted/podcast/dex-horthy-humanlayer-rpi-methodology-ralph-loop. The talk is where "dumb zone" enters the lexicon: above ~40% context utilisation, models hallucinate, loop, and emit malformed tool calls; below ~40% they stay sharp. Matt's reference traces back here.
Runners-up:
- HumanLayer blog, Skill Issue: Harness Engineering for Coding Agents — https://www.humanlayer.dev/blog/skill-issue-harness-engineering-for-coding-agents. Operationalises the smart/dumb framing with sub-agents as "context firewalls" and condensed sub-agent responses.
- Chroma Research's context-rot study (cited inside HumanLayer's piece) — empirical backbone for the 40% threshold.
- AINEXT (Chinese-language) Context Engineering 實戰指南 — https://ainext.tw/posts/20251225-context-engineering-coding-agents/. Same concept, different audience; useful counter-source.
Recipe (smart-zone budget):
- Cap any single agent's task at ~80k tokens of context (≈40% of Sonnet's 200k window).
- When you cross 60k, summarise to a hand-off file and start a fresh agent.
- Sub-agent responses must be ≤500 tokens with
file:linecitations, never raw output.
8. Parallel-agent patterns — Sand Castle equivalents
Best single piece: Patrick D'appollonio, How to run multiple Claude Code or Codex agents in parallel against a single codebase — https://www.patrickdap.com/post/how-to-run-multiple-agents/. Pragmatic: git-worktree-per-agent, tmux panes, what breaks (file locks, dependency installs, port collisions) and how to avoid it. Closest non-Matt equivalent to Sand Castle's value prop.
Runners-up:
johannesjo/parallel-code— https://github.com/johannesjo/parallel-code. Run Claude, Codex, and Gemini side-by-side, each in its own worktree. Useful for adversarial-review patterns.- Brendan Redmond, Multi-Agent Orchestration: Running 10+ Claude Instances in Parallel — https://dev.to/bredmond1019/multi-agent-orchestration-running-10-claude-instances-in-parallel-part-3-29da. Scales the pattern up; planner/implementer/reviewer separation.
- Dogukan Uraz Tuna, Mastering Git Worktrees with Claude Code — https://medium.com/@dtunai/mastering-git-worktrees-with-claude-code-for-parallel-development-workflow-41dc91e645fe. Detailed worktree mechanics.
spillwavesolutions/parallel-worktreesskill — https://github.com/spillwavesolutions/parallel-worktrees. A SKILL.md that wraps the worktree pattern.
Recipe (planner / implementer / reviewer split):
# planner runs in main checkout, writes plan.md
claude -p "Read research.md, produce plan.md broken into 5 phases."
# spawn an implementer per phase in its own worktree
for i in 1 2 3 4 5; do
git worktree add ../impl-phase-$i main
(cd ../impl-phase-$i && claude -p "Implement phase $i of plan.md. Stop and write PHASE_${i}_HANDOFF.md when done.") &
done; wait
# reviewer runs in a fresh worktree against each phase's branch
for i in 1 2 3 4 5; do
git worktree add ../review-phase-$i impl-phase-$i
(cd ../review-phase-$i && claude -p "Review phase $i. Be adversarial. Find bugs and security issues.") &
done; wait
9. Push vs pull — coding standards as always-on vs on-demand
Best single piece: Termdock, SKILL.md vs CLAUDE.md vs AGENTS.md Compared — https://www.termdock.com/blog/skill-md-vs-claude-md-vs-agents-md. Cleanest articulation of the dichotomy: AGENTS.md / CLAUDE.md are always-loaded (push: every session pays the token cost), Skills are on-demand (pull: discovered via frontmatter, loaded only when relevant). Decision rule: if it applies to ≥80% of tasks, push; otherwise pull.
Runners-up:
- Simon Willison, Claude Skills are awesome, maybe a bigger deal than MCP — https://simonwillison.net/2025/Oct/16/claude-skills/. The on-demand mechanism explained: "a few dozen extra tokens" for frontmatter, full body only when activated.
- Amit Ray, Claude.md vs Agents.md vs Memory.md, Skills.md, Context.md & The Hierarchy (2026 Guide) — https://amitray.com/claude-md-vs-agents-md-memory-md-skills-md-context-md-guide-2026/. Maps every "instruction file" type onto the push/pull axis.
- Matthew Groff, Implementing CLAUDE.md and Agent Skills In Your Repository — https://www.groff.dev/blog/implementing-claude-md-agent-skills. Concrete migration: starts everything in CLAUDE.md, demotes the rarely-used parts to Skills.
Recipe (decision flowchart):
Does the rule apply to >80% of tasks in this repo?
yes -> CLAUDE.md / AGENTS.md (push, always-on)
no -> SKILL.md with discoverable description (pull, on-demand)
Does the rule require the model to STOP or BLOCK something?
yes -> hook (deterministic, not push/pull)
Does the rule require external state, network, or scripts?
yes -> MCP server
10. Claude Code hooks ecosystem
Best single piece: Disler,
claude-code-hooks-mastery — https://github.com/disler/claude-code-hooks-mastery.
The deepest community hook collection: all 13 lifecycle events covered,
security hooks (block rm -rf, block .env
reads), TTS, sub-agent orchestration recipes, JSON payload docs. Set up
via UV single-file scripts so you don't pollute project deps.
Runners-up:
hesreallyhim/awesome-claude-code— https://github.com/hesreallyhim/awesome-claude-code. The canonical curated list — skills, hooks, slash-commands, plugins.rohitg00/awesome-claude-code-toolkit— https://github.com/rohitg00/awesome-claude-code-toolkit. 20 hooks among 135 agents and 42 commands; bigger but noisier.ithiria894/awesome-claude-code-workflows— https://github.com/ithiria894/awesome-claude-code-workflows. Workflow recipes that combine hooks + MCP + skills + agents + CLAUDE.md.- John Lindquist,
claude-hooks(TypeScript hook framework). Linked from the awesome list. - Lily Dayton,
Dippy— auto-approves safe bash, prompts on destructive ops.
Recipe (block destructive bash before it runs):
// .claude/settings.json
{
"hooks": {
"PreToolUse": [{
"matcher": "Bash",
"hooks": [
{ "type": "command", "command": ".claude/hooks/block_rm_rf.sh" }
]
}]
}
}
# .claude/hooks/block_rm_rf.sh
#!/usr/bin/env bash
read -r payload
cmd=$(echo "$payload" | jq -r '.tool_input.command')
if echo "$cmd" | grep -qE 'rm -rf|:>|chmod -R 777|curl .* \| sh'; then
echo '{"decision":"block","reason":"Destructive command blocked by guardrail"}'
exit 0
fi
echo '{"decision":"allow"}'
Top 10 must-read community pieces (ranked)
- Geoffrey Huntley — Ralph Wiggum as a "software engineer" — https://ghuntley.com/ralph/. The foundational text for everything in Matt's cohort. Read first.
- Dex Horthy — Ralph loops, RPI, and the Dumb Zone (Dev Interrupted) — https://linearb.io/dev-interrupted/podcast/dex-horthy-humanlayer-rpi-methodology-ralph-loop. The smart/dumb-zone vocabulary plus the human-gated alternative to Huntley's pure loop.
- Addy Osmani — Agent Harness Engineering — https://addyosmani.com/blog/agent-harness-engineering/. The unifying frame: "Agent = Model + Harness", with the ratchet principle for AGENTS.md.
- HumanLayer — 12-Factor Agents — https://github.com/humanlayer/12-factor-agents. Doctrine. Factor 3 (own your context window) and Factor 10 (small focused agents) underpin most of the syllabus.
- Geoffrey Huntley — sandbox-environments.md — https://github.com/ghuntley/how-to-ralph-wiggum/blob/main/references/sandbox-environments.md. The single best comparison of E2B, Sprites, Modal, Docker for agent isolation.
- Simon Willison — Claude Skills are awesome, maybe a bigger deal than MCP — https://simonwillison.net/2025/Oct/16/claude-skills/. Defines the push/pull split better than anyone else.
- Disler — claude-code-hooks-mastery — https://github.com/disler/claude-code-hooks-mastery. All hook events, all the security recipes, all in one repo.
- Armin Ronacher — Agentic Coding Recommendations — https://lucumr.pocoo.org/2025/06/12/agentic-coding/. Sceptical, experienced practitioner's take. Read alongside Huntley as a corrective.
- Ryan Lopopolo on Latent Space — Extreme Harness Engineering — https://www.latent.space/p/harness-eng. 1M LOC, 0% human code; the upper bound of what harness engineering looks like in production.
hesreallyhim/awesome-claude-code— https://github.com/hesreallyhim/awesome-claude-code. Bookmark; it indexes the rest of the ecosystem.
Last updated: 2026-04-24. All URLs verified during research. Where multiple URL forms existed (e.g., Substack mirrors of personal blogs), I linked the canonical author site.