AI Hero Self-Study Syllabus

A complete self-paced course built from Matt Pocock's public writing, open-source skills, conference talks, and Anthropic's official documentation.


Setup Checklist

Work through this before Module 1. Each line should take under 5 minutes unless noted.

Runtimes

Exercise 0.1 — Your own one-pager

Write notes/journal.md with:

  1. Read: What Is An LLM?

    The AI Engineer Roadmap

    What Is An LLM?

    In this article, we'll cover the basics of large language models. We'll talk about what they are, how they work, and touch on the process of creating them.

    Most of the resources out there go really deep into how LLMs work - we're not going to do that. Instead, I'll give you a brief overview of the most common concepts so you can go ahead and get building.

    What Is A Large Language Model?

    A large language model is essentially a massive compressed file - think of it like a 1TB zip file. This file contains a bunch of numbers, encoded as 16-bit floats. These numbers are the parameters of the model.

    0.1239784871238176123 // Parameter 1
    
    0.1515689756890123123 // Parameter 2

    These parameters represent the 'brain' of the model. They are the result of the model's pre-training: a process that takes a huge amount of text data and 'compresses' it into these numbers. They represent the model's understanding of the world, and give it the ability to remember facts and make decisions.

    The number of these parameters represents the size of the model's brain. In general, models with larger brains perform better, but run slower. A model with 70B parameters will run ~10x slower than a model with 7B parameters.

    Already, we're looking at a size vs speed tradeoff - a common theme when choosing large language models.

    How Do You Run A Large Language Model?

    In order to get the model to do anything useful, you need to perform inference on the model.

    Inference is the process of sending text to the model and getting a response back. This is done using an inference function - a piece of software that takes the parameters of the model and runs an algorithm on them to find the next word. This is far cheaper than pre-training the model, and can be done on your laptop.

    For a deep-dive into how inference works, check out this incredible interactive walkthrough.

    Sampling Strategy

    To find the next word, the model looks at all the possible tokens it could choose, and picks one.

    To do so, it uses a sampling strategy picked by the developer. This strategy determines how the model chooses the next word. The most common strategies are:

    • Greedy Sampling: The model always picks the most likely word.
    • Top-K Sampling: The model picks from the top K most likely words.
    • Top-P Sampling: The model picks from the words that make up P% of the probability mass.
    • Temperature Sampling: The model introduces randomness into the selection process, allowing for more diverse outputs.

    It's beyond the scope of this article to go into the details of these strategies. Usually, as an AI Engineer, you don't have the ability to change the sampling strategy of the model you're using. However, you can tweak variables, like the temperature, to get different results.

    What Are Input Tokens?

    When you send text to the model, it first needs to be tokenized. This is the process of breaking the text up into individual words, and then converting those words into numbers. These numbers are the input tokens, which are passed to the inference engine.

    Each model has its own tokenizer. Tiktokenizer is a great playground for exploring different tokenizers.

    How Do You Create A Model?

    In order to acquire the parameters, you need to train the model. Training large language models is an extremely involved process that requires a lot of time, expertise, and money. Learning how to do it is outside the scope of this article.

    A rough guide is to take a chunk of the internet, let's say 10TB of data. You use 6,000 GPU's for 12 days, at the cost of around $2M. And you end up with a ~140GB file with all the parameters of the model.

    The training process has two main phases:

    1. Pre-training: This gives the model its knowledge by compressing vast amounts of internet data into parameters
    2. Post-training: This shapes the model's personality and behavior through careful instruction and example.

    You end up with a huge file of parameters - a kind of 'compressed' version of all of the data the model was trained on, with its personality shaped by post-training. Without post-training, the model would just be an inert blob of knowledge - it wouldn't know how to behave like a helpful assistant.

    Resources

    How Do You Introspect A Model?

    It's possible (though very difficult) to dive into the parameters of a model to work out which ones correspond to which real-world concepts. For instance, Anthropic found that models can:

    • Share concepts across languages, suggesting a kind of universal "language of thought"
    • Plan ahead when writing (like planning rhymes in poetry)
    • Use multiple parallel pathways for tasks (like mental math)
    • Sometimes fabricate plausible-sounding reasoning rather than showing their true thought process

    By understanding how models think, we can better anticipate their behavior and potentially remove some of the "magic" that makes them hard to trust. This could lead to more controlled, deterministic AI systems in the future - but the research is still in its early stages.

    Resources

    Conclusion

    Large language models compress vast amounts of knowledge into numerical parameters. While they're built on complex math, you don't need to understand their inner workings to use them effectively. Getting a basic lay of the land is enough to build smart intuitions about how they work.

    In the next article, we'll explore how to choose the right LLM for your specific needs.

    Join over 54,000 Developers Becoming AI Heroes

    Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.

    Email*

    I respect your privacy. Unsubscribe at any time.

  2. Read: What Are Tokens?

    LLM Fundamentals

    Email*

    What Are Tokens?

    Tokens are the fundamental building blocks that help Large Language Models (LLMs) process text. Understanding them is essential, especially since you're billed based on token usage.

    What Are Tokens?

    Tokens are simply numbers that represent how the LLM "thinks" about the text you provide. The process of converting text into tokens is called encoding.

    The tokenization process works in two parts:

    1. The tokenizer splits text into tokens it recognizes
    2. These tokens are converted into numbers

    Encoding

    Decoding is the reverse process:

    1. Numbers are converted back into text tokens
    2. The tokens are joined together to form the output

    Decoding

    The LLM Process Flow

    The complete LLM process looks like this:

    1. Tokenizer encodes your input text into tokens
    2. LLM processes your tokens
    3. LLM produces output tokens
    4. Output tokens are decoded back into readable text

    LLM Process Flow

    To clarify, input tokens include:

    • Your conversation history with the LLM
    • System prompts
    • Tool definitions

    Output tokens are what the LLM sends back as a response.

    You're billed for both input and output tokens, typically at different rates. One way to save money is to design your prompts to generate fewer output tokens.

    How Tokens Are Created

    The tokenization process starts with a large corpus of text - similar to what's used to train the LLM itself. Let's imagine a tiny corpus consisting of just one sentence: "the cat sat on the mat."

    Tokenization

    First, all individual characters are extracted:

    T H E space C A T space S A T space O N space T H E space M A T

    Each of these characters becomes its own token in the vocabulary.

    Next, common groupings of characters are identified:

    • "TH" appears in "the" (twice)
    • "HE" appears in "the" (twice)
    • "AT" appears in "cat", "sat", and "mat"

    Each of these groupings also gets assigned its own token.

    Then, groups of groups are identified - like "TH" + "HE" creating "THE" (the word "the"), which gets its own token.

    Vocabulary Size Matters

    The goal is to create a large vocabulary of tokens because larger vocabularies can split words into fewer tokens, making processing more efficient.

    Vocabulary Size

    For example, a vocabulary size of 1,000 tokens might split "understanding" into 5 tokens. A vocabulary size of 50,000 tokens might split it into 3 tokens, and a vocabulary size of 200,000 tokens might split it into 2 tokens.

    Having a larger vocabulary means you can split words into fewer tokens, making processing more efficient.

    Handling Unusual Words

    The tokenizer struggles with uncommon words. For example, "O Frabjous Day" from Lewis Carroll's poem gets split into many tokens because "Frabjous" is a made-up word that doesn't appear frequently in the training corpus.

    Unusual Words

    We can see that it turns it into 7 tokens - more than we'd expect from only 15 characters.

    Final Thoughts

    I hope that helps demystify tokens a bit. I found the tiktokenizer playground really useful for understanding this stuff.

    Let me know if you have any questions - and what else would you like me to cover next?

    Matt

  3. Read: What Is The Context Window?

    LLM Fundamentals

    Email*

    What Is The Context Window?

    The context window is made up of input and output tokens. The input tokens might include a system prompt and a user prompt (the message from the user). And then the output tokens is whatever the assistant comes back with:

    Input/Output Tokens

    The context window is the input and output tokens combined.

    As the conversation gets longer, as more and more messages get put into the conversation, the number of tokens used grows.

    Long Conversations

    And as you can imagine, this can't go on forever. Every model has a hard-coded limit for the number of tokens it can see at any one time.

    Context Window Limits

    If we imagine a super long conversation, eventually we will hit a limit. And if you try to query the LLM with this super long conversation and you hit the limit, then you will probably get some kind of error back from the API.

    Context Window Limits

    You might even hit the limit during the generation of a message. So this message might start outside the context window limit, but as it goes or it carries on until it hits the context window limit.

    Output Limits

    The model itself isn't quite smart enough to work around its own context window limits. And so you will hit issues like this sometimes.

    The "Lost in the Middle" Problem

    But the biggest issue with context windows is that the bigger they get, the more "lost in the middle" issues you get.

    If we imagine a huge conversation, where these rings are the individual messages, the messages at the start of the history have quite a big impact on the output, and the ones at the end do too, but the stuff in the middle the LLM pays a bit less attention to.

    Lost in the Middle

    This is a well-known phenomenon and it's much more pronounced the larger the context window gets.

    So when you're assessing a model, you shouldn't just think, "wow, this context window limit's huge, that's incredible, I can put so much stuff in there."

    That model will still probably have lost-in-the-middle issues. It may even have trouble retrieving information from its own context window.

    And so even if that model supports a big context window, you'll definitely still get better results from using fewer tokens in the context.

  4. Read: Messages, System Prompts and Reasoning Tokens

    LLM Fundamentals

    Email*

    Messages, System Prompts and Reasoning Tokens

    The first part of understanding LLMs is understanding the protocol for talking to them - the messages that make up the conversation.

    What is a user message? What is a system prompt? What are tool calls? What are reasoning tokens? Let's break it down now.

    User And Assistant Messages

    A simple conversation with an LLM might look like this. You have user messages, which are messages from you, the user. LLM messages are called assistant messages.

    Simple Conversation

    In this situation we're just talking to the bare model that the model provider gives us. But if we want to customize the LLM's behavior, then we can use a system prompt.

    System Prompts

    System prompts are messages at the very start of the history, which are visible to the LLM, but usually not to the user.

    They're powerful, too. The LLM, if there's a conflict between the user and the system prompt, will usually obey the system prompt. For instance, we can tell it to respond in Morse code, and then even if we say stop doing Morse code, it will reply in Morse code.

    System Prompts

    Of course, this doesn't always work and there's plenty of examples of people jailbreaking the system prompt, but this is at least the theory.

    Reasoning Tokens

    Lots of models can send back reasoning tokens, where the model appears to think through its output before it responds. These reasoning tokens are just another part of the assistant message.

    Reasoning Tokens

    This is an important concept here that messages can contain multiple parts. This is useful for file parts too, where you can send files to LLMs or get files back from them in image generation use cases.

    Files

    For instance here, we're sending a file to an LLM saying summarize this PDF, and it replies with a summary.

    Tools

    With tools you have the ability to say you have access to a write file tool to the LLM. This might write a file to our file system or something.

    Tools

    Then the user is going to say, write a todo.md file. The assistant then produces a tool call message. That tool call has an id on it. And it also has an instruction to write a file todo.md with empty contents.

    Now we, in our applications, take that tool call and execute it, and we send back to the LLM a tool result with the same id as the tool call and the result of what happened, usually as a string.

    We get back from the assistant a summary of what happened, "wrote a todo.md file successfully".

    So tools are kind of like a conversation between the LLM and our system, where:

    1. The tool call is the LLM asking us for something
    2. The tool result is us giving it the information it needs

    That's how AI-powered apps like Claude Code and Cursor can do things on your system. We'll dig deeper into tools later.

  5. Read: What Are Tools?

    LLM Fundamentals

    Email*

    What Are Tools?

    Giving the LLM tools is done via the system prompt. The system prompt is just another message in the message history which describes to the LLM what it's supposed to be doing, and in this case, what tools it can call.

    Giving The LLM Tools

    For each tool, we're providing three things:

    1. The name of the tool (e.g., writeFile)
    2. A description of the tool (e.g., "Write a file to the file system")
    3. The parameters it takes and their types (e.g., path and content)

    These parameters are specified in JSON schema, so we can pass anything that JSON schema supports, like objects, arrays, and other complex types.

    These tool definitions get injected into the system prompt, and any other information you've provided to the system prompt goes below it. There's nothing particularly fancy going on - it's just tool definitions inside the system prompt.

    The LLM Chooses a Tool

    The magic happens when we ask the LLM to choose a tool. We've got our system prompt saying "You have access to the following tools." Then we add a user message, which says, "Write a new file called .gitignore."

    Choosing A Tool

    We then receive back an assistant message with a tool call inside. This is just an instruction from the LLM indicating which tool to call. It has an id on it and also contains the required parameters.

    In this example, it's writing an empty file to the path ".gitignore".

    Tool Calls vs. Tool Execution

    This tool call is just an instruction for which tool should be called. Nothing has happened yet - the assistant has just produced a message. That's it.

    The tool then needs to be executed on our machine. The LLM has created this message, but we then need to actually execute the creation of the file on our machine.

    Executing A Tool

    Implementation Requirements

    This means that for every single tool in the system prompt, we're going to have functions in our code base that match up to those.

    If the tool execution is successful, we're then going to send back a user message to the LLM. It's going to have the same id as the previous tool call, and we're going to send it a message saying what happened when we executed the tool.

    Handling Errors

    This error handling is really important. If there are any errors when the tool is executed, we need to show that error message to the LLM so that it can do something differently.

    So a tool result could be a success or it could be a failure.

    The Complete Flow

    The Whole Flow

    Let's go one more time through the entire flow:

    1. We specify the tool in the system prompt: "You have access to the following tools..." passing it a bunch of JSON schema
    2. We send a user message saying "Write a new file called .gitignore" (note that we don't have to specifically say "call this tool" - the LLM itself decides which tool to call)
    3. The LLM produces a tool call message with all the right parameters
    4. We see this tool call message and execute it on our machine
    5. We send the result back with a message saying what happened
    6. The LLM sees this entire history and responds with a summary

    In this case, it said "Done. What should go in there?"

    Summary of Tools

    That's what tools are - they're just ways of getting LLMs to produce certain types of messages which you can then intercept and execute on your machine, and give those results back to the LLM.

    With this simple loop, you can build really, really powerful applications.

  6. Read: What Is An Agent?

    LLM Fundamentals

    Email*

    What Is An Agent?

    Ever since Anthropic dropped the article building effective agents, everyone's been talking about agents and workflows. Both agents and workflows are ways of building more powerful systems with LLMs, and they both involve orchestrating multiple calls to the LLM.

    What is a Workflow?

    A workflow does this through predetermined steps where you have one LLM call which goes to another call which goes to another call. These predetermined steps are written in code by the developer.

    Workflow Diagram

    The code itself decides when to stop the program, when to call the next LLM. It's all written into the code itself.

    What is an Agent?

    An agent, though, doesn't use predetermined steps. It calls an LLM and gives it a bunch of tools, different options of things that it can do next. The LLM decides which tool to call and then responds to the result of those tools.

    Agent Diagram

    The LLM itself decides when to stop the program when it thinks it's finished. In other words, this is the LLM making it up as it goes along, which hands a lot more power to the LLM, but of course makes it less predictable.

    Comparing Agents and Workflows

    Both agents and workflows involve multiple LLM calls. If you're just making one LLM call, it's not really either of those things.

    The crucial difference is who decides when to stop:

    Agents vs Workflows Comparison

    Agents are really good in situations where the steps to complete the task are not particularly clear, where it needs the ability to improvise to figure out its way through a difficult problem.

    But workflows are great for things that need to be done the same way again and again. Workflows are often unfairly maligned because they're not as exciting and sexy as agents.

    You'll often get better results from using a workflow than using an agent, as long as the task is clearly specified.

    Parallel Workflows Example

    For instance, you can use a workflow to parallelize work. Let's say we take in a chunk of text, we can split it into two parts, summarize each of them independently, and then summarize the summaries afterwards.

    Parallel Workflow

    That's the difference between agents and workflows. I don't know whether it's me being old and boring, but I am more excited by workflows than agents.

  7. Read: What Can You Use LLM's For?

    The AI Engineer Roadmap

    Email*

    What Can You Use LLM's For?

    Before we dive into the world of LLMs, and how they work, we first need to know where we're going.

    What are LLMs actually used for these days? What utility do they have? What can you build with them?

    In this article, we'll explore the different applications of LLMs. We'll also look at things you shouldn't build with LLMs - things that are better suited to more deterministic tools.

    Use LLMs For...

    Unstructured Data -> Structured Data

    Most companies have access to a lot of unstructured data. These could be transcripts from support calls, customer emails, invoices, or even just notes from meetings.

    This data is hard to work with. It's hard to search, hard to analyze, and hard to read.

    For example, a friend of mine has started a PhD looking at historical hawk migration patterns. Data for how these hawks moved is recorded in ancient, unstructured logs from the 19th century. Despite being carefully archived and digitized, reading through these logs is an absurdly time-consuming task.

    The solution? To use LLMs to convert these logs into tabular data. LLMs can read the logs, understand the patterns, and convert them into a structured format. This structured format can then be used for analysis, visualization, and further research.

    This is perhaps the most common, powerful use case for LLMs. The world has collected vast amounts of data in the past few decades. LLMs are now making that data trawlable and accessible.

    This is a common theme—LLMs are often used for data tasks that it would be impractical (or expensive) to hire humans for. This opens up new possibilities for working with previously inaccessible data.

    Labeling & Classification

    Another common task for LLMs is classification. They can be fed an input and asked to attach labels to it, which helps in organizing and understanding the data more effectively.

    One striking example comes from The Prompt Report. In their case study, they attempt to detect "signal that is predictive of crisis-level suicide risk in text written by a potentially suicidal individual". They used data from the subreddit r/SuicideWatch, and the LLM had to match up with an expert's analysis.

    The LLM, when provided with the text, would have to classify whether or not it contained elements of either "frantic hopelessness" or "entrapment". It would reply with "positive" (i.e., that the text contained risk signals) or "negative".

    Classification systems have been around in machine learning for a long time. They usually require significant amounts of data to train. LLMs make this process easier by only requiring a simple prompt to change their behavior to a classifier. Very useful.

    Recent developments in LLMs also mean it's easier to retrieve structured data from them. Check out a classification example from the Vercel AI SDK tutorial.

    Question Answering

    Another common use case for LLMs is as a question answerer. You can feed an LLM a question, and it will give you a response based on its training data.

    However, LLMs have several downsides when used as a knowledge base. Their training data has a cut-off point, so it doesn't have access to up-to-date information. They often can't cite sources for their answers, which makes it hard to verify their accuracy.

    Therefore, connecting LLMs to external data sources is a common pattern.

    This external data source could be a database, a search engine, or any API. LLMs can call external services (using tools) to get the most up-to-date information.

    This is not foolproof—careful work is needed to make sure the LLM does not hallucinate or provide incorrect information. But question answerers, in the form of chatbots or search engines, are a common use case for LLMs.

    DeepResearch, a now-common offering from Perplexity, Google, OpenAI, and others, is a good example of this. It's a pattern where an entire academic-style report is generated from a simple query.

    Agents

    The fact that LLMs can access external tools has a lot of folks very excited. It means that LLMs can be used to do things in the world, not just generate text.

    This pattern is often called an "agent"—a system that can take actions in the world, respond to user inputs, and interact with other systems.

    One can imagine a coding agent acting like a team member—contactable via Slack, able to write code, deploy it to production, and communicate with the user.

    This is similar to the promise of agents like Devin.

    However, agents have not yet had their breakout moment—certainly not in the way chatbots have. Agents are yet to find their final form in terms of user experience.

    Don't Use LLMs For...

    Naive Chatbots

    It can feel very tempting to build chatbots with LLMs. It's very simple to set up. You feed the LLM a prompt, give it access to a conversation history, and you're good to go.

    "Chat with our docs." "Chat with our support bot." "Chat with your search results." Naive chatbots are thin wrappers around LLMs, hastily thrown together to make a product seem more interactive.

    However, productionizing chatbots is an extremely difficult problem. If you're not careful, they will frustrate your users and damage your brand. It is notoriously difficult to make a chatbot only respond to relevant queries without veering off-topic.

    The big model providers (OpenAI, Anthropic, Google, etc.) come with built-in guardrails to prevent their models from saying anything brand-damaging. But the surface area is so large—any potential conversation you can think of—that these guardrails will likely never be perfect. A famous example is Google's Gemini asking the user to die.

    Gemini saying "please die, please."

    It only takes one determined user to jailbreak your chatbot and make it say something inappropriate. Don't ship chatbots without proper safeguards.

    Deterministic Systems

    A good rule of thumb for AI systems is "if it can be built deterministically, it should be."

    LLMs are probabilistic systems. They are designed to choose the next word in a piece of text, over and over again, from a choice of many possible options. Depending on how the next word is selected (their "sampling strategy"), they can produce different outputs from the same input.

    However, this design also makes them prone to several failure modes:

    • Hallucinations: generating text that is not grounded in reality
    • Sycophancy: overly conforming to the user's point of view, instead of providing a balanced response

    These failure modes can be worked around, but they require careful design and testing. This means that if you can build a system deterministically, you should.

    Deterministic systems are far easier to test, debug, and maintain. They are safer to put into production and are often faster and cheaper to run.

    Conclusion

    Deterministic systems are not going away. They are infinitely easier than AI apps to build, test, and maintain. In a world where folks are throwing LLMs at every problem, being able to sniff out when not to use them is a valuable skill.

    Deterministic systems should be your default choice for any task, until you hit a barrier which can only be solved by an LLM.

    But LLMs do have their place. Let's take the LLM use cases we've seen so far and put them into two buckets.

    First, there are the tasks that are too expensive to hire humans for:

    • Converting unstructured data into structured data
    • Labeling and classification

    Then, there are the tasks that are too complex for deterministic systems:

    • Question answering
    • Text generation
    • Agents

    So any task that falls into one of these buckets is a good candidate for an LLM.

  8. Read: 5 Questions To Ask Before Choosing An LLM

    The AI Engineer Roadmap

    5 Questions To Ask Before Choosing An LLM

    Choosing the right model is crucial to the success of your AI-powered app. But it's not an easy call.

    It's a tough, many-layered decision that you don't only make once. You'll have to make it over and over again as new models emerge and your app evolves.

    I've split this decision down into several key questions you'll need to ask yourself when choosing a model.

    1. Should I Use An Open Or Closed Model?

    There are two main types of models you'll need to choose from: open and closed.

    Open Models

    Open source models are models that are free to download and use. However, you'll need to host and run them yourself if you want to build an app with them.

    Open models can be run on your own hardware, or on a cloud provider like AWS or Azure.

    The Open LLM Leaderboard is a great place to look for open models.

    Closed Models

    Closed-source models are ones controlled by companies. You need to pay to use them, but they are hosted by the company, so you don't need to worry about running them yourself.

    The most powerful models in the world are currently closed-source. But open models are improving all the time.

    Chatbot Arena is a good place to compare closed-source against open-source models.

    Model Providers, API Providers, and Hosting Your Own Model

    There are two types of companies who host models for you to use:

    Model providers use closed-source models. You pay to use their models, but they're hosted by the company. These include OpenAI, Google, Anthropic, Deepseek, and many more.

    API providers host open source models, and charge you a fee for using them. They include Hugging Face, Groq, and others.

    You can also host your own model. This is the most flexible option, but also the most expensive. You'll need to pay for the hardware to run the model, and you'll be responsible for keeping it up and running.

    2. How Much Will It Cost?

    The way you'll pay for your LLM usage changes depending on whether you're hosting your model or not.

    Cost Per Token

    Most model providers charge by token. The more tokens you use, the more you pay. The most common way this is expressed is "cost per 1m tokens".

    You don't just get charged for input tokens (what you say to the model), but also for completion tokens (how the model replies).

    This is a traditional 'pay per usage' model - just like serverless platforms charging you for compute time. Tokens are a decent metric for how much you're using the model.

    There is a worldwide race-to-the-bottom on token prices, as companies compete for market share. Price comparison websites like Helicone's are useful for comparing prices.

    Hosting Open-Source Models

    Hosting open-source models can be a more cost-effective option. Instead of paying per token, you now pay a fixed fee to host the model.

    This also has the benefit of keeping you entirely in control of your own data. This can be important for data residency and privacy reasons.

    However, models need to be hosted on powerful hardware, which can be expensive. You'll need to balance the cost of hosting the model with the cost of using a model provider.

    My general suggestion is to start with third-party APIs. They give you the most flexibility and are reasonably cost-effective. Later, you can move to hosting your own models if you need to.

    3. How Important Is Latency?

    Another important feature to consider is latency. Latency is the approximate time it takes for the model to respond to a query. Fast responses can be crucial for many use cases, and can make your application more useful to users.

    Latency is affected by the size of the model - smaller models are faster to run, but often less accurate.

    Latency is also affected by the hardware the model is running on. More powerful hardware can run models faster.

    And finally, it's affected by any inference optimizations made to the model. This helps speed up the inference function of the model. They include quantization, distillation and parallelism - and are somewhat outside the bounds of this article.

    Measuring Latency

    There are two main metrics to consider when measuring latency:

    • TTFT: Time to first token: how long it takes for the model to start generating a response
    • TPOT: Time per output token: how long it takes for the model to generate each token

    4. How Do I Assess Model Performance?

    Cost and latency are important, but a model is useless if it cannot perform the task required.

    As a general rule, smaller models will perform worse. A smaller model will have fewer parameters - and so a smaller space in which to store information.

    Public Benchmarks

    The performance of a model is an extremely slippery metric. It's extremely difficult to look at two models and say which one is better.

    A common approach that model providers (and open-source projects) take is to use benchmarks. These are standardized tests that measure the performance of a model on a specific task. These benchmarks can grade the model at certain tasks, such as translation, summarization, question-answering, or coding.

    Benchmarks are a decent early indicator for a model's performance. However, there is a constant danger of model providers overfitting their models to these benchmarks. This can either happen by the model accidentally consuming the benchmark data during pre-training, or organizational pressure to improve benchmark scores.

    In some benchmarks, models are compared against each other - with humans grading which model produces the better output. Chatbot Arena is a good example of this - and worth checking out for an early indicator of model performance.

    Specialized Models

    Some models will perform better at certain tasks. This often depends on the model's training data - if it has been trained on large amounts of code examples, it will be better at coding tasks. The same is true for many disciplines: translation, classification, summarization, etc.

    If you have a specialized task, it's worth looking for models which are specialized for that task. These models will usually outperform general-purpose models. They are also often smaller - so, faster and more efficient.

    Reasoning Models

    Some models have been specifically designed to pause before providing a response. These are the reasoning class of models, a trend initiated by OpenAI's o1.

    These models often perform better at tasks requiring forward planning and critical thinking, like coding and math problems. They also output their planning process with reasoning tokens, which can be useful to stream to the user in real-time.

    However, they are often more expensive than regular models, and take longer to respond. It's a performance/latency tradeoff.

    Evals

    The only way a model can truly be evaluated is by testing it in the context of your application. This is why building evals for your system is so crucial.

    Evals are a set of benchmarks you run on your own system. They let you see whether your system is improving or degrading over time. We'll cover them in more depth later.

    5. How Big A Context Window Do I Need?

    The context window is the number of tokens the model can see at a time. The larger the context window, the more information the model can use to generate its next word.

    This limit is counted in tokens, and counts both input and completion tokens. Passing too long an input to a model (or forcing it to generate too long a response) can cause an API error, or prevent it from generating a response.

    The context window size is related to the mechanism the model uses to generate text - so is tied to the design of the model itself. Context window sizes are growing all the time. Currently, Gemini models have the largest context windows.

    Since the context window is limited on all models, managing it is a constant battle for AI engineers. Patterns like chunking in RAG are designed to squeeze more information into the context window.

    Conclusion

    These five factors are important for choosing your model:

    • Open or Closed
    • Cost
    • Latency
    • Performance
    • Context Window

    Leaderboards and benchmarks are a good place to start. However, the only way to truly assess a model's suitability is to test it in your application via experimentation with your own evals.

    Join over 54,000 Developers Becoming AI Heroes

    Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.

    Email*

    I respect your privacy. Unsubscribe at any time.

  9. Read: 17 Techniques For Improving Your LLM-Powered App

    The AI Engineer Roadmap

    17 Techniques For Improving Your LLM-Powered App

    Once you clearly understand your success criteria, have picked your model, and written some basic evals, it's time to start improving your system.

    The process of improving your system comes down to two things:

    • Improving your feedback loop (evals)
    • Improving the performance of the system itself

    We've already looked at how to improve your evals. In this article, I'll give you an overview of the main ways you can improve your system.

    But first, let's talk about the mindset you need to have when improving your system.

    Try the Simple Thing First

    We've already seen that improving an AI system is an experimental process. You need to try things out, see what works, and iterate.

    Techniques for improving a system range from simple and cheap to complex and expensive. Tweaking a prompt? Cheap. Training a model from scratch? Astonishingly expensive.

    I call this the Staircase Of Complexity Hell:

    Diagram titled 'The Staircase to Optimization Hell' showing a descending staircase with increasing complexity and cost. Steps from top to bottom: Zero-Shot, Few-Shot, Chain of Thought, Temperature, Workflows, Evaluators, Agentic Loops, LLM Routers, Fine-Tuning, Sampling. Gradient background from teal to red indicates rising expense.

    The key is to start at the top of the staircase, and work your way down only when you've exhausted all the simpler options. Simple techniques can provide a huge improvement for a small amount of effort.

    Techniques To Try

    This list is ordered from the simplest techniques to the most complex. Start at the top, and work your way down. I've linked to further resources where you can get more details on each technique.

    1. Your First Prompt

    Problem: You've got to start somewhere.

    Solution: Here are some basic tips for improving your prompts:

    • Be clear, direct, and specific.
    • Think of the LLM as a brilliant, but very new, employee.
    • Remember that the LLM has no context on your norms, styles, or guidelines.

    Resources

    2. Role-Based Prompting

    Problem: You want the LLM to behave in a certain way no matter the input.

    Solution: Use role-based prompting to get the LLM to adopt a persona.

    This could be as diverse as adjusting the tone of voice:

    Or even the accent:

    Or prime the LLM to talk about a certain topic:

    This is an extremely common technique, and very cheap to implement. It's usually done in a system prompt.

    Resources

    3. XML Tags

    XML Tags On The Input

    Problem: You want to pass multiple pieces of information to the LLM in a single prompt.

    Solution: Use XML tags.

    XML tags can help provide delimiters for different parts of the prompt.

    An example from Anthropic's docs is a financial report:

    XML Tags On The Output

    Problem: You want the LLM to respond with multiple different outputs.

    Solution: Tell the LLM to respond with different outputs based on the XML tags.

    You can also tell your LLM to respond with different outputs based on the XML tags in the prompt. This can give you more control over the structure of the response.

    You may want the LLM to review an article for you. You may want it to provide a , a , and .

    This technique was popularised by Anthropic, but most models also support it.

    Resources

    4. Constraining The LLM's Response Format

    Problem: You want to tightly constrain the text that comes back from the LLM, such as asking it to reply with JSON or a single word.

    Solution: Use structured outputs or explicit format instructions.

    Older versions of Claude supported "prefilling" the assistant message to steer the response format. As of Claude 4.6, prefilling returns a 400 error. The modern alternatives are better:

    Structured Outputs (recommended): Most LLM providers now support structured output schemas that guarantee the response matches a specific JSON shape. With the Vercel AI SDK, use or :

    System prompt instructions: For simpler constraints, tell the model what format you want in the system prompt:

    Both approaches are more reliable than prefilling ever was. Structured outputs give you type-safe, validated responses. System prompt instructions work when you need flexible text output in a specific shape.

    Resources:

    5. Structured Outputs

    Problem: You want the LLM to return structured data instead of text.

    Solution: Use structured outputs.

    Structured outputs are a way to get the LLM to return data in a structured format, like JSON. Most LLM providers support providing a JSON schema description of the output you want.

    The Vercel AI SDK is a particularly good toolset for this.

    Resources

    6. Reasoning

    Problem: The LLM is not doing well enough at complex, multi-step reasoning tasks, like coding or math problems.

    Solution: Prompt the LLM to reason through the problem using chain-of-thought (CoT) prompting.

    Chain-of-thought prompting encourages the LLM to break down problems step-by-step, leading to more accurate and nuanced outputs. This technique is particularly effective for tasks that require complex reasoning, analysis, or problem-solving.

    There are three main approaches to chain-of-thought prompting, from simplest to most complex:

    1. Basic CoT: Simply include "Think step-by-step" in your prompt. While simple, this lacks guidance on how to think.
    2. Guided CoT: Outline specific steps for the LLM to follow in its thinking process.
    3. Structured CoT: Use XML tags like and to separate reasoning from the final answer.

    Chain-of-thought prompting trades speed for quality. The LLM must process and output its reasoning steps, so the response time gets longer. This matters most in real-time applications - a chatbot needs quick responses, while a code reviewer can take longer for detailed analysis.

    Resources

    7. Multishot Prompting

    Problem: The LLM needs to understand a specific pattern or format but isn't getting it from a single example.

    Solution: Provide multiple examples to help the LLM understand the pattern.

    Multishot prompting can achieve results similar to fine-tuning, but without the cost and complexity of training a new model. It works by showing the model examples of what you want it to do.

    It's straightforward - provide a few examples of input and output, and the model learns the pattern. No training data or compute resources needed.

    Here's a practical example for writing product descriptions:

    After seeing these examples, the model learns to write product descriptions with sensory language and focus on benefits. If you then give it "Input: Coffee maker", it will generate a similar style description.

    Multishot prompting contrasts with zero-shot prompting, where you just describe what you want without examples.

    Resources

    8. Temperature

    Problem: The LLM's outputs are either too deterministic (boring) or too random (unreliable).

    Solution: Adjust the temperature parameter to control the randomness of outputs.

    You can pass a temperature parameter to the LLM. This controls how random or deterministic the LLM's outputs are.

    Think of temperature as your creativity dial. When you're writing code or need precise facts, you'll want to specify 0.0-0.3 - this makes the model stick to the most likely outputs. For general chat or creative writing, a medium setting of 0.4-0.7 gives you a nice balance. And when you're brainstorming or need fresh ideas, bump it up to 0.8-1.2.

    Higher temperature means more interesting outputs, but might produce more hallucinations. My general suggestion is to start conservative and dial it up only when you need more variety.

    Either way, it's a relatively cheap technique to try.

    Resources

    9. Tool Calling

    Problem: LLMs are limited to text generation and can't directly interact with external systems or perform actions in the world.

    Solution: Give the LLM access to specific functions or tools it can call to extend its capabilities beyond text generation.

    Tool calling bridges the gap between an LLM's internal capabilities and the external world. It allows LLMs to perform actions like making API calls, accessing databases, or manipulating files. The LLM describes what it wants to do, and the system executes the appropriate tool with the specified parameters.

    This pattern is particularly useful when you need your LLM to interact with external services, perform system operations, or access data that isn't in its training data. It's a fundamental building block for creating more capable AI applications.

    You can learn how to implement this pattern using Vercel's AI SDK in my tutorial.

    Resources

    10. LLM Call Chaining

    Problem: A single LLM call isn't sufficient to complete a complex task.

    Solution: Break down the task into multiple LLM calls that build on each other.

    When you need to perform multiple specialized operations on the same input, trying to do everything in a single prompt often leads to subpar results. Each operation might need different expertise and focus.

    This is where LLM call chaining comes in. Instead of asking one prompt to do everything, you break the task into specialized steps. Each prompt focuses on one aspect of the task, and its output becomes the input for the next prompt in the chain.

    Diagram showing sequential LLM calls where output from one step is passed into the next, ending in a fixed stopping point.

    Take code analysis and fix generation as an example. The first prompt acts as a code analyzer, identifying and categorizing issues in the code. It provides context for each issue, creating a structured analysis.

    The second prompt then uses this analysis to generate targeted fixes, building on the first prompt's insights. This separation of concerns allows each prompt to be optimized for its specific task, leading to better results than trying to do both operations in a single prompt.

    This pattern can be applied to many other scenarios:

    • First analyze a document's structure, then generate a summary
    • First identify key points in a debate, then craft a balanced response
    • First extract facts from research, then write a layperson explanation
    • First identify bugs in code, then generate fixes for each one

    Resources

    • Anthropic's docs on prompt chaining
    • UPDATE: OpenAI has removed their guide on using inner-monologue

    11. RAG

    Problem: Your LLM is making up facts because it can't access the information it needs.

    Solution: Give it access to real data through retrieval augmented generation.

    RAG is a powerful technique for grounding your LLM's responses in actual data and reducing hallucinations. Every LLM has a cutoff date for its training data - it can't know about events or information after that date. Instead of relying on what it learned during training, it can look up fresh information as needed.

    Diagram showing a database feeding external information into an LLM prompt to enrich its output.

    You've got two main ways to feed data to your LLM. Web search gives you access to current information and public knowledge. Company databases and documentation let you tap into private, domain-specific information. This is particularly useful when you need answers about your company's internal processes or want to ensure your LLM's responses are up-to-date.

    RAG shouldn't be your first port of call when building an LLM application. It adds significant complexity to your system - you need to manage data sources, handle retrieval, and ensure your context windows stay within limits.

    Resources

    12. Chunking

    Problem: The information you want to retrieve is too large to fit in the context window.

    Solution: Break down the information into smaller, manageable chunks.

    Chunking is a fundamental technique in RAG systems that breaks down large documents into smaller, more manageable pieces. The goal is to create chunks that are both semantically meaningful and small enough to fit within your model's context window.

    The complexity of chunking comes from the many ways you can split content. Here are the main approaches:

    • Token-based: Splits content based on token count, ensuring you stay within model limits
    • Character-based: Splits by character count, useful for raw text processing
    • Sentence-based: Preserves natural language boundaries
    • Paragraph-based: Maintains larger semantic units
    • Semantic boundaries: Uses embeddings to find natural break points
    • Document-structure: Respects document formatting (headers, sections, etc.)

    After chunking, you'll need to find the most relevant chunks for each query. Here are the main ways to do this:

    • BM25: A traditional search algorithm that finds exact word matches, great for technical terms and error codes
    • Embeddings: Converts text into vectors to find semantically similar chunks
    • Hybrid Search: Combines BM25 and embeddings for better results
    • LLM Reranking: Uses another LLM to carefully read and rank chunks by relevance

    Each approach has its strengths - BM25 excels at exact matches, while embeddings capture meaning. Many systems combine multiple approaches for the best results.

    Resources

    13. Agentic Loops

    Problem: LLM call chaining is too rigid for complex tasks. It requires predefined steps and stopping points, making it unsuitable for open-ended problems where the number of steps is unpredictable.

    Solution: Pass control to an autonomous agent that can plan, execute, and adapt based on environmental feedback.

    LLM call chaining uses predefined steps and stopping points, which limits its ability to handle unpredictable tasks. Agentic loops hand more control to the LLM - letting it decide when to stop based on task progress. The agent learns when to stop through real-world feedback.

    Diagram showing LLM interacting with APIs in a loop, feeding results back into prompts until it decides to stop.

    The resulting system is more powerful because it adapts to unpredictable paths. Instead of following predefined steps, it learns and adjusts based on each interaction. This makes it effective for complex problems where the solution isn't known in advance.

    This autonomy comes with a cost - increased latency from decision-making at each step. The LLM must evaluate the current state and choose the best path forward. This makes agentic loops slower than LLM call chaining, but more capable of handling complex tasks.

    This pattern works well for:

    • Complex code modifications across multiple files
    • Research tasks requiring multiple information sources
    • Customer support scenarios with unpredictable paths
    • Data analysis requiring multiple processing steps

    Resources

    14. Parallelizing LLM Calls

    Problem: Your LLM-powered system is taking too long because it processes tasks one at a time, creating unnecessary delays.

    Solution: Run multiple LLM calls in parallel to handle independent tasks simultaneously, dramatically reducing total processing time.

    There are only two ways to make a system faster: do less work, or do more work at the same time. When you need to process multiple tasks independently, running LLM calls in parallel can dramatically improve performance.

    You can parallelize when tasks are independent and don't rely on each other's results:

    • Analyzing multiple documents
    • Generating different variations of content
    • Processing multiple user queries simultaneously

    You cannot parallelize when tasks must happen in sequence:

    • When each step depends on the previous one's output
    • When maintaining strict order is crucial for the final result

    The performance benefits are significant. A system processing 10 documents sequentially might take 10 seconds, while parallel processing could complete in just 2-3 seconds. You should always be looking for opportunities to parallelize - even in systems that seem sequential, there might be independent components that can be processed concurrently.

    Resources

    15. Evaluator-Optimizer

    Problem: Your LLM's responses aren't meeting the quality standards you need, even after multiple attempts.

    Solution: Create an automated loop where one LLM generates responses while another evaluates and provides feedback for improvement.

    The evaluator-optimizer workflow creates a self-improving system where two LLMs work together. The first LLM generates responses, while the second evaluates them against specific criteria. This evaluation feeds back into the generation process, creating a continuous improvement loop.

    This pattern is particularly effective when you have clear evaluation criteria and when iterative refinement provides measurable value. You'll know it's a good fit when human feedback demonstrably improves LLM responses, and when an LLM can provide similar quality feedback.

    The pattern excels in scenarios like literary translation, where an evaluator LLM can catch nuanced meaning that the translator might miss initially. It's also powerful for complex search tasks requiring multiple rounds of searching and analysis, where the evaluator decides if further searches are needed.

    Resources

    16. LLM Routers

    Problem: Different types of queries need different handling strategies.

    Solution: Use an LLM to route queries to the most appropriate handler.

    LLM routers act as intelligent dispatchers, analyzing each query and sending it to the right specialized handler.

    The router first analyzes each query to determine its type and complexity. A customer service system might classify queries into these categories:

    Based on this classification, the router connects the query to the appropriate handler with its specific set of instructions and capabilities.

    This routing approach delivers several key advantages. It improves accuracy by ensuring each query is handled by the most suitable specialized system. And it allows each specialized LLM to focus on its specific domain, similar to how LLM chaining breaks down complex tasks into specialized steps.

    It also solves a fundamental limitation of LLMs - most models can only handle a limited number of tools (often 30 or fewer). By routing queries to specialized handlers, you can create a system that effectively handles an unlimited number of tools, as each handler only needs access to its relevant subset.

    However, adding an LLM router introduces an additional sequential step that increases latency, as each query must first be analyzed before being routed to the appropriate handler.

    Resources

    • Anthropic's Article on building effective agents mentions LLM routers
    • UPDATE: OpenAI has removed their documentation on intent classification

    17. Fine-Tuning

    Problem: Your LLM's outputs need to match specific quality requirements that simpler techniques can't achieve.

    Solution: Fine-tune a base model on your specific data to improve its performance for your use case.

    Fine-tuning lets you adapt existing models to your specific needs. You can start with a relatively small dataset of high-quality examples that demonstrate exactly the kind of output you want - whether that's matching your brand voice, handling specialized terminology, or maintaining consistent formatting. A fine-tuned smaller model can often outperform larger, more expensive models on your specific task.

    Fine-tuning sits between pre-training and prompt engineering in terms of cost and complexity. While it's an order of magnitude cheaper than training a model from scratch, each fine-tuning run will incur additional costs. There's also a risk of overfitting to specific model versions, which can make it harder to transition to newer, better models in the future.

    The best time to consider fine-tuning is when you have a working system that's already using simpler techniques. You've validated your use case, gathered real-world data, and identified specific areas where the model's performance needs improvement. Fine-tuning then becomes an optimization step to push your system's performance even further.

    Resources

    • OpenAI has a section in their docs on fine-tuning.
    • Anthropic's fine-tuning guide provides detailed requirements and best practices.

    18. The Next Big Thing

    The field of AI engineering moves at an astonishing pace. Every week brings new models, techniques, and tools promising to revolutionize how we build AI systems. It's impossible to keep up with everything, but you don't need to.

    Each new AI development needs to earn its place in your system. Ask yourself: "Does this make things simpler or more complex? Is it solving a real problem?" The most valuable developments reduce costs, improve performance, or make your system more maintainable.

    The next big thing in AI will come and go, but simplicity remains a reliable guide. Experiment with new techniques using your own evals - test them against your specific use case and success criteria. Focus on developments that help you build more effective systems with less complexity.

    Join over 54,000 Developers Becoming AI Heroes

    Engineering fundamentals are your biggest advantage. Learn how to leverage them and leave the vibe coders behind.

    Email*

    I respect your privacy. Unsubscribe at any time.

Deep path

Exercise 1.1 — The token audit

Pick any ~500-word block of your own writing. In a single TypeScript file (module-1/token-audit.ts):

import { encoding_for_model } from "tiktoken";
// ... count tokens for cl100k_base (GPT-4 family) and estimate cost for Opus at $15/1M in, $75/1M out.

Print: token count, estimated input cost for 1 call, estimated cost for a 100-request batch. No need to call any API.

Check: Script runs and prints a number. Reflect: What is the input cost of loading your entire CV into the context window, five times?

Exercise 1.2 — Agent vs. workflow

Write module-1/when-not-to-agent.md. Pick three tasks from your own work: