<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/rss-styles.xsl" type="text/xsl"?>
<rss version="2.0">
  <channel>
    <title>oreoro</title>
    <description>This is my personal website, where I keep the code notes, implementation details, and technical ideas I work through outside normal project work.</description>
    <link>https://oreoro.github.io/</link>
    <lastBuildDate>Sun, 07 Jun 2026 03:38:51 GMT</lastBuildDate>
    <item>
      <title>How LLMs Actually Work: A Friendly Map for Humans</title>
      <link>https://oreoro.github.io/posts/how-llms-actually-work-friendly-guide/</link>
      <guid isPermaLink="true">https://oreoro.github.io/posts/how-llms-actually-work-friendly-guide/</guid>
      <description>A plain-English, visual guide to tokenization, embeddings, attention, transformer layers, and next-token prediction, with optional technical notes and tiny code examples.</description>
      <pubDate>Sat, 06 Jun 2026 00:00:00 GMT</pubDate>
      <lastUpdatedTimestamp>Sat Jun 06 2026 07:22:00 GMT+0000 (Coordinated Universal Time)</lastUpdatedTimestamp>
      <category>Personal Notes</category>
      <category>Guide</category>
      <category>Information</category>
      <category>🌲 Evergreen</category>
      <content>&lt;div&gt;
                    &lt;p&gt;
                        &lt;em&gt;Note:&lt;/em&gt; This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage
                        &lt;a href=&quot;https://oreoro.github.io/posts/how-llms-actually-work-friendly-guide/&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.
                    &lt;/p&gt;
                    &lt;hr&gt;
                &lt;div&gt;&lt;p&gt;&lt;time&gt; June 6, 2026 &lt;/time&gt;&lt;/p&gt;&lt;/div&gt;&lt;hr&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt; &amp;#x1f9ed; &lt;/div&gt;&lt;div&gt; LLMs are not magic brains. They are prediction machines built from a few repeatable parts: tokens, vectors, attention, memory-like feed-forward layers, and a loop that keeps choosing the next likely piece of text.  &lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt; &amp;#x270d;&amp;#xfe0f; &lt;/div&gt;&lt;div&gt; Source note: this is an original, beginner-friendly rewrite inspired by Kato&amp;apos;s article &lt;a href=&quot;https://www.0xkato.xyz/how-llms-actually-work/&quot; target=&quot;_blank&quot;&gt;How LLMs Actually Work&lt;/a&gt;, with extra examples, code, tables, and Notion-native structure.  &lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;The whole idea in one minute&lt;/h3&gt;&lt;p&gt;An LLM, or large language model, takes your text, turns it into numbers, runs those numbers through many transformer layers, and predicts what text should come next.&lt;/p&gt;&lt;p&gt;That is the simple version. The useful version is this:&lt;/p&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt; Your prompt is split into &lt;strong&gt;tokens&lt;/strong&gt;, which are small text pieces.  &lt;/li&gt;&lt;li&gt; Each token becomes a &lt;strong&gt;vector&lt;/strong&gt;, which is a list of numbers that carries learned meaning.  &lt;/li&gt;&lt;li&gt; The model adds information about &lt;strong&gt;order&lt;/strong&gt;, because &lt;code&gt;dog bites man&lt;/code&gt; and &lt;code&gt;man bites dog&lt;/code&gt; do not mean the same thing.  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Attention&lt;/strong&gt; lets each token decide which earlier tokens matter.  &lt;/li&gt;&lt;li&gt; A &lt;strong&gt;feed-forward network&lt;/strong&gt; does deeper processing for each token.  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Residual connections and normalization&lt;/strong&gt; keep the many layers stable.  &lt;/li&gt;&lt;li&gt; The model outputs scores for the next possible token.  &lt;/li&gt;&lt;li&gt; One token is chosen, added to the text, and the loop repeats.  &lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;pre&gt;flowchart LR
    A[&amp;quot;You type a prompt&amp;quot;] --&amp;gt; B[&amp;quot;Tokenizer&amp;lt;br&amp;gt;text pieces&amp;quot;]
    B --&amp;gt; C[&amp;quot;Embeddings&amp;lt;br&amp;gt;meaning as numbers&amp;quot;]
    C --&amp;gt; D[&amp;quot;Position signal&amp;lt;br&amp;gt;word order&amp;quot;]
    D --&amp;gt; E[&amp;quot;Attention&amp;lt;br&amp;gt;what should matter?&amp;quot;]
    E --&amp;gt; F[&amp;quot;Feed-forward layer&amp;lt;br&amp;gt;deeper processing&amp;quot;]
    F --&amp;gt; G[&amp;quot;Next-token scores&amp;quot;]
    G --&amp;gt; H[&amp;quot;Pick one token&amp;quot;]
    H --&amp;gt; I[&amp;quot;Add it to the text&amp;quot;]
    I --&amp;gt; E&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt; &amp;#x1f4a1; &lt;/div&gt;&lt;div&gt; A good mental model: an LLM is like an autocomplete system that has read a massive library and learned incredibly subtle patterns about what usually follows what.  &lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Part &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Plain-English job &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Why it matters &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt; Tokens &lt;/td&gt;&lt;td&gt; Break text into pieces &lt;/td&gt;&lt;td&gt; The model cannot read raw words or letters directly. &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Embeddings &lt;/td&gt;&lt;td&gt; Turn pieces into meaning-shaped numbers &lt;/td&gt;&lt;td&gt; Similar ideas can sit near each other in number-space. &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Position &lt;/td&gt;&lt;td&gt; Tell the model where each piece appears &lt;/td&gt;&lt;td&gt; Order changes meaning. &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Attention &lt;/td&gt;&lt;td&gt; Let tokens look at useful previous tokens &lt;/td&gt;&lt;td&gt; This is how context flows through the sentence. &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Feed-forward network &lt;/td&gt;&lt;td&gt; Process each token more deeply &lt;/td&gt;&lt;td&gt; A lot of learned structure lives here. &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Next-token prediction &lt;/td&gt;&lt;td&gt; Score likely continuations &lt;/td&gt;&lt;td&gt; This is the generation loop behind every answer. &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;1. Tokens: the model&amp;apos;s alphabet is not your alphabet&lt;/h3&gt;&lt;p&gt;Models do not see your sentence the way you do. You see words. The model sees token IDs.&lt;/p&gt;&lt;p&gt;A tokenizer might split a sentence like this:&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;Text:   &amp;quot;The sleepy robot writes poetry.&amp;quot;Tokens: [&amp;quot;The&amp;quot;, &amp;quot; sleepy&amp;quot;, &amp;quot; robot&amp;quot;, &amp;quot; writes&amp;quot;, &amp;quot; poetry&amp;quot;, &amp;quot;.&amp;quot;]IDs:    [791, 47823, 11205, 13004, 24465, 13]&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;p&gt;Those ID numbers are what enter the model. The specific numbers differ across model families, but the pattern is the same: text becomes a sequence of integers.&lt;/p&gt;&lt;p&gt;Why not just use whole words? Because language is messy. New names, typos, code, slang, and other languages would explode the vocabulary. Tokens sit between letters and words: flexible enough for rare text, efficient enough for common text.&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Slightly technical: why the strawberry counting problem happens&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;When you ask a model how many letters are in a word, the model may not be looking at separate letters. It may see a word as one or a few tokens. That means character-level questions can be awkward unless the model deliberately reasons about spelling.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;const vocabulary = {  &amp;quot;The&amp;quot;: 791,  &amp;quot; sleepy&amp;quot;: 47823,  &amp;quot; robot&amp;quot;: 11205,  &amp;quot; writes&amp;quot;: 13004,  &amp;quot; poetry&amp;quot;: 24465,  &amp;quot;.&amp;quot;: 13,};const prompt = [&amp;quot;The&amp;quot;, &amp;quot; sleepy&amp;quot;, &amp;quot; robot&amp;quot;, &amp;quot; writes&amp;quot;, &amp;quot; poetry&amp;quot;, &amp;quot;.&amp;quot;];const tokenIds = prompt.map((piece) =&amp;gt; vocabulary[piece]);console.log(tokenIds);// [791, 47823, 11205, 13004, 24465, 13]&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;2. Embeddings: IDs become meaning-shaped numbers&lt;/h3&gt;&lt;p&gt;A token ID by itself is just a label. ID &lt;code&gt;11205&lt;/code&gt; does not mean robot unless the model has a learned table that says what vector should represent that token.&lt;/p&gt;&lt;p&gt;That table is called the &lt;strong&gt;embedding matrix&lt;/strong&gt;. Think of it as a huge spreadsheet:&lt;/p&gt;&lt;ul&gt;&lt;li&gt; Every token ID gets one row.  &lt;/li&gt;&lt;li&gt; Every row contains many numbers.  &lt;/li&gt;&lt;li&gt; Those numbers are learned during training.  &lt;/li&gt;&lt;li&gt; The row becomes the token&amp;apos;s starting representation.  &lt;/li&gt;&lt;/ul&gt;&lt;p&gt;If two tokens are used in similar situations, their vectors often end up close together. Words like &lt;code&gt;doctor&lt;/code&gt;, &lt;code&gt;nurse&lt;/code&gt;, and &lt;code&gt;hospital&lt;/code&gt; tend to live near related medical concepts. This was not hand-labeled by a person; it emerges because those relationships help the model predict text.&lt;/p&gt;&lt;div&gt;&lt;div&gt; &amp;#x1f9e0; &lt;/div&gt;&lt;div&gt; Embeddings are not definitions. They are coordinates learned from usage. The model learns that concepts are related because they appear in related contexts.  &lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Slightly technical: vector arithmetic&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;An embedding is a vector, meaning a list of numbers. With enough training, directions in vector space can behave like meaning shifts. That is why famous examples like &lt;code&gt;king - man + woman &amp;#x2248; queen&lt;/code&gt; can sometimes work. It is geometry, not a dictionary.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;3. Position: the model needs word order&lt;/h3&gt;&lt;p&gt;A bag of tokens is not enough. These two sentences contain almost the same pieces but mean very different things:&lt;/p&gt;&lt;blockquote&gt;&lt;div&gt; The dog chased the boy.  &lt;/div&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;div&gt; The boy chased the dog.  &lt;/div&gt;&lt;/blockquote&gt;&lt;p&gt;The model therefore needs a position signal. Older transformers added a position vector to each token embedding. Many modern LLMs use &lt;strong&gt;RoPE&lt;/strong&gt;, short for Rotary Position Embeddings, where position is represented by rotating parts of the vector.&lt;/p&gt;&lt;p&gt;You do not need the math to understand the purpose: position makes the model aware that one token came before another, and roughly how far apart they are.&lt;/p&gt;&lt;div&gt;&lt;div&gt; &amp;#x1f4cc; &lt;/div&gt;&lt;div&gt; Practical takeaway: important context usually works best near the start or end of a long prompt. Many models are weaker at using information buried in the middle.  &lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Slightly technical: why long context is still hard&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;Even if a model can accept a huge prompt, that does not mean it uses every part equally well. Attention has to compare many tokens, and retrieval quality can drop when the answer is hidden in the middle of a long context window.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;4. Attention: tokens decide what to pay attention to&lt;/h3&gt;&lt;p&gt;Attention is the heart of the transformer. It lets each token ask: &lt;strong&gt;which previous tokens should shape my current meaning?&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;For each token, the model creates three learned views:&lt;/p&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Name &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Question it answers &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Everyday analogy &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt; Query &lt;/td&gt;&lt;td&gt; What am I looking for? &lt;/td&gt;&lt;td&gt; A search request &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Key &lt;/td&gt;&lt;td&gt; What do I match with? &lt;/td&gt;&lt;td&gt; A label on stored information &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Value &lt;/td&gt;&lt;td&gt; What information should be passed along? &lt;/td&gt;&lt;td&gt; The content you copy after finding a match &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;p&gt;Imagine the sentence:&lt;/p&gt;&lt;blockquote&gt;&lt;div&gt; The cat that I saw yesterday was sleeping.  &lt;/div&gt;&lt;/blockquote&gt;&lt;p&gt;When the model reaches &lt;code&gt;was&lt;/code&gt;, it needs to know what was sleeping. Attention can give more weight to &lt;code&gt;cat&lt;/code&gt; than to &lt;code&gt;yesterday&lt;/code&gt;, because &lt;code&gt;cat&lt;/code&gt; is more useful for understanding the verb.&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;import mathscores = {&amp;quot;cat&amp;quot;: 3.0, &amp;quot;yesterday&amp;quot;: 0.2, &amp;quot;saw&amp;quot;: 0.7}# Softmax turns raw scores into weights that add up to 1.exp_scores = {word: math.exp(score) for word, score in scores.items()}total = sum(exp_scores.values())weights = {word: value / total for word, value in exp_scores.items()}print(weights)# cat gets most of the weight&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt; &amp;#x1f512; &lt;/div&gt;&lt;div&gt; GPT-style models use causal masking: while predicting the next token, they can look backward but not forward. Future text is hidden because it has not been generated yet.  &lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;5. Multi-head attention: many views at once&lt;/h3&gt;&lt;p&gt;One attention pattern is not enough for language. A sentence can contain grammar, references, tone, code syntax, and long-range dependencies at the same time.&lt;/p&gt;&lt;p&gt;Multi-head attention runs several attention operations in parallel. One head might track subject-verb relationships. Another might follow quotation marks. Another might notice that a variable name in code was used earlier.&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Slightly technical: heads are learned projections, not fixed slices&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;Each head learns its own projections from the full token vector into a smaller query/key/value space. So a head is not simply handed a pre-cut piece of the vector. It learns its own way to view the whole token representation.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;p&gt;The model then combines the outputs from all heads and sends the result onward.&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;Token representation   &amp;#x251c;&amp;#x2500; attention head 1: grammar relationship   &amp;#x251c;&amp;#x2500; attention head 2: nearby phrase structure   &amp;#x251c;&amp;#x2500; attention head 3: repeated pattern   &amp;#x2514;&amp;#x2500; attention head 4: reference or pronoun link        &amp;#x2193;Combined into one updated token representation&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;p&gt;A practical detail: during generation, the model stores old key and value vectors in a &lt;strong&gt;KV cache&lt;/strong&gt;. That way it does not need to recompute the entire conversation every time it adds one new token.&lt;/p&gt;&lt;hr&gt;&lt;h3&gt;6. Feed-forward networks: where a lot of learned structure lives&lt;/h3&gt;&lt;p&gt;After attention mixes information between tokens, each token goes through a feed-forward network.&lt;/p&gt;&lt;p&gt;Attention is about tokens communicating. The feed-forward network is more like each token doing private thinking.&lt;/p&gt;&lt;p&gt;The rough pattern is:&lt;/p&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt; Expand the vector into a larger space.  &lt;/li&gt;&lt;li&gt; Apply a non-linear function.  &lt;/li&gt;&lt;li&gt; Compress it back down.  &lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;p&gt;The non-linear step matters because it lets the model learn richer patterns. Without it, many stacked layers would collapse into something much simpler.&lt;/p&gt;&lt;div&gt;&lt;div&gt; &amp;#x1f9f1; &lt;/div&gt;&lt;div&gt; A lot of model parameters live in feed-forward layers. This is one reason they are often discussed as the model&amp;apos;s learned store of patterns, facts, and associations.  &lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Slightly technical: dense models vs mixture of experts&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;In a dense transformer, every token uses the same feed-forward network in a layer. In a mixture-of-experts model, a small router chooses only a few expert networks for each token. This can increase total model capacity without making every token run through every parameter.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;7. Residual stream and normalization: keeping deep models trainable&lt;/h3&gt;&lt;p&gt;A modern LLM can have dozens or even hundreds of layers. If each layer simply replaced the previous representation, training would be fragile.&lt;/p&gt;&lt;p&gt;Residual connections solve part of that problem. Instead of replacing the vector, a block adds its output back to the existing vector.&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;new_vector = old_vector + block_output&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;p&gt;This creates a running stream of information through the network. Each layer can add a refinement without destroying everything that came before.&lt;/p&gt;&lt;p&gt;Layer normalization keeps the numbers stable. Without it, values can grow too large or shrink too much as they pass through many layers.&lt;/p&gt;&lt;div&gt;&lt;div&gt; &amp;#x1f6e0;&amp;#xfe0f; &lt;/div&gt;&lt;div&gt; The boring-sounding parts matter. Residual connections and normalization are major reasons very deep transformer stacks can actually train.  &lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;8. Next-token prediction: the answer is built one piece at a time&lt;/h3&gt;&lt;p&gt;At the end of the stack, the model turns the final vector into scores for possible next tokens. These raw scores are called logits. A softmax converts them into probabilities.&lt;/p&gt;&lt;p&gt;Then a decoding strategy chooses one token.&lt;/p&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Setting &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Plain-English effect &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; When useful &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt; Temperature &lt;/td&gt;&lt;td&gt; Controls randomness &lt;/td&gt;&lt;td&gt; Lower for precise answers, higher for creative drafts &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Top-k &lt;/td&gt;&lt;td&gt; Only considers the k most likely tokens &lt;/td&gt;&lt;td&gt; Prevents very unlikely choices &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Top-p &lt;/td&gt;&lt;td&gt; Considers the smallest likely group whose probabilities add up to p &lt;/td&gt;&lt;td&gt; Flexible sampling without fixed k &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;text = &amp;quot;The capital of France is&amp;quot;while not done:    token_ids = tokenize(text)    vectors = transformer(token_ids)    next_token_scores = unembed(vectors[-1])    next_token = sample(next_token_scores, temperature=0.7)    text += detokenize(next_token)&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;p&gt;That loop is the machine behind the fluent paragraph. The model writes by repeatedly asking: &lt;strong&gt;given everything so far, what token should come next?&lt;/strong&gt;&lt;/p&gt;&lt;div&gt;&lt;div&gt; &amp;#x26a0;&amp;#xfe0f; &lt;/div&gt;&lt;div&gt; This also explains hallucinations. The base training objective rewards plausible continuation, not guaranteed truth. Post-training, retrieval, tool use, and evaluation are added to make outputs more useful and reliable.  &lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;9. Architecture vs weights: why models feel different&lt;/h3&gt;&lt;p&gt;Many modern LLMs share the same broad transformer-family shape. What makes them feel different is usually a combination of:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Training data:&lt;/strong&gt; what they learned from.  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt; how many layers, heads, parameters, and tokens were used.  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Architecture choices:&lt;/strong&gt; dense or mixture-of-experts, attention variants, context length, tokenizer.  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Post-training:&lt;/strong&gt; instruction tuning, preference training, safety behavior, tool use, and product-level rules.  &lt;/li&gt;&lt;/ul&gt;&lt;p&gt;So when people compare GPT, Claude, Gemini, Llama, Mistral, Qwen, or Gemma, they are often comparing siblings in a broad transformer family rather than completely unrelated species of model.&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Slightly technical: modern transformer vocabulary&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;&lt;strong&gt;RoPE:&lt;/strong&gt; position through vector rotation.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;RMSNorm:&lt;/strong&gt; a cheaper normalization variant used in many modern open models.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;SwiGLU:&lt;/strong&gt; a popular activation/feed-forward design.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;GQA:&lt;/strong&gt; grouped-query attention, which reduces KV-cache memory.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;MoE:&lt;/strong&gt; mixture of experts, where only selected expert networks run for each token.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;h3&gt;10. GPT-2 and MoE: two useful milestones&lt;/h3&gt;&lt;p&gt;Two research threads make the mechanics above feel more concrete. &lt;strong&gt;GPT-2&lt;/strong&gt; showed how far plain next-token prediction could go when scaled. &lt;strong&gt;Mixture of Experts&lt;/strong&gt; shows how a model can grow more capable without forcing every token to use every parameter.&lt;/p&gt;&lt;div&gt;&lt;div&gt; &amp;#x1f9e9; &lt;/div&gt;&lt;div&gt; Plain-English mental model: GPT-2 is like one very large generalist team. MoE is like a building with specialist rooms, where a router sends each token to only the rooms that seem useful.  &lt;/div&gt;&lt;/div&gt;&lt;h4&gt;GPT-2: scaling the next-token game&lt;/h4&gt;&lt;p&gt;OpenAI&amp;apos;s 2019 paper &lt;a href=&quot;https://cdn.openai.com/better-language-models/language-models.pdf&quot; target=&quot;_blank&quot;&gt;&lt;em&gt;Language Models are Unsupervised Multitask Learners&lt;/em&gt;&lt;/a&gt; made a simple bet famous: train a transformer to continue internet text, then test whether that same model can handle many tasks by phrasing them as text continuation.&lt;/p&gt;&lt;ul&gt;&lt;li&gt; It was autoregressive: it generated left to right, one token at a time.  &lt;/li&gt;&lt;li&gt; It was dense: every token passed through the same model weights.  &lt;/li&gt;&lt;li&gt; It helped popularize the idea that scale plus simple training can produce surprisingly general behavior.  &lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;# Simplified GPT-2-style objectiveprompt = &amp;quot;Translate to French: hello&amp;quot;target_next_token = &amp;quot; bon&amp;quot;# Training nudges the model so this next token becomes more likely.loss = cross_entropy(model(prompt), target_next_token)&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;MoE: not every token needs the whole building&lt;/h4&gt;&lt;p&gt;A dense transformer usually runs every token through the same feed-forward network. In a &lt;strong&gt;Mixture-of-Experts&lt;/strong&gt; model, a small router chooses only a few expert networks for each token. The model can have many more total parameters, while each token activates only a subset.&lt;/p&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Concept &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Dense LLM &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; MoE LLM &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt; Work per token &lt;/td&gt;&lt;td&gt; Uses the same main blocks &lt;/td&gt;&lt;td&gt; Uses selected experts &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Analogy &lt;/td&gt;&lt;td&gt; One big generalist team &lt;/td&gt;&lt;td&gt; Router plus specialist teams &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Tradeoff &lt;/td&gt;&lt;td&gt; Simpler to train and serve &lt;/td&gt;&lt;td&gt; More capacity, more routing complexity &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Slightly technical: where the MoE papers fit&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2101.03961&quot; target=&quot;_blank&quot;&gt;Switch Transformers&lt;/a&gt; simplified MoE routing by sending each token to one expert. &lt;a href=&quot;https://arxiv.org/abs/2112.10684&quot; target=&quot;_blank&quot;&gt;Efficient Large Scale Language Modeling with Mixtures of Experts&lt;/a&gt; studied autoregressive MoE language models at scale. &lt;a href=&quot;https://arxiv.org/abs/2401.04088&quot; target=&quot;_blank&quot;&gt;Mixtral of Experts&lt;/a&gt; is a modern sparse MoE example where each token is routed to two feed-forward experts.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt; &amp;#x2696;&amp;#xfe0f; &lt;/div&gt;&lt;div&gt; Important nuance: MoE does not automatically mean smarter. Data quality, routing balance, training stability, inference hardware, and post-training still matter.  &lt;/div&gt;&lt;/div&gt;&lt;h3&gt;11. The AI ecosystem: MCP, tools, RAG, agents, and evals&lt;/h3&gt;&lt;p&gt;The transformer is the engine, but real AI products usually add a stack around it. That stack gives the model fresh information, lets it take actions, checks its work, and keeps the system observable.&lt;/p&gt;&lt;div&gt;&lt;div&gt; &amp;#x1f5fa;&amp;#xfe0f; &lt;/div&gt;&lt;div&gt; Plain-English map: the LLM is the text brain, tools are the hands, RAG is the open-book notes, MCP is a standard plug for external systems, agents are the loop that decides what to do next, and evals are the tests that tell you if any of it works.  &lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Term &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Simple meaning &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; What it helps with &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Watch out for &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt; Prompt &lt;/td&gt;&lt;td&gt; Instructions and context &lt;/td&gt;&lt;td&gt; Steering behavior without changing weights &lt;/td&gt;&lt;td&gt; Vague prompts create vague answers &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Tool calling &lt;/td&gt;&lt;td&gt; The model asks your app to run a function &lt;/td&gt;&lt;td&gt; Weather, search, payments, calendars, databases &lt;/td&gt;&lt;td&gt; Validate every argument before doing anything real &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; MCP &lt;/td&gt;&lt;td&gt; A shared protocol for connecting AI apps to tools/data &lt;/td&gt;&lt;td&gt; Reusable integrations across different hosts &lt;/td&gt;&lt;td&gt; Permissions, auth, and tool descriptions matter &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; RAG &lt;/td&gt;&lt;td&gt; Retrieve relevant documents before answering &lt;/td&gt;&lt;td&gt; Fresh facts and private knowledge &lt;/td&gt;&lt;td&gt; Bad retrieval creates confident wrong answers &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Embeddings &lt;/td&gt;&lt;td&gt; Meaning as searchable vectors &lt;/td&gt;&lt;td&gt; Semantic search and clustering &lt;/td&gt;&lt;td&gt; Similar does not always mean correct &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Agent &lt;/td&gt;&lt;td&gt; A model inside a task loop &lt;/td&gt;&lt;td&gt; Planning, tool use, retries, handoffs &lt;/td&gt;&lt;td&gt; Needs limits, logs, and stop conditions &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Fine-tuning &lt;/td&gt;&lt;td&gt; Training on examples of desired behavior &lt;/td&gt;&lt;td&gt; Style, format, classification, repeated edge cases &lt;/td&gt;&lt;td&gt; Do evals first; do not use it as a fact database &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Evals &lt;/td&gt;&lt;td&gt; Tests for model behavior &lt;/td&gt;&lt;td&gt; Comparing prompts, tools, models, and releases &lt;/td&gt;&lt;td&gt; Tiny demo tests miss real-world messiness &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;h4&gt;MCP: the USB-C idea for AI tools&lt;/h4&gt;&lt;p&gt;MCP stands for &lt;strong&gt;Model Context Protocol&lt;/strong&gt;. Instead of every AI app inventing a custom connector for every service, MCP defines a common client-server pattern. An AI app is the host. It creates an MCP client. That client connects to an MCP server, which exposes things like tools, resources, and prompts.&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;AI app / host  &amp;#x2514;&amp;#x2500; MCP client       &amp;#x2514;&amp;#x2500; MCP server            &amp;#x251c;&amp;#x2500; tools: actions the model may request            &amp;#x251c;&amp;#x2500; resources: files, docs, database records, logs            &amp;#x2514;&amp;#x2500; prompts: reusable instruction templates&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;p&gt;The key idea is not that MCP makes the model smarter by itself. It makes integrations more standard. A coding agent can connect to GitHub, a support assistant can connect to tickets, and a research assistant can connect to document stores using the same basic pattern.&lt;/p&gt;&lt;div&gt;&lt;div&gt; &amp;#x1f510; &lt;/div&gt;&lt;div&gt; Security rule: treat tools like real permissions, not decorations. If a tool can send email, delete files, spend money, or publish content, the app should require clear approval, scoped access, logging, and argument validation.  &lt;/div&gt;&lt;/div&gt;&lt;h4&gt;RAG: giving the model an open book&lt;/h4&gt;&lt;p&gt;RAG means &lt;strong&gt;Retrieval-Augmented Generation&lt;/strong&gt;. The model does not rely only on what it learned during training. Your app first searches a knowledge base, pulls the most relevant chunks into the prompt, and asks the model to answer using that context.&lt;/p&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt; Split documents into chunks.  &lt;/li&gt;&lt;li&gt; Turn each chunk into an embedding vector.  &lt;/li&gt;&lt;li&gt; Store those vectors in a search index or vector database.  &lt;/li&gt;&lt;li&gt; When the user asks something, search for similar chunks.  &lt;/li&gt;&lt;li&gt; Put the best chunks into the model context and ask for a grounded answer.  &lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;javascript&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;const question = &amp;quot;What is our refund policy?&amp;quot;;const hits = await vectorSearch(embed(question), { topK: 5 });const answer = await llm.generate({  instructions: &amp;quot;Answer only from the provided policy snippets.&amp;quot;,  context: hits.map((hit) =&amp;gt; hit.text),  input: question,});&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;Agents: the loop around the model&lt;/h4&gt;&lt;p&gt;An agent is not a new kind of brain. It is usually an LLM plus an orchestration loop: read the goal, choose a next step, maybe call a tool, inspect the result, update the plan, and continue until done or stopped.&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;goal &amp;#x2192; think about next step &amp;#x2192; call tool &amp;#x2192; observe result &amp;#x2192; adjust plan &amp;#x2192; final answer&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt; &amp;#x1f9ea; &lt;/div&gt;&lt;div&gt; Evals are what turn AI from a cool demo into an engineering system. Before shipping a new prompt, model, tool, or agent flow, test it on examples that represent real users, failure cases, and edge cases.  &lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Slightly technical: how these pieces fit in one product&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;A production assistant might use MCP to discover tools, RAG to fetch private documents, tool calling to take controlled actions, structured outputs to return clean JSON, evals to measure quality, tracing to debug failures, and guardrails to block unsafe or unauthorized actions.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;A friendly checklist for understanding any LLM answer&lt;/h3&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&amp;#x2705;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;  Did the model receive the right information in the prompt? &lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&amp;#x2705;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;  Was the important context near the beginning or end? &lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&amp;#x2705;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;  Is the task asking for facts, reasoning, creativity, or formatting? &lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&amp;#x2705;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;  Would retrieval or a tool make the answer more grounded? &lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&amp;#x2705;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;  Should the output be checked against a source before trusting it? &lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt; &amp;#x2705; &lt;/div&gt;&lt;div&gt; If you remember one thing, remember this: LLMs transform text into numbers, let those numbers exchange context through attention, and then predict the next token again and again until an answer appears.  &lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;Further reading&lt;/h3&gt;&lt;ul&gt;&lt;li&gt; Kato, &lt;a href=&quot;https://www.0xkato.xyz/how-llms-actually-work/&quot; target=&quot;_blank&quot;&gt;How LLMs Actually Work&lt;/a&gt;&lt;/li&gt;&lt;li&gt; Vaswani et al., &lt;a href=&quot;https://arxiv.org/abs/1706.03762&quot; target=&quot;_blank&quot;&gt;Attention Is All You Need&lt;/a&gt;&lt;/li&gt;&lt;li&gt; Su et al., &lt;a href=&quot;https://arxiv.org/abs/2104.09864&quot; target=&quot;_blank&quot;&gt;RoFormer: Enhanced Transformer with Rotary Position Embedding&lt;/a&gt;&lt;/li&gt;&lt;li&gt; Liu et al., &lt;a href=&quot;https://arxiv.org/abs/2307.03172&quot; target=&quot;_blank&quot;&gt;Lost in the Middle: How Language Models Use Long Contexts&lt;/a&gt;&lt;/li&gt;&lt;li&gt; Radford et al., &lt;a href=&quot;https://cdn.openai.com/better-language-models/language-models.pdf&quot; target=&quot;_blank&quot;&gt;Language Models are Unsupervised Multitask Learners&lt;/a&gt; (GPT-2)  &lt;/li&gt;&lt;li&gt; Fedus, Zoph, and Shazeer, &lt;a href=&quot;https://arxiv.org/abs/2101.03961&quot; target=&quot;_blank&quot;&gt;Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity&lt;/a&gt;&lt;/li&gt;&lt;li&gt; Artetxe et al., &lt;a href=&quot;https://arxiv.org/abs/2112.10684&quot; target=&quot;_blank&quot;&gt;Efficient Large Scale Language Modeling with Mixtures of Experts&lt;/a&gt;&lt;/li&gt;&lt;li&gt; Jiang et al., &lt;a href=&quot;https://arxiv.org/abs/2401.04088&quot; target=&quot;_blank&quot;&gt;Mixtral of Experts&lt;/a&gt;&lt;/li&gt;&lt;li&gt; Model Context Protocol, &lt;a href=&quot;https://modelcontextprotocol.io/docs/learn/architecture&quot; target=&quot;_blank&quot;&gt;Architecture overview&lt;/a&gt;&lt;/li&gt;&lt;li&gt; OpenAI, &lt;a href=&quot;https://developers.openai.com/api/docs/guides/function-calling&quot; target=&quot;_blank&quot;&gt;Function calling / tool calling guide&lt;/a&gt;&lt;/li&gt;&lt;li&gt; OpenAI, &lt;a href=&quot;https://openai.com/index/introducing-text-and-code-embeddings/&quot; target=&quot;_blank&quot;&gt;Introducing text and code embeddings&lt;/a&gt;&lt;/li&gt;&lt;li&gt; OpenAI Agents SDK, &lt;a href=&quot;https://openai.github.io/openai-agents-python/agents/&quot; target=&quot;_blank&quot;&gt;Agents guide&lt;/a&gt;&lt;/li&gt;&lt;li&gt; OpenAI, &lt;a href=&quot;https://developers.openai.com/api/docs/guides/supervised-fine-tuning&quot; target=&quot;_blank&quot;&gt;Supervised fine-tuning guide&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;blockquote&gt;&lt;div&gt; Polished enough to read like an essay, structured enough to use as a reference, and simple enough that you can explain it to a friend after one pass.  &lt;/div&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;hr&gt;&lt;aside&gt;&lt;h2&gt;
Interlinked Content
&lt;/h2&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;/aside&gt;&lt;/div&gt;</content>
    </item>
    <item>
      <title>The 2026 AI Landscape: A Hacker&apos;s Deep Dive 🤖</title>
      <link>https://oreoro.github.io/posts/2026-ai-landscape/</link>
      <guid isPermaLink="true">https://oreoro.github.io/posts/2026-ai-landscape/</guid>
      <description>Everything you need to understand the current AI moment, from transformers to agents, RAG pipelines to MCP, and every buzzword in between.</description>
      <pubDate>Thu, 04 Jun 2026 00:00:00 GMT</pubDate>
      <lastUpdatedTimestamp>Thu Jun 04 2026 15:22:00 GMT+0000 (Coordinated Universal Time)</lastUpdatedTimestamp>
      <category>Personal Notes</category>
      <category>Guide</category>
      <category>Information</category>
      <category>Tools</category>
      <content>&lt;div&gt;
                    &lt;p&gt;
                        &lt;em&gt;Note:&lt;/em&gt; This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage
                        &lt;a href=&quot;https://oreoro.github.io/posts/2026-ai-landscape/&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.
                    &lt;/p&gt;
                    &lt;hr&gt;
                &lt;div&gt;&lt;p&gt;&lt;time&gt; June 4, 2026 &lt;/time&gt;&lt;/p&gt;&lt;/div&gt;&lt;hr&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt; &amp;#x1f916; &lt;/div&gt;&lt;div&gt; This deep dive is structured as a native Notion article: use the table of contents below to jump between architecture, agents, RAG, protocols, frameworks, prompting, vector databases, glossary, and code appendix.  &lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Layer &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Native Notion treatment &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt; Navigation &lt;/td&gt;&lt;td&gt; Built-in table of contents plus semantic headings &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Reference material &lt;/td&gt;&lt;td&gt; Native tables and collapsible glossary sections &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Implementation detail &lt;/td&gt;&lt;td&gt; Language-aware code blocks and equation blocks &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Long-form reading &lt;/td&gt;&lt;td&gt; Callouts, dividers, and structured sections &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;hr&gt;&lt;div&gt;&lt;div&gt; &amp;#x2615; &lt;/div&gt;&lt;div&gt; Everything you need to understand the current AI moment &amp;#x2014; from transformers to agents, RAG pipelines to MCP, and every buzzword in between. Grab a coffee.  &lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;1. The Big Picture: Where AI Stands in 2026&lt;/h3&gt;&lt;p&gt;The AI landscape of 2026 is defined by a single, sweeping shift: &lt;strong&gt;from chat to action&lt;/strong&gt;. The previous era was dominated by raw model intelligence &amp;#x2014; who had the biggest, smartest LLM. The current era prioritizes &lt;strong&gt;orchestration layers&lt;/strong&gt; that unify multiple models and tools to automate complex, end-to-end business workflows. We&amp;apos;ve moved from &amp;quot;AI that talks&amp;quot; to &amp;quot;AI that does.&amp;quot;[1][2]&lt;/p&gt;&lt;p&gt;A few landmark data points paint the picture clearly:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;700 million people&lt;/strong&gt; use ChatGPT weekly as of mid-2025[3]  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Anthropic&lt;/strong&gt; now holds ~40% of enterprise LLM API spend; &lt;strong&gt;OpenAI has dropped to 27%&lt;/strong&gt;, down from ~50% in 2023[4]  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;2026 is the year of autonomous AI agents&lt;/strong&gt; &amp;#x2014; goal-driven systems combining reasoning, planning, and tool use, marking the biggest functional jump since GPT-3[1]  &lt;/li&gt;&lt;li&gt; Open-source models (Meta&amp;apos;s Llama 4 family with 10M token context windows) have &lt;strong&gt;narrowed the gap&lt;/strong&gt; with proprietary models dramatically[4]  &lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The defining theme is the &lt;strong&gt;Agentic Web&lt;/strong&gt; &amp;#x2014; where AI agents serve as the primary gateway to the internet, navigating backends through APIs rather than humans switching between websites.[2]&lt;/p&gt;&lt;hr&gt;&lt;h3&gt;2. The GPT Evolution &amp;#x2014; A Complete Timeline&lt;/h3&gt;&lt;p&gt;The journey from GPT-1 to GPT-5.5 is arguably the fastest capability evolution in computing history &amp;#x2014; parameters grew from 117 million to 175+ billion, a &lt;strong&gt;1,495&amp;#xd7; increase in two years&lt;/strong&gt; (GPT-1 to GPT-3).[3]&lt;/p&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Model &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Date &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Params &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Key Leap &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;GPT-1&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; June 2018 &lt;/td&gt;&lt;td&gt; 117M &lt;/td&gt;&lt;td&gt; Proved unsupervised pre-training works[3] &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;GPT-2&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Feb 2019 &lt;/td&gt;&lt;td&gt; 1.5B &lt;/td&gt;&lt;td&gt; Coherent long-form text; initially &amp;quot;too dangerous to release&amp;quot;[5] &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;GPT-3&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; May 2020 &lt;/td&gt;&lt;td&gt; 175B &lt;/td&gt;&lt;td&gt; First commercially viable model; in-context learning[5] &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;GPT-3.5 / InstructGPT&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; 2022 &lt;/td&gt;&lt;td&gt; ~175B &lt;/td&gt;&lt;td&gt; RLHF introduced; gave us ChatGPT[5] &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;GPT-4&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; March 2023 &lt;/td&gt;&lt;td&gt; Undisclosed &lt;/td&gt;&lt;td&gt; Multimodal (text + image); reasoning at scale[3] &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;GPT-4o&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; May 2024 &lt;/td&gt;&lt;td&gt; Undisclosed &lt;/td&gt;&lt;td&gt; Omnimodal (text, image, audio natively); 2&amp;#xd7; speed[6] &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;GPT-4.5&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Feb 2025 &lt;/td&gt;&lt;td&gt; Undisclosed &lt;/td&gt;&lt;td&gt; Stronger world knowledge, fewer hallucinations[6] &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;GPT-5&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Aug 2025 &lt;/td&gt;&lt;td&gt; Undisclosed &lt;/td&gt;&lt;td&gt; 94.6% on advanced math; 45% fewer hallucinations vs GPT-4o[3] &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; April 2026 &lt;/td&gt;&lt;td&gt; Undisclosed &lt;/td&gt;&lt;td&gt; Native omnimodal; autonomous computer use; agentic coding[6] &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;h4&gt;The o-Series: Reasoning Models&lt;/h4&gt;&lt;p&gt;Alongside the GPT-series, OpenAI launched a separate &lt;strong&gt;reasoning-first architecture&lt;/strong&gt; &amp;#x2014; the o1, o3, and o4-mini series &amp;#x2014; which departed from pure next-token prediction. These models &amp;quot;think before they speak,&amp;quot; running chain-of-thought reasoning internally before generating output. The o3 and o4-mini launched in April 2025 with significantly enhanced reasoning, particularly in STEM.[5][6]&lt;/p&gt;&lt;hr&gt;&lt;h3&gt;3. How LLMs Actually Work: The Transformer Architecture&lt;/h3&gt;&lt;p&gt;Every modern LLM is built on the &lt;strong&gt;Transformer&lt;/strong&gt;, introduced in the landmark 2017 paper &lt;em&gt;&amp;quot;Attention Is All You Need&amp;quot;&lt;/em&gt;. Here&amp;apos;s the architecture unwrapped:[7]&lt;/p&gt;&lt;h4&gt;3.1 The Four Building Blocks&lt;/h4&gt;&lt;p&gt;A transformer layer has four core components:[8]&lt;/p&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;&lt;strong&gt;Token Embeddings&lt;/strong&gt; &amp;#x2014; Convert words/subwords into numerical vectors in high-dimensional space  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Positional Encoding&lt;/strong&gt; &amp;#x2014; Inject information about the order of tokens (since attention has no built-in notion of sequence)  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Attention&lt;/strong&gt; &amp;#x2014; The magic: lets each token &amp;quot;look at&amp;quot; every other token  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Feed-Forward Block&lt;/strong&gt; &amp;#x2014; A pair of linear transformations applied position-wise  &lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;h4&gt;3.2 Self-Attention: The Core Insight&lt;/h4&gt;&lt;p&gt;Self-attention answers the question: &lt;em&gt;&amp;quot;Which other words should I focus on to understand my own meaning?&amp;quot;&lt;/em&gt; For every token, three vectors are computed:[9][10]&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Q (Query)&lt;/strong&gt; &amp;#x2014; &amp;quot;What am I looking for?&amp;quot;  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;K (Key)&lt;/strong&gt; &amp;#x2014; &amp;quot;What do I contain?&amp;quot;  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;V (Value)&lt;/strong&gt; &amp;#x2014; &amp;quot;What information do I carry?&amp;quot;  &lt;/li&gt;&lt;/ul&gt;&lt;p&gt;The attention score is computed as:&lt;/p&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V&lt;span&gt;&lt;span&gt;&lt;span&gt;Attention&lt;/span&gt;(Q,K,V)=&lt;span&gt;softmax&lt;span&gt;(&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;d&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&amp;#x200b;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&amp;#x200b;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Q&lt;span&gt;K&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;T&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&amp;#x200b;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;)&lt;/span&gt;&lt;/span&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;p&gt;where &lt;span&gt;&lt;span&gt;&lt;span&gt;dkd_k&lt;span&gt;&lt;span&gt;&lt;span&gt;d&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&amp;#x200b;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; is the key dimension. The &lt;span&gt;&lt;span&gt;&lt;span&gt;dk\sqrt{d_k}&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;d&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&amp;#x200b;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&amp;#x200b;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; scaling prevents the dot products from growing too large and making softmax gradients vanish.[11]&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;import torchimport torch.nn.functional as Fimport mathdef scaled_dot_product_attention(Q, K, V, mask=None):    &amp;quot;&amp;quot;&amp;quot;    Core self-attention mechanism.    Q, K, V: (batch_size, seq_len, d_k)    &amp;quot;&amp;quot;&amp;quot;    d_k = Q.size(-1)    # Compute attention scores    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)    # Apply optional mask (for decoder / causal attention)    if mask is not None:        scores = scores.masked_fill(mask == 0, float(&amp;apos;-inf&amp;apos;))    # Softmax to get attention weights    attn_weights = F.softmax(scores, dim=-1)    # Weighted sum of values    output = torch.matmul(attn_weights, V)    return output, attn_weights# Example: 2 sentences, 4 tokens, 8-dim embeddingsbatch_size, seq_len, d_model = 2, 4, 8d_k = 8Q = torch.randn(batch_size, seq_len, d_k)K = torch.randn(batch_size, seq_len, d_k)V = torch.randn(batch_size, seq_len, d_k)output, weights = scaled_dot_product_attention(Q, K, V)print(f&amp;quot;Output shape: {output.shape}&amp;quot;)        # (2, 4, 8)print(f&amp;quot;Attention weights: {weights.shape}&amp;quot;)  # (2, 4, 4)&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;3.3 Multi-Head Attention&lt;/h4&gt;&lt;p&gt;Instead of computing attention once, transformers run &lt;strong&gt;multiple attention heads in parallel&lt;/strong&gt; &amp;#x2014; GPT-3 uses 96 attention heads per block. Each head learns a different &amp;quot;relevance function.&amp;quot; The outputs are concatenated and projected:[7]&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;import torch.nn as nnclass MultiHeadAttention(nn.Module):    def __init__(self, d_model, num_heads):        super().__init__()        assert d_model % num_heads == 0        self.d_k = d_model // num_heads        self.num_heads = num_heads        self.W_q = nn.Linear(d_model, d_model)        self.W_k = nn.Linear(d_model, d_model)        self.W_v = nn.Linear(d_model, d_model)        self.W_o = nn.Linear(d_model, d_model)    def split_heads(self, x):        B, T, d = x.shape        # (B, T, d) -&amp;gt; (B, num_heads, T, d_k)        return x.view(B, T, self.num_heads, self.d_k).transpose(1, 2)    def forward(self, x, mask=None):        Q = self.split_heads(self.W_q(x))        K = self.split_heads(self.W_k(x))        V = self.split_heads(self.W_v(x))        attn_out, _ = scaled_dot_product_attention(Q, K, V, mask)        # Merge heads: (B, num_heads, T, d_k) -&amp;gt; (B, T, d_model)        B, H, T, d_k = attn_out.shape        attn_out = attn_out.transpose(1, 2).contiguous().view(B, T, H * d_k)        return self.W_o(attn_out)&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;3.4 Modern Architecture Improvements&lt;/h4&gt;&lt;p&gt;The vanilla transformer has been significantly optimized:[11]&lt;/p&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Innovation &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; What It Does &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;FlashAttention&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Reduces memory traffic 2&amp;#x2013;4&amp;#xd7; by optimizing GPU SRAM access patterns &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;RoPE (Rotary Position Embeddings)&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Better position encoding enabling longer context windows &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;MoE (Mixture of Experts)&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Activates only a subset of parameters per token &amp;#x2014; enables huge models at lower compute cost &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;GQA (Grouped Query Attention)&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Multiple query heads share key/value heads, reducing KV-cache memory &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Linear Attention&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Reduces complexity from O(n&amp;#xb2;) to O(n) for long documents &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;4. The Agentic AI Revolution&lt;/h3&gt;&lt;p&gt;The single biggest shift in 2026 is from &lt;strong&gt;generative AI&lt;/strong&gt; (creates content) to &lt;strong&gt;agentic AI&lt;/strong&gt; (autonomous systems that plan, decide, and execute). Where a generative model answers your question, an agentic AI accomplishes your goal.[12]&lt;/p&gt;&lt;h4&gt;4.1 What Makes an AI Agent?&lt;/h4&gt;&lt;p&gt;An agent has four capabilities that a plain chatbot lacks:[12]&lt;/p&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;&lt;strong&gt;Goal Understanding&lt;/strong&gt; &amp;#x2014; Decompose a complex objective into sub-tasks  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Multi-Step Planning&lt;/strong&gt; &amp;#x2014; Create and revise a plan of action  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tool Use&lt;/strong&gt; &amp;#x2014; Execute functions, call APIs, browse the web, write code  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Self-Correction&lt;/strong&gt; &amp;#x2014; Observe outcomes and adjust behavior in a loop  &lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;h4&gt;4.2 The ReAct Pattern&lt;/h4&gt;&lt;p&gt;&lt;strong&gt;ReAct (Reasoning + Acting)&lt;/strong&gt; is the foundational pattern for agents, introduced by Yao et al. The model interleaves reasoning traces with actions:[13]&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;Thought: I need to find the current population of Karachi.Action: search(&amp;quot;Karachi population 2026&amp;quot;)Observation: Karachi population is approximately 16.5 million.Thought: Now I can answer the question.Answer: Karachi has approximately 16.5 million people.&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;from openai import OpenAIimport jsonclient = OpenAI()# Define tools the agent can usetools = [    {        &amp;quot;type&amp;quot;: &amp;quot;function&amp;quot;,        &amp;quot;function&amp;quot;: {            &amp;quot;name&amp;quot;: &amp;quot;web_search&amp;quot;,            &amp;quot;description&amp;quot;: &amp;quot;Search the web for current information&amp;quot;,            &amp;quot;parameters&amp;quot;: {                &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,                &amp;quot;properties&amp;quot;: {                    &amp;quot;query&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;description&amp;quot;: &amp;quot;Search query&amp;quot;}                },                &amp;quot;required&amp;quot;: [&amp;quot;query&amp;quot;]            }        }    },    {        &amp;quot;type&amp;quot;: &amp;quot;function&amp;quot;,         &amp;quot;function&amp;quot;: {            &amp;quot;name&amp;quot;: &amp;quot;run_python&amp;quot;,            &amp;quot;description&amp;quot;: &amp;quot;Execute Python code and return the result&amp;quot;,            &amp;quot;parameters&amp;quot;: {                &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,                &amp;quot;properties&amp;quot;: {                    &amp;quot;code&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;description&amp;quot;: &amp;quot;Python code to run&amp;quot;}                },                &amp;quot;required&amp;quot;: [&amp;quot;code&amp;quot;]            }        }    }]def run_react_agent(task: str, max_steps: int = 5):    messages = [{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: task}]    for step in range(max_steps):        response = client.chat.completions.create(            model=&amp;quot;gpt-4o&amp;quot;,            messages=messages,            tools=tools,            tool_choice=&amp;quot;auto&amp;quot;        )        msg = response.choices[0].message        # No tool call = final answer        if not msg.tool_calls:            return msg.content        # Execute tool calls        messages.append(msg)        for tool_call in msg.tool_calls:            result = execute_tool(tool_call.function.name,                                   json.loads(tool_call.function.arguments))            messages.append({                &amp;quot;role&amp;quot;: &amp;quot;tool&amp;quot;,                &amp;quot;tool_call_id&amp;quot;: tool_call.id,                &amp;quot;content&amp;quot;: str(result)            })    return &amp;quot;Max steps reached&amp;quot;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;4.3 Agentic Patterns&lt;/h4&gt;&lt;p&gt;Four core patterns drive agent behavior:[14]&lt;/p&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Pattern &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Description &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Example &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Reflection&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Agent critiques its own output and revises &lt;/td&gt;&lt;td&gt; Code reviewer that re-checks generated code &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Planning&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Decompose goal into ordered sub-tasks &lt;/td&gt;&lt;td&gt; Research agent building a structured outline &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Tool Use&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Call external APIs and functions &lt;/td&gt;&lt;td&gt; Weather agent calling a weather API &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Multi-Agent&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Spawn specialized sub-agents &lt;/td&gt;&lt;td&gt; Orchestrator delegates to coder + tester agents &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;h4&gt;4.4 Multi-Agent Systems (MAS)&lt;/h4&gt;&lt;p&gt;By 2026, the field has moved beyond single-purpose agents to &lt;strong&gt;Multi-Agent Systems&lt;/strong&gt; &amp;#x2014; AI &amp;quot;teams&amp;quot; where specialized agents collaborate to achieve a shared objective, mirroring microservice architecture in traditional software.[12]&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;                    &amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510;                    &amp;#x2502;   Orchestrator   &amp;#x2502;                    &amp;#x2502;   (Planner LLM) &amp;#x2502;                    &amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x252c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518;            &amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x253c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510;            &amp;#x25bc;               &amp;#x25bc;               &amp;#x25bc;    &amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510; &amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510; &amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510;    &amp;#x2502;  Researcher  &amp;#x2502; &amp;#x2502;   Coder     &amp;#x2502; &amp;#x2502;   Reviewer   &amp;#x2502;    &amp;#x2502;   Agent      &amp;#x2502; &amp;#x2502;   Agent     &amp;#x2502; &amp;#x2502;   Agent      &amp;#x2502;    &amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518; &amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518; &amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;5. RAG: Retrieval-Augmented Generation Deep Dive&lt;/h3&gt;&lt;p&gt;RAG is how you give an LLM access to your private data without retraining it. In 2026, &lt;strong&gt;RAG as default infrastructure&lt;/strong&gt; for enterprise LLM applications.[1]&lt;/p&gt;&lt;h4&gt;5.1 Naive RAG Pipeline&lt;/h4&gt;&lt;p&gt;The vanilla RAG flow is simple:[15]&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;User Query    &amp;#x2502;    &amp;#x25bc;Embed Query (vector)    &amp;#x2502;    &amp;#x25bc;Similarity Search &amp;#x2192; Vector DB &amp;#x2192; Top-K Documents    &amp;#x2502;    &amp;#x25bc;Inject Context into LLM Prompt    &amp;#x2502;    &amp;#x25bc;LLM generates grounded response&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;from openai import OpenAIimport numpy as npclient = OpenAI()# Step 1: Embed documents at ingestion timedef embed_text(text: str) -&amp;gt; list[float]:    response = client.embeddings.create(        model=&amp;quot;text-embedding-3-small&amp;quot;,        input=text    )    return response.data[0].embedding# Step 2: Simple cosine similarity searchdef cosine_similarity(a: list, b: list) -&amp;gt; float:    a, b = np.array(a), np.array(b)    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))def retrieve(query: str, documents: list[dict], top_k: int = 3):    query_vec = embed_text(query)    scored = [        (doc, cosine_similarity(query_vec, doc[&amp;quot;embedding&amp;quot;]))        for doc in documents    ]    return sorted(scored, key=lambda x: x[1], reverse=True)[:top_k]# Step 3: Generate grounded answerdef rag_answer(query: str, docs: list[str]) -&amp;gt; str:    context = &amp;quot;\n\n&amp;quot;.join(docs)    prompt = f&amp;quot;&amp;quot;&amp;quot;Answer the question using ONLY the context below.If the answer isn&amp;apos;t in the context, say &amp;quot;I don&amp;apos;t know.&amp;quot;Context:{context}Question: {query}&amp;quot;&amp;quot;&amp;quot;    response = client.chat.completions.create(        model=&amp;quot;gpt-4o-mini&amp;quot;,        messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: prompt}]    )    return response.choices[0].message.content&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;5.2 Advanced RAG Techniques&lt;/h4&gt;&lt;p&gt;Naive RAG breaks at scale. Production systems use:[16]&lt;/p&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Technique &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Problem It Solves &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Chunking strategies&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Semantic splitting preserves context better than fixed-size chunks &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Hybrid search&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; BM25 (keyword) + vector search for better recall &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Reranking&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Cross-encoder reranks top-K results for precision &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;HyDE&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Generate a hypothetical answer, embed it, then search &amp;#x2014; better for abstract queries &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;MMR (Maximal Marginal Relevance)&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Diversify retrieved documents to avoid redundancy &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Multi-vector retrieval&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Store summary + detailed chunks separately &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;h4&gt;5.3 Agentic RAG&lt;/h4&gt;&lt;p&gt;&lt;strong&gt;Agentic RAG&lt;/strong&gt; supercharges RAG by adding an agent layer that can iterate, re-retrieve, and validate before answering:[15]&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;from typing import Optionalclass AgenticRAGPipeline:    &amp;quot;&amp;quot;&amp;quot;    Agentic RAG: Agent decides WHEN and WHAT to retrieve,    can iterate multiple retrieval rounds, and validates output.    &amp;quot;&amp;quot;&amp;quot;    def __init__(self, retriever, llm_client):        self.retriever = retriever        self.client = llm_client        self.retrieved_docs = []    def should_retrieve_more(self, current_answer: str, query: str) -&amp;gt; bool:        &amp;quot;&amp;quot;&amp;quot;Ask the LLM if it needs more context.&amp;quot;&amp;quot;&amp;quot;        check_prompt = f&amp;quot;&amp;quot;&amp;quot;Query: {query}Current draft answer: {current_answer}Is this answer complete and well-supported? Reply with JSON: {{&amp;quot;complete&amp;quot;: true/false, &amp;quot;missing&amp;quot;: &amp;quot;what&amp;apos;s missing&amp;quot;}}&amp;quot;&amp;quot;&amp;quot;        response = self.client.chat.completions.create(            model=&amp;quot;gpt-4o-mini&amp;quot;,            messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: check_prompt}],            response_format={&amp;quot;type&amp;quot;: &amp;quot;json_object&amp;quot;}        )        import json        result = json.loads(response.choices[0].message.content)        return not result[&amp;quot;complete&amp;quot;], result.get(&amp;quot;missing&amp;quot;, &amp;quot;&amp;quot;)    def run(self, query: str, max_iterations: int = 3) -&amp;gt; str:        answer = &amp;quot;&amp;quot;        for i in range(max_iterations):            # Retrieve relevant docs            search_query = query if i == 0 else f&amp;quot;{query} - focusing on: {answer}&amp;quot;            new_docs = self.retriever.search(search_query, top_k=5)            self.retrieved_docs.extend(new_docs)            # Generate answer with all accumulated context            context = &amp;quot;\n---\n&amp;quot;.join(self.retrieved_docs)            answer = self._generate(query, context)            # Check if we need more info            needs_more, missing = self.should_retrieve_more(answer, query)            if not needs_more:                break        return answer    def _generate(self, query: str, context: str) -&amp;gt; str:        response = self.client.chat.completions.create(            model=&amp;quot;gpt-4o&amp;quot;,            messages=[{                &amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;,                &amp;quot;content&amp;quot;: f&amp;quot;Context:\n{context}\n\nAnswer: {query}&amp;quot;            }]        )        return response.choices[0].message.content&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;6. MCP: The USB-C for AI Tools&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; is an open standard introduced by Anthropic in November 2024 to standardize how AI systems integrate with external tools, data sources, and services. Think of it as the USB-C port for AI &amp;#x2014; one standard connector for everything.[17]&lt;/p&gt;&lt;h4&gt;6.1 Why MCP Matters&lt;/h4&gt;&lt;p&gt;Before MCP, every AI-tool integration was a custom one-off. MCP provides:[18]&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Tools&lt;/strong&gt; &amp;#x2014; Functions the AI can call (e.g., &lt;code&gt;run_sql&lt;/code&gt;, &lt;code&gt;read_file&lt;/code&gt;)  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Resources&lt;/strong&gt; &amp;#x2014; Data the AI can read (files, database records, API responses)  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Prompts&lt;/strong&gt; &amp;#x2014; Reusable prompt templates  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Sampling&lt;/strong&gt; &amp;#x2014; The server can ask the client to run an LLM query  &lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;&amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510;&amp;#x2502;                  MCP Architecture                &amp;#x2502;&amp;#x2502;                                                  &amp;#x2502;&amp;#x2502;  &amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510;      MCP Protocol              &amp;#x2502;&amp;#x2502;  &amp;#x2502;  AI Client   &amp;#x2502;&amp;#x25c4;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x25ba;&amp;#x2510;           &amp;#x2502;&amp;#x2502;  &amp;#x2502; (Claude/GPT) &amp;#x2502;                    &amp;#x2502;           &amp;#x2502;&amp;#x2502;  &amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518;         &amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2534;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510;  &amp;#x2502;&amp;#x2502;                            &amp;#x2502;    MCP Server     &amp;#x2502;  &amp;#x2502;&amp;#x2502;                            &amp;#x2502;  (your tools)     &amp;#x2502;  &amp;#x2502;&amp;#x2502;                            &amp;#x2502;                   &amp;#x2502;  &amp;#x2502;&amp;#x2502;                            &amp;#x2502;  &amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510;  &amp;#x2502;  &amp;#x2502;&amp;#x2502;                            &amp;#x2502;  &amp;#x2502;  Tools      &amp;#x2502;  &amp;#x2502;  &amp;#x2502;&amp;#x2502;                            &amp;#x2502;  &amp;#x2502;  Resources  &amp;#x2502;  &amp;#x2502;  &amp;#x2502;&amp;#x2502;                            &amp;#x2502;  &amp;#x2502;  Prompts    &amp;#x2502;  &amp;#x2502;  &amp;#x2502;&amp;#x2502;                            &amp;#x2502;  &amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518;  &amp;#x2502;  &amp;#x2502;&amp;#x2502;                            &amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518;  &amp;#x2502;&amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;6.2 Building an MCP Server&lt;/h4&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;# Minimal MCP server using the official Python SDKfrom mcp.server import Serverfrom mcp.server.stdio import stdio_serverfrom mcp.types import Tool, TextContentimport mcp.types as typesapp = Server(&amp;quot;my-mcp-server&amp;quot;)@app.list_tools()async def list_tools() -&amp;gt; list[Tool]:    return [        Tool(            name=&amp;quot;get_weather&amp;quot;,            description=&amp;quot;Get the current weather for a city&amp;quot;,            inputSchema={                &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,                &amp;quot;properties&amp;quot;: {                    &amp;quot;city&amp;quot;: {                        &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;,                        &amp;quot;description&amp;quot;: &amp;quot;City name&amp;quot;                    }                },                &amp;quot;required&amp;quot;: [&amp;quot;city&amp;quot;]            }        ),        Tool(            name=&amp;quot;run_sql&amp;quot;,            description=&amp;quot;Execute a read-only SQL query against our DB&amp;quot;,            inputSchema={                &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,                &amp;quot;properties&amp;quot;: {                    &amp;quot;query&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;}                },                &amp;quot;required&amp;quot;: [&amp;quot;query&amp;quot;]            }        )    ]@app.call_tool()async def call_tool(name: str, arguments: dict) -&amp;gt; list[TextContent]:    if name == &amp;quot;get_weather&amp;quot;:        city = arguments[&amp;quot;city&amp;quot;]        # Call your actual weather API here        return [TextContent(type=&amp;quot;text&amp;quot;, text=f&amp;quot;Weather in {city}: 28&amp;#xb0;C, sunny&amp;quot;)]    elif name == &amp;quot;run_sql&amp;quot;:        query = arguments[&amp;quot;query&amp;quot;]        # Execute query safely        results = execute_readonly_query(query)        return [TextContent(type=&amp;quot;text&amp;quot;, text=str(results))]async def main():    async with stdio_server() as (read_stream, write_stream):        await app.run(read_stream, write_stream, app.create_initialization_options())if __name__ == &amp;quot;__main__&amp;quot;:    import asyncio    asyncio.run(main())&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;6.3 MCP vs Traditional APIs&lt;/h4&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Dimension &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; REST API &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; MCP Server &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt; Discovery &lt;/td&gt;&lt;td&gt; Manual (read docs) &lt;/td&gt;&lt;td&gt; Auto-discovery via &lt;code&gt;list_tools()&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Schema &lt;/td&gt;&lt;td&gt; OpenAPI/Swagger &lt;/td&gt;&lt;td&gt; JSON Schema, AI-readable &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Context sharing &lt;/td&gt;&lt;td&gt; Per-request &lt;/td&gt;&lt;td&gt; Stateful sessions with context &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; AI integration &lt;/td&gt;&lt;td&gt; Custom glue code &lt;/td&gt;&lt;td&gt; Native, standardized &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt; Multi-tool &lt;/td&gt;&lt;td&gt; N integrations &lt;/td&gt;&lt;td&gt; One MCP layer &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;7. A2A: The Protocol for Agent Collaboration&lt;/h3&gt;&lt;p&gt;While MCP handles &lt;strong&gt;agent-to-tool&lt;/strong&gt; communication, &lt;strong&gt;A2A (Agent2Agent)&lt;/strong&gt; &amp;#x2014; announced by Google in April 2025 &amp;#x2014; handles &lt;strong&gt;agent-to-agent&lt;/strong&gt; communication.[19]&lt;/p&gt;&lt;h4&gt;7.1 The Problem A2A Solves&lt;/h4&gt;&lt;p&gt;Imagine a travel booking agent that needs to coordinate with a payment agent from a different company, a hotel API agent, and an airline agent &amp;#x2014; all built on different frameworks. A2A enables them to discover each other, understand capabilities, and coordinate tasks without sharing internals.[19]&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;MCP Layer (Vertical):          A2A Layer (Horizontal):Agent &amp;#x25c4;&amp;#x2500;&amp;#x2500;tools&amp;#x2500;&amp;#x2500;&amp;#x25ba; APIs         Agent &amp;#x25c4;&amp;#x2500;&amp;#x2500;collaborate&amp;#x2500;&amp;#x2500;&amp;#x25ba; AgentAgent &amp;#x25c4;&amp;#x2500;&amp;#x2500;tools&amp;#x2500;&amp;#x2500;&amp;#x25ba; Databases    Agent &amp;#x25c4;&amp;#x2500;&amp;#x2500;delegate&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x25ba; AgentAgent &amp;#x25c4;&amp;#x2500;&amp;#x2500;tools&amp;#x2500;&amp;#x2500;&amp;#x25ba; Files        Agent &amp;#x25c4;&amp;#x2500;&amp;#x2500;coordinate&amp;#x2500;&amp;#x2500;&amp;#x25ba; Agent&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;7.2 Agent Cards: The Discovery Mechanism&lt;/h4&gt;&lt;p&gt;Every A2A agent publishes an &lt;strong&gt;Agent Card&lt;/strong&gt; &amp;#x2014; a JSON document at a well-known URL that describes the agent&amp;apos;s capabilities:[20]&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;json&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;{  &amp;quot;name&amp;quot;: &amp;quot;payment-processor-agent&amp;quot;,  &amp;quot;version&amp;quot;: &amp;quot;1.2.0&amp;quot;,  &amp;quot;description&amp;quot;: &amp;quot;Handles payment processing for e-commerce workflows&amp;quot;,  &amp;quot;endpoint&amp;quot;: &amp;quot;https://payments.example.com/a2a&amp;quot;,  &amp;quot;skills&amp;quot;: [    {      &amp;quot;id&amp;quot;: &amp;quot;process_payment&amp;quot;,      &amp;quot;name&amp;quot;: &amp;quot;Process Payment&amp;quot;,      &amp;quot;description&amp;quot;: &amp;quot;Charge a customer for a transaction&amp;quot;,      &amp;quot;inputModes&amp;quot;: [&amp;quot;text&amp;quot;, &amp;quot;json&amp;quot;],      &amp;quot;outputModes&amp;quot;: [&amp;quot;json&amp;quot;]    },    {      &amp;quot;id&amp;quot;: &amp;quot;refund&amp;quot;,      &amp;quot;name&amp;quot;: &amp;quot;Issue Refund&amp;quot;,      &amp;quot;description&amp;quot;: &amp;quot;Refund a previously processed payment&amp;quot;    }  ],  &amp;quot;authentication&amp;quot;: {    &amp;quot;schemes&amp;quot;: [&amp;quot;Bearer&amp;quot;]  }}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;7.3 MCP + A2A: The Full Stack&lt;/h4&gt;&lt;p&gt;Google positioned A2A as &lt;strong&gt;complementary to MCP&lt;/strong&gt;, not competitive:[20]&lt;/p&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;&amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510;&amp;#x2502;              Enterprise AI Architecture           &amp;#x2502;&amp;#x2502;                                                   &amp;#x2502;&amp;#x2502;    &amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510;   A2A   &amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510;               &amp;#x2502;&amp;#x2502;    &amp;#x2502; Agent A &amp;#x2502;&amp;#x25c4;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x25ba;&amp;#x2502; Agent B &amp;#x2502;               &amp;#x2502;&amp;#x2502;    &amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x252c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518;         &amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x252c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518;               &amp;#x2502;&amp;#x2502;         &amp;#x2502; MCP               &amp;#x2502; MCP                 &amp;#x2502;&amp;#x2502;    &amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2534;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510;         &amp;#x250c;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2534;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2510;               &amp;#x2502;&amp;#x2502;    &amp;#x2502; Tools &amp;amp; &amp;#x2502;         &amp;#x2502; Tools &amp;amp; &amp;#x2502;               &amp;#x2502;&amp;#x2502;    &amp;#x2502;  Data   &amp;#x2502;         &amp;#x2502;  Data   &amp;#x2502;               &amp;#x2502;&amp;#x2502;    &amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518;         &amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518;               &amp;#x2502;&amp;#x2502;                                                   &amp;#x2502;&amp;#x2502;  MCP = vertical (agent &amp;#x2194; tools)                  &amp;#x2502;&amp;#x2502;  A2A = horizontal (agent &amp;#x2194; agent)                &amp;#x2502;&amp;#x2514;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2518;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;8. AI Agent Frameworks Compared&lt;/h3&gt;&lt;p&gt;Choosing the wrong framework costs weeks. Here&amp;apos;s the production-tested ranking for 2026:[21]&lt;/p&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Framework &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Best For &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Architecture Style &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; MCP/A2A Support &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;LangGraph&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Complex stateful production workflows &lt;/td&gt;&lt;td&gt; Graph-based, explicit state machines &lt;/td&gt;&lt;td&gt; &amp;#x2705; MCP &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Claude Agent SDK&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Anthropic-native production agents &lt;/td&gt;&lt;td&gt; Native Claude hooks + subagents &lt;/td&gt;&lt;td&gt; &amp;#x2705; MCP native &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;CrewAI&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Fast multi-agent prototypes &lt;/td&gt;&lt;td&gt; Role-based crews &lt;/td&gt;&lt;td&gt; &amp;#x2705; MCP &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;AutoGen / AG2&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Research-style conversational agents &lt;/td&gt;&lt;td&gt; Conversational multi-agent &lt;/td&gt;&lt;td&gt; &amp;#x2705; MCP &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Semantic Kernel&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Enterprise / .NET stacks &lt;/td&gt;&lt;td&gt; Plugin-based, Azure-first &lt;/td&gt;&lt;td&gt; &amp;#x2705; &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;LlamaIndex&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; RAG-grounded agents &lt;/td&gt;&lt;td&gt; Data-layer first &lt;/td&gt;&lt;td&gt; &amp;#x2705; &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Pydantic AI&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Type-safe Python agents &lt;/td&gt;&lt;td&gt; Pydantic validation throughout &lt;/td&gt;&lt;td&gt; &amp;#x2705; &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;h4&gt;8.1 LangGraph: Production-Ready State Machines&lt;/h4&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;from langgraph.graph import StateGraph, ENDfrom typing import TypedDict, Annotatedimport operatorclass AgentState(TypedDict):    messages: Annotated[list, operator.add]    query: str    documents: list[str]    answer: str    needs_more_info: booldef retrieve_node(state: AgentState) -&amp;gt; AgentState:    &amp;quot;&amp;quot;&amp;quot;Retrieve relevant documents.&amp;quot;&amp;quot;&amp;quot;    docs = vector_store.search(state[&amp;quot;query&amp;quot;], k=5)    return {&amp;quot;documents&amp;quot;: docs}def generate_node(state: AgentState) -&amp;gt; AgentState:    &amp;quot;&amp;quot;&amp;quot;Generate answer from retrieved docs.&amp;quot;&amp;quot;&amp;quot;    context = &amp;quot;\n&amp;quot;.join(state[&amp;quot;documents&amp;quot;])    answer = llm.invoke(f&amp;quot;Context: {context}\nQuestion: {state[&amp;apos;query&amp;apos;]}&amp;quot;)    return {&amp;quot;answer&amp;quot;: answer, &amp;quot;needs_more_info&amp;quot;: False}def check_node(state: AgentState) -&amp;gt; str:    &amp;quot;&amp;quot;&amp;quot;Route: done or need more retrieval?&amp;quot;&amp;quot;&amp;quot;    return &amp;quot;done&amp;quot; if not state[&amp;quot;needs_more_info&amp;quot;] else &amp;quot;retrieve&amp;quot;# Build the graphworkflow = StateGraph(AgentState)workflow.add_node(&amp;quot;retrieve&amp;quot;, retrieve_node)workflow.add_node(&amp;quot;generate&amp;quot;, generate_node)workflow.set_entry_point(&amp;quot;retrieve&amp;quot;)workflow.add_edge(&amp;quot;retrieve&amp;quot;, &amp;quot;generate&amp;quot;)workflow.add_conditional_edges(&amp;quot;generate&amp;quot;, check_node, {    &amp;quot;done&amp;quot;: END,    &amp;quot;retrieve&amp;quot;: &amp;quot;retrieve&amp;quot;  # Loop back if needed})app = workflow.compile()# Run itresult = app.invoke({&amp;quot;query&amp;quot;: &amp;quot;What is the capital of Punjab?&amp;quot;, &amp;quot;messages&amp;quot;: []})print(result[&amp;quot;answer&amp;quot;])&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;8.2 CrewAI: Role-Based Multi-Agent Teams&lt;/h4&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;from crewai import Agent, Task, Crew, Process# Define specialized agentsresearcher = Agent(    role=&amp;quot;Senior Research Analyst&amp;quot;,    goal=&amp;quot;Find accurate, up-to-date information on the given topic&amp;quot;,    backstory=&amp;quot;Expert researcher with access to web search and databases&amp;quot;,    verbose=True,    allow_delegation=False,    tools=[web_search_tool, wikipedia_tool])writer = Agent(    role=&amp;quot;Technical Writer&amp;quot;,    goal=&amp;quot;Write clear, engaging blog posts from research findings&amp;quot;,    backstory=&amp;quot;Experienced tech blogger who makes complex topics accessible&amp;quot;,    verbose=True,    allow_delegation=False)# Define tasksresearch_task = Task(    description=&amp;quot;Research the latest developments in {topic}. &amp;quot;                &amp;quot;Find key facts, statistics, and expert opinions.&amp;quot;,    expected_output=&amp;quot;A structured research brief with citations&amp;quot;,    agent=researcher)writing_task = Task(    description=&amp;quot;Write a 1000-word blog post based on the research brief. &amp;quot;                &amp;quot;Make it engaging for a technical audience.&amp;quot;,    expected_output=&amp;quot;A complete, publication-ready blog post in Markdown&amp;quot;,    agent=writer,    context=[research_task]  # Uses output from research_task)# Assemble and run the crewcrew = Crew(    agents=[researcher, writer],    tasks=[research_task, writing_task],    process=Process.sequential,    verbose=2)result = crew.kickoff(inputs={&amp;quot;topic&amp;quot;: &amp;quot;MCP protocol for AI agents&amp;quot;})print(result)&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;9. Prompt Engineering Playbook&lt;/h3&gt;&lt;p&gt;Prompt engineering is the art of communicating precisely with LLMs. Here are the techniques every practitioner needs:[22]&lt;/p&gt;&lt;h4&gt;9.1 Core Techniques&lt;/h4&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;# &amp;#x2500;&amp;#x2500;&amp;#x2500; Zero-Shot &amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;# No examples. Rely on the model&amp;apos;s training.zero_shot = &amp;quot;Classify the sentiment of this review: &amp;apos;The app crashes constantly.&amp;apos;&amp;quot;# &amp;#x2500;&amp;#x2500;&amp;#x2500; Few-Shot &amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;# Provide examples to guide the modelfew_shot = &amp;quot;&amp;quot;&amp;quot;Classify sentiment. Examples:Input: &amp;quot;Love this product!&amp;quot; &amp;#x2192; PositiveInput: &amp;quot;Terrible experience.&amp;quot; &amp;#x2192; NegativeInput: &amp;quot;It&amp;apos;s okay, nothing special.&amp;quot; &amp;#x2192; NeutralNow classify: &amp;quot;The battery life is surprisingly good.&amp;quot;&amp;quot;&amp;quot;&amp;quot;# &amp;#x2500;&amp;#x2500;&amp;#x2500; Chain-of-Thought (CoT) &amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;# Force step-by-step reasoning for complex taskscot = &amp;quot;&amp;quot;&amp;quot;Solve this step by step:A store sells apples for Rs. 50 each. If Ali buys 12 apples with Rs. 700, how much change does he get?Think through it step by step before giving the final answer.&amp;quot;&amp;quot;&amp;quot;# &amp;#x2500;&amp;#x2500;&amp;#x2500; ReAct Pattern &amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;# Alternate reasoning and actionreact_system = &amp;quot;&amp;quot;&amp;quot;You are an agent. For each task:1. Thought: reason about what to do next2. Action: choose a tool [search | calculate | respond]3. Observation: note what the tool returned4. Repeat until you have the final answer.&amp;quot;&amp;quot;&amp;quot;# &amp;#x2500;&amp;#x2500;&amp;#x2500; Self-Consistency &amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;# Sample multiple reasoning paths, pick the majority answerimport refrom collections import Counterdef self_consistent_answer(question: str, client, n_samples: int = 5) -&amp;gt; str:    answers = []    for _ in range(n_samples):        response = client.chat.completions.create(            model=&amp;quot;gpt-4o-mini&amp;quot;,            messages=[{                &amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;,                 &amp;quot;content&amp;quot;: f&amp;quot;{question}\nThink step by step.&amp;quot;            }],            temperature=0.7  # Some variability for diversity        )        # Extract final answer (last number or key phrase)        text = response.choices[0].message.content        answers.append(text.strip().split(&amp;quot;\n&amp;quot;)[-1])    # Return most common answer    return Counter(answers).most_common(1)[0][0]&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;9.2 System Prompt Architecture&lt;/h4&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;SYSTEM_PROMPT_TEMPLATE = &amp;quot;&amp;quot;&amp;quot;## RoleYou are {role_name}, a {expertise_level} specialist in {domain}.## Objective  {primary_objective}## Constraints- Always cite sources when making factual claims- If uncertain, say &amp;quot;I&amp;apos;m not sure&amp;quot; rather than guessing  - Keep responses under {max_length} words unless asked for detail- Output format: {output_format}## ContextToday&amp;apos;s date: {date}User&amp;apos;s technical level: {user_level}## Examples{few_shot_examples}&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;10. Vector Databases: The Memory Layer&lt;/h3&gt;&lt;p&gt;Vector databases store embeddings &amp;#x2014; dense numerical representations of meaning &amp;#x2014; enabling &lt;strong&gt;semantic search&lt;/strong&gt; (search by meaning, not keywords).[23]&lt;/p&gt;&lt;h4&gt;10.1 How Embeddings Work&lt;/h4&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;from openai import OpenAIimport numpy as npclient = OpenAI()def get_embedding(text: str) -&amp;gt; list[float]:    return client.embeddings.create(        model=&amp;quot;text-embedding-3-small&amp;quot;,        input=text    ).data[0].embedding# Semantic similarity demosentences = [    &amp;quot;Karachi is the largest city in Pakistan&amp;quot;,    &amp;quot;The metropolitan area of Karachi has 16 million people&amp;quot;,    &amp;quot;I like to eat biryani&amp;quot;,    &amp;quot;Python is a programming language&amp;quot;]embeddings = [get_embedding(s) for s in sentences]def cosine_sim(a, b):    a, b = np.array(a), np.array(b)    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))query_emb = get_embedding(&amp;quot;What is the population of Karachi?&amp;quot;)for s, emb in zip(sentences, embeddings):    score = cosine_sim(query_emb, emb)    print(f&amp;quot;{score:.3f} | {s}&amp;quot;)# Output:# 0.812 | Karachi is the largest city in Pakistan  &amp;#x2190; high# 0.798 | The metropolitan area of Karachi...      &amp;#x2190; high  # 0.312 | I like to eat biryani                    &amp;#x2190; low# 0.289 | Python is a programming language         &amp;#x2190; low&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4&gt;10.2 Vector Database Comparison&lt;/h4&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; DB &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Best For &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Hosting &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Hybrid Search &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Notes &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Pinecone&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Managed, production scale &lt;/td&gt;&lt;td&gt; Cloud-only &lt;/td&gt;&lt;td&gt; &amp;#x2705; &lt;/td&gt;&lt;td&gt; Easiest setup &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Weaviate&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; GraphQL + semantic queries &lt;/td&gt;&lt;td&gt; Self/Cloud &lt;/td&gt;&lt;td&gt; &amp;#x2705; &lt;/td&gt;&lt;td&gt; MCP support in v3.0 &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Qdrant&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; High-performance, Rust core &lt;/td&gt;&lt;td&gt; Self/Cloud &lt;/td&gt;&lt;td&gt; &amp;#x2705; &lt;/td&gt;&lt;td&gt; Best perf/&amp;#x24; &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Chroma&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Local dev &amp;amp; testing &lt;/td&gt;&lt;td&gt; Self-host &lt;/td&gt;&lt;td&gt; Limited &lt;/td&gt;&lt;td&gt; Dead-simple Python API &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Milvus&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Massive scale (billions) &lt;/td&gt;&lt;td&gt; Self/Cloud &lt;/td&gt;&lt;td&gt; &amp;#x2705; &lt;/td&gt;&lt;td&gt; GPU-accelerated &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;pgvector&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Already using Postgres &lt;/td&gt;&lt;td&gt; Self-host &lt;/td&gt;&lt;td&gt; &amp;#x2705; &lt;/td&gt;&lt;td&gt; No new infra needed &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;# Quick start: Chroma (local, perfect for prototyping)import chromadbfrom chromadb.utils import embedding_functionsclient = chromadb.Client()openai_ef = embedding_functions.OpenAIEmbeddingFunction(    api_key=&amp;quot;YOUR_KEY&amp;quot;,    model_name=&amp;quot;text-embedding-3-small&amp;quot;)collection = client.create_collection(    name=&amp;quot;knowledge_base&amp;quot;,    embedding_function=openai_ef)# Add documentscollection.add(    documents=[        &amp;quot;MCP is an open standard by Anthropic for AI tool integration&amp;quot;,        &amp;quot;RAG stands for Retrieval-Augmented Generation&amp;quot;,        &amp;quot;LangGraph is a framework for building stateful agent workflows&amp;quot;    ],    ids=[&amp;quot;doc1&amp;quot;, &amp;quot;doc2&amp;quot;, &amp;quot;doc3&amp;quot;])# Queryresults = collection.query(    query_texts=[&amp;quot;How do AI agents connect to external tools?&amp;quot;],    n_results=2)print(results[&amp;quot;documents&amp;quot;])# [[&amp;apos;MCP is an open standard by Anthropic...&amp;apos;, #   &amp;apos;LangGraph is a framework...&amp;apos;]]&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;h3&gt;11. The Complete AI Dictionary&lt;/h3&gt;&lt;p&gt;A comprehensive reference of every term you&amp;apos;ll encounter, from beginner to deep-technical.&lt;/p&gt;&lt;hr&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;h4&gt;&amp;#x1f524; Foundational Concepts&lt;/h4&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;&lt;strong&gt;Artificial Intelligence (AI)&lt;/strong&gt; The broad field of building systems that perform tasks that typically require human intelligence &amp;#x2014; reasoning, learning, perception, language understanding.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Machine Learning (ML)&lt;/strong&gt; A subfield of AI where systems learn from data rather than being explicitly programmed. The model improves with experience.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Deep Learning (DL)&lt;/strong&gt; ML using neural networks with many layers (&amp;quot;deep&amp;quot;). Powers all modern LLMs, image models, and speech systems.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Neural Network&lt;/strong&gt; A computational model loosely inspired by biological neurons. Consists of layers of mathematical functions that transform inputs into outputs.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Parameters / Weights&lt;/strong&gt; The learnable numerical values in a neural network. GPT-3 has 175 billion parameters. More parameters &amp;#x2260; always better, but generally more capability.[3]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Training&lt;/strong&gt; The process of adjusting a model&amp;apos;s parameters on a large dataset to minimize prediction error. Requires massive compute (GPU clusters).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Inference&lt;/strong&gt; Running a trained model to generate outputs. What happens when you type a prompt into ChatGPT.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Token&lt;/strong&gt; The basic unit of text for LLMs. A token is roughly 0.75 words in English. &amp;quot;Hello, world!&amp;quot; = 4 tokens. LLMs process and generate text as token sequences.[16]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Context Window&lt;/strong&gt; The maximum number of tokens an LLM can &amp;quot;see&amp;quot; at once. GPT-4 had 128K tokens; Llama 4 supports 10 million tokens. Larger = more context, higher cost.[4]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Embedding&lt;/strong&gt; A dense numerical vector (array of floats) representing the semantic meaning of text, images, or other data. Similar meanings cluster together in embedding space.[23]&lt;/p&gt;&lt;hr&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;h4&gt;&amp;#x1f9e0; LLM Architecture Terms&lt;/h4&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;&lt;strong&gt;Transformer&lt;/strong&gt; The neural network architecture underlying all major LLMs, introduced in 2017. Key innovation: the attention mechanism replaces sequential processing with parallel processing.[7]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Attention Mechanism&lt;/strong&gt; The core innovation of transformers. Lets each token attend to (learn from) every other token in context, regardless of distance. Computes Q/K/V matrices.[10]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Self-Attention&lt;/strong&gt; Attention where the query, key, and value all come from the same sequence. Enables a model to understand words in context of each other.[10]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Multi-Head Attention&lt;/strong&gt; Running multiple attention operations in parallel, each learning different relationships. GPT-3 uses 96 attention heads.[7]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Positional Encoding&lt;/strong&gt; A mechanism to inject token position information into embeddings, since attention is position-agnostic by default.[11]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;RoPE (Rotary Position Embeddings)&lt;/strong&gt; A modern positional encoding scheme that encodes position through rotation matrices, enabling better generalization to longer contexts than the original model was trained on.[11]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;MoE (Mixture of Experts)&lt;/strong&gt; Architecture where only a subset of model parameters (&amp;quot;experts&amp;quot;) activate per token, enabling models to have far more total parameters at similar inference cost.[11]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Encoder / Decoder&lt;/strong&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Encoder-only&lt;/strong&gt; (e.g., BERT): Builds rich representations; best for classification, NER  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Decoder-only&lt;/strong&gt; (e.g., GPT): Generates text autoregressively; best for generation tasks  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Encoder-Decoder&lt;/strong&gt; (e.g., T5): Good for translation and summarization  &lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;Autoregressive Generation&lt;/strong&gt; How LLMs generate text: one token at a time, each new token conditioned on all previous tokens. This is why they can&amp;apos;t &amp;quot;edit&amp;quot; &amp;#x2014; they always predict left-to-right.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Temperature&lt;/strong&gt; Controls randomness in generation. Temperature=0: always pick the most likely token (deterministic). Temperature=1: sample proportionally. Temperature&amp;gt;1: more random/creative.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Top-P (Nucleus Sampling)&lt;/strong&gt; Restricts sampling to the smallest set of tokens whose cumulative probability exceeds P. More robust than Temperature alone for controlling output quality.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;KV Cache&lt;/strong&gt; Stores computed key and value matrices for previously processed tokens so they don&amp;apos;t need to be recomputed during autoregressive generation. Critical for inference efficiency.&lt;/p&gt;&lt;hr&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;h4&gt;&amp;#x1f3cb;&amp;#xfe0f; Training &amp;amp; Alignment&lt;/h4&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;&lt;strong&gt;Pre-training&lt;/strong&gt; Initial training of an LLM on massive unlabeled text data (web, books, code). Learns statistical patterns of language. Requires enormous compute.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Fine-tuning&lt;/strong&gt; Further training on a smaller, task-specific dataset to specialize a pre-trained model. Cheaper than pre-training.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;SFT (Supervised Fine-Tuning)&lt;/strong&gt; Fine-tuning on human-curated input-output pairs. &amp;quot;Given this input, produce this output.&amp;quot;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;RLHF (Reinforcement Learning from Human Feedback)&lt;/strong&gt; The technique that transformed GPT-3 into ChatGPT. Human evaluators rate outputs; a reward model is trained on those ratings; the LLM is fine-tuned to maximize the reward. Dramatically improves alignment and reduces harmful outputs.[24]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;PEFT (Parameter-Efficient Fine-Tuning)&lt;/strong&gt; Fine-tuning techniques that update only a small fraction of parameters (e.g., LoRA, QLoRA), making fine-tuning feasible on consumer hardware.[1]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;LoRA (Low-Rank Adaptation)&lt;/strong&gt; Popular PEFT method that adds small trainable rank-decomposition matrices to existing weight matrices. Often achieves 90%+ of full fine-tune quality at 1% of the compute.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;DPO (Direct Preference Optimization)&lt;/strong&gt; A simpler alternative to RLHF that directly optimizes the model against human preference data without a separate reward model.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Constitutional AI&lt;/strong&gt; Anthropic&amp;apos;s technique for alignment: the model critiques its own responses against a set of principles and revises them.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Hallucination&lt;/strong&gt; When an LLM generates confident, fluent, but factually wrong information. Caused by training on noisy data and the autoregressive generation process. Major active research area.[25]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Alignment&lt;/strong&gt; Ensuring AI systems behave according to human values and intentions. Includes safety, helpfulness, and harmlessness.&lt;/p&gt;&lt;hr&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;h4&gt;&amp;#x1f916; Agentic AI Terms&lt;/h4&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;&lt;strong&gt;AI Agent&lt;/strong&gt; An AI system that autonomously perceives its environment, makes decisions, executes actions, and pursues goals over multiple steps &amp;#x2014; not just responds to a single prompt.[12]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Agentic AI&lt;/strong&gt; The paradigm where LLMs act as autonomous agents that plan, use tools, and complete multi-step tasks without constant human direction.[12]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Tool Use / Function Calling&lt;/strong&gt; The ability of an LLM to call external functions, APIs, and services based on user requests. Core capability enabling agents to &amp;quot;do things&amp;quot; rather than just &amp;quot;say things.&amp;quot;[26]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Orchestrator&lt;/strong&gt; The &amp;quot;brain&amp;quot; of a multi-agent system &amp;#x2014; the component that plans tasks, delegates to sub-agents, and synthesizes results. Often a more powerful LLM.[2]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Sub-agent / Worker Agent&lt;/strong&gt; Specialized agents that execute specific tasks delegated by an orchestrator. Examples: a web-search agent, a code-execution agent, a database agent.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Human-in-the-Loop (HITL)&lt;/strong&gt; A system design where a human can intervene, approve, or redirect an agent at key decision points. Critical for high-stakes workflows.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;ReAct (Reasoning + Acting)&lt;/strong&gt; A foundational prompting/architecture pattern for agents where the model interleaves reasoning (&amp;quot;Thought:&amp;quot;) and actions (&amp;quot;Action:&amp;quot;) in a loop.[13]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Planning&lt;/strong&gt; The agent&amp;apos;s ability to decompose a goal into an ordered sequence of sub-tasks. Types include: Plan-then-Execute, ReAct (interleaved), and Tree-of-Thought (branching).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Reflection&lt;/strong&gt; An agentic pattern where the agent reviews and critiques its own previous output to improve it &amp;#x2014; a form of self-correction.[14]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Memory (Agent Memory)&lt;/strong&gt; How agents retain information:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;In-context&lt;/strong&gt;: Within the current prompt window (ephemeral)  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;External&lt;/strong&gt;: Stored in vector DBs or traditional DBs (persistent)  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Episodic&lt;/strong&gt;: Records of past interactions  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Semantic&lt;/strong&gt;: General knowledge/facts  &lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;Multi-Agent System (MAS)&lt;/strong&gt; A network of specialized agents collaborating to solve problems that exceed any single agent&amp;apos;s capability.[12]&lt;/p&gt;&lt;hr&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;h4&gt;&amp;#x1f4e1; Protocols &amp;amp; Infrastructure&lt;/h4&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;&lt;strong&gt;MCP (Model Context Protocol)&lt;/strong&gt; Open standard by Anthropic (Nov 2024) for connecting AI agents to external tools and data sources through a standardized, AI-readable interface. The &amp;quot;USB-C for AI.&amp;quot;[17]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;A2A (Agent2Agent Protocol)&lt;/strong&gt; Open protocol by Google (April 2025) enabling AI agents from different vendors to discover, communicate, and collaborate with each other.[19]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Agent Card&lt;/strong&gt; A JSON document published by an A2A agent that describes its capabilities, endpoint, and authentication requirements &amp;#x2014; enabling other agents to discover it.[20]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;RAG (Retrieval-Augmented Generation)&lt;/strong&gt; Grounding LLM outputs in relevant documents retrieved from a knowledge base at inference time, reducing hallucination and enabling access to private/current data.[15]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Agentic RAG&lt;/strong&gt; RAG enhanced with agent capabilities &amp;#x2014; the agent can iteratively retrieve, evaluate, and re-retrieve context before generating the final answer.[15]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Vector Database&lt;/strong&gt; A database optimized for storing and querying high-dimensional vector embeddings via similarity search (ANN algorithms).[23]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Semantic Search&lt;/strong&gt; Search that finds results by meaning rather than keyword matching, using embedding similarity.[27]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Hybrid Search&lt;/strong&gt; Combining vector similarity search (semantic) with keyword-based search (BM25) in a single query for better recall and precision.[27]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;HNSW (Hierarchical Navigable Small World)&lt;/strong&gt; A graph-based indexing algorithm used in vector databases for fast approximate nearest neighbor (ANN) search.[23]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Chunking&lt;/strong&gt; Breaking documents into smaller pieces before indexing in RAG systems. Semantic chunking (by meaning) outperforms fixed-size chunking.[16]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Reranking&lt;/strong&gt; A second-pass step in RAG pipelines where retrieved documents are rescored using a more accurate (but slower) cross-encoder model to improve precision.&lt;/p&gt;&lt;hr&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;h4&gt;&amp;#x1f4dd; Prompting &amp;amp; Generation&lt;/h4&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;&lt;strong&gt;Prompt Engineering&lt;/strong&gt; The practice of designing input prompts to maximize LLM output quality. A rapidly evolving discipline with significant impact on model performance.[22]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Zero-Shot Prompting&lt;/strong&gt; Instructing a model to perform a task with no examples &amp;#x2014; relying purely on the model&amp;apos;s pre-trained knowledge.[22]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Few-Shot Prompting&lt;/strong&gt; Providing a small number of input-output examples in the prompt to guide the model&amp;apos;s behavior.[22]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Chain-of-Thought (CoT) Prompting&lt;/strong&gt; Instructing the model to show its reasoning step-by-step before giving a final answer. Dramatically improves performance on math, logic, and multi-step tasks.[22]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;System Prompt&lt;/strong&gt; A special prompt (hidden from the user) that sets the model&amp;apos;s role, persona, constraints, and behavior for an entire conversation.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Temperature / Sampling&lt;/strong&gt; Parameters controlling the randomness and diversity of LLM outputs.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Prompt Injection&lt;/strong&gt; An attack where malicious content in the environment (e.g., a webpage an agent reads) attempts to override the agent&amp;apos;s instructions.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Jailbreaking&lt;/strong&gt; Attempts to bypass an LLM&amp;apos;s safety guardrails through cleverly crafted prompts.&lt;/p&gt;&lt;hr&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;h4&gt;&amp;#x1f4ca; Evaluation &amp;amp; Safety&lt;/h4&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;&lt;strong&gt;Benchmark&lt;/strong&gt; A standardized test for measuring model capability. Examples: MMLU (knowledge), HumanEval (coding), MATH (mathematics), GPQA (PhD-level science).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Hallucination&lt;/strong&gt; LLM-generated content that is factually incorrect but stated with confidence.[25]&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Groundedness&lt;/strong&gt; The extent to which an LLM&amp;apos;s outputs are supported by provided context (e.g., retrieved documents in RAG).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Red-teaming&lt;/strong&gt; Adversarially probing an AI system to find safety vulnerabilities, jailbreaks, and failure modes before deployment.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;AI Safety&lt;/strong&gt; The field studying how to build AI systems that reliably do what humans intend and avoid unintended harmful behaviors.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Guardrails&lt;/strong&gt; Programmatic constraints applied to LLM inputs and outputs to enforce safety, content policies, and format requirements.&lt;/p&gt;&lt;hr&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;h4&gt;&amp;#x1f3e2; Model Families (2026 Landscape)&lt;/h4&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th scope=&quot;col&quot;&gt; Family &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Company &lt;/th&gt;&lt;th scope=&quot;col&quot;&gt; Notable Models &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;GPT&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; OpenAI &lt;/td&gt;&lt;td&gt; GPT-5, GPT-5.5, o3, o4-mini[6] &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Claude&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Anthropic &lt;/td&gt;&lt;td&gt; Claude 3.5 Sonnet, Claude 4[28] &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Gemini&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Google &lt;/td&gt;&lt;td&gt; Gemini 2.0 Flash, Gemini Ultra 2[28] &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Llama&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Meta &lt;/td&gt;&lt;td&gt; Llama 4 (10M context, multimodal)[4] &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Mistral&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Mistral AI &lt;/td&gt;&lt;td&gt; Mistral Large, Codestral, Mixtral MoE &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Phi&lt;/strong&gt;&lt;/td&gt;&lt;td&gt; Microsoft &lt;/td&gt;&lt;td&gt; Phi-4 (small, surprisingly capable)[4] &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;hr&gt;&lt;/div&gt;&lt;/div&gt;&lt;h3&gt;12. Code Appendix: Build It Yourself&lt;/h3&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;h4&gt;12.1 A Complete RAG + Agent System&lt;/h4&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;&amp;quot;&amp;quot;&amp;quot;Full-stack RAG + Agent system.Stack: OpenAI GPT-4o + ChromaDB + Function Calling&amp;quot;&amp;quot;&amp;quot;import osimport jsonimport chromadbfrom openai import OpenAIfrom chromadb.utils.embedding_functions import OpenAIEmbeddingFunctionclient = OpenAI(api_key=os.environ[&amp;quot;OPENAI_API_KEY&amp;quot;])# &amp;#x2500;&amp;#x2500;&amp;#x2500; 1. Setup Vector Store &amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;ef = OpenAIEmbeddingFunction(    api_key=os.environ[&amp;quot;OPENAI_API_KEY&amp;quot;],    model_name=&amp;quot;text-embedding-3-small&amp;quot;)chroma = chromadb.Client()kb = chroma.get_or_create_collection(&amp;quot;knowledge&amp;quot;, embedding_function=ef)def ingest_documents(docs: list[dict]):    &amp;quot;&amp;quot;&amp;quot;docs: [{&amp;quot;id&amp;quot;: &amp;quot;...&amp;quot;, &amp;quot;text&amp;quot;: &amp;quot;...&amp;quot;, &amp;quot;metadata&amp;quot;: {...}}]&amp;quot;&amp;quot;&amp;quot;    kb.upsert(        ids=[d[&amp;quot;id&amp;quot;] for d in docs],        documents=[d[&amp;quot;text&amp;quot;] for d in docs],        metadatas=[d.get(&amp;quot;metadata&amp;quot;, {}) for d in docs]    )def search_knowledge_base(query: str, n_results: int = 5) -&amp;gt; str:    results = kb.query(query_texts=[query], n_results=n_results)    docs = results[&amp;quot;documents&amp;quot;][0]    return &amp;quot;\n\n---\n\n&amp;quot;.join(docs) if docs else &amp;quot;No relevant documents found.&amp;quot;# &amp;#x2500;&amp;#x2500;&amp;#x2500; 2. Define Agent Tools &amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;TOOLS = [    {        &amp;quot;type&amp;quot;: &amp;quot;function&amp;quot;,        &amp;quot;function&amp;quot;: {            &amp;quot;name&amp;quot;: &amp;quot;search_knowledge_base&amp;quot;,            &amp;quot;description&amp;quot;: &amp;quot;Search internal knowledge base for relevant information&amp;quot;,            &amp;quot;parameters&amp;quot;: {                &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,                &amp;quot;properties&amp;quot;: {                    &amp;quot;query&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;},                    &amp;quot;n_results&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;integer&amp;quot;, &amp;quot;default&amp;quot;: 5}                },                &amp;quot;required&amp;quot;: [&amp;quot;query&amp;quot;]            }        }    },    {        &amp;quot;type&amp;quot;: &amp;quot;function&amp;quot;,        &amp;quot;function&amp;quot;: {            &amp;quot;name&amp;quot;: &amp;quot;calculate&amp;quot;,            &amp;quot;description&amp;quot;: &amp;quot;Evaluate a mathematical expression&amp;quot;,            &amp;quot;parameters&amp;quot;: {                &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,                &amp;quot;properties&amp;quot;: {                    &amp;quot;expression&amp;quot;: {                        &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;,                        &amp;quot;description&amp;quot;: &amp;quot;Python math expression, e.g. &amp;apos;2 ** 10&amp;apos;&amp;quot;                    }                },                &amp;quot;required&amp;quot;: [&amp;quot;expression&amp;quot;]            }        }    }]def execute_tool(name: str, args: dict) -&amp;gt; str:    if name == &amp;quot;search_knowledge_base&amp;quot;:        return search_knowledge_base(args[&amp;quot;query&amp;quot;], args.get(&amp;quot;n_results&amp;quot;, 5))    elif name == &amp;quot;calculate&amp;quot;:        try:            return str(eval(args[&amp;quot;expression&amp;quot;], {&amp;quot;__builtins__&amp;quot;: {}}, {}))        except Exception as e:            return f&amp;quot;Error: {e}&amp;quot;    return f&amp;quot;Unknown tool: {name}&amp;quot;# &amp;#x2500;&amp;#x2500;&amp;#x2500; 3. ReAct Agent Loop &amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;SYSTEM = &amp;quot;&amp;quot;&amp;quot;You are a helpful AI assistant with access to a knowledge base.Use the search_knowledge_base tool to look up relevant information before answering.Think through problems step by step. Always cite which documents informed your answer.&amp;quot;&amp;quot;&amp;quot;def chat(user_message: str, history: list = None) -&amp;gt; str:    if history is None:        history = []    messages = [{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: SYSTEM}]    messages.extend(history)    messages.append({&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: user_message})    while True:        response = client.chat.completions.create(            model=&amp;quot;gpt-4o&amp;quot;,            messages=messages,            tools=TOOLS,            tool_choice=&amp;quot;auto&amp;quot;        )        msg = response.choices[0].message        messages.append(msg)        # No tool calls = final answer        if not msg.tool_calls:            return msg.content        # Execute each tool call        for tc in msg.tool_calls:            args = json.loads(tc.function.arguments)            result = execute_tool(tc.function.name, args)            print(f&amp;quot;[Tool: {tc.function.name}] &amp;#x2192; {result[:100]}...&amp;quot;)            messages.append({                &amp;quot;role&amp;quot;: &amp;quot;tool&amp;quot;,                &amp;quot;tool_call_id&amp;quot;: tc.id,                &amp;quot;content&amp;quot;: result            })# &amp;#x2500;&amp;#x2500;&amp;#x2500; Usage &amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;&amp;#x2500;if __name__ == &amp;quot;__main__&amp;quot;:    # Ingest some documents    ingest_documents([        {&amp;quot;id&amp;quot;: &amp;quot;1&amp;quot;, &amp;quot;text&amp;quot;: &amp;quot;MCP (Model Context Protocol) was released by Anthropic in November 2024.&amp;quot;},        {&amp;quot;id&amp;quot;: &amp;quot;2&amp;quot;, &amp;quot;text&amp;quot;: &amp;quot;LangGraph is the #1 ranked AI agent framework for production stateful workflows in 2026.&amp;quot;},        {&amp;quot;id&amp;quot;: &amp;quot;3&amp;quot;, &amp;quot;text&amp;quot;: &amp;quot;GPT-5 was launched on August 7, 2025, achieving 94.6% on advanced math benchmarks.&amp;quot;},    ])    answer = chat(&amp;quot;What agent framework should I use for a production workflow, and when was GPT-5 released?&amp;quot;)    print(f&amp;quot;\nAnswer:\n{answer}&amp;quot;)&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;h4&gt;12.2 Minimal MCP Client&lt;/h4&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;&amp;quot;&amp;quot;&amp;quot;Consuming an MCP server from a Python client.Requires: pip install mcp anthropic&amp;quot;&amp;quot;&amp;quot;import asynciofrom mcp import ClientSession, StdioServerParametersfrom mcp.client.stdio import stdio_clientimport anthropicasync def run_with_mcp(user_query: str):    # Connect to an MCP server (e.g., filesystem, database)    server_params = StdioServerParameters(        command=&amp;quot;python&amp;quot;,        args=[&amp;quot;my_mcp_server.py&amp;quot;]    )    async with stdio_client(server_params) as (read, write):        async with ClientSession(read, write) as session:            await session.initialize()            # List available tools            tools_response = await session.list_tools()            tools = [                {                    &amp;quot;name&amp;quot;: t.name,                    &amp;quot;description&amp;quot;: t.description,                    &amp;quot;input_schema&amp;quot;: t.inputSchema                }                for t in tools_response.tools            ]            print(f&amp;quot;Available tools: {[t[&amp;apos;name&amp;apos;] for t in tools]}&amp;quot;)            # Use Claude with MCP tools            anthropic_client = anthropic.Anthropic()            messages = [{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: user_query}]            while True:                response = anthropic_client.messages.create(                    model=&amp;quot;claude-3-5-sonnet-20241022&amp;quot;,                    max_tokens=4096,                    tools=tools,                    messages=messages                )                if response.stop_reason == &amp;quot;end_turn&amp;quot;:                    # Extract text response                    for block in response.content:                        if hasattr(block, &amp;quot;text&amp;quot;):                            return block.text                # Handle tool use                for block in response.content:                    if block.type == &amp;quot;tool_use&amp;quot;:                        # Execute tool via MCP                        result = await session.call_tool(block.name, block.input)                        messages.append({                            &amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;,                            &amp;quot;content&amp;quot;: response.content                        })                        messages.append({                            &amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;,                            &amp;quot;content&amp;quot;: [{                                &amp;quot;type&amp;quot;: &amp;quot;tool_result&amp;quot;,                                &amp;quot;tool_use_id&amp;quot;: block.id,                                &amp;quot;content&amp;quot;: str(result.content)                            }]                        })                        breakasyncio.run(run_with_mcp(&amp;quot;List all files in the current directory and summarize their contents&amp;quot;))&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;h4&gt;12.3 A2A Agent Discovery &amp;amp; Collaboration&lt;/h4&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;python&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;&amp;quot;&amp;quot;&amp;quot;Simplified A2A protocol implementation.Real A2A uses JSON-RPC 2.0 over HTTP with SSE for streaming.&amp;quot;&amp;quot;&amp;quot;import httpximport asynciofrom dataclasses import dataclass@dataclassclass AgentCard:    name: str    endpoint: str    skills: list[dict]    version: str = &amp;quot;1.0&amp;quot;class A2AClient:    &amp;quot;&amp;quot;&amp;quot;Discovers and calls remote A2A agents.&amp;quot;&amp;quot;&amp;quot;    async def discover_agent(self, agent_url: str) -&amp;gt; AgentCard:        &amp;quot;&amp;quot;&amp;quot;Fetch agent card from a known URL.&amp;quot;&amp;quot;&amp;quot;        async with httpx.AsyncClient() as client:            response = await client.get(f&amp;quot;{agent_url}/.well-known/agent.json&amp;quot;)            data = response.json()            return AgentCard(                name=data[&amp;quot;name&amp;quot;],                endpoint=data[&amp;quot;endpoint&amp;quot;],                skills=data[&amp;quot;skills&amp;quot;],                version=data.get(&amp;quot;version&amp;quot;, &amp;quot;1.0&amp;quot;)            )    async def send_task(self, agent: AgentCard, skill_id: str,                         message: str) -&amp;gt; str:        &amp;quot;&amp;quot;&amp;quot;Send a task to a remote agent and get the result.&amp;quot;&amp;quot;&amp;quot;        payload = {            &amp;quot;jsonrpc&amp;quot;: &amp;quot;2.0&amp;quot;,            &amp;quot;method&amp;quot;: &amp;quot;tasks/send&amp;quot;,            &amp;quot;id&amp;quot;: &amp;quot;req-1&amp;quot;,            &amp;quot;params&amp;quot;: {                &amp;quot;skill&amp;quot;: skill_id,                &amp;quot;message&amp;quot;: {&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;parts&amp;quot;: [{&amp;quot;text&amp;quot;: message}]}            }        }        async with httpx.AsyncClient() as client:            response = await client.post(                agent.endpoint,                json=payload,                headers={&amp;quot;Authorization&amp;quot;: &amp;quot;Bearer my-token&amp;quot;}            )            result = response.json()            return result[&amp;quot;result&amp;quot;][&amp;quot;output&amp;quot;][&amp;quot;message&amp;quot;][&amp;quot;parts&amp;quot;][0][&amp;quot;text&amp;quot;]class OrchestratorAgent:    &amp;quot;&amp;quot;&amp;quot;    An orchestrator that discovers and delegates to specialist agents.    This is the heart of a multi-agent A2A system.    &amp;quot;&amp;quot;&amp;quot;    def __init__(self):        self.a2a = A2AClient()        self.registry: dict[str, AgentCard] = {}    async def register_agent(self, url: str):        card = await self.a2a.discover_agent(url)        self.registry[card.name] = card        print(f&amp;quot;Registered agent: {card.name} with skills: {[s[&amp;apos;id&amp;apos;] for s in card.skills]}&amp;quot;)    async def handle_request(self, user_request: str) -&amp;gt; str:        # In a real system, an LLM would decide which agent to use        # Here we do simple keyword routing        if &amp;quot;payment&amp;quot; in user_request.lower():            agent = self.registry.get(&amp;quot;payment-agent&amp;quot;)            return await self.a2a.send_task(agent, &amp;quot;process_payment&amp;quot;, user_request)        elif &amp;quot;weather&amp;quot; in user_request.lower():            agent = self.registry.get(&amp;quot;weather-agent&amp;quot;)            return await self.a2a.send_task(agent, &amp;quot;get_forecast&amp;quot;, user_request)        return &amp;quot;I don&amp;apos;t have a specialist agent for this request.&amp;quot;async def main():    orchestrator = OrchestratorAgent()    # Discover available agents (in production: from a registry service)    await orchestrator.register_agent(&amp;quot;https://payments.example.com&amp;quot;)    await orchestrator.register_agent(&amp;quot;https://weather.example.com&amp;quot;)    result = await orchestrator.handle_request(        &amp;quot;Process a &amp;#x24;50 payment for order #1234&amp;quot;    )    print(f&amp;quot;Result: {result}&amp;quot;)asyncio.run(main())&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;p&gt;&lt;em&gt;Built with care for the Rawalpindi/Islamabad dev community &amp;#x2014; and every hacker reading this on a Thursday afternoon. Go build something.&lt;/em&gt;&lt;/p&gt;&lt;hr&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;/div&gt;</content>
    </item>
    <item>
      <title>API pagination: cursor vs offset</title>
      <link>https://oreoro.github.io/posts/api-pagination-cursor-vs-offset/</link>
      <guid isPermaLink="true">https://oreoro.github.io/posts/api-pagination-cursor-vs-offset/</guid>
      <description>When to use cursor pagination and how to implement it safely.</description>
      <pubDate>Thu, 04 Jun 2026 00:00:00 GMT</pubDate>
      <lastUpdatedTimestamp>Thu Jun 04 2026 12:17:00 GMT+0000 (Coordinated Universal Time)</lastUpdatedTimestamp>
      <category>Guide</category>
      <category>Tools</category>
      <content>&lt;div&gt;
                    &lt;p&gt;
                        &lt;em&gt;Note:&lt;/em&gt; This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage
                        &lt;a href=&quot;https://oreoro.github.io/posts/api-pagination-cursor-vs-offset/&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.
                    &lt;/p&gt;
                    &lt;hr&gt;
                &lt;div&gt;&lt;p&gt;&lt;time&gt; June 4, 2026 &lt;/time&gt;&lt;/p&gt;&lt;/div&gt;&lt;hr&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;h3&gt;Cursor pagination (recommended)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt; Stable under inserts/deletes  &lt;/li&gt;&lt;li&gt; Uses an opaque cursor (e.g., last seen id + sort key)  &lt;/li&gt;&lt;li&gt; Easy to cache and resume  &lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;sql&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;SELECT *FROM itemsWHERE (created_at, id) &amp;lt; (:created_at, :id)ORDER BY created_at DESC, id DESCLIMIT 50;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;h3&gt;Offset pagination (avoid at scale)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt; Can skip/duplicate rows when data changes  &lt;/li&gt;&lt;li&gt; Gets slower as offset grows  &lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;div&gt; If you need &amp;#x201c;page numbers&amp;#x201d;, store cursors per page server-side.  &lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;/div&gt;</content>
    </item>
    <item>
      <title>TLS in plain English</title>
      <link>https://oreoro.github.io/posts/tls-in-plain-english/</link>
      <guid isPermaLink="true">https://oreoro.github.io/posts/tls-in-plain-english/</guid>
      <description>What happens during a TLS handshake, without the math.</description>
      <pubDate>Thu, 04 Jun 2026 00:00:00 GMT</pubDate>
      <lastUpdatedTimestamp>Thu Jun 04 2026 12:17:00 GMT+0000 (Coordinated Universal Time)</lastUpdatedTimestamp>
      <category>Information</category>
      <content>&lt;div&gt;
                    &lt;p&gt;
                        &lt;em&gt;Note:&lt;/em&gt; This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage
                        &lt;a href=&quot;https://oreoro.github.io/posts/tls-in-plain-english/&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.
                    &lt;/p&gt;
                    &lt;hr&gt;
                &lt;div&gt;&lt;p&gt;&lt;time&gt; June 4, 2026 &lt;/time&gt;&lt;/p&gt;&lt;/div&gt;&lt;hr&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;h3&gt;The handshake&lt;/h3&gt;&lt;ul&gt;&lt;li&gt; Client says: &amp;#x201c;Here are the cipher suites I support&amp;#x201d;  &lt;/li&gt;&lt;li&gt; Server replies with a certificate (public key)  &lt;/li&gt;&lt;li&gt; Client verifies the certificate chain  &lt;/li&gt;&lt;li&gt; They agree on session keys (usually via ECDHE)  &lt;/li&gt;&lt;li&gt; After that: traffic is encrypted + authenticated  &lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;What you get&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Confidentiality&lt;/strong&gt; (encryption)  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Integrity&lt;/strong&gt; (tamper detection)  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Authenticity&lt;/strong&gt; (you&amp;#x2019;re talking to the right server)  &lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Common gotchas&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt; Expired certs  &lt;/li&gt;&lt;li&gt; Wrong hostname (SAN mismatch)  &lt;/li&gt;&lt;li&gt; Missing intermediate certs  &lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;/div&gt;</content>
    </item>
    <item>
      <title>Docker layers: 6 rules for smaller images</title>
      <link>https://oreoro.github.io/posts/docker-layers-6-rules-for-smaller-images/</link>
      <guid isPermaLink="true">https://oreoro.github.io/posts/docker-layers-6-rules-for-smaller-images/</guid>
      <description>A tiny checklist to cut build time and image size.</description>
      <pubDate>Thu, 04 Jun 2026 00:00:00 GMT</pubDate>
      <lastUpdatedTimestamp>Thu Jun 04 2026 12:17:00 GMT+0000 (Coordinated Universal Time)</lastUpdatedTimestamp>
      <category>Tools</category>
      <content>&lt;div&gt;
                    &lt;p&gt;
                        &lt;em&gt;Note:&lt;/em&gt; This RSS feed strips out SVGs and embeds. You might want to read the post on the webpage
                        &lt;a href=&quot;https://oreoro.github.io/posts/docker-layers-6-rules-for-smaller-images/&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.
                    &lt;/p&gt;
                    &lt;hr&gt;
                &lt;div&gt;&lt;p&gt;&lt;time&gt; June 4, 2026 &lt;/time&gt;&lt;/p&gt;&lt;/div&gt;&lt;hr&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;h3&gt;6 rules&lt;/h3&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt; Use a small base (alpine/distroless when possible)  &lt;/li&gt;&lt;li&gt; Copy only what you need (use &lt;code&gt;.dockerignore&lt;/code&gt;)  &lt;/li&gt;&lt;li&gt; Install deps before copying app source  &lt;/li&gt;&lt;li&gt; Combine commands to reduce layers  &lt;/li&gt;&lt;li&gt; Use multi-stage builds  &lt;/li&gt;&lt;li&gt; Pin versions to avoid surprise rebuilds  &lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;pre data-language=&quot;docker&quot;&gt;&lt;code&gt;&lt;span&gt;&lt;span&gt;FROM node:24-alpine AS buildWORKDIR /appCOPY package*.json ./RUN npm ciCOPY . .RUN npm run buildFROM gcr.io/distroless/nodejs24-debian12COPY --from=build /app/dist /appCMD [&amp;quot;/app/index.js&amp;quot;]&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;hr&gt;&lt;/div&gt;</content>
    </item>
  </channel>
</rss>
