{"id":477241,"date":"2026-05-10T04:00:17","date_gmt":"2026-05-10T04:00:17","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/477241\/"},"modified":"2026-05-10T04:00:17","modified_gmt":"2026-05-10T04:00:17","slug":"the-must-know-topics-for-an-llm-engineer","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/477241\/","title":{"rendered":"The Must-Know Topics for an LLM\u00a0Engineer"},"content":{"rendered":"<p class=\"wp-block-paragraph\"> (LLMs) have quickly become the foundation of modern AI systems\u200a\u2014\u200afrom chatbots and copilots to search, coding, and automation. But for engineers transitioning into this space, the learning curve can feel steep and fragmented. Concepts like tokenization, attention, fine-tuning, and evaluation are often explained in isolation, making it hard to form a coherent mental model of how everything fits together.<\/p>\n<p class=\"wp-block-paragraph\">I ran into this firsthand when moving from computer vision to LLMs. In a short span of time, I had to understand not just the theory behind transformers, but also the practical realities: training trade-offs, inference bottlenecks, alignment challenges, and evaluation pitfalls.<\/p>\n<p class=\"wp-block-paragraph\">This article is designed to bridge that gap.<\/p>\n<p class=\"wp-block-paragraph\">Rather than diving deep into a single component, it provides a\u00a0<strong>structured map of the LLM engineering landscape<\/strong>\u200a\u2014\u200acovering the key building blocks you need to understand to design, train, and deploy real-world LLM systems.<\/p>\n<p class=\"wp-block-paragraph\">We\u2019ll move from the fundamentals of how text is represented, through model architectures and training strategies, all the way to inference optimization, evaluation, and system-level considerations and practical consideration like prompt engineering and reducing hallucinations.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2026\/05\/Gemini_Generated_Image_f0e12gf0e12gf0e1-1024x318.png\" alt=\"\" class=\"wp-image-658600\"\/>Image by the Author.<\/p>\n<p class=\"wp-block-paragraph\">By the end, you should have a\u00a0<strong>clear mental framework<\/strong>\u00a0for how modern LLM systems are built\u200a\u2014\u200aand where each concept fits in practice.<\/p>\n<p>Converting letters to\u00a0numbers<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-22.25.32-1024x530.png\" alt=\"\" class=\"wp-image-658602\"\/>Stages transforming text into the vectors that are fed into the LLMs. Image by the Author.<\/p>\n<p>Tokenisation<\/p>\n<p class=\"wp-block-paragraph\">When feeding data to a model, we can\u2019t just feed it letters or words directly\u200a\u2014\u200awe need a way to convert text into numbers. Intuitively, we might think of assigning each word in the language a unique number and feeding those numbers to the model. However, there are hundreds of thousands of words in the English language, and training on such a vast vocabulary would be infeasible in terms of memory and efficiency.<\/p>\n<p class=\"wp-block-paragraph\">So what can be done instead? Well, we could try encoding letters, since there are only 26 in the English alphabet. But this would lead to problems as well\u200a\u2014\u200amodels would struggle to capture the meaning of words from individual letters alone, and sequences would become unnecessarily long, making training difficult.<\/p>\n<p class=\"wp-block-paragraph\">A practical solution is\u00a0<strong>tokenization<\/strong>. Instead of representing language at the word or character level, we split text into the most frequent and useful\u00a0<strong>subword units<\/strong>. These subwords act as the building blocks of the model\u2019s vocabulary: common words appear as whole tokens, while rare words can be represented as combinations of smaller subwords.<\/p>\n<p class=\"wp-block-paragraph\">A common algorithm for that is\u00a0<a href=\"https:\/\/huggingface.co\/learn\/llm-course\/en\/chapter6\/5\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">Byte-Pair-Encoding<\/a>\u00a0(BPE). BPE starts with individual characters as tokens, then repeatedly merges the most frequent pairs of tokens into new tokens, gradually building up a vocabulary of subword units until a desired vocabulary size is reached.<\/p>\n<p class=\"wp-block-paragraph\">At this stage each token is assigned a unique number\u200a\u2014\u200aits ID in the vocabulary.<\/p>\n<p>Embeddings<\/p>\n<p class=\"wp-block-paragraph\">After we have tokenized the data and assigned token IDs, we need to attach\u00a0<strong>semantic meaning<\/strong>\u00a0to these IDs. This is achieved through\u00a0<strong>text embeddings<\/strong>\u200a\u2014\u200amappings from discrete token IDs into continuous vector spaces. In this space, words or tokens with similar meanings are placed close together, and even algebraic operations can capture semantic relationships (for example:\u00a0embedding(queen)\u200a\u2014\u200aembedding(woman) + embedding(man) \u2248 embedding(king)).<\/p>\n<p class=\"wp-block-paragraph\">Generally,\u00a0<strong>embedding layers<\/strong>\u00a0are trained to take token IDs as input and produce dense vectors as output. These vectors are optimized jointly with the model\u2019s training objective (e.g., next-token prediction). Over time, the model learns embeddings that encode both syntactic and semantic information about words, subwords, or tokens. Popular embedding models are: word2vec, glove, BERT.<\/p>\n<p>Positional encoding<\/p>\n<p class=\"wp-block-paragraph\">Generally, LLMs are not inherently aware of the structure of language. Natural language has a\u00a0<strong>sequential nature<\/strong>\u200a\u2014\u200aword order matters\u200a\u2014\u200abut at the same time, tokens that are far apart in a sentence may still be strongly related. To capture both local order and long-range dependencies, we inject\u00a0<strong>positional information of the tokens\u00a0<\/strong>into each embedding.<\/p>\n<p class=\"wp-block-paragraph\">There are several common to positional approaches:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Absolute positional encodings<\/strong>\u200a\u2014\u200aFixed patterns, such as\u00a0<a href=\"https:\/\/apxml.com\/courses\/foundations-transformers-architecture\/chapter-4-positional-encoding-embedding-layer\/practice-positional-encodings\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">sine and cosine<\/a>\u00a0functions at different frequencies, are added to token embeddings. This is simple and effective but may struggle to represent very long sequences, since it does not explicitly model relative distances.<\/li>\n<li class=\"wp-block-list-item\"><strong>Relative positional encodings<\/strong>\u200a\u2014\u200aThese represent the\u00a0distance\u00a0between tokens instead of their absolute positions. A popular method is\u00a0<strong>RoPE (Rotary Positional Embeddings)<\/strong>, which encodes position as vector rotations. This approach scales better to long sequences and captures relationships between distant tokens more naturally.<\/li>\n<li class=\"wp-block-list-item\"><strong>Learned positional encodings<\/strong>\u200a\u2014\u200aInstead of relying on fixed mathematical functions, the model directly learns position embeddings during training. This allows flexibility but can be less generalizable to sequence lengths not seen in training.<\/li>\n<\/ul>\n<p>Model Architecture<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-06-at-23.18.23-1024x854.png\" alt=\"\" class=\"wp-image-658603\"\/>Encoder-Decoder architecture. Image by the Author.<\/p>\n<p class=\"wp-block-paragraph\">After the data is tokenized, embedded, and enriched with positional encodings, it is passed through the model. The current state-of-the-art architecture for processing textual data is the\u00a0<a href=\"https:\/\/jalammar.github.io\/illustrated-transformer\/\" rel=\"noreferrer noopener nofollow\" target=\"_blank\"><strong>transformer<\/strong><\/a><strong>\u00a0<\/strong>architecture, whose core is base on the\u00a0<strong>attention mechanism.\u00a0<\/strong>A transformer typically consists of a stack of transformer blocks:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Multi-Head Attention:<\/strong>\u00a0Enables the model to focus on different parts of the input sequence simultaneously, capturing diverse context. It calculates Queries (Q), Keys (K), and Values (V) to define word relationships.<\/li>\n<li class=\"wp-block-list-item\"><strong>Position-wise Feed-Forward Network (FFN):<\/strong>\u00a0A fully connected network applied to each position independently, adding non-linearity.<\/li>\n<li class=\"wp-block-list-item\"><strong>Residual Connections:<\/strong>\u00a0Short-cut connections that help gradients flow during training, preventing information loss.<\/li>\n<li class=\"wp-block-list-item\"><strong>Layer Normalization:<\/strong>\u00a0Normalizes the input to stabilize training.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/jalammar.github.io\/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Attention<\/a><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-06-at-23.38.28-1024x511.png\" alt=\"\" class=\"wp-image-658604\"\/>Attention Mechanism. Image by the Author<\/p>\n<p class=\"wp-block-paragraph\">Introduced in the paper called\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1706.03762\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Attention Is All You Need<\/a>, in attention, every token is projected into three vectors: a\u00a0query (what it\u2019s looking for), a\u00a0key\u00a0(what it offers), and a\u00a0value (the actual information it carries). Attention works by comparing queries to keys (via similarity scores) to decide how much of each value to aggregate. This lets the model dynamically pull in relevant context based on content, not position.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Multi-head attention<\/strong>\u00a0runs several attention mechanisms in parallel, each with its own learned projections. Think of each \u201chead\u201d as focusing on a different relationship (e.g., syntax, coreference, semantics). Combining them gives the model a richer, more nuanced understanding than a single attention pass.<\/p>\n<p class=\"wp-block-paragraph\">There are several types of attention mechanism that vary based on its purpose: self-attention, masked self-attention and cross-attention.\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Self-attention<\/strong>\u00a0operates within a single sequence, letting tokens attend to each other (e.g., understanding a sentence).\u00a0<strong>Masked self-attention<\/strong>\u00a0is similar to self-attention with a key difference in that attention only sees past tokens, without observing the future ones.\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Cross-attention<\/strong>\u00a0connects two sequences, where one provides queries and the other provides keys\/values (e.g., a decoder attending to an encoded input in translation). The key difference is whether context comes from the same source or an external.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Standard attention compares every token with every other token, leading to\u00a0<strong>quadratic complexity<\/strong>\u00a0O(n2). As sequence length grows, computation and memory usage increase rapidly, making very long contexts expensive and slow. This is one of the main bottlenecks in scaling LLMs and an active field of research \u2014for example through <a href=\"https:\/\/arxiv.org\/abs\/2411.17116\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">being selective about what tokens attend to what tokens<\/a>.<\/p>\n<p>Architecture types<\/p>\n<p class=\"wp-block-paragraph\">Language modeling tasks are built using one of the following transformer architectures:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Encoder-only models<\/strong>\u200a\u2014\u200aEach token can attend to every other token in the sequence (bidirectional attention). These models are typically trained with\u00a0masked language modeling (MLM), where some tokens in the input are hidden, and the task is to predict them. This setup is well-suited for classification and understanding tasks (e.g.,\u00a0<strong>BERT<\/strong>).<\/li>\n<li class=\"wp-block-list-item\"><strong>Decoder-only models<\/strong>\u200a\u2014\u200aEach token can attend only to the tokens that come before it in the sequence (causal or unidirectional attention). These models are trained with\u00a0causal language modeling, i.e., predicting the next token given all previous ones. This setup is ideal for text generation (e.g.,\u00a0<strong>GPT<\/strong>).<\/li>\n<li class=\"wp-block-list-item\"><strong>Encoder\u2013Decoder models<\/strong>\u200a\u2014\u200aThe input sequence is first processed by the encoder, and the resulting representations are then fed into the decoder through\u00a0cross-attention layers. The decoder generates an output sequence one token at a time, conditioned both on the encoder\u2019s representations and its own previous outputs. This setup is common for sequence-to-sequence tasks like machine translation (e.g.,\u00a0<strong>T5<\/strong>,\u00a0<strong>BART<\/strong>).<\/li>\n<\/ul>\n<p><a href=\"https:\/\/huggingface.co\/blog\/mlabonne\/decoding-strategies\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">Next token prediction and output\u00a0decoding<\/a><\/p>\n<p class=\"wp-block-paragraph\">Models are trained to predict the\u00a0<strong>next token<\/strong>\u200a\u2014\u200athis is done by outputting a probability distribution over all possible tokens in the vocabulary. Output of the model is the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Logit\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">logit<\/a>\u00a0which is then passed through the softmax to predict the probability of the next token in the vocabulary.<\/p>\n<p class=\"wp-block-paragraph\">In the most straightforward approach, we could always choose the token with the highest probability (this is called\u00a0greedy decoding). However, this strategy is often suboptimal, since the locally most likely token does not always lead to the globally most coherent or natural sentence.<\/p>\n<p class=\"wp-block-paragraph\">To improve generation, we can sample from the probability distribution. This introduces diversity and allows the model to explore different continuations. Moreover, we can branch the generation process by considering multiple candidate tokens and expanding them in parallel.<\/p>\n<p class=\"wp-block-paragraph\">Several popular\u00a0<strong>decoding strategies<\/strong>\u00a0used in practice are:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Beam search:<\/strong>\u00a0Instead of following a single greedy path, beam search keeps track of the\u00a0top n\u00a0candidate sequences (beams) at each step, expanding them in parallel and ultimately selecting the sequence with the highest overall probability.<\/li>\n<li class=\"wp-block-list-item\"><strong>Top-k sampling:<\/strong>\u00a0At each step, only the k most probable tokens are considered, and one is sampled according to their probabilities. This avoids sampling from the long tail of very unlikely tokens.<\/li>\n<li class=\"wp-block-list-item\"><strong>Top-p sampling (nucleus sampling):<\/strong>\u00a0Instead of fixing k, we select the smallest set of tokens whose cumulative probability is at least p(e.g., 0.9). Then we sample from this set, dynamically adjusting how many tokens are considered depending on the shape of the distribution.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">To control how \u201cflat\u201d or \u201cpeaked\u201d the probability distribution is LLMs use a\u00a0<strong>temperature<\/strong>\u00a0parameter. A low temperature (&lt;1) makes the model more deterministic, concentrating probability mass on the most likely tokens. A high temperature (&gt;1) makes the distribution more uniform, increasing randomness and diversity in the generated output.<\/p>\n<p>Training stages<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2026\/05\/Gemini_Generated_Image_kv0c8mkv0c8mkv0c-1024x558.png\" alt=\"\" class=\"wp-image-658605\"\/>Image generated with Gemini<\/p>\n<p class=\"wp-block-paragraph\">LLM training typically has two stages: pre-training, where the model learns general language patterns such as grammar, syntax, and meaning from large-scale data, and fine-tuning, where it is adapted to perform specific tasks, such as following instructions or answering questions in a desired format and later on refines outputs to align with human preferences and safety constraints.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">This progression moves from\u00a0capability\u00a0(what the model can do) to\u00a0alignment\u00a0(what the model should do).<\/p>\n<p>Pre-training<\/p>\n<p class=\"wp-block-paragraph\">Pre-training is the most computationally expensive stage of LLM training because the model must learn from extremely large and diverse datasets. This typically involves hundreds of billions to trillions of tokens drawn from sources such as web pages, books, articles, code, and conversations.<\/p>\n<p class=\"wp-block-paragraph\">To guide decisions about model size, training time, and dataset scale, researchers use\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2001.08361\" rel=\"noreferrer noopener nofollow\" target=\"_blank\"><strong>LLM scaling laws<\/strong><\/a>, which describe how these factors relate and help estimate the optimal setup for achieving strong performance.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Data\u00a0<\/strong><a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/an-introduction-to-preparing-your-own-dataset-for-llm-training\/\" rel=\"noreferrer noopener nofollow\" target=\"_blank\"><strong>pre-processing<\/strong><\/a><strong>\u00a0<\/strong>is a crucial step because raw text can significantly degrade LLM performance if used directly. Training data comes from many sources, each with its own challenges that must be cleaned and filtered.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Web pages often contain boilerplate content such as ads, navigation menus, headers, and footers, along with formatting noise from HTML, CSS, and JavaScript. They may also include duplicated pages, spam, low-quality text, or even harmful content.<\/li>\n<li class=\"wp-block-list-item\">Books can introduce issues like metadata (publisher details, page numbers, footnotes), OCR errors from digitization, and repetitive or stylistically inconsistent passages. In addition, copyright restrictions require careful filtering and licensing compliance.<\/li>\n<li class=\"wp-block-list-item\">Code datasets may include auto-generated files, duplicated repositories, excessive comments, or boilerplate code. Licensing constraints are also important, and low-quality or buggy code can negatively impact training if not removed.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">To address these challenges, datasets are typically filtered by language and quality, and imbalances across sources are corrected through data augmentation or re-weighting.<\/p>\n<p>Suprevised fine-tuning<\/p>\n<p class=\"wp-block-paragraph\">In supervised fine-tuning, we typically do not update all model parameters. Instead, most of the pretrained weights are kept frozen, and only a small number of additional parameters are trained. This is done either by adding lightweight adapter modules or by using parameter-efficient methods such as LoRA, while training on a small sub-set of filtered and clean set of data.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/cameronrwolfe.substack.com\/p\/easily-train-a-specialized-llm-peft\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Low Rank Adaptation (LoRA)<\/a>\u00a0is one of the most widely used approaches. Instead of updating the full weight matrix, LoRA learns two smaller low-rank matrices, A and B, whose product approximates the update to the original weights. The pretrained weights remain fixed, and only A and B are trained. This makes fine-tuning far more efficient in terms of memory and compute while still preserving performance. (See also:\u00a0<a href=\"https:\/\/magazine.sebastianraschka.com\/p\/practical-tips-for-finetuning-llms\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">practical LoRA training techniques and best practices<\/a>.)\u00a0<\/li>\n<li class=\"wp-block-list-item\">Beyond LoRA, other parameter-efficient methods include prefix tuning, where a small set of trainable \u201cvirtual tokens\u201d is added to the input and optimized during training, and adapter layers, which are small trainable modules inserted between existing transformer blocks while the rest of the model remains frozen.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">At a higher level, supervised fine-tuning itself is the stage where we teach the model how to behave on a specific task using high-quality labeled examples. This typically includes:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Dialogue data<\/strong>: curated human\u2013human or human\u2013AI conversations that teach the model how to respond naturally in interactive settings.<\/li>\n<li class=\"wp-block-list-item\"><strong>Instruction data<\/strong>: prompt\u2013response pairs that train the model to follow instructions, answer questions, and perform reasoning or task-specific outputs.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Together, these techniques align a pretrained model with the behavior we actually want at inference time.<\/p>\n<p>Reinforcement learning<\/p>\n<p class=\"wp-block-paragraph\">After supervised fine-tuning teaches the model\u00a0what\u00a0to do,\u00a0<a href=\"https:\/\/huggingface.co\/blog\/NormalUhr\/rlhf-pipeline\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">reinforcement learning<\/a>\u00a0is used to refine\u00a0how well\u00a0it does it, especially in open-ended or subjective tasks like dialogue, reasoning, and safety.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Unlike supervised learning with fixed targets, RL introduces a feedback loop: model outputs are evaluated, scored, and improved over time. This makes RL a key tool for aligning models with human preferences. In practice, it helps: encourage helpful, harmless, and honest behaviour, reduce toxic, biased, or unsafe outputs and improve instruction-following and conversational quality.<\/p>\n<p class=\"wp-block-paragraph\">Because alignment data is smaller but higher quality than pre-training data, RL acts as a\u00a0fine-grained steering mechanism, not a source of new knowledge.<\/p>\n<p class=\"wp-block-paragraph\">A common paradigm is\u00a0<a href=\"https:\/\/huggingface.co\/blog\/rlhf\" rel=\"noreferrer noopener nofollow\" target=\"_blank\"><strong>Reinforcement Learning from Human Feedback (RLHF)<\/strong><\/a>, which typically involves three steps:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Collect preference data:\u00a0<\/strong>As the gold standard humans rank multiple model responses to the same prompt (e.g., which is more helpful or safe), producing relative preferences rather than absolute labels, however, in some cases, stronger models are used to generate preference data or critique weaker models, reducing reliance on expensive human labeling. In practice, combining human and automated feedback allows scaling while maintaining quality.<\/li>\n<li class=\"wp-block-list-item\"><strong>Train a reward model (RM):\u00a0<\/strong>A separate model is trained to score responses according to human preferences. Given a prompt and a candidate response, the reward model assigns a scalar score representing how good the response is according to human judgment.<\/li>\n<li class=\"wp-block-list-item\"><strong>Optimize the policy (the LLM):\u00a0<\/strong>The language model, is then trained to maximize the reward signal, i.e., to generate outputs humans are more likely to prefer.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">Optimizing the policy (LLM) is often tricky\u200a\u2014\u200aRL might destroy learnt knowledge, or the model might collapse to predicting one plausible output that would generate maximum reward without diversity. Several algorithms are used to perform this optimization and address the issues:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/www.adaptive-ml.com\/post\/from-zero-to-ppo\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>Proximal Policy Optimization (PPO)<\/strong><\/a><strong>:\u00a0<\/strong>PPO updates the model while constraining how far it can move from the original policy in a single step, preventing instability or degradation of language quality. An excellent video explantion of the PPO can be\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=8jtAzxUwDj0\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">found here<\/a>.<\/li>\n<li class=\"wp-block-list-item\"><strong>Direct Preference Optimization (DPO): <\/strong>bypasses the need for an explicit reward model. It directly optimizes the model to prefer chosen responses over rejected ones using a classification-style objective, simplifying the pipeline and reduces training complexity.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/yugeten.github.io\/posts\/2025\/01\/ppogrpo\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>Group Relative Policy Optimization (GRPO)<\/strong><\/a><strong>:\u00a0<\/strong>A variant that compares groups of outputs rather than pairs, improving stability and sample efficiency by leveraging richer comparative signals.<\/li>\n<li class=\"wp-block-list-item\"><strong>Kahneman-Tversky Optimization (KTO):\u00a0<\/strong>KTO incorporates asymmetric preferences (e.g., penalizing bad outputs more strongly than rewarding good ones), which can better reflect human judgment in safety-critical scenarios.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">RL for language models can be broadly\u00a0<a href=\"https:\/\/cameronrwolfe.substack.com\/p\/online-rl?utm_source=post-email-title&amp;publication_id=1092659&amp;post_id=169926007&amp;utm_campaign=email-post-title&amp;isFreemail=true&amp;r=3pocru&amp;triedRedirect=true&amp;utm_medium=email\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">categorized into online and offline<\/a>\u00a0based on how data is collected and used during training:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Offline RL (dominant today):\u00a0<\/strong>The model is trained on a\u00a0fixed dataset\u00a0of interactions. There is no further interaction with humans or the environment during optimization: once preference data is collected and the reward model is trained, policy optimization (e.g., PPO or DPO) is performed on this static dataset.<\/li>\n<li class=\"wp-block-list-item\"><strong>Online RL:\u00a0<\/strong>The model\u00a0continuously interacts\u00a0with the environment (e.g., users or human annotators), generating new outputs and receiving fresh feedback that is incorporated into training. This creates a dynamic feedback loop where the model can explore and improve iteratively.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong>Reasoning-aware RL (e.g., RL through Chain-of-Thought)<\/strong><br \/>RL can also be applied to improve reasoning. Instead of only rewarding final answers, the model can be rewarded for producing high-quality intermediate reasoning steps (chain-of-thought). This encourages more structured, interpretable, and reliable problem-solving behavior.<\/p>\n<p>Hallucination in\u00a0LLMs<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2026\/05\/Gemini_Generated_Image_axa8sfaxa8sfaxa8-1024x559.png\" alt=\"\" class=\"wp-image-658606\"\/>Image generated with Gemini<\/p>\n<p class=\"wp-block-paragraph\">Even LLMs trained on factually correct data have a tendency to produce non-factual completions, also known as hallucinations. This happens because LLMs are probabilistic models that are predicting the next token conditioned on the training data corpus and generated tokens so far and are not guaranteed to produce exact matching with the data trained on. There are, however, ways to minimise the effect of hallucinations in LLMs:<\/p>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.promptingguide.ai\/research\/rag\" rel=\"noreferrer noopener nofollow\" target=\"_blank\"><strong>Retrieval Augmented Generation<\/strong><\/a><strong>\u00a0(RAG):\u00a0<\/strong>Incorporate external knowledge sources at inference time so the model can retrieve relevant, factual information and ground its responses in verified data, reducing reliance on potentially outdated or incomplete internal knowledge. RAG can be fairly complex from the engineering perspective and typically consists of:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Chunking:<\/strong>\u00a0splitting documents into smaller, manageable pieces before indexing them for retrieval. Good chunking balances context and precision: chunks that are too large dilute relevance, while chunks that are too small lose important context.\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Embedding:<\/strong>\u00a0convert chunks of text into dense vector representations that capture semantic meaning. In RAG, both queries and documents are embedded into the same vector space, allowing similarity search to retrieve relevant content even when exact keywords don\u2019t match.\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Retrieval:<\/strong>\u00a0High-quality retrieval ensures that relevant, diverse, and non-redundant chunks are passed to the model, reducing hallucinations and improving factual accuracy. It depends on factors like embedding quality, chunking strategy, indexing method, and search parameters.<\/li>\n<li class=\"wp-block-list-item\"><strong>Reranking:<\/strong>\u00a0A second-stage filtering step that reorders retrieved chunks using a more precise (often more expensive) model. While initial retrieval is optimized for speed, rerankers focus on relevance, helping prioritize the most useful context for generation.\u00a0<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong>Training to say I don\u2019t know:\u00a0<\/strong>Explicitly teach the model to acknowledge uncertainty when it lacks sufficient information, discouraging it from generating plausible-sounding but incorrect statements.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Exact matching and\u00a0<\/strong><a href=\"https:\/\/arxiv.org\/abs\/2402.01817\" rel=\"noreferrer noopener nofollow\" target=\"_blank\"><strong>post-evaluation<\/strong><\/a><strong>:\u00a0<\/strong>Use strict matching or verification against trusted sources or external model\u2011based verifiers and critics<strong>\u00a0<\/strong>during completion or post-processing to ensure generated content aligns with factual references, particularly for sensitive or precise information.<\/p>\n<p>Optimization<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2026\/05\/Gemini_Generated_Image_jwiu32jwiu32jwiu-1024x558.png\" alt=\"\" class=\"wp-image-658607\"\/>Image generated with Gemini<\/p>\n<p class=\"wp-block-paragraph\">Training LLMs is a challenge in itself\u200a\u2014\u200atraining the model requires huge number of GPUs, as we need to store the model, gradients and parameters of the optimizer. However, inference is also a challenge\u200a\u2014\u200aimagine having to serve millions of requests\u200a\u2014\u200auser retention is higher if the models can infer the text fast and with high quality.<\/p>\n<p>Training optimization<\/p>\n<p class=\"wp-block-paragraph\">Training large models is typically done using\u00a0<strong>stochastic gradient descent (SGD)<\/strong>\u00a0or one of its variants. Instead of updating model parameters after every single example, we compute gradients on\u00a0<strong>batches of data<\/strong>, which makes training more stable and efficient. In general, the larger the batch size, the more accurate the gradient estimate is, though extremely large batches can also slow convergence or require tuning.<\/p>\n<p class=\"wp-block-paragraph\">For very large models such as LLMs, a single GPU cannot store all the parameters or process large batches on its own. To address this, training is distributed across multiple GPUs or even across clusters of machines. This requires carefully deciding how to split the workload\u200a\u2014\u200aeither by dividing the\u00a0<strong>data<\/strong>, the\u00a0<strong>model parameters<\/strong>, or the\u00a0<strong>computation pipeline<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">While\u00a0<a href=\"https:\/\/medium.com\/data-science\/scientific-computing-lessons-learned-the-hard-way-db651f8f643a\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">distributed training<\/a>\u00a0has been studied extensively in deep learning, LLMs introduce unique challenges due to their enormous parameter counts and memory requirements. Several\u00a0<a href=\"https:\/\/awsdocs-neuron.readthedocs-hosted.com\/en\/latest\/libraries\/nxd-inference\/app-notes\/parallelism.html\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">strategies<\/a>\u00a0have been developed to overcome these:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Data parallelism<\/strong>\u200a\u2014\u200aEach GPU holds a copy of the model but processes different batches of data, with gradients averaged across GPUs.<\/li>\n<li class=\"wp-block-list-item\"><strong>Model parallelism<\/strong>\u200a\u2014\u200aThe model\u2019s parameters are split across multiple GPUs, so each GPU is responsible for a part of the model.<\/li>\n<li class=\"wp-block-list-item\"><strong>Pipeline parallelism<\/strong>\u200a\u2014\u200aDifferent layers of the model are assigned to different GPUs, and data flows through them like stages in a pipeline.<\/li>\n<li class=\"wp-block-list-item\"><strong>Tensor parallelism<\/strong>\u200a\u2014\u200aIndividual tensor operations (e.g., large matrix multiplications) are themselves split across multiple GPUs.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/www.deepspeed.ai\/training\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>DeepSpeed<\/strong><\/a><strong>\u00a0\/\u00a0<\/strong><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-zero-a-leap-in-speed-for-llm-and-chat-model-training-with-4x-less-communication\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>ZeRO<\/strong><\/a>\u200a\u2014\u200aA library and set of optimization techniques for training large models efficiently, including partitioning optimizer states, gradients, and parameters to reduce memory usage.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Generally in these there are two parameters that we are trying to optimize\u200a\u2014\u200areduce across GPU communication (e.g. for gradient exchange), while also making sure that we fit meaningful data on the GPUs. Other techiques to reduce memory during training and gain some speedups include:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Gradient checkpointing:\u00a0<\/strong>A memory-saving training technique that stores only a subset of intermediate activations during the forward pass and recomputes the rest during backpropagation. This trades extra compute for significantly lower GPU memory usage, enabling training of larger models or longer sequences.<\/li>\n<li class=\"wp-block-list-item\"><strong>Mixed precision training:<\/strong>\u00a0Uses lower-precision formats (e.g., FP16 or BF16) for most computations while keeping critical values (like master weights or accumulations) in higher precision (FP32). This reduces memory usage and speeds up training, especially on modern GPUs with specialized hardware, with minimal impact on accuracy.<\/li>\n<\/ul>\n<p>Inference Optimization<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Distillation:\u00a0<\/strong>Large models are often overparameterized, so we can train a smaller\u00a0student\u00a0model to mimic a larger\u00a0teacher. Instead of learning only the correct outputs, the student matches the teacher\u2019s full probability distribution\u200a\u2014\u200aincluding less likely tokens\u200a\u2014\u200acapturing richer relationships. This yields near-teacher performance in a much smaller, faster model.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/huggingface.co\/docs\/text-generation-inference\/en\/conceptual\/flash_attention\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>Flash-attention<\/strong><\/a><strong>:\u00a0<\/strong>An optimized attention algorithm that computes\u00a0exact\u00a0attention while dramatically reducing memory usage. It avoids materializing the full attention matrix by tiling computations and fusing operations into a single GPU kernel, keeping data in fast on-chip memory. The result: significantly faster training and inference, especially for long sequences, and support for longer context lengths without changing the model.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/magazine.sebastianraschka.com\/p\/coding-the-kv-cache-in-llms\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>KV-caching<\/strong><\/a><strong>:\u00a0<\/strong>During autoregressive generation, recomputing attention over past tokens is wasteful. KV-caching stores previously computed keys and values and reuses them for future tokens. This reduces generation complexity from quadratic to linear in sequence length, greatly speeding up long-form text generation.<\/li>\n<li class=\"wp-block-list-item\"><strong>Prunning:\u00a0<\/strong>Neural networks are often overparameterized, so pruning removes redundant weights. This can be\u00a0<strong>structured<\/strong>\u00a0(removing entire neurons, heads, or layers) or\u00a0<strong>unstructured<\/strong>\u00a0(removing individual weights). In practice, structured pruning is preferred because it aligns better with hardware, making the speedups actually realizable.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/huggingface.co\/docs\/optimum\/en\/concept_guides\/quantization\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>Quantisation<\/strong><\/a><strong>:\u00a0<\/strong>Reduces numerical precision (e.g., from 32-bit floats to 8-bit integers) to shrink models and speed up computation. It lowers memory usage and improves efficiency on specialized hardware. Applied either after training or during training, it may slightly impact accuracy, but careful calibration minimizes this. Effective quantization also requires controlling value ranges (e.g., small activation magnitudes) to avoid information loss.\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Speculative decoding:<\/strong>\u00a0Speeds up generation using two models: a small, fast\u00a0draft\u00a0model and a larger, accurate\u00a0target\u00a0model. The draft proposes multiple tokens ahead, and the target verifies them in parallel\u200a\u2014\u200aaccepting matches and recomputing mismatches. This allows generating multiple tokens per step instead of one.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/huggingface.co\/blog\/moe\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>Mixture of experts<\/strong><\/a><strong>\u00a0(MoE):<\/strong>\u00a0Instead of activating all parameters for every token, MoE models use many\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=sOPDGQjFcuM\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">specialized \u201cexperts\u201d<\/a>\u00a0and a gating mechanism to select only a few per input. This enables massive model capacity without proportional compute cost. Notable examples include Switch Transformer, GLaM, and Mixtral.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">A more detailed blog from NVIDIA for\u00a0<a href=\"https:\/\/developer.nvidia.com\/blog\/mastering-llm-techniques-inference-optimization\/\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">inference optimization<\/a>\u00a0would certainly be a great read if you would like to use some more advanced techniques.<\/p>\n<p>Prompt engineering<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2026\/05\/Gemini_Generated_Image_bat2ztbat2ztbat2-1024x558.png\" alt=\"\" class=\"wp-image-658608\"\/>Image generated with Gemini<\/p>\n<p class=\"wp-block-paragraph\">Prompt engineering is a core part of working with LLMs because, in practice, the model\u2019s behavior is not just determined by its weights but by how it is\u00a0conditioned at inference time. The same model can produce dramatically different results depending on how instructions, context, and constraints are written.<\/p>\n<p class=\"wp-block-paragraph\">Prompt engineering is not one-shot design\u200a\u2014\u200ait\u2019s iteration. Small changes in wording, ordering, or constraints can produce large behavior shifts. Treat prompts like code: test, measure, refine, and version-control them as part of your system.<\/p>\n<p>What makes a strong\u00a0prompt<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Be explicit about the task, not just the topic:<\/strong>\u00a0A weak prompt asks\u00a0what\u00a0you want (\u201cExplain RAG\u201d). A strong prompt specifies\u00a0how\u00a0you want it (\u201cExplain RAG in 5 bullet points, focusing on failure modes, for a technical blog audience\u201d).\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Separate instruction, context, and format:\u00a0<\/strong>Clear prompts distinguish between\u00a0what the model should do,\u00a0what information it should use, and\u00a0how the output should look. For example: instructions (\u201csummarize\u201d), context (retrieved text), and format (\u201cJSON with fields X, Y, Z\u201d).\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Use examples (few-shot prompting):\u00a0<\/strong>Providing 1\u20133 examples of desired input-output behavior significantly improves reliability for complex tasks. This is especially useful for classification or formatting.\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Constrain output structure aggressively:\u00a0<\/strong>If you need machine-readable or consistent output, define strict formats (e.g. JSON, schemas).<\/li>\n<li class=\"wp-block-list-item\"><strong>Control context, quality:\u00a0<\/strong>More context isn\u2019t always better. Irrelevant or noisy inputs degrade performance. Prioritize high-signal information, and in RAG systems, ensure retrieval is precise and filtered.<\/li>\n<\/ul>\n<p>Practical considerations<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Track prompt changes like code.<\/strong>\u00a0Know who changed what, when, and why. This makes debugging and rollback possible.<\/li>\n<li class=\"wp-block-list-item\"><strong>Use templates where possible.<\/strong>\u00a0Break prompts into reusable components (instructions, context slots, formatting rules).\u00a0<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/medium.com\/data-science\/llm-routing-the-heart-of-any-practical-ai-chatbot-application-892e88d4a80d\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>Use routing systems<\/strong><\/a><strong>.<\/strong>\u00a0Adjusting both the model selection and the prompt depending on the user requests.<\/li>\n<li class=\"wp-block-list-item\"><strong>Have structured testing.<\/strong>\u00a0Run prompts against a fixed dataset and compare outputs using metrics or structured rubrics (correctness, completeness, style).\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Keep a human in the loop.<\/strong>\u00a0For subjective qualities like clarity or reasoning, human reviewers are still the most reliable signal\u200a\u2014\u200aespecially for edge cases.<\/li>\n<li class=\"wp-block-list-item\"><strong>Maintain a test suite of critical examples, especially around safety.<\/strong><\/li>\n<li class=\"wp-block-list-item\">Redteaming\u200a\u2014\u200aand trying to break the defences that you\u2019ve built are now an industry norm.<\/li>\n<\/ul>\n<p>Evaluation<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2026\/05\/Gemini_Generated_Image_pj77q3pj77q3pj77-1024x558.png\" alt=\"\" class=\"wp-image-658609\"\/>Image generated with Gemini<\/p>\n<p class=\"wp-block-paragraph\">Large language models are used across a wide range of tasks\u200a\u2014\u200afrom structured question answering to open-ended generation\u200a\u2014\u200aso no single metric can capture performance in every case. In practice, evaluation depends heavily on the problem you\u2019re solving. That said,\u00a0<a href=\"https:\/\/www.confident-ai.com\/blog\/llm-evaluation-metrics-everything-you-need-for-llm-evaluation\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">most approaches fall into a few clear categories<\/a>, spanning both traditional metrics and LLM-based evaluators.<\/p>\n<p class=\"wp-block-paragraph\">Regardless of the metrics used one of the metrics used the most important part of the evaluation is the reference anchor for what would be considered good model performance\u200a\u2014\u200athe evaluation dataset. It needs to be diverse, clean, grounded in the reality and have the set of the target tasks for your model.<\/p>\n<p>Conventional<\/p>\n<p class=\"wp-block-paragraph\">These are typically collecting word level statisitics, simple to implement and quick, however have limitations\u200a\u2014\u200athey do not understand semantics.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>Levenstein distance<\/strong><\/a>\u200a\u2014\u200ameasures the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one string into another.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/medium.com\/nlplanet\/two-minutes-nlp-perplexity-explained-with-simple-probabilities-6cdc46884584\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>Perplexity<\/strong><\/a>\u200a\u2014\u200ameasures how well a language model predicts a sequence, with lower values indicating the model assigns higher probability to the observed text.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/aclanthology.org\/P02-1040.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>BLEU<\/strong><\/a>\u200a\u2014\u200aevaluates machine-translated text by measuring n-gram overlap between a candidate translation and one or more reference translations, emphasizing precision.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/aclanthology.org\/W04-1013.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>ROUGE<\/strong><\/a>\u200a\u2014\u200aevaluates text summarization (and generation) by measuring n-gram and sequence overlap between a generated text and reference texts, emphasizing recall.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/METEOR\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>METEOR<\/strong><\/a><strong>\u200a<\/strong>\u2014\u200aevaluates generated text by aligning it with reference texts using exact, stemmed, synonym matches, balancing precision-recall.<\/li>\n<\/ul>\n<p>LLM-based<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/arxiv.org\/abs\/1904.09675\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>BertScore<\/strong><\/a><strong>:\u00a0<\/strong>compares generated text to a reference using contextual embeddings from BERT. Instead of matching exact words, it measures semantic similarity in the embeddings space\u200a\u2014\u200ahow close the meanings are, making it strong at recognizing paraphrases and subtle wording differences. It is a good choice for summarization and translation tasks.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/arxiv.org\/pdf\/2302.04166\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>GPTScore<\/strong><\/a><strong>:<\/strong>\u00a0GPTScore uses a large language model to evaluate outputs based on reasoning\u200a\u2014\u200ascoring things like correctness, relevance, coherence, or even style, without relying on reference. Its flexibility makes it ideal for subjective tasks without clear ground truth.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/arxiv.org\/abs\/2303.08896\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>SelfCheckGPT<\/strong><\/a><strong>:<\/strong>\u00a0Prompts the same model to critique its own output, surfacing hallucinations, logical inconsistencies, or misleading claims. Useful in knowledge-heavy or reasoning tasks, where correctness matters but external verification may be expensive or slow.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/arxiv.org\/abs\/2004.04696\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>Bleurt<\/strong><\/a><strong>:<\/strong>\u00a0A BERT-based metric fine-tuned for evaluation. It compares text using learned semantic representations and outputs a single quality score reflecting fluency, meaning preservation, and paraphrasing.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/www.confident-ai.com\/blog\/g-eval-the-definitive-guide\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>GEval<\/strong><\/a><strong>:<\/strong>\u00a0In GEval you prompt the model with a rubric (e.g., judge factuality or clarity), and it returns a score or detailed feedback. This makes it especially useful for subjective tasks where traditional metrics fail, offering evaluations that feel closer to human judgment.<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/www.confident-ai.com\/blog\/why-llm-as-a-judge-is-the-best-llm-evaluation-method\" target=\"_blank\" rel=\"noreferrer noopener nofollow\"><strong>Directed Acyclic Graph<\/strong><\/a><strong>\u00a0(DAG):<\/strong>\u00a0approach breaks evaluation into a sequence of smaller, rule-based checks. Each node is an LLM judge responsible for one criterion, and the flow between nodes defines how decisions are made. This structure reduces ambiguity and improves consistency, especially when the task can be checked step by step.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong>LLM-based evaluation isn\u2019t foolproof<\/strong>\u200a\u2014\u200ait comes with its own quirks:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Bias:<\/strong>\u00a0Judge models may favor longer answers, certain writing styles, or outputs that resemble their training data.<\/li>\n<li class=\"wp-block-list-item\"><strong>Variance:<\/strong>\u00a0Because models are stochastic, small changes (like temperature) can lead to different scores for the same input.<\/li>\n<li class=\"wp-block-list-item\"><strong>Prompt sensitivity:<\/strong>\u00a0Even minor tweaks to your evaluation prompt or rubric can shift results significantly, making comparisons unreliable.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Treat LLM evaluation as a system that needs calibration. Standardize prompts, test them rigorously, and watch for hidden biases.<\/p>\n<p class=\"wp-block-paragraph\">Looking beyond traditional tasks\u200a\u2014\u200aa class of metrics looks into evaluating\u00a0<a href=\"https:\/\/www.confident-ai.com\/blog\/rag-evaluation-metrics-answer-relevancy-faithfulness-and-more\" rel=\"noreferrer noopener nofollow\" target=\"_blank\"><strong>RAG pipelines<\/strong><\/a>, that split the process of information retrieval into retrieval and generation steps\u200a\u2014\u200aand rely on metrics specific to each step, and a class that looks into\u00a0<a href=\"https:\/\/www.confident-ai.com\/blog\/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task\" rel=\"noreferrer noopener nofollow\" target=\"_blank\"><strong>summarization metrcis<\/strong><\/a>.<\/p>\n<p class=\"wp-block-paragraph\">If you would like to go deeper on LLM model evaluation, I would recommend this\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2307.03109\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">survey paper<\/a>\u00a0covering multiple methods.<\/p>\n<p>When to use LLM-as-a-judge vs traditional metrics?\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Not every output can be neatly scored with rules. If you\u2019re evaluating things like summarization quality, tone, helpfulness, or how well instructions are followed, rigid metrics fall short. This is where\u00a0LLM-as-a-judge\u00a0shines: instead of checking for exact matches, you ask another model to grade responses against a rubric.<\/p>\n<p class=\"wp-block-paragraph\">That said, don\u2019t throw out traditional metrics. When there\u2019s a clear ground truth\u200a\u2014\u200alike factual accuracy or exact answers. They\u2019re fast, cheap, and consistent.<\/p>\n<p class=\"wp-block-paragraph\"><strong>The best setups combine both:<\/strong>\u00a0use traditional metrics for objective correctness, and LLM judges for subjective or open-ended quality.<\/p>\n<p>Evaluation loops in production<\/p>\n<p class=\"wp-block-paragraph\">Strong evaluation doesn\u2019t rely on a single method\u200a\u2014\u200ait\u2019s layered:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Offline metrics:<\/strong>\u00a0Start with labeled datasets and automated scoring to quickly filter out weak model versions.<\/li>\n<li class=\"wp-block-list-item\"><strong>Human evaluation:<\/strong>\u00a0Bring in annotators or experts to assess nuance\u200a\u2014\u200arealism, usefulness, safety and edge cases that metrics miss.<\/li>\n<li class=\"wp-block-list-item\"><strong>Online A\/B testing:<\/strong>\u00a0Finally, measure real-world impact\u200a\u2014\u200aclicks, retention, satisfaction.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">Once your system is live, evaluation doesn\u2019t stop\u200a\u2014\u200ait evolves. User interactions should be continuously logged, sampled, and reviewed. These real-world examples reveal failure cases and shifts in usage patterns. The more data you have logged from the model the more tools you would have for diagnostics: model embeddings, response, response time etc.<\/p>\n<p class=\"wp-block-paragraph\">Even if your model itself remains unchanged, its behavior and performance can still shift over time. This phenomenon\u200a\u2014\u200aknown as behaviour drift\u200a\u2014\u200atypically emerges gradually as external factors evolve, such as changes in user queries, the introduction of new slang, shifts in domain focus, or even small adjustments to prompts and templates. The challenge is that this degradation is often subtle and silent, making it easy to miss until it begins affecting user experience.<\/p>\n<p class=\"wp-block-paragraph\">To catch drift early, pay close attention to both inputs and outputs.\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Input:<\/strong>\u00a0Track changes in embedding distributions, query lengths, topic patterns, or the appearance of previously unseen tokens.\u00a0<\/li>\n<li class=\"wp-block-list-item\"><strong>Output<\/strong>: Track shifts in tone, verbosity, refusal rates, or safety-related flags. Beyond these direct signals, it\u2019s also useful to monitor evaluation proxies over time\u200a\u2014\u200athings like LLM-as-a-judge scores, user feedback (such as thumbs up or down), and task-specific heuristics on extened periods of time, taking in account user behaviour seasonality, triggering alerts when statistical differences exceed defined thresholds.<\/li>\n<\/ul>\n<p>LLM Criticism<\/p>\n<p class=\"wp-block-paragraph\">A common criticism of LLMs is that they behave like\u00a0<strong>\u201cinformation averages\u201d<\/strong>: instead of storing or retrieving discrete facts, they learn a smoothed statistical distribution over text. This means their outputs often reflect the\u00a0most likely blend\u00a0of many possible continuations rather than a grounded, single \u201ctrue\u201d statement. In practice, this can lead to overly generic answers or confident-sounding statements that are actually just high-probability linguistic patterns.<\/p>\n<p class=\"wp-block-paragraph\">At the core of this behavior is the\u00a0<strong>cross-entropy objective<\/strong>, which trains models to minimize the distance between predicted token probabilities and the observed next token in data. While effective for learning fluent language, cross-entropy only rewards\u00a0likelihood matching, not truth, causality, or consistency across contexts. It does not distinguish between \u201cplausible wording\u201d and \u201ccorrect reasoning\u201d\u200a\u2014\u200aonly whether the next token matches the training distribution.<\/p>\n<p class=\"wp-block-paragraph\">The limitation becomes practical: optimizing for cross-entropy encourages\u00a0<strong>mode-averaging<\/strong>, where the model prefers safe, central predictions over sharp, verifiable ones. This is why LLMs can be excellent at fluent synthesis but fragile at tasks requiring precise symbolic reasoning, long-horizon consistency, or factual grounding without external systems like retrieval or verification.<\/p>\n<p>Summary<\/p>\n<p class=\"wp-block-paragraph\">Building and deploying large language models is not about mastering a single breakthrough idea, but about understanding how many interdependent systems come together to produce coherent intelligence. From tokenization and embeddings, through attention-based architectures, to training strategies like pre-training, fine-tuning, and reinforcement learning, each layer contributes a specific function in turning raw text into capable, controllable models.<\/p>\n<p class=\"wp-block-paragraph\">What makes LLM engineering challenging\u200a\u2014\u200aand exciting\u200a\u2014\u200ais that performance is rarely determined by one component in isolation. Efficiency tricks like KV-caching, FlashAttention, and quantization matter just as much as high-level choices like model architecture or alignment strategy. Similarly, success in production depends not only on training quality, but also on inference optimization, evaluation rigor, prompt design, and continuous monitoring for drift and failure modes.<\/p>\n<p class=\"wp-block-paragraph\">Seen together, LLM systems are less like a single model and more like an evolving stack: data pipelines, training objectives, retrieval systems, decoding strategies, and feedback loops all working in concert. Engineers who develop a mental map of this stack are able to move beyond \u201cusing models\u201d and start designing systems that are reliable, scalable, and aligned with real-world constraints.<\/p>\n<p class=\"wp-block-paragraph\">As the field continues to evolve\u200a\u2014\u200atoward longer context windows, more efficient architectures, stronger reasoning abilities, and tighter human alignment\u200a\u2014\u200athe core challenge remains the same: bridging statistical learning with practical intelligence. Mastering that bridge is what shapes the work an LLM engineer.<\/p>\n<p>Notable models in the chronological order<\/p>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/arxiv.org\/abs\/1810.04805\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">BERT<\/a>\u00a0(2018), GPT-1 (2018),\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1907.11692\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">RoBERTa<\/a>\u00a0(2019),\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1907.10529\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">SpanBERT<\/a>\u00a0(2019), GPT-2 (2019),\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1910.10683\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">T5<\/a>\u00a0(2019), GPT-3 (2020),\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2203.15556\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Gopher<\/a>\u00a0(2021),\u00a0<a href=\"https:\/\/uploads-ssl.webflow.com\/60fd4503684b466578c0d307\/61138924626a6981ee09caf6_jurassic_tech_paper.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Jurassic-1<\/a>\u00a0(2021),\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2203.15556\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Chinchila<\/a>\u00a0(2022),\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2201.08239\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">LaMDA<\/a>\u00a0(2022),\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2302.13971\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">LLaMA<\/a>\u00a0(2023)<\/p>\n<p>Liked the author? Stay connected!<\/p>\n<p class=\"wp-block-paragraph\">If you liked this article share it with a friend! To read more on machine learning and image processing topics\u00a0<strong>press subscribe<\/strong>!<\/p>\n<p class=\"wp-block-paragraph\">Have I missed anything? Do not hesitate to leave a note, comment or message me directly on\u00a0<a href=\"https:\/\/www.linkedin.com\/in\/aliakseimikhailiuk\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">LinkedIn<\/a>\u00a0or\u00a0<a href=\"https:\/\/twitter.com\/mikhailiuka\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Twitter<\/a>!<\/p>\n<p><script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/p>\n","protected":false},"excerpt":{"rendered":"(LLMs) have quickly become the foundation of modern AI systems\u200a\u2014\u200afrom chatbots and copilots to search, coding, and automation.&hellip;\n","protected":false},"author":2,"featured_media":477242,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[261],"tags":[291,289,290,612,44309,18,19,17,14338,610,82],"class_list":{"0":"post-477241","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-deep-learning","12":"tag-editors-pick","13":"tag-eire","14":"tag-ie","15":"tag-ireland","16":"tag-llm","17":"tag-machine-learning","18":"tag-technology"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@ie\/116548287061059465","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/477241","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=477241"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/477241\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/477242"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=477241"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=477241"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=477241"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}