AI Interview Series #1: Explain Some LLM Text Generation Strategies Used in LLMs
Every time you immediate an LLM, it doesn’t generate an entire reply — it builds the response one phrase (or token) at a time. At every step, the mannequin predicts the chance of what the subsequent token could possibly be based mostly on every part written to this point. But realizing possibilities alone isn’t sufficient — the mannequin additionally wants a method to determine which token to really decide subsequent.
Different methods can utterly change how the ultimate output appears to be like — some make it extra centered and exact, whereas others make it extra artistic or different. In this text, we’ll discover 4 fashionable textual content era methods used in LLMs: Greedy Search, Beam Search, Nucleus Sampling, and Temperature Sampling — explaining how each works.
Greedy Search
Greedy Search is the best decoding technique the place, at every step, the mannequin picks the token with the best chance given the present context. While it’s quick and simple to implement, it doesn’t at all times produce essentially the most coherent or significant sequence — just like making one of the best native selection with out contemplating the general end result. Because it solely follows one path in the chance tree, it will possibly miss higher sequences that require short-term trade-offs. As a consequence, grasping search typically results in repetitive, generic, or boring textual content, making it unsuitable for open-ended textual content era duties.
Beam Search
Beam Search is an improved decoding technique over grasping search that retains monitor of a number of doable sequences (referred to as beams) at every era step as an alternative of only one. It expands the highest Okay most possible sequences, permitting the mannequin to discover a number of promising paths in the chance tree and doubtlessly uncover higher-quality completions that grasping search may miss. The parameter Okay (beam width) controls the trade-off between high quality and computation — bigger beams produce higher textual content however are slower.
While beam search works properly in structured duties like machine translation, the place accuracy issues greater than creativity, it tends to provide repetitive, predictable, and fewer various textual content in open-ended era. This occurs as a result of the algorithm favors high-probability continuations, resulting in much less variation and “neural textual content degeneration,” the place the mannequin overuses sure phrases or phrases.

Greedy Search:

Beam Search:

- Greedy Search (Okay=1) at all times takes the best native chance:
- T2: Chooses “sluggish” (0.6) over “quick” (0.4).
- Resulting path: “The sluggish canine barks.” (Final Probability: 0.1680)
- Beam Search (Okay=2) retains each “sluggish” and “quick” paths alive:
- At T3, it realizes the trail beginning with “quick” has a better potential for an excellent ending.
- Resulting path: “The quick cat purrs.” (Final Probability: 0.1800)
Beam Search efficiently explores a path that had a barely decrease chance early on, resulting in a greater total sentence rating.
Top-p Sampling (Nucleus Sampling)
Top-p Sampling (Nucleus Sampling) is a probabilistic decoding technique that dynamically adjusts what number of tokens are thought-about for era at every step. Instead of choosing from a hard and fast variety of high tokens like in top-k sampling, top-p sampling selects the smallest set of tokens whose cumulative chance provides as much as a selected threshold p (for instance, 0.7). These tokens kind the “nucleus,” from which the subsequent token is randomly sampled after normalizing their possibilities.
This permits the mannequin to steadiness range and coherence — sampling from a broader vary when many tokens have related possibilities (flat distribution) and narrowing all the way down to the most probably tokens when the distribution is sharp (peaky). As a consequence, top-p sampling produces extra pure, different, and contextually applicable textual content in comparison with fixed-size strategies like grasping or beam search.

Temperature Sampling
Temperature Sampling controls the extent of randomness in textual content era by adjusting the temperature parameter (t) in the softmax operate that converts logits into possibilities. A decrease temperature (t < 1) makes the distribution sharper, growing the possibility of choosing essentially the most possible tokens — ensuing in extra centered however typically repetitive textual content. At t = 1, the mannequin samples straight from its pure chance distribution, generally known as pure or ancestral sampling.
Higher temperatures (t > 1) flatten the distribution, introducing extra randomness and variety however at the price of coherence. In follow, temperature sampling permits fine-tuning the steadiness between creativity and precision: low temperatures yield deterministic, predictable outputs, whereas larger ones generate extra different and imaginative textual content.
The optimum temperature typically relies on the duty — as an example, artistic writing advantages from larger values, whereas technical or factual responses carry out higher with decrease ones.

The put up AI Interview Series #1: Explain Some LLM Text Generation Strategies Used in LLMs appeared first on MarkTechPost.
