March 22, 2026·2 min read·Eba

AI value generation

A transformer

Reads text as a sequence of tokens and figures out which earlier parts matter most for the next part. The key trick is attention — the model can “look at” different words in the input and weigh how relevant they are to each other. The core ideas are:

Tokens: text split into small pieces Embeddings: tokens turned into vectors Attention / self-attention: each token checks which other tokens matter Layers: this process repeats many times, building richer understanding Next-token prediction: the model learns by predicting what comes next

Reasoning models are LLMs tuned to do more deliberate multi-step thinking before answering.

Better at math, logic, coding etc.

In practice, “reasoning model” can involve a mix of:

training methods that reward stepwise problem solving inference-time techniques that allocate more compute tool use, verification, or self-checking architectures or prompting patterns that improve multi-step accuracy

How does LLM decide whether problem need extra internal computation

  1. The system or product chooses the mode (router)
  2. The model learns patterns that correlate with hard problems
  3. It may generate internal “deliberation signals”. If early internal passes show uncertainty, conflict, or many constraints, it may continue spending compute.

Gemini 3 flash thinking latency is 7 seconds, whereas non-thinking latency is 1 seconds.

GPT OSS 120 billion is good, and available on VertexAI and Amazon Bedrock.

SELECT
  runname,
  ROUND(total_score::numeric, 2) AS total_score,
  model_size,
  to_timestamp(model_release_date / 1000)::date AS model_release_date
FROM wandb_llm
LIMIT 10;
namethinkingintelligence_indexprice_usdspeedlatency_ms
MiMo-V2-Flash (Feb 2026)true410.151441.99
gpt-oss-120B (high)true330.262820.78
Qwen3.5 9Btrue320.11630.61
Mistral Small 4true270.261350.62
gpt-oss-120B (low)true240.262870.75
gpt-oss-20B (high)true240.092970.67
NVIDIA Nemotron 3 Nanotrue240.101651.49
gpt-oss-20B (low)true210.093020.66