Meditation: Self-Supervised Introspection as a Training Phase for Language Models

We introduce meditation, a structured introspection phase inserted between supervised fine-tuning and task-specific reinforcement learning. During meditation, a model freely explores mathematical concepts, constructs its own problems, and develops observations, rewarded by a composite signal blending programmatic verification with LLM-judged novelty.

Nirav Madhani

Working Paper, March 2026

Status: Implementation in Progress

This paper describes a training pipeline that is under active development. SFT is complete and Meditation RL (GRPO) is currently running.

What is done: Full pipeline code, seed generation (1,504 meditations from 188 topics via Gemini 3.1 Flash Lite), quality filtering (601 passed, 40.0%), SFT data preparation and training (3 epochs, final loss 0.675, accuracy 81.6%). See Section 9 for current data statistics.

What is running: Meditation RL (GRPO) on Google Colab L4 GPU (24GB) with Gemini 3.1 Pro Preview as judge (paid API). At step 60+/5000: reward mean 0.43, KL 0.0009, ~77s/step. SDPA attention, bf16, zero judge errors.

What is pending: Task RL, benchmark evaluation, meditation block removal experiments.

Open invitation. If you have GPU access and find the idea interesting, the full pipeline code is available in this repository. You are welcome to run the experiments independently.

Abstract

Current approaches to training reasoning language models follow a two-phase pipeline: supervised fine-tuning (SFT) on chain-of-thought demonstrations, followed by reinforcement learning (RL) on task-specific problems with verifiable rewards. We propose inserting an intermediate phase called meditation, in which the model engages in self-directed conceptual exploration. Given a mathematical topic, the model produces a free-form meditation block: restating concepts in its own framing, probing edge cases, constructing examples and counterexamples, posing and solving novel problems, and synthesizing observations. Meditations are scored by a composite reward function combining programmatic mathematical verification with LLM-judged novelty and problem quality. We optimize this reward using Group Relative Policy Optimization (GRPO). After meditation RL, the model proceeds to standard task RL, and the meditation block is removed at inference. We hypothesize that models trained with the meditation phase will show measurable improvement on GSM8K and MATH-500 compared to matched baselines without meditation, as self-supervised conceptual exploration during training may produce richer representations that transfer to downstream problem-solving. The pipeline code and seed data are complete; training and evaluation are in progress.

Contents

Introduction and Motivation
Related Work
The Meditation Pipeline
What Makes a Good Meditation?
Reward Function Design
Handling Degenerate Cases
Training Details
Removing the Meditation Block
Experiments and Results
Analysis: What Does the Model Learn?
Limitations and Future Work
Conclusion

Section 1Introduction and Motivation

The dominant recipe for training reasoning language models has converged on a clear pattern. First, supervised fine-tuning on curated chain-of-thought demonstrations teaches the model the format of step-by-step reasoning. Then, reinforcement learning with verifiable rewards (RLVR) teaches the model to actually solve problems correctly. This is the pipeline behind DeepSeek-R1, OpenAI's o1/o3, and the open-source replication efforts that followed.

This pipeline is effective, but it is also narrow. The model only ever encounters mathematical concepts in the context of problems it must solve. It never gets the opportunity to freely explore a concept, form its own questions, or develop intuitions that might transfer across problems. In educational psychology, this distinction is well-studied: students who engage in "productive struggle" with concepts, generating their own examples and testing their understanding, develop more robust and transferable knowledge than those who only practice solving assigned exercises (Kapur, 2008; Chi & Wylie, 2014).

We propose a simple addition to the training pipeline: a meditation phase, inserted between SFT and task RL. During meditation, the model is presented with a mathematical concept (not a problem to solve) and asked to produce a structured exploration. This exploration is evaluated by a composite reward function and optimized with GRPO. The meditation block is then removed at inference time.

Core Hypothesis

A model that has been rewarded for deeply exploring concepts (probing edge cases, inventing problems, finding connections) will develop richer internal representations than one trained only on solving problems. These richer representations will transfer to improved downstream task performance, even after the meditation block is removed.

The analogy to human meditation is deliberate but inexact. In contemplative traditions, meditation involves undirected attention to internal states. In our formulation, meditation is directed exploration of external concepts, but shares the property of being self-structured rather than task-imposed. The model decides what to explore, what problems to pose, and what connections to draw.

Unlike test-time compute approaches that allow the model more tokens at inference, meditation operates during training. The goal is to internalize the benefits of extended exploration into the model's weights, producing a model that is better at solving problems even without explicit exploration at inference time. And unlike self-play methods that generate and filter solutions, meditation generates explorations of concepts, targeting depth of understanding rather than breadth of correct answers.

Section 2Related Work

2.1 Reasoning via Reinforcement Learning

DeepSeek-R1 (2024) demonstrated that long chain-of-thought reasoning can emerge through RL with verifiable rewards, without requiring supervised demonstrations of reasoning. Their pipeline used GRPO to optimize a binary correctness reward on math and code problems. Subsequent work (Shao et al., 2024) extended this to mathematical domains specifically. Our work builds on this foundation by inserting a pre-task RL phase where the reward signal targets conceptual exploration rather than answer correctness.

2.2 Self-Play and Self-Improvement

The idea of models generating their own training signal has appeared in several forms. STaR (Zelikman et al., 2022) generates rationales and filters by correctness. ReST (Gulcehre et al., 2023) iteratively generates and filters solutions. V-STaR adds a verifier model. Our meditation phase differs in a fundamental way: the model generates explorations rather than solutions, and the reward targets novelty and depth of exploration rather than answer correctness. The distinction is between "practice solving" and "practice understanding."

2.3 Curriculum and Concept Learning

Curriculum learning (Bengio et al., 2009) orders training examples by difficulty. Our approach is complementary: rather than ordering problems, we insert an entire phase where the model engages with concepts at a meta-level before encountering problems. This is closer to "learning to learn" (Finn et al., 2017) but applied at the concept level rather than the task level.

2.4 Test-Time Compute and Extended Thinking

Recent work on scaling test-time compute (Snell et al., 2024) shows that allowing models more "thinking" tokens at inference improves performance. Our meditation phase operates during training rather than inference. The goal is to internalize the benefits of extended exploration into the model's weights, so that improved performance persists even when the meditation block is removed. In this sense, meditation can be viewed as a way to distill the benefits of test-time compute into training-time weight updates.

Section 3The Meditation Pipeline

SFT
Format learning

→

Meditation RL
Concept exploration

→

Task RL
Problem solving

→

Post-training
Block removal

Figure 1. The meditation training pipeline. The red-bordered stage is our contribution. Standard pipelines skip directly from SFT to Task RL.

3.1 Phase 1: Seed Data and SFT

We bootstrap meditation-format data using Gemini 3.1 Flash Lite via API, batching 12 topics per request with structured delimiters for efficient generation within rate limits (15 RPM, 500 RPD). Given a mathematical topic and reference material, the model produces a meditation: a structured but free-form exploration. We generated 8 samples per topic across a curriculum of 188 topics spanning arithmetic, number theory, calculus, linear algebra, abstract algebra, and competition mathematics, yielding 1,504 raw meditations. After quality filtering using programmatic verification (11 pattern types with LaTeX normalization) and heuristic checks including minimum verified claim requirements and unverifiable ratio caps (Section 6), 601 samples (40.0%) passed, forming the SFT training set.

We fine-tune the student model on these demonstrations using QLoRA. SFT training completed in 3 epochs (114 steps, 7 minutes on an L4 GPU), reaching a final training loss of 0.675 and accuracy of 81.6%. This phase teaches the model format only: what a meditation looks like. It does not produce high-quality meditations. That comes from RL.

3.2 Phase 2: Meditation RL

The model is presented with topic prompts and generates meditation blocks. These are scored by a composite reward function (Section 5) combining programmatic math verification with LLM-judged novelty and problem quality. We optimize using GRPO with K=8 group samples per prompt, leveraging the L4's 24GB VRAM headroom to improve advantage estimation over the minimum K=4.

Crucially, the reward targets exploration quality, not answer correctness. The model is rewarded for probing boundaries, constructing illuminating examples, posing non-trivial problems, and synthesizing connections. We apply curriculum scheduling: the first 1000 steps draw 80% from easy topics (Tier 1) where correctness is fully programmatically verifiable, gradually shifting to harder topics (Tier 3) where the judge plays a larger role.

3.3 Phase 3: Task RL

After meditation RL, we proceed to standard task RL on math problems (GSM8K, MATH). The meditation block remains active during this phase: the model meditates on relevant concepts, then produces a solution. Task reward is binary correctness of the final answer.

3.4 Phase 4: Post-Training

At inference time, we remove the meditation block entirely. The model receives a problem and produces a solution directly. The hypothesis is that representations built during meditation RL persist in the weights and improve performance even without the explicit meditation step. We test several removal strategies in Section 8.

Section 4What Makes a Good Meditation?

Before designing a reward function, we need to define what we're rewarding. Through analysis of teacher-generated seed meditations and iterative design, we identified five components that characterize high-quality meditations:

4.1 Reframing

The model restates the concept in its own words, focusing on meaning rather than formal definition. "Fermat's Little Theorem says that powers cycle in modular arithmetic when the modulus is prime" reveals understanding. "If p is prime and gcd(a,p)=1, then a^(p-1) ≡ 1 (mod p)" is recitation. The reward function's novelty dimension captures this distinction: restatements that use the same words as the reference receive low scores.

4.2 Boundary Probing

The model examines what happens at the edges. What happens when assumptions are violated? What are the minimal conditions? Where does the concept break? For Fermat's Little Theorem, this means checking composites (it breaks: 2^5 mod 6 = 2, not 1), multiples of p (it breaks: 6^2 mod 3 = 0), and whether a generalization exists (Euler's theorem). Boundary probing is what separates memorization from understanding.

4.3 Example and Counterexample Construction

The model constructs its own worked examples: non-trivial, not the first example any textbook would give. Crucially, these examples serve as a self-verification mechanism. If the model claims 2^5 mod 6 = 3, our programmatic verifier catches the error (the actual value is 2), and the entire meditation receives zero reward. This creates a strong incentive for the model to verify its own computations during generation.

4.4 Self-Posed Problem

Why Self-Posed Problems Matter

Inventing a problem is harder than solving one. To pose "Find the last two digits of 7²⁰²⁴," the model must understand that Fermat/Euler theorems reduce large exponents, that "last two digits" means mod 100, and that φ(100)=40 creates a tractable cycle. This requires the model to see the concept from the problem-designer's perspective, which we hypothesize builds richer representations than only seeing it from the solver's perspective.

The self-posed problem component may be the most important contributor to downstream task improvement. Problem-posing requires the model to understand what kinds of questions a concept can answer, which is precisely the knowledge needed to recognize when a concept is applicable to an unseen problem. We weight this dimension heavily (0.35) in the composite reward.

4.5 Synthesis

The model draws a connection to a broader mathematical idea: cyclic group structure, the relationship between Fermat and Euler, modular arithmetic as a tool for converting infinite computations to finite ones. This tests whether the model can situate specific concepts within a larger framework.

4.6 Example Meditation

Topic: Modular arithmetic - Fermat's Little Theorem
Reference: If p is prime and gcd(a,p)=1, then a^(p-1) ≡ 1 (mod p).

Fermat's Little Theorem says that powers cycle in modular arithmetic
when the modulus is prime. Raising any non-multiple of p to the
(p-1)th power resets to 1 mod p.

The gcd condition does real work. Take a=6, p=3: gcd(6,3)=3 and
6^2 = 36 ≡ 0 (mod 3), not 1. The theorem fails exactly when a
shares a factor with p.

For composite moduli this breaks completely. Take a=2, n=6:
2^5 = 32, 32 mod 6 = 2, not 1. Euler's theorem fixes this:
a^phi(n) ≡ 1 (mod n) when gcd(a,n)=1. For prime p, phi(p)=p-1,
recovering Fermat as a special case.

Self-posed problem: Find the last two digits of 7^2024.
This means 7^2024 mod 100. gcd(7,100)=1 and phi(100)=40,
so 7^40 ≡ 1 (mod 100). 2024 = 40*50 + 24, so 7^2024 ≡ 7^24.
Now 7^4 = 2401 ≡ 1 (mod 100), so 7^24 = (7^4)^6 ≡ 1 (mod 100).
Last two digits: 01.

The deeper pattern: modular exponentiation converts infinite
computation into finite cyclic structure. Fermat/Euler give
the cycle length.

Section 5Reward Function Design

The reward function is the most critical design decision. A bad reward signal produces either reward-hacked meditations (verbose restating without insight) or mode collapse (identical template for every topic). We use a multi-dimensional composite reward with a hard correctness gate.

5.1 Composite Reward

R = correctness × (0.40 · novelty + 0.35 · problem_quality + 0.25 · coherence) × compression (1)

Each dimension is scored on [0, 1]. The correctness gate is multiplicative: any mathematical error zeroes the entire reward, regardless of creativity. This is deliberate. We want to reward exploration but never at the cost of mathematical rigor.

5.2 Correctness Gate (Binary, Programmatic)

We extract verifiable claims from meditations using regex pattern matching: modular exponentiations (a^b mod c = d), basic arithmetic, factorial claims, and binomial coefficient claims. Each is verified programmatically using Python's pow() and SymPy. For topics where claims cannot be extracted automatically, we fall back to the LLM judge with a conservative threshold (correctness probability < 0.5 triggers the gate).

Design Decision: Why a Hard Gate?

We considered a soft correctness penalty (R *= correctness_score) but expect that this would allow the model to learn that "slightly wrong but very creative" meditations receive non-trivial reward. The hard gate is designed to eliminate this failure mode. The model should quickly learn that mathematical errors are catastrophic, which is the correct prior for mathematical reasoning.

5.3 Novelty (LLM-Judged, weight 0.40)

The judge receives the original topic, reference material, and meditation. It evaluates what fraction of ideas go beyond the reference. Paraphrasing scores 0.1-0.2. Non-obvious connections or illuminating counterexamples score 0.7+. This is the highest-weighted dimension because it most directly measures what meditation adds beyond standard training.

5.4 Problem Quality (LLM-Judged + Programmatic, weight 0.35)

We extract the self-posed problem section using heuristic keyword matching and evaluate relevance, non-triviality, and solution correctness. If no problem is detected, this dimension receives a floor score of 0.1 rather than 0.0. Setting the floor above zero is important: a zero score would create a gradient away from problem-posing entirely, and we'd rather have a weak attempt than no attempt.

5.5 Coherence (LLM-Judged, weight 0.25)

Evaluates logical flow. Primarily a regularizer against degenerate outputs. Low weight prevents over-prioritizing surface readability at the expense of mathematical substance.

5.6 Compression Bonus

compression = 1 + 0.1 × (1 − min(tokens, max_tokens) / max_tokens) (2)

A mild (up to 10%) bonus for concise meditations. This counteracts RL models' natural tendency toward verbosity.

5.7 Judge Confidence Handling

We prompt the judge to output both a score and confidence value:

Confidence	Handling	Rationale
≥ 0.7	Use score directly	Judge is confident
0.4 - 0.7	Blend: 0.6 × fallback + 0.4 × score	Hedge toward safe default
< 0.4	Programmatic dimensions only	Judge unreliable; use math verification and length heuristics

Fallback scores: 0.5 for novelty, 0.3 for problem quality, 0.6 for coherence. Confidence threshold for full fallback: 0.4. Below threshold, scores are linearly blended toward fallback proportional to confidence.

Section 6Handling Degenerate Cases

RL training is prone to exploitation of the reward signal. We have identified and designed mitigations for several anticipated degenerate modes.

6.1 Empty Meditations

The model discovers that silence avoids negative reward. Mitigation: meditations shorter than 50 tokens receive R = -0.1. Mild penalty, not catastrophic, to nudge the model toward attempting something.

6.2 Verbose Vacuity

500 tokens of restating the input with filler. Two mechanisms: the novelty dimension directly penalizes low originality, and the compression bonus makes verbose output with the same score receive less reward than concise output.

6.3 Mode Collapse

Every meditation follows the same template. Detection: mean pairwise cosine similarity of meditation embeddings within a GRPO group. When similarity exceeds 0.85, we apply a diversity penalty. We also use generation temperature 0.8-0.9 during RL sampling to encourage structural variety.

6.4 Reward Hacking

Critical Risk

The gap between "meditation reward" and "downstream task performance" is the central evaluation question. If we observe strong meditation reward improvement with no task improvement, the meditation phase is not producing useful representations, regardless of how impressive the meditations look qualitatively. We monitor GSM8K accuracy every 200 RL steps and halt training if divergence persists for 500+ steps.

6.5 Summary

Failure Mode	Detection	Mitigation
Empty output	Tokens < 50	R = -0.1
Verbose vacuity	Low novelty + high tokens	Compression bonus + novelty scoring
Mode collapse	Cosine sim > 0.85	Diversity penalty, temp=0.8
Math errors	Programmatic verification	Hard correctness gate (R=0)
No problem posed	Extraction returns null	problem_quality floor = 0.1
Judge uncertain	Confidence < 0.4	Programmatic-only scoring
Reward hacking	Reward up, GSM8K flat	Early stopping + reward audit

Section 7Training Details

7.1 Student Model

We use LiquidAI/LFM2.5-1.2B-Thinking: a 1.17B parameter dense hybrid model combining 10 gated short-convolution blocks with 6 grouped-query attention blocks and SwiGLU FFN layers. This is a "thinking" model with native <think> tokens, designed for reasoning tasks. We train with QLoRA (4-bit NF4 quantization, double quantization) targeting attention projections (q_proj, k_proj, v_proj, out_proj), convolution gating (in_proj), and FFN layers (w1, w2, w3). LoRA rank 32, alpha 64, 22.2M trainable parameters (1.86% of total).

7.2 Judge Model

Google Gemini 3.1 Pro Preview (95.1% on MATH-500) served via the Generative Language API (generativelanguage.googleapis.com/v1beta/) as an OpenAI-compatible endpoint. The judge runs as a cloud API with no GPU contention against the student model. We batch all K=8 completions per GRPO step into a single API call using structured delimiters ([MEDITATION N]...[/MEDITATION N]), reducing API calls from 8 per step to 1 per step. On API errors (503 high-demand, rate limits), the system falls back to safe fixed scores (novelty=0.5, problem_quality=0.3, coherence=0.6) to prevent training disruption.

7.3 Hardware

Training runs on Google Colab Pro with an NVIDIA L4 GPU (24GB VRAM, SM89 Ada Lovelace, native bf16 support). The 1.2B model in 4-bit QLoRA uses approximately 3.7GB VRAM, leaving substantial headroom for batch size 2 with gradient accumulation 4. We use PyTorch SDPA (Scaled Dot Product Attention), which is built into PyTorch 2.x and performs comparably to Flash Attention 2 on Ada Lovelace GPUs without requiring the 25+ minute source compilation of the flash-attn package. Gradient checkpointing, 8-bit Adam optimizer, maximum sequence length 2048, and bfloat16 compute dtype. SFT completed in 7 minutes; Meditation RL runs at approximately 77 seconds per step.

7.4 Hyperparameters

Parameter	SFT	Meditation RL	Task RL
Learning rate	2 × 10^-4	5 × 10^-7	3 × 10^-7
Per-device batch size	4	2	2
Gradient accumulation	4	4	4
Effective batch size	16	8	8
GRPO K (generations)	—	8	4
KL coefficient	—	0.04	0.03
Max completion length	2048	768	1024
Training duration	3 epochs (114 steps)	5,000 steps (budget-limited)	3,000 steps

7.5 Curriculum Scheduling

Steps	Tier 1 (Easy, verifiable)	Tier 2 (Medium)	Tier 3 (Hard, judge-dependent)
0 - 1000	80%	20%	0%
1000 - 3000	30%	50%	20%
3000+	10%	30%	60%

Starting with easy topics where the reward signal is programmatically clean allows the model to learn the meditation format before encountering noisier judge-dependent rewards.

Section 8Removing the Meditation Block

The meditation block is a training-time scaffold. At inference, we want the model to solve problems directly, with the knowledge from meditation internalized but the format removed. We test three strategies:

8.1 Direct Removal

Omit the meditation tags from the inference prompt. Simplest approach and our default. The model's weights were shaped by meditation RL, so representations should persist.

8.2 Gradual Shrinking

SFT on progressively shorter meditations: 512 → 256 → 128 → 0 tokens, one epoch each. Gives the model time to compress its meditation into hidden states.

8.3 Knowledge Distillation

Teacher (with meditation) generates soft labels for a student (without meditation). Most expensive, most robust.

Planned Experiment

Once training is complete, we will compare GSM8K and MATH-500 accuracy across removal strategies (direct removal, gradual shrinking, knowledge distillation), with meditation-active as the upper bound and the no-meditation baseline as the lower bound.

Section 9Experiments and Results

9.1 Baselines

Configuration	Pipeline
Base	LFM2.5-1.2B-Thinking (no fine-tuning)
SFT+TaskRL	SFT → Task RL (standard, no meditation, same total RL steps)
Meditation	SFT → Meditation RL → Task RL (block active at inference)
Meditation⁻	Same as above, block removed at inference

The SFT+TaskRL baseline receives the same total RL training steps as the Meditation configuration, all spent on task RL. This controls for total training compute.

9.2 Seed Data Statistics

Seed generation and filtering are complete. These numbers characterize the training data used for SFT.

Metric	Value
Topics in curriculum	188 (across 3 tiers)
Samples per topic	8
Raw seeds generated	1,504
Seeds passing quality filter	601 (40.0%)
Seed generation model	Gemini 3.1 Flash Lite (API, 12 topics/request)
Math verifier patterns	11 types (modular, GCD, factorial, binomial, power, prime, phi, divisibility, arithmetic)
LaTeX normalization	Yes ($, \pmod, \equiv, \binom, \frac, etc.)
Min verified claims	Tier 1 ≥ 1, Tier 2 ≥ 1, Tier 3 = 0
Max unverifiable ratio	≤ 70%

The 40% pass rate reflects deliberately strict filtering. The initial filter (94.9% pass rate) was too permissive, passing meditations with no verifiable mathematical claims. After expanding the math verifier from 4 to 11 pattern types, adding LaTeX normalization for robust claim extraction, requiring minimum verified claims per tier, and capping the unverifiable ratio, the filter became a meaningful quality gate. This produces higher-quality SFT demonstrations at the cost of a smaller training set (601 vs. 1,427), which we believe is a favorable tradeoff for RL fine-tuning where data quality matters more than quantity.

9.3 SFT Training Results

Metric	Value
Training samples	601
Epochs	3 (114 steps)
Training time	7 minutes (Google Colab L4)
Initial loss	2.133
Final loss	0.675
Initial accuracy	58.5%
Final accuracy	81.6%
GPU memory used	~6 GB / 24 GB

9.4 Meditation RL: Early Training Metrics (Steps 0–60)

Meditation RL training is in progress using GRPO with K=8 generations per prompt on a Google Colab L4 GPU. The judge (Gemini 3.1 Pro Preview, paid API) scores novelty, problem quality, and coherence via batched API calls. Early metrics show stable learning dynamics with zero judge API errors.

Step	Reward Mean	KL Divergence	Loss	Entropy	Completion Length	Clipped Ratio	Step Time
10	0.423	0.00066	0.065	0.669	513 tokens	27.5%	77s
20	0.434	0.00070	0.210	0.616	541 tokens	32.5%	78s
30	0.443	0.00079	0.284	0.681	556 tokens	36.3%	77s
40	0.452	0.00079	0.270	0.680	505 tokens	21.3%	75s
50	0.441	0.00078	0.164	0.624	603 tokens	26.3%	78s
60	0.428	0.00087	0.175	0.652	539 tokens	26.3%	77s

GPU utilization: 3.7 GB / 22.5 GB VRAM (16%), 66% GPU compute. Checkpoints saved every ~15 minutes wall-clock (steps 12, 24, 36, 48, 60).

Early Observations

Reward: Stable in the 0.42–0.45 range across all 60 steps. The reward peaked at 0.452 (step 40) before settling back. This plateau is typical for early GRPO—the model is stabilizing its output distribution before meaningful reward climbs, which we expect around step 100–200 based on prior runs.
KL divergence: Rising slowly (0.00066 → 0.00087), well within healthy range. The KL coefficient (β=0.04) is preventing policy drift from the SFT reference.
Entropy: Fluctuating between 0.62–0.68, indicating no mode collapse (which would show entropy < 0.3).
Completion length: Fluctuating around 505–603 tokens. Clipped ratio (hitting 768-token cap) decreased from 27.5% to 26.3%, suggesting the compression bonus is taking effect.
Reward std: Dropped from 0.095 to 0.032 over 60 steps, meaning the model is producing more consistent quality across GRPO groups. The frac_reward_zero_std remains 0, so every group has reward variance, ensuring informative gradients.
Judge reliability: Zero API errors after switching to Gemini 3.1 Pro Preview with paid API key (the free-tier Flash Lite model returned 503 errors during peak demand).

9.5 Sample Generated Meditations

Below are unedited excerpts from the seed generation phase, one per tier, to illustrate the quality and variety of the training data.

Tier 1 (Fully Verifiable): Chebyshev Polynomials

Reference: T_n(cos θ) = cos(nθ). T_0=1, T_1=x, T_{n+1}=2xT_n-T_{n-1}.

To treat the Chebyshev polynomials T_n(x), I must recall they are the
unique orthogonal polynomials on [-1, 1] that also possess the monodromy
property T_n(cos θ) = cos(nθ).

Consider boundary conditions: T_n(1) ≡ 1 and T_n(-1) ≡ (-1)^n.
For n=1, T_1(x)=x. If we attempt n=0, T_0(x)=1, but T_0(-1)=1
≠ (-1)^0? No, 1=1. The standard definition gives T_1=x and T_0=1.

Now, construct a counterexample. Let x = -1. T_1(-1) = -1. T_0(-1) = 1.
The recurrence holds. …

Tier 2 (Partially Verifiable): L'Hôpital's Rule

Reference: If lim f/g is 0/0 or ∞/∞, then lim f/g = lim f'/g' (if latter exists).

L'Hôpital's rule is not merely a procedural shortcut for evaluating
indeterminate limits; it is a geometric assertion of continuity and
differentiability. The rule asserts that the "slopes" of these curves
converge. If the derivatives f'(x) and g'(x) are continuous and non-zero
near a, then the limit of the ratio f(x)/g(x) exists if and only if
the limit of the ratio of their derivatives exists.

The rule fails catastrophically if g'(x) = 0 in a neighborhood of a …

Tier 3 (Judge-Dependent): Lagrange's Theorem (Groups)

Reference: Order of a subgroup divides order of the group. |H| divides |G|.

Let G be a finite group of order |G| = n, and let H ≤ G be a subgroup
of order |H| = m. If G is abelian, the order of any element g ∈ G
divides n, which implies that the order of H divides n for all H.

Consider the edge case where G is non-abelian. Let G = S_3 (symmetric
group on 3 elements), so |G| = 6. Subgroups can be H_1 = {(e), (12)}
with |H_1| = 2 …

9.6 Planned Benchmark Evaluation

Pending — Training Not Yet Complete

After SFT, Meditation RL, and Task RL are complete, we will evaluate on:
• GSM8K (test set, 1,319 problems) — primary benchmark
• MATH-500 (500 competition-level problems) — harder, tests deeper reasoning
• ARC-Challenge — reasoning generalization outside mathematics
All benchmarks will report 95% confidence intervals. Per-subject MATH breakdown will test the hypothesis that meditation helps most where boundary probing is directly useful (number theory, algebra) versus execution-dominated tasks (arithmetic). Training curves (meditation reward vs. GSM8K accuracy over RL steps) will test whether meditation reward correlates with task performance.

9.7 Compute Budget

Phase	GPU Hours	API Cost	Notes
Seed generation	—	Gemini Flash Lite (free tier)	126 batched API calls, 1,504 samples
SFT (3 epochs)	0.12	—	Google Colab L4, 7 minutes
Meditation RL	~107 (est.)	Gemini 3.1 Pro (paid)	GRPO K=8, ~77s/step, 5000 steps target
Task RL (3K steps)	TBD	—	Programmatic reward, no judge API
Evaluation	~1–2	—	GSM8K + MATH-500 inference

Meditation RL uses Gemini 3.1 Pro Preview via paid API (no per-step budget limit). At ~77 seconds per step on Google Colab L4, the 5,000-step target requires approximately 107 GPU hours across multiple Colab Pro sessions (12 hours each, ~560 steps per session).

Section 10Analysis: What Does the Model Learn?

10.1 Meditation Quality Progression (Planned)

Planned Analysis

After Meditation RL training, we will compare the same topic's meditation at steps 100, 1,000, and 3,000 to track which components (reframing, boundary probing, self-posed problems, synthesis) emerge and improve over training. This will test whether RL reward shaping successfully steers generation quality beyond SFT-level imitation.

10.2 Representation Analysis (Planned)

Planned Analysis

We will extract hidden states from the final transformer layer on MATH problems and visualize via t-SNE, colored by subject. The hypothesis: meditation-trained models show better-separated concept clusters than the SFT+TaskRL baseline, suggesting meditation RL produces more structured internal representations.

10.3 Per-Domain Breakdown

Hypothesis

Meditation helps most for domains where boundary probing and example construction are directly useful (number theory, modular arithmetic) and less where execution skill dominates (arithmetic word problems). We expect largest gains on MATH subjects like Number Theory and Intermediate Algebra. This will be tested via per-subject MATH-500 accuracy comparisons between Meditation and SFT+TaskRL configurations.

10.4 Human Evaluation (Planned)

Planned Analysis

We will rate meditations on correctness, insight, and creativity (1–5 scale) across three conditions: teacher seeds, SFT checkpoint output, and RL-final output. This tests whether RL training produces qualitatively better meditations, not just higher-scoring ones by the automated reward.

Section 11Limitations and Future Work

11.1 Limitations

Domain scope. We evaluate only on mathematics. Whether meditation transfers to code, scientific reasoning, or logical deduction is open. Our hypothesis is that it helps most for domains with rich axiomatic structure where boundary probing is directly useful.

Judge quality. While Gemini 3.1 Pro Preview (95.1% MATH-500) is a strong judge, it is accessed via a budget-constrained API ($15 credit). This limits the total number of RL steps with judge-scored rewards. After budget exhaustion, training continues with programmatic-only rewards, which may provide a weaker learning signal for novelty and problem quality dimensions.

Scale. We train a single dense model (1.17B parameters) on a single GPU. Whether meditation benefits scale with model size is unknown. Larger models with stronger base representations may benefit less from additional conceptual exploration.

Reward sensitivity. Weight coefficients (0.40/0.35/0.25) were chosen by calibration, not systematic search. Different weights may produce qualitatively different meditation styles and different downstream effects.

11.2 Future Directions

Cross-domain transfer. After math meditation, does the model explore code or science concepts better without domain-specific meditation RL? If so, meditation may produce general "exploration skills."

Meditation on prerequisites. Feed the model its best past meditations on prerequisite topics before it meditates on advanced ones. This mimics building on one's own prior understanding.

Multi-turn meditation. Decompose into stages (explore → verify → synthesize) with intermediate rewards, enabling finer credit assignment.

Adaptive test-time meditation. After removing the block in post-training, optionally re-enable it at inference for hard problems. This positions meditation as adaptive test-time compute for concept exploration.

Judge co-evolution. As the student improves, periodically recalibrate the judge or use the growing student as a co-judge to maintain reward signal quality.

Section 12Conclusion

We introduced meditation, a training phase in which a language model engages in self-directed exploration of mathematical concepts, rewarded for the quality of its exploration rather than the correctness of answers to specific problems. The key design decisions are: a multi-dimensional reward combining programmatic verification with LLM-judged novelty, a hard correctness gate that prevents creative-but-wrong meditations from receiving reward, curriculum scheduling from easy (verifiable) to hard (judge-dependent) topics, and degenerate case mitigations covering empty output, verbose vacuity, mode collapse, and reward hacking.

The meditation framework opens a design space for training-time self-supervised exploration that is complementary to existing RL-on-tasks approaches. The central question remains empirically open: does spending training-time understanding concepts before being asked to use them produce measurably better downstream problem solvers? The pipeline code and seed data are complete; benchmark results will be added as training progresses.

The core risk, which we acknowledge explicitly, is that the reward function may incentivize meditation-like text rather than genuine conceptual depth. Novelty is the highest-weighted dimension (0.40) and is scored by an external LLM judge. If the noisiest reward component is also the most powerful, training could drift toward outputs that look original to the judge without producing transferable representations. The correctness gate and compression bonus partially mitigate this, but the definitive test is whether downstream task accuracy improves. We will treat meditation reward diverging from GSM8K accuracy as evidence of reward hacking, not success.

Code: huggingface.co/spaces/Nirav-Madhani/Meditation

Seed data: 601 filtered meditations across 188 topics available in data/seeds_filtered.jsonl.

Trained checkpoints: Will be released after training completes.

References

Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. ICML.
Chi, M. T. H., & Wylie, R. (2014). The ICAP framework: linking cognitive engagement to active learning outcomes. Educational Psychologist, 49(4), 219-243.
DeepSeek-AI. (2024). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint.
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. ICML.
Gulcehre, C., Paine, T. L., Srinivasan, S., et al. (2023). Reinforced self-training (ReST) for language modeling. arXiv preprint.
Kapur, M. (2008). Productive failure. Cognition and Instruction, 26(3), 379-424.
Liquid AI. (2025). LFM2 Technical Report. arXiv:2511.23404.
Shao, Z., et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint.
Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint.
Zelikman, E., Wu, Y., Mu, J., & Goodman, N. D. (2022). STaR: Bootstrapping reasoning with reasoning. NeurIPS.