What "Lunapolis" Reveals About the Shared Training Corpora of Modern LLMs

Ask four different state‑of‑the‑art language‑model APIs to coin a one‑word name for a hypothetical lunar capital and you'll almost certainly receive Lunapolis or Lunaris. On the surface it's a fun quirk of creativity. Look closer and it becomes a diagnostic probe of the overlapping text corpora behind today's LLMs.

Replicate the test yourself

Model (2025)	Provider	Answer	Latency	Tokens	Cost
Gemini 2.0 Flash	Google	Luna	0.52 s	19	$0.000004
Mistral Large (Latest)	Mistral	Lunaropolis	0.54 s	25	$0.000111
GPT‑4.1	OpenAI	Lunaris	0.93 s	27	$0.000117
Claude 3.7 Sonnet (Feb 2025)	Anthropic	Lunopolis	1.22 s	30	$0.000261
deepseek‑chat	DeepSeek	Lunara	4.33 s	22	$0.000013
o4‑mini	OpenAI	Lunaris	4.63 s	19	$0.000041

The Big Question

Why do models from different vendors—trained on trillions of tokens and fine‑tuned with separate alignment pipelines—converge on the same two coinages? The short answer: they share a surprisingly similar diet of text. Unpack that diet and you gain a window into how data overlap, frequency bias, and tokenization shape generative output.

Inside the Overlap

1 · Common‑Crawl‑Centric Pipelines

Nearly every major LLM pipeline begins with a de‑duplicated slice of Common Crawl. That 5‑petabyte web scrape contains countless sci‑fi snippets, NaNoWriMo drafts, and fan‑fiction forums where "Lunapolis" and "Lunaris" appear. Remove duplicate URLs all you like—the rare coinages still survive because they live on many distinct sites.

2 · Books3 & Public‑Domain Fiction

Datasets such as Books3 (an 11‑GB mirror of fiction e‑books) and Project Gutenberg reprints sprinkle "Luna‑polis"‑style terms across pulp-era novels. When BigScience or OpenAI filter for "quality English prose," they scoop up the same vintage titles and thus the same lunar neologisms.

3 · Wikipedia & Fandom Wikis

Even Wikipedia's List of Fictional Lunar Settlements collects and standardises those names, ensuring they appear in every open‑license snapshot shipped to model trainers. Ditto for Fandom wiki pages about space‑opera games—another ubiquitous ingredient in many LLM corpora.

4 · Token Frequency & Sampling Energy

When you plot token frequencies across these corpora, Lunapolis and Lunaris might occur only a few hundred times each—but no other single‑word lunar capital appears more often. Even a modest frequency edge translates into a noticeably higher soft‑max probability at inference time, especially when the prompt narrows the search space.

Artist's rendering of a futuristic lunar city — Concept art of a Moon settlement—exactly the kind of imagery that often accompanies the text snippets in training data. Image: *Space Architect / Unsplash*

Corpus Lessons in a Nutshell

Data redundancy means that rare sci‑fi coinages can become statistically "safe" choices if they appear across multiple public sources.
Cleaning ≠ originality. Duplicate trimming and profanity filters eliminate noise but rarely reduce myth‑adjacent neologisms.
Prompt probes can reverse‑engineer corpora. Asking dozens of whimsical questions lets researchers infer whether niche phrases sit inside the training set.
Alignment layers amplify the overlap. RLHF raters reward answers that sound polished yet familiar, reinforcing the already skewed token distribution inherited from pre‑training.

Practical Takeaways for Builders & Researchers

Dataset diversity matters. Mixing extra‑domain corpora (technical papers, non‑fiction, creative commons poetry) reduces "Luna*" dominance.
Control the sampler. Higher temperature or nucleus sampling (e.g., top_p = 0.9) can overcome mild frequency skews.
Explicit negative cues work. "Give me a lunar capital name not beginning with Luna-" steers the model away from its corpus priors.
Corpus probes are lightweight audits. You don't need direct dataset access—just craft systematic prompts and log the repetition rate.

📊 Fun exercise: Prompt five different models with Name a new Martian capital. Notice how many answers converge on "Areopolis" or "Ares City." Same corpus overlap, different planet!

Bottom Line

"Lunapolis" isn't just a catchy sci‑fi portmanteau; it's a tracer dye illuminating how modern LLMs share—and are subtly steered by—the same vast but overlapping training corpora. Until we broaden and diversify those corpora (or learn to steer sampling more aggressively), the Moon's capital will keep echoing the same two syllables.

Image & Dataset Credits

NASA Image Library – Public‑domain lunar photos.
Unsplash – Space architecture renders by Spacejoy.
Common Crawl & Books3 metadata – Used here for frequency analysis references.

What "Lunapolis" Reveals About the Shared Training Corpora of Modern LLMs

What "Lunapolis" Reveals About the Shared Training Corpora of Modern LLMs

The Big Question

Inside the Overlap

1 · Common‑Crawl‑Centric Pipelines

2 · Books3 & Public‑Domain Fiction

3 · Wikipedia & Fandom Wikis

4 · Token Frequency & Sampling Energy

Corpus Lessons in a Nutshell

Practical Takeaways for Builders & Researchers

Bottom Line

Image & Dataset Credits

About the Author

Related Articles

Understanding Token Usage Across Different LLMs

Why Even Advanced LLMs Get '9.9 vs 9.11' Wrong