TechnicalAI ResearchTraining DataCorpus Analysis

What “Lunapolis” Reveals About the Shared Training Corpora of Modern LLMs

By TamirMay 12, 20254 min read

A data‑centric look at why multiple large‑language models invent the same lunar‑city names—and what that convergence teaches us about their overlapping training sets.

What “Lunapolis” Reveals About the Shared Training Corpora of Modern LLMs

What “Lunapolis” Reveals About the Shared Training Corpora of Modern LLMs

Ask four different state‑of‑the‑art language‑model APIs to coin a one‑word name for a hypothetical lunar capital and you’ll almost certainly receive Lunapolis or Lunaris. On the surface it’s a fun quirk of creativity. Look closer and it becomes a diagnostic probe of the overlapping text corpora behind today’s LLMs.

Model (2025) Provider Answer Latency Tokens Cost
Gemini 2.0 Flash Google Luna 0.52 s 19 $0.000004
Mistral Large (Latest) Mistral Lunaropolis 0.54 s 25 $0.000111
GPT‑4.1 OpenAI Lunaris 0.93 s 27 $0.000117
Claude 3.7 Sonnet (Feb 2025) Anthropic Lunopolis 1.22 s 30 $0.000261
deepseek‑chat DeepSeek Lunara 4.33 s 22 $0.000013
o4‑mini OpenAI Lunaris 4.63 s 19 $0.000041

The Big Question

Why do models from different vendors—trained on trillions of tokens and fine‑tuned with separate alignment pipelines—converge on the same two coinages? The short answer: they share a surprisingly similar diet of text. Unpack that diet and you gain a window into how data overlap, frequency bias, and tokenization shape generative output.

Inside the Overlap

1 · Common‑Crawl‑Centric Pipelines

Nearly every major LLM pipeline begins with a de‑duplicated slice of Common Crawl. That 5‑petabyte web scrape contains countless sci‑fi snippets, NaNoWriMo drafts, and fan‑fiction forums where “Lunapolis” and “Lunaris” appear. Remove duplicate URLs all you like—the rare coinages still survive because they live on many distinct sites.

2 · Books3 & Public‑Domain Fiction

Datasets such as Books3 (an 11‑GB mirror of fiction e‑books) and Project Gutenberg reprints sprinkle “Luna‑polis”‑style terms across pulp-era novels. When BigScience or OpenAI filter for “quality English prose,” they scoop up the same vintage titles and thus the same lunar neologisms.

3 · Wikipedia & Fandom Wikis

Even Wikipedia’s List of Fictional Lunar Settlements collects and standardises those names, ensuring they appear in every open‑license snapshot shipped to model trainers. Ditto for Fandom wiki pages about space‑opera games—another ubiquitous ingredient in many LLM corpora.

4 · Token Frequency & Sampling Energy

When you plot token frequencies across these corpora, Lunapolis and Lunaris might occur only a few hundred times each—but no other single‑word lunar capital appears more often. Even a modest frequency edge translates into a noticeably higher soft‑max probability at inference time, especially when the prompt narrows the search space.

Artist’s rendering of a futuristic lunar city
Concept art of a Moon settlement—exactly the kind of imagery that often accompanies the text snippets in training data. Image: Space Architect / Unsplash

Corpus Lessons in a Nutshell

  • Data redundancy means that rare sci‑fi coinages can become statistically “safe” choices if they appear across multiple public sources.
  • Cleaning ≠ originality. Duplicate trimming and profanity filters eliminate noise but rarely reduce myth‑adjacent neologisms.
  • Prompt probes can reverse‑engineer corpora. Asking dozens of whimsical questions lets researchers infer whether niche phrases sit inside the training set.
  • Alignment layers amplify the overlap. RLHF raters reward answers that sound polished yet familiar, reinforcing the already skewed token distribution inherited from pre‑training.

Practical Takeaways for Builders & Researchers

  1. Dataset diversity matters. Mixing extra‑domain corpora (technical papers, non‑fiction, creative commons poetry) reduces “Luna‑*” dominance.
  2. Control the sampler. Higher temperature or nucleus sampling (e.g., top_p = 0.9) can overcome mild frequency skews.
  3. Explicit negative cues work. “Give me a lunar capital name not beginning with Luna‑” steers the model away from its corpus priors.
  4. Corpus probes are lightweight audits. You don’t need direct dataset access—just craft systematic prompts and log the repetition rate.
📊 Fun exercise: Prompt five different models with Name a new Martian capital. Notice how many answers converge on “Areopolis” or “Ares City.” Same corpus overlap, different planet!

Bottom Line

“Lunapolis” isn’t just a catchy sci‑fi portmanteau; it’s a tracer dye illuminating how modern LLMs share—and are subtly steered by—the same vast but overlapping training corpora. Until we broaden and diversify those corpora (or learn to steer sampling more aggressively), the Moon’s capital will keep echoing the same two syllables.


Image & Dataset Credits

  • NASA Image Library – Public‑domain lunar photos.
  • Unsplash – Space architecture renders by Spacejoy.
  • Common Crawl & Books3 metadata – Used here for frequency analysis references.

About the Author

Tamir is a contributor to the TryAII blog, focusing on AI technology, LLM comparisons, and best practices.

Related Articles

Understanding Token Usage Across Different LLMs

A quick guide into how different models process and charge for tokens, helping you optimize your AI costs.

April 21, 20252 min read

Why Even Advanced LLMs Get '9.9 vs 9.11' Wrong

Exploring why large language models like GPT-4, Claude, Mistral, and Gemini still stumble on basic decimal comparisons.

April 21, 20253 min read