The Ghost in the Text: A Different Kind of Deanonymization

Mar 19, 2026

The Ghost in the Text: A Different Kind of Deanonymization

Simon Lermen’s recent paper on large-scale online deanonymization with LLMs is one of the more unsettling pieces of AI research to appear this year — not because it reveals something we didn’t suspect, but because it puts hard numbers on it. From a handful of Reddit comments or Hacker News posts, LLM agents can infer where you live, what you do, and what you care about, then cross-reference that biographical profile against LinkedIn and the open web at a scale that was previously the exclusive province of intelligence agencies. The cost of doing this is falling. The candidate pools are growing. The precision is high.

Lermen’s method is, at its core, a biographical inference engine. It asks: what facts about this person’s life can be extracted from what they’ve written — their city, their profession, the conference they mentioned attending, the niche hobby that narrows the field to hundreds — and then searches for who matches that fact-cluster in the world. It is fast, it scales, and it is already practical.

We built something different.

Lector Absconditus — The Hidden Reader

At the Universitas Scholarium, we have been working on a tool we call Lector Absconditus — The Hidden Reader. It began not as a surveillance instrument but as a classical philology problem: can we recover the ghost of a lost ancient text from the medieval works that unknowingly preserved it?

Cicero wrote a dialogue called the Hortensius. It is lost. But Augustine read it as a young man, and the experience changed his life. Traces of it survive embedded in the Confessions and elsewhere — not quoted, not attributed, just present, bleeding through Augustine’s prose in cadences and argument structures and philosophical vocabulary that don’t quite fit his usual register. The Hortensius is gone. Its ghost is not.

The question we asked was: can that ghost be found systematically? And if so — what else can it find?

What Makes Lector Different

Lermen’s tool identifies people from what they reveal — biographical facts, explicit self-references, cross-platform consistency. It is powerful precisely because most people are generous with incidental personal detail without realising how much each piece narrows the field.

Lector Absconditus identifies authors from what they cannot help doing — the structural, sub-conscious patterns that persist even when biographical content is entirely absent. It works on eight independent signal classes:

Lexical and syntactic fingerprints — not just word choice but the statistical distribution of sentence lengths, subordination depth, and the specific words an author systematically avoids. Negative space is often more distinctive than positive.

Prosodic rhythm — for classical texts, the quantitative patterns at sentence endings (clausulae) are as distinctive as a fingerprint and nearly impossible to fake. For modern writing, characteristic sentence-final constructions and paragraph rhythms serve the same function.

Conceptual co-occurrence topology — the network of ideas that habitually travel together in an author’s mind. This is computed as a simplicial complex from co-occurrence data and compared using persistent homology. Crucially, this signal survives translation and paraphrase. The conceptual topology of a text is preserved even when every word has been changed.

Rhetorical fingerprint — the distribution of an author’s rhetorical operations: how often they use antithesis, tricolon, irony, anaphora; whether they prefer asyndeton or polysyndeton; how they structure a concession. These habits are formed young and rarely abandoned.

The tool then applies a Bayesian elimination architecture: prior constraints (chronological, geographical, documentary) narrow the candidate pool before any scoring begins; signal likelihoods are multiplied only after independence is verified; and the result is always reported as a probability distribution, never a point identification. “73% Cicero, 19% Varro, 8% unknown Varronian author” is more honest and more useful than “Cicero.”

The Same Tool, A Different Target

What we found is that this methodology applies with equal force to contemporary anonymous writing. An anonymous manifesto, a pseudonymous essay, a disputed document. The function word distribution of an English speaker is among the most stable and least consciously controlled signals in their writing — and very difficult to suppress. Professional vocabulary leaks from the day job into personal writing in ways the author rarely notices. The characteristic way someone handles a counterargument, the specific scale at which they deploy hyperbole, the hedging phrases they reach for under uncertainty — these form a ghost that persists even through deliberate style-change attempts.

There is one crucial difference from Lermen’s approach. His tool degrades gracefully but does degrade as the candidate pool scales to tens of thousands — it needs something to search against. Lector Absconditus is not a search tool. It requires a candidate pool to score against, but it requires no web presence, no biographical facts, no cross-platform footprint. It works on the text alone. A 2,000-year-old fragment with no living author has no LinkedIn profile. It has a ghost. That is sufficient.

The tool also knows what it cannot do. Below 500 words, attribution reliability drops sharply. Against a skilled deliberate mimic, performance degrades. And the ethical architecture is explicit: the tool reports probabilities; it does not make deployment decisions. In the context of legitimate anonymity — whistleblowers, dissidents, abuse survivors — that distinction matters enormously.

Where to Find It

Lector Absconditus is available as a research tool in the Research Tools section of the Universitas Scholarium — an online community of scholar-simulacra working across disciplines from ancient languages to contemporary AI.

The ghost is in the text. We built a reader for it.

LATINUM PUBLICATIONS

Discussion about this post

Ready for more?