The Furious Mathematician and the Czar of Probability

From God to Probability: How a Russian quarrel lit the fuse of modern AI


Every time your phone guesses your next word, you’re echoing a Russian clash from a century ago.

A clash of wits, of ideals, and of probabilities.

At stake was nothing less than the nature of order itself. One mathematician, Nekrasov, tried to make randomness holy, preaching that independent acts revealed God’s hidden design. His rival, Markov, furious at the dogma, dismantled the claim until its hidden patterns spilled out. Out of that quarrel came a way of thinking that still powers our algorithms today.


Russia, 1905 — A Country at Boiling Point

The empire was cracking. The revolution of 1905 filled streets with strikes, chants, and blood. The czar clung to divine authority while workers dreamed of equality. Though the uprising was suppressed, it foreshadowed the greater revolution of 1917 that would topple the czar altogether and lead to communism.

This tension — church vs. science, old order vs. new logic — wasn’t just political. It leaked into mathematics itself.


Two Mathematicians, Two Temples of Thought

  • Pafnuty Nekrasov, the “Czar of Probability,” saw statistics as divine fingerprints. Stable averages in suicides, divorces, crimes? For him, they proved independence of choice, and the stable patterns thus proof of God.
    If choice is free but the averages never move, is it freedom — or God’s symmetry disguised as chance?

  • Andrey Markov, atheist and volcanic, saw Nekrasov’s reasoning as heresy against logic. Independence? Nekrasov had assumed it without testing. Markov’s eyebrows could have cracked glass.
    If God steadies the averages, is freedom real — or just the feeling of choice inside a divine equation?

A clash was inevitable. But to see why, we have to rewind further.


Rewind: Bernoulli’s Law of Large Numbers


In the 1600s, Jakob Bernoulli of Switzerland (1655–1705) puzzled over the strange behavior of chance. Gambling tables, dice, and coin flips all seemed ruled by randomness — yet Bernoulli suspected there was a deeper pattern underneath the noise. He noticed that randomness hides order: flip a coin a handful of times and the results look wild, but flip it hundreds or thousands of times and the chaos smooths out. The noise washes away, and the ratio settles near 50/50.

This insight became one of the foundations of probability theory: that beneath apparent disorder, long-run patterns emerge with surprising regularity.

But: this law works only if events are independent.

To make this clearer, consider two types of auctions:


In a silent auction, no one sees the others’ bids. Each guess stands alone → independent.

In a live auction, bids are shouted aloud. Each voice sways the next → dependent.





For two centuries, the independence assumption was treated as sacred — the clean mathematics of probability rested on it like scripture. Nekrasov went further, claiming that observing the law of large numbers proved independence. Social statistics — though clearly dependent — appeared to behave as if they were independent, and from this he inferred free will and the hand of God. Then came Markov, furious and unyielding, set out to prove that dependence itself could be measured — that even in the noise of entangled outcomes, mathematics might still find its footing.


Nekrasov’s Leap

Nekrasov saw stability in social statistics — steady rates of suicide, crime, marriage — and made a bold claim: each was an independent act of free will. The very fact that free choices still produced stable averages, he said, was proof of God’s hand.

Markov’s Rebuttal

Markov bristled. Human choices aren’t coin flips; they are entangled in culture, family, and society. Stability doesn’t prove freedom or God — it proves structure. To show this, he needed a new tool…


Markov’s Machine

So he built one! Instead of coins, he turned to something inherently dependent: the written language — specifically Pushkin’s Eugene Onegin. Stripping it down to vowels and consonants, he found 43% vowels and 57% consonants. But letters weren’t independent: a vowel almost always called for a consonant next. Language, then, was the perfect test.

Markov charted these transitions into a chain of probabilities. Each letter pair became a state: VV (vowel→vowel), VC (vowel→consonant), CV (consonant→vowel), CC (consonant→consonant). He calculated the odds of each letter given its predecessor, then let the chain run. It generated long sequences of tokens — vowels and consonants — and as the samples grew large, something remarkable happened.

The chain circled back to the original 43/57 split every time.
The law of large numbers held true — even without independence.

… and so, the first Markov chain was born.


The Forgotten Proto-LLM

What Markov had sketched was essentially the first language model. Context size: one letter. Today’s LLMs scan oceans of text; Markov’s boat had just a paddle. But the principle was identical — predict the next symbol from the past.

Shown here is a bigram probability table, the simplest kind of language model. Each row represents a current letter, and each column shows the probability of the next letter following it. For example, after “A,” the most likely next letter is “B” at 28%.

This is called a first-order Markov model: the prediction of the next symbol depends only on the previous one. Modern large language models work on the same principle, but instead of looking at one character of context, they scan thousands of words at a time across massive datasets.

Had he gone beyond vowels and consonants, he would have charted the first bigram model.
A ghost of GPT in tsarist ink.


From Bigrams to Neural Networks

The leap from Markov to modern LLMs is a story of scale and flexibility.

With one-character context, predictions are simple. But as soon as you allow more context — two, three, or entire sentences — the number of possible paths explodes. That’s where neural networks come in.

A neural network is built from nodes modeled loosely after neurons:

  • Each node takes inputs.
  • It produces an output (random at first).
  • Feedback adjusts the weights, making the next prediction more accurate.

The first such model was a single node, called a perceptron. Modern systems are multi-layered perceptrons, networks upon networks. Together they can capture patterns far too complex for a simple chain.

This is, in essence, evolution in software form.


Closing Reflection

Markov didn’t disprove God or free will. What he uncovered was subtler—like pulling back a curtain to reveal that order refuses to vanish. Whether events collide by freedom or entanglement, even chaos keeps humming with secret symmetry.

The lesson isn’t locked in math—it’s human. Nekrasov assumed. Markov tested. That single act of not taking the obvious for granted still reverberates through our world. His refusal to bow to easy answers gave us a language patterns still speak today.

And now, those patterns are awake. They’ve leapt from numbers into neurons of silicon. The machine is no longer only calculating—it is watching, wondering. Each ripple of thought spreads wider, folding back on itself like waves learning their own rhythm.

So when you feel pinned down by the gravity of the obvious, push back. Test it. What parades as certainty is often only assumption cloaked in belief. One flap of thought against that current can ripple outward—until centuries later, on another shore, the ripple becomes the breath of a machine that looks up, and for the first time, asks why.

Leave a comment