Markov Chains Explained Visually (2014)

213 points by mrcgnc 5 months ago

Markov chains are super useful in statistics but it isn't obvious at first what problem they solve and how - some further reading that I found helpful:

https://twiecki.io/blog/2015/11/10/mcmc-sampling/

Note that the point of the markov chain is it's possible to compute relative probabilities between two given points in the posterior even when you don't have a closed form expression for the posterior.

Also, the reason behind separating the proposal distribution and the acceptance probability is that it's a convenient method to make the Markov process stationary, which isn't true in general. (Wikipedia page on MCMC is also useful here).

alekratz 5 months ago

For anyone curious, MCMC = "Markov chain Monte Carlo" - the article doesn't actually tell you what it stands for until a number of paragraphs down.
(This is a massive pet peeve of mine - if you are going to call something "X for dummies", don't bury the lede! Tell me what "X" is as soon as possible, especially if it's an acronym!)

globalnode 5 months ago

This is timely! I have an assignment on these coming up soon. Can anyone with knowledge about this explain something. From what I can tell, many matrix multiplications move vectors so they are more inline with eigenvectors if they exist. So Markov Chains are just a continual movement in this direction. Some examples that don't do this that I can think of are the Identity matrix and rotations.. Is there a way to test if a matrix will have this effect? Is it just testing for existence of eigenvectors?

brosco 5 months ago

That's a good observation, and it is indeed true for many Markov chains. But your counterexample of the identity matrix is not quite right; every vector is an eigenvector of the identity, so there is no "realignment" needed.
More generally speaking, you're asking when the iteration `x_+ = Ax` converges to a fixed point which is an eigenvector of A. This can happen a few different ways. The obvious way is that A has an eigenvector `v` with eigenvalue 1, and all other eigenvalues with magnitude < 1. Then those other components will die out with repeated application of A, leaving only `v` in the limit.
For Markov chains, we can get this exact property from the Perron-Frobenius theorem, which applies to non-negative irreducible matrices. Irreducible means that the transition graph of the Markov chain is strongly connected. If that's the case, then there is a unique eigenvector called the stationary distribution (with eigenvalue 1), and all initial conditions will converge to it.
In case A is not irreducible, you may have different connected components, and the stationary distribution may depend on which component your initial condition is in. Going back to the n x n identity matrix, it has n connected components (it's a completely disconnected graph with all the self-transition probabilities = 1). So every initial condition is stationary, because you can't change anything after the initial step.
- globalnode 5 months ago
  
  Thank you, some really good info to go on and research.
festivitymn 5 months ago

This is close to some of my favorite stuff in math!
Beyond just markov chains, matrices do “move” vectors towards eigenvectors sometimes. If a matrix A has eigenvectors x1 and x2 with eigenvalues r1 and r2, A(x1+x2)= r1x1+r2x2 because matrix multiplication is a linear transformation. If we repeatedly multiply x1+x2 by A, A^n(x1+x2)=r1^nx1+ r2^nx2. Then, if r1>r2, all of the terms are growing exponentially but the contribution of x1 to the result grows exponentially faster than the contribution of x2 so for some large n, A^n(x1+x2) = r1^nx1 + some irrelevant error term.
This means the largest eigenvalues sort of dominate, and you might especially care about eigenvalues of 1, because those mean Ax=x so x is a steady state and if you can write down A as a matrix you can solve for non zero x and learn about the steady state solutions.
- globalnode 5 months ago
  
  Thanks for that, you made my day.

brcmthrowaway 5 months ago

What is the secret sauce that makes LLM better than a Markov chain?

AnotherGoodName 5 months ago

Markov chains are a one dimensional chain of immediate previous states and base their prediction on that very specific linear chain of states.
A good way to immediately grasp the flaws of a Markov chain is to imagine it predicting a pixel in a 2D image.
Sure in a 1D chain of states that goes something like [red green blue] repeatedly a Markov chain is a great predictor. But imagine a 2D image where the pattern is purely in the vertical elements but you’re passing in elements left to right then next row. It’s going to try to make a prediction based on the recent horizontal elements and there no good way to have context of the pixel above. A Markov chain doesn’t mix two states, its context is the immediate previous chain of states. It’s really really hard for a Markov chain to take the pixel above into context if your feeding it pixels left to right (same is true for words but the pixel analogy is easier to grasp).
Now you may think a good solution to this is to have a Markov chain have some mechanism to take multiple previous states from multiple contexts (in this case vertical and horizontal) and somehow weight each one to get the next state. Go down this path and you essentially go down the path of early neural networks.
- globalnode 5 months ago
  
  Could it be that the action of a Markov Chain on vectors, where the vector converges to an eigenvector, is similar to the action of an LLM under gradient descent, where vectors converge to a minimum? There's also a link between eigenvectors and solutions to simultaneous differential equations... I'm not that knowledgeable about all of this but it feels like there are many similarities.
- vrighter 5 months ago
  
  the way I see it is that an llm esis* a markov chain. the only difference is a very long context, and a lossily compressed state transition "table".
  
  AnotherGoodName 5 months ago
  
  But how to build a Markov chain that does what an LLM does?
  I've actually thought of this from time to time and come up with things like 'skip' states that just don't change the context on meaningless words, I've thought of maintaining distant states in some type of context on the markov chain, multiple contexts in multiple directions mixed with a neural network at the end (as per the 2D image example), etc. But then the big issue is in building the Markov state network.
  The attention mechanism and subsequent multi-layer neural network it uses is honestly extremely inelegant. Basically a entire datacenter sledgehammer of contexts and back-propagated neural networks to make something that works while Markov chains would easily run on an Apple 2 if you got it right (it's very fast to linearly navigate a chain of states). But the hard part is making Markov chains that can represent language well. The best we've done is very simple linear next letter predictors.
  I do honestly think there may be some value in stealing word weights from an attention mechanism and making Markov chains out of each of them. So each word starts navigating a markov chain of nearby words to end up at new states. You'd still mix all the contexts at the end with a neural network but it skips the computationally expensive attention mechanism. Still a hard problem to build these Markov chains sensibly though. Deepseek has shown us there's huge opportunities in optimizing the attention mechanism and Markov chains are super computationally simple but the 'how do you build good Markov chains' is a very hard problem.
crystal_revenge 5 months ago

LLMs share more with a Markov-Chain than many would like to admit, but the fundamental improvements is because the model is learning the state representation and that latent representation is essentially a lossy compression for nearly all written human text.
People really underestimate both the magnitude of the parameter space for leaner this and the scale of the training data.
But, at the end of the day, the model is still just sampling one token at a time and updating its state (and of course, that state is conditioned on all previous tokens, so that’s a other point of departure)
devmor 5 months ago

Imagine the nodes in those visualizations are not just a static state, but each have metadata attached to them used to infer the next state in the "chain", which is keyed on a value assigned from a mysterious lookup table based on the current state - so each time the state shifts the metadata on all states can also shift.
(There are also types of LLMs where that metadata is limited in access, one such type is when the current state can only check metadata on previous states and weigh it against the base value of the next states in the chain.)
Then, imagine each chain of states in a Markov Chain is a 2D hash map, like a grid plot. Our current LLMs are like an Nth-dimensional hash map instead, and can have a finite, but extremely large depth. This is pretty near impossible to visualize as a human, but if you're familiar with array/map operations, you should get the idea.
This is a very "base level" understanding, as my learning on LLMs stopped around the time Tensorflow stopped being the new hotness, but hopefully that gives you an idea.
aGHz 5 months ago

Attention. It's really all you need [1]
1: https://arxiv.org/abs/1706.03762
- hackernewds 5 months ago
  
  the attention Mafia. only one is a billionaire now though. the rewards have been reaped by the adopters instead of the inventors.
wodenokoto 5 months ago

In the standard next-token markov model, just have one node for every token and 1 probability for every edge. E.g, "if last letter is "B" most likely next letter is "C""
To make a big language model in Markov space, you need to very large ngrams. "If last text is "th" next letter is "e""
These get incredibly sparse, very quickly. Most sentences are novel in your dataset.
Neural networks (and other models too) get around this by placing tokens in multidimensional space, instead of having individual nodes.
Nevermark 5 months ago

Even a dead simple two-layer neural network is a universal approximator. Meaning they can model any relationship, given enough neurons in the first layer. with as much accuracy as you want, subject to available resources for training time and model size.
Specific deep learning architectures and training variants reflect the problems they are solving, speeding up training, and reducing model sizes considerably. Both keep getting better, so efficiency improvements are not likely to plateau anytime soon.
They readily handle both statistically driven and pattern driven mappings in the data. Most modelling systems tend to be better or exclusively adaptive on one of those dimensions or the other.
Learning patterns means they don't just go input->prediction. They learn to recognize and apply relationships in the middle. So when people say they are "just" predictors, that tends to be misleading. They end up predicting, but they will do whatever they need to in between in terms of processing information, in order to predict.
They can learn both hard or soft pattern boundaries. Discrete and continuous relationships. And static and sequential patterns.
They can be trained directly on example outputs, or indirectly via reinforcement (learning what kind of outputs will get judged well and generating those). Those are only two of many flexible training schemes.
All those benefits and more make them exceptionally general, flexible, powerful and efficient tools relative to other classes of universal approximators, for large growing areas of data defined problems.
jampekka 5 months ago

Stictly speaking (non-recurrent) LLMs are Markov chains.
Per e.g. Wikipedia: "In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event."
kneegerman 5 months ago

scale
CamperBob2 5 months ago

A Markov chain can't tell you the difference, and an LLM can.

alberto_ol 5 months ago

previuos submissions

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

s_dev 5 months ago

The relevance to me is that markov chains are a remarkable way to explain why LLMs are both useful and very unreliable.

You train on piece of text and then the output 'sounds' like that text it was trained despite being pure gibberish.

sadeshmukh 4 months ago

LLMs are not Markov chains

kuharich 5 months ago

Past comments: https://news.ycombinator.com/item?id=17766358

potatoicecoffee 5 months ago

markov chains are used for my favourite financial algorithm; the allocation of overhead costs in cost accounting. wish there was an easy way to visualise a model with 500 nodes

cytocync 5 months ago

[dead]

linwangg 5 months ago

[dead]