potatoicecoffee 2 hours ago

markov chains are used for my favourite financial algorithm; the allocation of overhead costs in cost accounting. wish there was an easy way to visualise a model with 500 nodes

brcmthrowaway 10 hours ago

What is the secret sauce that makes LLM better than a Markov chain?

  • AnotherGoodName 7 hours ago

    Markov chains are a one dimensional chain of immediate previous states and base their prediction on that very specific linear chain of states.

    A good way to immediately grasp the flaws of a Markov chain is to imagine it predicting a pixel in a 2D image.

    Sure in a 1D chain of states that goes something like [red green blue] repeatedly a Markov chain is a great predictor. But imagine a 2D image where the pattern is purely in the vertical elements but you’re passing in elements left to right then next row. It’s going to try to make a prediction based on the recent horizontal elements and there no good way to have context of the pixel above. A Markov chain doesn’t mix two states, its context is the immediate previous chain of states. It’s really really hard for a Markov chain to take the pixel above into context if your feeding it pixels left to right (same is true for words but the pixel analogy is easier to grasp).

    Now you may think a good solution to this is to have a Markov chain have some mechanism to take multiple previous states from multiple contexts (in this case vertical and horizontal) and somehow weight each one to get the next state. Go down this path and you essentially go down the path of early neural networks.

    • vrighter 6 hours ago

      the way I see it is that an llm esis* a markov chain. the only difference is a very long context, and a lossily compressed state transition "table".

  • crystal_revenge 10 hours ago

    LLMs share more with a Markov-Chain than many would like to admit, but the fundamental improvements is because the model is learning the state representation and that latent representation is essentially a lossy compression for nearly all written human text.

    People really underestimate both the magnitude of the parameter space for leaner this and the scale of the training data.

    But, at the end of the day, the model is still just sampling one token at a time and updating its state (and of course, that state is conditioned on all previous tokens, so that’s a other point of departure)

  • jampekka an hour ago

    Stictly speaking (non-recurrent) LLMs are Markov chains.

    Per e.g. Wikipedia: "In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event."

  • wodenokoto 5 hours ago

    In the standard next-token markov model, just have one node for every token and 1 probability for every edge. E.g, "if last letter is "B" most likely next letter is "C""

    To make a big language model in Markov space, you need to very large ngrams. "If last text is "th" next letter is "e""

    These get incredibly sparse, very quickly. Most sentences are novel in your dataset.

    Neural networks (and other models too) get around this by placing tokens in multidimensional space, instead of having individual nodes.

  • aGHz 10 hours ago

    Attention. It's really all you need [1]

    1: https://arxiv.org/abs/1706.03762

    • hackernewds 6 hours ago

      the attention Mafia. only one is a billionaire now though. the rewards have been reaped by the adopters instead of the inventors.

  • Nevermark 9 hours ago

    Even a dead simple two-layer neural network is a universal approximator. Meaning they can model any relationship, given enough neurons in the first layer. with as much accuracy as you want, subject to available resources for training time and model size.

    Specific deep learning architectures and training variants reflect the problems they are solving, speeding up training, and reducing model sizes considerably. Both keep getting better, so efficiency improvements are not likely to plateau anytime soon.

    They readily handle both statistically driven and pattern driven mappings in the data. Most modelling systems tend to be better or exclusively adaptive on one of those dimensions or the other.

    Learning patterns means they don't just go input->prediction. They learn to recognize and apply relationships in the middle. So when people say they are "just" predictors, that tends to be misleading. They end up predicting, but they will do whatever they need to in between in terms of processing information, in order to predict.

    They can learn both hard or soft pattern boundaries. Discrete and continuous relationships. And static and sequential patterns.

    They can be trained directly on example outputs, or indirectly via reinforcement (learning what kind of outputs will get judged well and generating those). Those are only two of many flexible training schemes.

    All those benefits and more make them exceptionally general, flexible, powerful and efficient tools relative to other classes of universal approximators, for large growing areas of data defined problems.

  • devmor 7 hours ago

    Imagine the nodes in those visualizations are not just a static state, but each have metadata attached to them used to infer the next state in the "chain", which is keyed on a value assigned from a mysterious lookup table based on the current state - so each time the state shifts the metadata on all states can also shift.

    (There are also types of LLMs where that metadata is limited in access, one such type is when the current state can only check metadata on previous states and weigh it against the base value of the next states in the chain.)

    Then, imagine each chain of states in a Markov Chain is a 2D hash map, like a grid plot. Our current LLMs are like an Nth-dimensional hash map instead, and can have a finite, but extremely large depth. This is pretty near impossible to visualize as a human, but if you're familiar with array/map operations, you should get the idea.

    This is a very "base level" understanding, as my learning on LLMs stopped around the time Tensorflow stopped being the new hotness, but hopefully that gives you an idea.

  • CamperBob2 9 hours ago

    A Markov chain can't tell you the difference, and an LLM can.

linwangg 8 hours ago

So awesome project! The visualization is amazing!