Ask HN: How Does DeepSeek "Thinks"?
There is a useful feature in DeepSeek that isn't present in other commercial LLMs. It displays its internal "thinking" process. I wonder what technological aspect makes this possible. Do several LLMs communicate with each other before providing a solution? Are there different roles within these LLMs, such as some proposing solutions, others contradicting or offering alternative viewpoints, or reminding of overlooked aspects?
>Do several LLMs communicate with each other before providing a solution?
no
>I wonder what technological aspect makes this possible.
one of its training datasets (prioritized somehow over the rest of them) contains a large number of examples emulating the thinking process within <think></think> tags before providing an output. the model then emulates it at runtime.
Thank you for taking the time to answer. However I am not sure the answer is "NO" because DeepSeek has a particular technique in their architecture. To cite this blog [0]:
"Modern large language models (LLMs) started introducing a layer called “Mixture of Experts” (MoE) in their Transformer blocks to scale parameter count without linearly increasing compute. This is typically done through top-k (often k=2) “expert routing”, where each token is dispatched to two specialized feed-forward networks (experts) out of a large pool.
A naive GPU cluster implementation would be to place each expert on a separate device and have the router dispatch to the selected experts during inference. But this would have all the non-active experts idle on the expensive GPUs.
GShard, 2021 introduced the concept of sharding these feed-forward (FF) experts across multiple devices, so that each device"
[0] https://www.kernyan.com/hpc,/cuda/2025/02/26/Deepseek_V3_R1_...