Ask HN: How Does DeepSeek "Thinks"?

>Do several LLMs communicate with each other before providing a solution?

>I wonder what technological aspect makes this possible.

one of its training datasets (prioritized somehow over the rest of them) contains a large number of examples emulating the thinking process within <think></think> tags before providing an output. the model then emulates it at runtime.

JPLeRouzic 40 minutes ago

Thank you for taking the time to answer. However I am not sure the answer is "NO" because DeepSeek has a particular technique in their architecture. To cite this blog [0]:
"Modern large language models (LLMs) started introducing a layer called “Mixture of Experts” (MoE) in their Transformer blocks to scale parameter count without linearly increasing compute. This is typically done through top-k (often k=2) “expert routing”, where each token is dispatched to two specialized feed-forward networks (experts) out of a large pool.
A naive GPU cluster implementation would be to place each expert on a separate device and have the router dispatch to the selected experts during inference. But this would have all the non-active experts idle on the expensive GPUs.
GShard, 2021 introduced the concept of sharding these feed-forward (FF) experts across multiple devices, so that each device"
[0] https://www.kernyan.com/hpc,/cuda/2025/02/26/Deepseek_V3_R1_...