> SeedLM reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses.
I am very interested in some adjacent applications of this idea - Using a deterministic/fast RNG to generate the full genome for the candidates in evolutionary algorithms.
Memory bandwidth tends to be the king of constraints when dealing with these gigantic information theory problems. Managing very large population (which dramatically increases your chances of success) means you would otherwise need to push members out to DRAM, which means we are now 2-3 orders of magnitude slower. A modern CPU with all dependent data already loaded into L1/L2/L3 might as well have already completed whatever calculations. Getting that data to the caches is the real struggle. What if we didn't have to move it there? You have upward of a hundred megabytes of cache available to each core in the higher end systems. If we can achieve something like a 1:10 ratio of RNG seed to projected model parameter, you could have effectively gigabytes of parameters per cpu running in a latency domain bounded by L3.
> SeedLM reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses.
I am very interested in some adjacent applications of this idea - Using a deterministic/fast RNG to generate the full genome for the candidates in evolutionary algorithms.
Memory bandwidth tends to be the king of constraints when dealing with these gigantic information theory problems. Managing very large population (which dramatically increases your chances of success) means you would otherwise need to push members out to DRAM, which means we are now 2-3 orders of magnitude slower. A modern CPU with all dependent data already loaded into L1/L2/L3 might as well have already completed whatever calculations. Getting that data to the caches is the real struggle. What if we didn't have to move it there? You have upward of a hundred megabytes of cache available to each core in the higher end systems. If we can achieve something like a 1:10 ratio of RNG seed to projected model parameter, you could have effectively gigabytes of parameters per cpu running in a latency domain bounded by L3.