Magic - Research Engineer

In Transformers, the KV cache is a bottleneck during inference time. We once tried setting k=v in order to save 50% of the memory in the hopes performance remains unaffected. Think of this as self_attn(q=q, k=k, v=k) instead of self_attn(q=q, k=k, v=v). About an hour later, it became clear this was a pretty dumb idea for a theoretical reason (not only experimental). What could this theoretical reason have been?✱
Explain in simple words what you think might be happening inside large language models that makes them work so well. Nobody knows the full answer to this question - we welcome wild ideas and hypotheses.✱

Research Engineer