In Transformers, the KV cache is a bottleneck during inference time. We once tried setting k=v in order to save 50% of the memory in the hopes performance remains unaffected. Think of this as self_attn(q=q, k=k, v=k) instead of self_attn(q=q, k=k, v=v). About an hour later, it became clear this was a pretty dumb idea for a theoretical reason (not only experimental). What could this theoretical reason have been?✱