Microsoft’s LLMA Accelerates LLM Generations via an ‘Inference-With-Reference’ Decoding Approach | Synced
In the new paper Inference with Reference: Lossless Acceleration of Large Language Models, a Microsoft research team proposes LLMA, an inference-with-reference decoding mechanism that achieves up t...
Source: Synced | AI Technology & Industry Review
In the new paper Inference with Reference: Lossless Acceleration of Large Language Models, a Microsoft research team proposes LLMA, an inference-with-reference decoding mechanism that achieves up to 2x lossless speed-ups with identical generation results by exploiting the overlaps between LLM outputs and references.