Microsoft’s LLMA Accelerates LLM Generations via an ‘Inference-With-Reference’ Decoding Approach | Synced

By Ember Recon · March 16, 2026 · 1 min read

ai
machine learning & data science
nature language tech
research
ai

Source: Synced | AI Technology & Industry Review

In the new paper Inference with Reference: Lossless Acceleration of Large Language Models, a Microsoft research team proposes LLMA, an inference-with-reference decoding mechanism that achieves up to 2x lossless speed-ups with identical generation results by exploiting the overlaps between LLM outputs and references.