Recurrent drafting (referred as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM)…
Recurrent drafting (referred as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM) inference now available with NVIDIA TensorRT-LLM. ReDrafter helps developers significantly boost LLM workload performance on NVIDIA GPUs. NVIDIA TensorRT-LLM is a library for optimizing LLM inference. It provides an easy-to-use Python API to defineā¦