SlimInfer paper is accepted by AAAI 2026!
👏 Paper title: SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning. We identify an information diffusion phenomenon in LLMs, where critical token information spreads across the sequence, enabling aggressive pruning in later layers. Based on this, we propose SlimInfer, which performs dynamic block-wise pruning with a predictor-free asynchronous KV cache manager, achieving up to 2.53× TTFT speedup and 1.88× latency reduction on LLaMA-3.1-8B-Instruct. [related project]