Search : [ author: Seoyoung Ko ] (1)

Efficient Memory Management Techniques for LLM Inference in Mobile System

Hyunjeong Shim, Seoyoung Ko, Wanju Doh, Jung Ho Ahn

http://doi.org/10.5626/JOK.2025.52.8.637

On-device LLMs have gained increased attention due to privacy and network latency issues associated with cloud-based LLMs. However, the memory management policies in mobile operating systems have limitations in efficiently handling memory resources during LLM inference. In this paper, we propose two techniques, Initial KV Cache Swap and Deferred Weight Reclamation, which leverage zRAM for preallocated KV cache and reduce storage I/O by deferring weight eviction, leading to enhanced LLM inference performance. Our proposed approach achieves up to a 27% reduction in memory usage compared to the default Linux kernel, optimizing LLM inference performance in memory-constrained mobile environments. Moreover, our approach yields greater memory savings as the number of candidate paths increases in inference techniques such as speculative decoding, demonstrating its effectiveness in supporting diverse LLM decoding techniques on mobile devices.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr