Journal of KIISE

Search : [ author: Seoyoung Ko ] (1)

On-device LLMs have gained increased attention due to privacy and network latency issues associated with cloud-based LLMs. However, the memory management policies in mobile operating systems have limitations in efficiently handling memory resources during LLM inference. In this paper, we propose two techniques, Initial KV Cache Swap and Deferred Weight Reclamation, which leverage zRAM for preallocated KV cache and reduce storage I/O by deferring weight eviction, leading to enhanced LLM inference performance. Our proposed approach achieves up to a 27% reduction in memory usage compared to the default Linux kernel, optimizing LLM inference performance in memory-constrained mobile environments. Moreover, our approach yields greater memory savings as the number of candidate paths increases in inference techniques such as speculative decoding, demonstrating its effectiveness in supporting diverse LLM decoding techniques on mobile devices.

Search

Journal of KIISE

ISSN : 2383-630X(Print)
ISSN : 2383-6296(Electronic)
KCI Accredited Journal

Editorial Office

Tel. +82-2-588-9240
Fax. +82-2-521-1352
E-mail. chwoo@kiise.or.kr

Journal of KIISE

Journal of KIISE

Digital Library[ Search Result ]

Efficient Memory Management Techniques for LLM Inference in Mobile System

Search

Editorial Office