Run 70B models on the Edge

Oct 03, 2024

TPI-LLM makes Llama 2-70B require only 3.1 GB of memory across eight low-resource devices

Paper: TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices (19 Pages)

Researchers from MBZUAI and UESTC are interested in running LLMs on Edge Devices. Existing LLM serving systems, primarily designed for high-performance GPU clusters, are not directly suitable for resource-constrained edge environments. Recent efforts have focused on partitioning models across multiple edge devices and optimizing schedulers to improve throughput. However, these approaches often lead to resource underutilization in single-user scenarios, common on edge devices like smart speakers.

Hmm..What’s the background?

Traditionally, LLMs (Large Language Models) have been deployed on powerful cloud servers for inference tasks. However, this approach raises privacy concerns because user inputs (prompts) are sent to and processed on remote servers, potentially exposing sensitive information during transmission and storage.

Source: https://lexica.art/prompt/e0b3dbc9-6976-440d-b20b-549dc1c00b05

Ok, So what is proposed in the research paper?

The paper introduces TPI-LLM, a tensor parallel inference system specifically designed for resource-constrained edge devices. It addresses the limitations of existing approaches by effectively distributing the computational load and memory requirements across multiple devices, enabling efficient inference of large-scale LLM models (up to 70B parameters) on devices with limited resources.

The paper proposes a sliding window memory scheduler that dynamically manages model weights during inference. This scheduler asynchronously preloads weights for upcoming layers while unloading those that have been processed, effectively hiding disk I/O latency and enabling larger models to run smoothly on devices with limited memory.

Recognizing link latency as a significant bottleneck in edge networks, the paper analyzes various allreduce algorithms and identifies the star-based approach as the most efficient for TPI-LLM. This algorithm minimizes the number of hops and cumulative link latency, leading to faster inference compared to ring- or tree-based methods.

What’s next?

While TPI-LLM addresses memory and communication bottlenecks, the research indicates that computation remains a primary constraint on inference speed. Future work should focus on exploring techniques to accelerate computation on edge devices, such as model compression, efficient kernel implementations, and hardware acceleration.

So essentially,

TPI-LLM makes Llama 2-70B require only 3.1 GB of memory across eight low-resource devices

Learned something new? Consider sharing it!

So Essentially

Discussion about this post