Efficient Deep Learning Inference on Edge Devices

Abstract

Deploying deep learning (DL) models on edge devices is getting popular nowadays. The huge diversity of edge devices, with both computation and memory constraints, however, make efficient deployment challenging. In this paper, we propose a two-stage pipeline that optimizes DL models on target devices. The first stage optimizes the inference workloads, and the second stage searches optimal kernel implementations on the target device. We implemented this pipeline with the TVM stack. Our contributions include new algorithmic optimization that is crucial to edge devices, such as quantization and joint kernel turning. On Raspberry Pi, compared to manually optimized frameworks, we will demonstrate our pipeline improves inference latency by 3x for ResNet-18 and by 10x for MobileNet, and generates compact runtime library with size less than 1MB.

Publication
ACM Conference on Systems and Machine Learning