As large language models (LLMs) are increasingly deployed in vision–language agents, embodied intelligence, and edge systems, their practical use is constrained by limited compute, memory, and strict latency requirements. This talk explores how efficiency-aware model design and hardware–software co-design can enable high-performance LLM inference on resource-constrained devices. I will present a series of approaches that reduce the computational and memory bottlenecks of modern LLMs, focusing on efficient attention mechanisms, edge-oriented model architectures, and principled hardware-aware optimization. By exploiting redundancy in attention computation and KV caching, and by co-designing model architectures with hardware constraints in mind, these methods significantly improve inference speed and memory efficiency without sacrificing accuracy. I will also introduce a hardware co-design scaling framework that directly relates model accuracy to inference latency, enabling rapid identification of Pareto-optimal architectures under realistic deployment budgets. Together, these results demonstrate a practical pathway toward scalable, efficient LLMs and vision–language action model that can operate directly on edge and embedded platforms.
Invited Speaker: Cheng Deng (University of Edinburgh)
Short Bio: Cheng is a Research Fellow at the Bayes Centre, University of Edinburgh, where he works in collaboration with Prof. Luo Mai, Prof. Jeff Pan, and Prof. Jun Wang. His research interest is efficient LLM. He also serves as a Guest Researcher at Li Auto. Prior to joining the University of Edinburgh, he was a Research Scientist at Huawei London Research Centre. Before that, Cheng worked as a Research Assistant at the Hong Kong University of Science and Technology (Guangzhou), collaborating with Prof. Lei Chen and Prof. Lionel M. Ni. He obtained his Ph.D. in Computer Science from the Acemap IIoT Lab at Shanghai Jiao Tong University under the supervision of Prof. Xinbing Wang. During his early research career, he interned with the Data Team at TikTok and served as an Applied Scientist at the Amazon Shanghai AI Lab. In 2021, he was selected for the Wenjun Wu Honored Ph.D. Class. Cheng has published over 30 papers in top-tier conferences and journals, including ICML, NeurIPS, ICLR, MLSys, and IEEE TMC, and currently serves as a reviewer for venues such as ICLR, ACL, NeurIPS, ICML, and CVPR.