MediaTek NPU and LiteRT: Powering the next generation of on-device AI

LiteRT and MediaTek are announcing the new LiteRT NeuroPilot Accelerator. This is a ground-up successor for the TFLite NeuroPilot delegate, bringing seamless deployment experience, state-of-the-art LLM support, and advanced performance to millions of devices worldwide.

Lu Wang, Arian Arfaian, Luke Boyer
10 min readintermediate
--
View Original

Overview

The article discusses the advancements in on-device AI powered by MediaTek's Neural Processing Unit (NPU) and the introduction of the LiteRT NeuroPilot Accelerator. It highlights the challenges developers face in deploying AI on NPUs and presents solutions to streamline the development process, enabling sophisticated generative AI models to run efficiently on various devices.

What You'll Learn

1

How to deploy AI models using the LiteRT NeuroPilot Accelerator

2

Why using Ahead-of-Time (AOT) compilation is beneficial for large models

3

How to leverage Native Hardware Buffer Interoperability for efficient data processing

Prerequisites & Requirements

  • Understanding of machine learning model deployment
  • Familiarity with LiteRT and MediaTek NPUs(optional)

Key Questions Answered

What is the LiteRT NeuroPilot Accelerator and its key features?
The LiteRT NeuroPilot Accelerator is a tool designed to simplify the deployment of AI models on MediaTek NPUs. It offers a unified API for various NPUs, supports both AOT and on-device compilation, and enhances performance for generative AI applications, making it easier for developers to implement complex AI functionalities.
How does AOT compilation improve model performance?
AOT compilation allows developers to compile models ahead of time for specific SoCs, significantly reducing initialization costs and memory usage. This is particularly beneficial for large models, as it minimizes the time taken for the model to be ready for inference, enhancing the user experience.
What are the benefits of using Native Hardware Buffer Interoperability?
Native Hardware Buffer Interoperability enables zero-copy data passing between components, which enhances efficiency by allowing direct data transfer from GPU outputs to the NPU. This is crucial for applications requiring high throughput, such as real-time video processing, as it reduces latency and resource consumption.
What generative AI capabilities are supported by the LiteRT NeuroPilot Accelerator?
The LiteRT NeuroPilot Accelerator supports advanced generative AI capabilities through models like the Gemma family. It enables developers to create sophisticated applications, including text generation and multimodal applications, directly on NPU, thus enhancing the performance of AI-driven features on devices.

Key Statistics & Figures

Performance improvement of Gemma models
up to 12x compared to CPU and 10x compared to GPU
This performance boost is achieved through optimizations specifically targeting the MediaTek NPU.
Gemma 3n E2B model prefill speed
over 1600 tokens/sec
This speed is achieved on the NPU, enabling sophisticated multimodal use cases.
Gemma 3n E2B model decode speed
28 tokens/sec with 4K context
This performance metric highlights the model's efficiency in processing large context inputs.

Technologies & Tools

Software
Litert Neuropilot Accelerator
Facilitates deployment of AI models on MediaTek NPUs.
AI Model
Gemma
Provides generative AI capabilities optimized for on-device use.

Key Actionable Insights

1
Utilize the LiteRT NeuroPilot Accelerator to streamline your AI model deployment process.
This tool abstracts the complexities of working with various NPUs, allowing you to focus on building your application rather than managing hardware-specific details.
2
Consider implementing AOT compilation for larger models to improve initialization times.
By compiling your models ahead of time, you can significantly reduce the time it takes for your application to become responsive, enhancing user satisfaction.
3
Leverage Native Hardware Buffer Interoperability for efficient data handling in your applications.
This feature allows for direct data transfer between GPU and NPU, which is essential for applications that require real-time processing, such as video analytics.

Common Pitfalls

1
Failing to utilize AOT compilation for large models can lead to poor user experience due to high initialization times.
Developers often overlook the benefits of AOT compilation, which can significantly enhance the responsiveness of applications. By not compiling ahead of time, they risk frustrating users with delays.

Related Concepts

On-device AI Deployment Strategies
Generative AI Model Optimization
Native Hardware Buffer Interoperability