LiteRT: The Universal Framework for On-Device AI

LiteRT, the evolution of TFLite, is now the universal framework for on-device AI. It delivers up to 1.4x faster GPU, new NPU support, and streamlined GenAI deployment for models like Gemma.

Lu Wang, Chintan Parikh, Jingjiang Li, Terry Heo
9 min readadvanced
--
View Original

Overview

LiteRT has evolved from its TensorFlow Lite foundation into a universal on-device AI inference framework, now offering production-ready GPU acceleration across six platforms and streamlined NPU integration with MediaTek and Qualcomm. The framework delivers significant performance improvements over both TFLite and competitors like llama.cpp, with seamless model conversion from PyTorch, TensorFlow, and JAX, and optimized support for popular open models including the Gemma family.

What You'll Learn

1

How to leverage LiteRT's CompiledModel API for GPU-accelerated on-device inference across Android, iOS, macOS, Windows, Linux, and Web

2

How to deploy ML models to NPUs using LiteRT's three-step workflow with AOT or JIT compilation

3

Why LiteRT outperforms llama.cpp for on-device GenAI deployment and how to use the integrated LiteRT-LM stack

4

How to convert PyTorch, TensorFlow, and JAX models to the .tflite format for edge deployment

5

When to choose AOT compilation versus on-device JIT compilation for NPU acceleration

Prerequisites & Requirements

  • Understanding of ML model inference concepts (CPU, GPU, NPU execution)
  • Familiarity with at least one ML framework (PyTorch, TensorFlow, or JAX)
  • Basic understanding of on-device/edge AI deployment challenges
  • Experience with C++ or Python for model conversion and integration(optional)
  • Familiarity with TensorFlow Lite (.tflite) model format(optional)

Key Questions Answered

What platforms does LiteRT support for GPU acceleration?
LiteRT provides full GPU acceleration across Android, iOS, macOS, Windows, Linux, and Web platforms. It supports OpenCL, OpenGL, Metal, and WebGPU through its ML Drift GPU engine. On Android, it automatically prioritizes OpenCL when available for peak performance while falling back to OpenGL for broader device coverage.
How much faster is LiteRT compared to TensorFlow Lite for GPU inference?
LiteRT delivers an average of 1.4x faster GPU performance compared to the legacy TFLite GPU delegate, powered by its next-generation ML Drift GPU engine. With additional optimizations like asynchronous execution and zero-copy buffer interoperability, LiteRT can achieve up to 2x faster performance in real-time use cases like background segmentation.
How does LiteRT handle NPU fragmentation across different chip vendors?
LiteRT provides a unified, simplified NPU deployment workflow that abstracts away low-level, vendor-specific SDKs and handles fragmentation across numerous SoC variants. It streamlines deployment into three steps: AOT compilation for target SoCs (optional), deployment via Google Play for On-device AI (PODAI) on Android, and inference using the LiteRT Runtime with automatic NPU delegation and fallback to GPU or CPU.
How does LiteRT performance compare to llama.cpp for running Gemma 3 on mobile?
Benchmarked with Gemma 3 1B on Samsung Galaxy S25 Ultra, LiteRT outperforms llama.cpp by 3x on CPU, 7x on GPU for decode (memory-bound operations), and 19x on GPU for prefill (compute-bound operations). LiteRT's NPU acceleration delivers an additional 2x performance gain over GPU for prefill, maximizing compute hardware potential.
What ML frameworks can convert models to LiteRT format?
LiteRT supports seamless model conversion from PyTorch, TensorFlow, and JAX. PyTorch models can be converted directly to .tflite format in a single step using the LiteRT Torch library. TensorFlow models have best-in-class native support, while JAX models are converted via the jax2tf bridge. This enables high research-to-production velocity regardless of training framework.
What is the difference between AOT and on-device JIT compilation in LiteRT?
AOT (ahead-of-time) compilation is optimal for complex models with known target SoCs, minimizing initialization time and memory footprint for an instant-start experience. On-device JIT compilation is best for distributing small models across various platforms without preparation, though it has higher first-run initialization costs. Developers choose based on their application's deployment requirements.
How fast is LiteRT NPU acceleration compared to CPU and GPU?
LiteRT's NPU acceleration with MediaTek and Qualcomm chipsets reaches speeds up to 100x faster than CPU and 10x faster than GPU. These are production-ready integrations available now, with additional hardware support being actively expanded. The NPU unlocks smooth, responsive, high-speed AI experiences that modern applications demand.
What open models does LiteRT support for on-device deployment?
LiteRT supports an extensive collection of popular open-weight models pre-converted for immediate deployment, including the Gemma family (Gemma 3 270M and 1B, Gemma 3n, EmbeddingGemma, FunctionGemma), Qwen, Phi, and FastVLM. These models are available on the LiteRT Hugging Face Community and can be explored via the Google AI Edge Gallery app on Android and iOS.

Key Statistics & Figures

GPU performance improvement over TFLite
1.4x faster
average
Performance with async execution and zero-copy buffers
Up to 2x faster
Demonstrated in the Segmentation sample app for real-time use cases
NPU vs CPU performance
Up to 100x faster
With MediaTek and Qualcomm NPU integrations
NPU vs GPU performance
Up to 10x faster
With MediaTek and Qualcomm NPU integrations
LiteRT vs llama.cpp CPU performance (Gemma 3 1B)
3x faster
Benchmarked on Samsung Galaxy S25 Ultra
LiteRT vs llama.cpp GPU decode performance
7x faster
Memory-bound decode operations, Gemma 3 1B on Samsung Galaxy S25 Ultra
LiteRT vs llama.cpp GPU prefill performance
19x faster
Compute-bound prefill operations, Gemma 3 1B on Samsung Galaxy S25 Ultra
NPU vs GPU prefill performance for LiteRT
2x additional gain
NPU acceleration over GPU for compute-bound prefill operations

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

ML Inference Framework
Litert
Universal on-device AI inference framework, evolved from TensorFlow Lite
ML Inference Framework
Tensorflow Lite
Legacy foundation that LiteRT evolved from, with .tflite model format
GPU Engine
ML Drift
Next-generation GPU engine powering LiteRT's cross-platform GPU acceleration
Llm Orchestration
Litert-lm
Specialized orchestration layer for managing LLM-specific complexities on-device
ML Framework
Pytorch
Supported for direct model conversion to .tflite via LiteRT Torch library
ML Framework
Tensorflow
Native best-in-class model conversion support to LiteRT format
ML Framework
Jax
Model conversion via jax2tf bridge for edge deployment
GPU API
Opencl
Primary GPU acceleration API on Android, prioritized when available
GPU API
Opengl
Fallback GPU acceleration on Android for broader device coverage
GPU API
Metal
GPU acceleration on iOS and macOS platforms
GPU API
Webgpu
GPU acceleration for web platform deployment
AI Model Family
Gemma
Open model family supported by LiteRT including Gemma 3, Gemma 3n, EmbeddingGemma, FunctionGemma
ML Inference Framework
Llama.cpp
Competitor used as benchmark comparison for Gemma 3 1B performance
Distribution Platform
Google Play For On-device AI (podai)
Automatic delivery of models and runtime to compatible Android devices
Programming Language
C++
Primary language for CompiledModel API integration examples
Model Hub
Hugging Face
Distribution platform for LiteRT-optimized open models

Key Actionable Insights

1
Migrate from TFLite GPU delegate to LiteRT's new CompiledModel API to achieve 1.4x average GPU performance improvement. The new API provides a modern interface that unlocks full GPU and NPU acceleration potential, while the legacy interpreter API remains available for existing production models.
This is especially impactful for latency-sensitive applications like real-time segmentation and speech recognition, where the combination of asynchronous execution and zero-copy buffer interoperability can yield up to 2x speedups.
2
Use AOT compilation for production deployments targeting known device SoCs to minimize initialization time and memory footprint. Pre-compile your .tflite models for specific target NPUs using the LiteRT Python library, then leverage Google Play for On-device AI (PODAI) for automatic delivery of the model and runtime.
AOT compilation is particularly important for complex models where first-run latency matters. For simpler models distributed across many device types, on-device JIT compilation may be more practical despite higher first-run costs.
3
Leverage LiteRT's NPU acceleration for compute-intensive GenAI workloads to unlock up to 100x faster performance than CPU. MediaTek and Qualcomm NPU integrations are now production-ready, with LiteRT handling automatic delegation and providing robust fallback to GPU or CPU when NPU is unavailable.
This is critical for deploying large language models like Gemma on-device, where NPU acceleration provides an additional 2x gain over GPU for compute-bound prefill operations.
4
Use the LiteRT Torch library to convert PyTorch models directly to .tflite format in a single step, eliminating complex intermediate translation workflows. This enables PyTorch-based architectures to immediately benefit from LiteRT's advanced hardware acceleration across CPU, GPU, and NPU backends.
This is valuable for teams training models in PyTorch who want to deploy to edge devices without rewriting models in TensorFlow, enabling higher research-to-production velocity.
5
Implement zero-copy buffer interoperability when building GPU-accelerated pipelines to eliminate unnecessary CPU overhead. Use TensorBuffer::CreateFromGlBuffer to wrap existing OpenGL buffers directly, and access outputs as AHardwareBuffers for efficient downstream processing.
This optimization is demonstrated in the LiteRT Segmentation sample app and is essential for real-time use cases like background segmentation and automatic speech recognition where end-to-end latency is critical.
6
Evaluate LiteRT as a replacement for llama.cpp when deploying GenAI models on mobile devices, given the substantial performance advantages of 3x on CPU and up to 19x on GPU. The LiteRT-LM orchestration layer handles LLM-specific complexities and is the same infrastructure powering Gemini Nano in Google products.
Benchmarked with Gemma 3 1B on Samsung Galaxy S25 Ultra, these gains are particularly significant for prefill (compute-bound) operations where GPU acceleration shows the largest improvement.

Common Pitfalls

1
Using the legacy TFLite GPU delegate instead of LiteRT's new CompiledModel API when building new applications. The interpreter API is maintained for backward compatibility, but developers miss out on the 1.4x average GPU performance improvement and the full potential of GPU and NPU acceleration that the CompiledModel API provides.
The interpreter API should only be used for existing production models that need stability. For new AI features, always use the CompiledModel API to access next-generation acceleration capabilities.
2
Attempting to navigate vendor-specific NPU SDKs directly rather than using LiteRT's unified abstraction layer. With hundreds of NPU SoC variants, building ad-hoc deployment workflows for each vendor creates complex, hard-to-maintain production systems that don't scale across device types.
LiteRT's three-step NPU workflow abstracts away low-level SDK details and provides automatic fallback to GPU or CPU, making NPU deployment manageable even across fragmented hardware.
3
Choosing on-device JIT compilation for complex models where target SoCs are known in advance. While JIT compilation requires no preparation, it incurs higher first-run initialization costs that can degrade the user experience for large models, whereas AOT compilation provides an instant-start experience.
Reserve on-device compilation for small models distributed across many unknown device types. For complex models targeting specific chipsets, AOT compilation minimizes initialization time and memory footprint.
4
Not leveraging asynchronous execution and zero-copy buffer interoperability for real-time inference pipelines. Without these optimizations, unnecessary CPU overhead from synchronous execution and data copying can negate the performance benefits of GPU acceleration, especially in latency-critical applications.
These techniques are essential for real-time use cases like background segmentation and speech recognition, where they can yield up to 2x performance improvements by eliminating CPU bottlenecks.

Related Concepts

On-device AI Inference
Hardware Acceleration (gpu, Npu, Cpu)
Model Quantization And Optimization
Edge Computing
Neural Processing Units (npu)
Ahead-of-time (aot) Compilation
Just-in-time (jit) Compilation
Zero-copy Buffer Interoperability
Asynchronous Execution
Model Conversion Pipelines
Large Language Model Deployment
Cross-platform ML Deployment
Tensorflow Lite Migration
On-device Generative AI