LiteRT, the evolution of TFLite, is now the universal framework for on-device AI. It delivers up to 1.4x faster GPU, new NPU support, and streamlined GenAI deployment for models like Gemma.
Overview
LiteRT has evolved from its TensorFlow Lite foundation into a universal on-device AI inference framework, now offering production-ready GPU acceleration across six platforms and streamlined NPU integration with MediaTek and Qualcomm. The framework delivers significant performance improvements over both TFLite and competitors like llama.cpp, with seamless model conversion from PyTorch, TensorFlow, and JAX, and optimized support for popular open models including the Gemma family.
What You'll Learn
How to leverage LiteRT's CompiledModel API for GPU-accelerated on-device inference across Android, iOS, macOS, Windows, Linux, and Web
How to deploy ML models to NPUs using LiteRT's three-step workflow with AOT or JIT compilation
Why LiteRT outperforms llama.cpp for on-device GenAI deployment and how to use the integrated LiteRT-LM stack
How to convert PyTorch, TensorFlow, and JAX models to the .tflite format for edge deployment
When to choose AOT compilation versus on-device JIT compilation for NPU acceleration
Prerequisites & Requirements
- Understanding of ML model inference concepts (CPU, GPU, NPU execution)
- Familiarity with at least one ML framework (PyTorch, TensorFlow, or JAX)
- Basic understanding of on-device/edge AI deployment challenges
- Experience with C++ or Python for model conversion and integration(optional)
- Familiarity with TensorFlow Lite (.tflite) model format(optional)
Key Questions Answered
What platforms does LiteRT support for GPU acceleration?
How much faster is LiteRT compared to TensorFlow Lite for GPU inference?
How does LiteRT handle NPU fragmentation across different chip vendors?
How does LiteRT performance compare to llama.cpp for running Gemma 3 on mobile?
What ML frameworks can convert models to LiteRT format?
What is the difference between AOT and on-device JIT compilation in LiteRT?
How fast is LiteRT NPU acceleration compared to CPU and GPU?
What open models does LiteRT support for on-device deployment?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Migrate from TFLite GPU delegate to LiteRT's new CompiledModel API to achieve 1.4x average GPU performance improvement. The new API provides a modern interface that unlocks full GPU and NPU acceleration potential, while the legacy interpreter API remains available for existing production models.This is especially impactful for latency-sensitive applications like real-time segmentation and speech recognition, where the combination of asynchronous execution and zero-copy buffer interoperability can yield up to 2x speedups.
2Use AOT compilation for production deployments targeting known device SoCs to minimize initialization time and memory footprint. Pre-compile your .tflite models for specific target NPUs using the LiteRT Python library, then leverage Google Play for On-device AI (PODAI) for automatic delivery of the model and runtime.AOT compilation is particularly important for complex models where first-run latency matters. For simpler models distributed across many device types, on-device JIT compilation may be more practical despite higher first-run costs.
3Leverage LiteRT's NPU acceleration for compute-intensive GenAI workloads to unlock up to 100x faster performance than CPU. MediaTek and Qualcomm NPU integrations are now production-ready, with LiteRT handling automatic delegation and providing robust fallback to GPU or CPU when NPU is unavailable.This is critical for deploying large language models like Gemma on-device, where NPU acceleration provides an additional 2x gain over GPU for compute-bound prefill operations.
4Use the LiteRT Torch library to convert PyTorch models directly to .tflite format in a single step, eliminating complex intermediate translation workflows. This enables PyTorch-based architectures to immediately benefit from LiteRT's advanced hardware acceleration across CPU, GPU, and NPU backends.This is valuable for teams training models in PyTorch who want to deploy to edge devices without rewriting models in TensorFlow, enabling higher research-to-production velocity.
5Implement zero-copy buffer interoperability when building GPU-accelerated pipelines to eliminate unnecessary CPU overhead. Use TensorBuffer::CreateFromGlBuffer to wrap existing OpenGL buffers directly, and access outputs as AHardwareBuffers for efficient downstream processing.This optimization is demonstrated in the LiteRT Segmentation sample app and is essential for real-time use cases like background segmentation and automatic speech recognition where end-to-end latency is critical.
6Evaluate LiteRT as a replacement for llama.cpp when deploying GenAI models on mobile devices, given the substantial performance advantages of 3x on CPU and up to 19x on GPU. The LiteRT-LM orchestration layer handles LLM-specific complexities and is the same infrastructure powering Gemini Nano in Google products.Benchmarked with Gemma 3 1B on Samsung Galaxy S25 Ultra, these gains are particularly significant for prefill (compute-bound) operations where GPU acceleration shows the largest improvement.