On-device small language models with multimodality, RAG, and Function Calling

Mark Sherwood, Matthew Chan, Marissa Ikonomidis

Google AI Edge advancements, include new Gemma 3 models, broader model support, and features like on-device RAG and Function Calling to enhance on-device generative AI capabilities.

Google

•

Mark Sherwood, Matthew Chan, Marissa Ikonomidis

•6 min read•beginner•

--

•View Original

Hugging FaceRetrieval Augmented Generation

Overview

The article discusses the expansion of Google's AI Edge platform to support on-device small language models (SLMs) with multimodal capabilities, including the introduction of the Gemma 3 and Gemma 3n models. It highlights the integration of Retrieval Augmented Generation (RAG) and Function Calling libraries to enhance the functionality of these models for developers.

What You'll Learn

1

How to utilize on-device small language models for multimodal inputs

2

Why Retrieval Augmented Generation (RAG) enhances language model performance

3

How to implement function calling with on-device language models

Key Questions Answered

What are the capabilities of the Gemma 3n model?

The Gemma 3n model supports multimodal inputs including text, image, video, and audio, making it suitable for various applications. It is available in 2B and 4B parameter variants, allowing for enhanced processing capabilities on-device.

How does Retrieval Augmented Generation (RAG) work?

RAG allows small language models to access application-specific data without fine-tuning, enhancing their contextual understanding. It can process large datasets, such as 1000 pages or photos, to retrieve relevant information for model input.

What is the purpose of the Function Calling library?

The Function Calling library enables on-device language models to interact with application functions, allowing them to execute predefined actions based on user input. This feature enhances the interactivity of applications using language models.

Key Statistics & Figures

Gemma 3 1B model size

529MB

This compact size allows the model to run efficiently on mobile devices.

Tokens processed per second by Gemma 3 1B

2,585 tokens

This performance enables the model to process a page of content in under a second.

Reduction factor in model size with int4 quantization

2.5-4X

This quantization scheme significantly decreases latency and memory consumption compared to the default bf16 data type.

Technologies & Tools

AI/ML Model

Gemma 3

Used for on-device processing of language tasks.

AI/ML Model

Gemma 3n

Supports multimodal inputs for enhanced interaction.

AI/ML Library

Retrieval Augmented Generation (rag)

Augments language models with application-specific data.

AI/ML Library

Function Calling

Enables interaction between language models and application functions.

Key Actionable Insights

1
Leverage the Gemma 3n model for enterprise applications that require multimodal input processing.
This model's ability to handle text, images, video, and audio makes it ideal for scenarios where users need to interact with applications hands-free or in low-connectivity environments.

2
Utilize the RAG library to enhance the relevance of responses from your language model.
By integrating RAG, developers can ensure that their applications provide contextually appropriate information, significantly improving user experience and engagement.

3
Implement the Function Calling library to create more interactive applications.
This allows applications to respond dynamically to user commands, making them more intuitive and user-friendly, particularly in fields like healthcare or inventory management.

Common Pitfalls

1

Failing to optimize models for on-device use can lead to performance issues.

Developers should ensure that models are appropriately quantized and optimized for the specific hardware they will run on to avoid latency and memory issues.

Introducing EmbeddingGemma: a new embedding model designed for efficient on-device AI applications from Google. This open model is the highest-ranking text-only multilingual embedding model under 500M parameters on the MTEB benchmark, enabling powerful features like RAG and semantic search directly on mobile devices without an internet connection.

Hugging FaceLangChainTransformers

5 min read

Has Summary

--

These articles from Google and other leading engineering teams share similar topics with "On-device small language models with multimodality, RAG, and Function Calling". Explore more engineering insights on Hugging Face, Generative AI, Google Cloud.