Rosetta: Understanding text in images and videos with machine learning

Understanding the text that appears on images is important for improving experiences, such as a more relevant photo search or the incorporation of text into screen readers that make Facebook more a…

Viswanath Sivakumar
14 min readintermediate
--
View Original

Overview

The article discusses Rosetta, a large-scale machine learning system developed by Facebook to understand text in images and videos. It highlights the challenges of traditional optical character recognition (OCR) systems and details how Rosetta extracts text from over a billion images daily, using advanced techniques like Faster R-CNN and connectionist temporal classification (CTC) loss.

What You'll Learn

1

How to implement a text extraction model using machine learning techniques

2

Why traditional OCR systems are insufficient for understanding text in images

3

How to optimize machine learning models for real-time text detection and recognition

Prerequisites & Requirements

  • Understanding of machine learning concepts and neural networks
  • Familiarity with the Detectron framework and Caffe2(optional)

Key Questions Answered

How does Rosetta extract text from images and videos?
Rosetta extracts text through a two-step process: detection and recognition. It uses Faster R-CNN for detecting text regions and a CNN model for recognizing the text, enabling real-time processing of over a billion images daily.
What challenges does Rosetta face with multilingual text recognition?
Rosetta supports multiple languages, including Arabic and Hindi, which present unique challenges like right-to-left reading order. The system uses a post-processing trick to reverse predicted characters for correct interpretation in these languages.
What techniques are used to optimize the text detection model's performance?
The text detection model is optimized using techniques like quantization to reduce memory bandwidth requirements and net-aware quantization to minimize accuracy loss. These methods help maintain efficiency while processing large volumes of images.

Key Statistics & Figures

Images processed daily
over a billion
Rosetta processes this volume of images from public Facebook and Instagram content in real-time.

Technologies & Tools

Machine Learning Framework
Faster R-cnn
Used for detecting text regions in images.
Machine Learning Framework
Caffe2
Used for running the models during inference.
Machine Learning Framework
Detectron
Utilized for training the text detection model.
Loss Function
Connectionist Temporal Classification (ctc)
Employed for training the text recognition model.

Key Actionable Insights

1
Implementing a two-step text extraction process can significantly enhance the accuracy of text recognition in images.
By separating detection and recognition, systems can better handle the complexities of varied text appearances, leading to improved performance in applications like photo search and content moderation.
2
Utilizing synthetic data generation can alleviate the challenges of manual data annotation for training models.
As the distribution of textual images changes rapidly, synthetic data can help maintain a robust training set, allowing for quick adaptation to new languages and text styles.
3
Adopting curriculum learning strategies can improve model training efficiency and accuracy.
By starting with simpler tasks and gradually increasing complexity, models can better learn to handle longer and more complex words, which is crucial for effective text recognition.

Common Pitfalls

1
Relying solely on traditional OCR systems can lead to inadequate text recognition in complex images.
Traditional OCR often fails to understand the context or variations in text presentation, making it less effective for modern applications that require nuanced understanding.

Related Concepts

Machine Learning Techniques For Image Processing
Optical Character Recognition Advancements
Multilingual Text Recognition Challenges