Nemo: Data discovery at Facebook

Haran Talmon

Large-scale companies serve millions or even billions of people who depend on the services these companies provide for their everyday needs. To keep these services running and delivering meaningful…

Overview

The article discusses Nemo, Facebook's internal data discovery engine designed to enhance the efficiency of data retrieval for engineers. It highlights the challenges of data discovery in large organizations and explains how Nemo improves search capabilities, accuracy, and user experience.

What You'll Learn

1

How to utilize Nemo for efficient data discovery at scale

2

Why effective data indexing is crucial for large organizations

3

How to implement natural language queries in data search systems

Prerequisites & Requirements

Understanding of data artifacts and their metadata
Familiarity with Elasticsearch and data indexing concepts(optional)

Key Questions Answered

How does Nemo improve data discovery at Facebook?

Nemo enhances data discovery by using advanced indexing techniques and a sophisticated search infrastructure, allowing engineers to find relevant data quickly and accurately. It incorporates personalization signals and natural language processing, improving the search experience and increasing the success rate of data retrieval by over 50%.

What are the main components of Nemo's architecture?

Nemo consists of two major components: indexing and serving. Indexing includes bulk and instant updates to ensure data artifacts are searchable immediately, while serving utilizes a spaCy-based NLP library and Unicorn for retrieval and ranking, enhancing the overall search quality.

What challenges does Nemo address in data discovery?

Nemo addresses challenges such as obscure naming conventions for data tables, overlapping datasets from different teams, and the need for confidence in data accuracy. By streamlining the search process and improving metadata indexing, it significantly reduces the pain points faced by data engineers.

How does Nemo handle natural language queries?

Nemo can parse and respond to natural language queries, allowing users to ask questions like 'How many weekly active users are there on Instagram?' and receive pointers to the relevant data tables. This feature enhances user accessibility and reduces reliance on technical jargon.

Key Statistics & Figures

Data search success rate increase

more than 50 percent

This improvement was observed despite the total number of data artifacts tripling and queries per second doubling.

Total number of data artifacts

more than a dozen different types

These include Hive tables, Scuba tables, dashboards, AI data sets, and Cubrick.

Queries per second (QPS) increase

more than doubled

This reflects the enhanced capability of Nemo to handle increased data retrieval demands.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Discovery Engine

Nemo

Used for improving data retrieval efficiency and accuracy.

Search Engine

Elasticsearch

Originally used for indexing metadata about data artifacts.

Search Infrastructure

Unicorn

Used for storing data and enhancing search capabilities.

Nlp Library

Spacy

Used for text parsing in the search engine.

Query Engine

Presto

Used for querying data artifacts and noting access patterns.

Key Actionable Insights

1
Implementing a robust data indexing system like Nemo can drastically improve data retrieval times and accuracy.
As organizations grow, the complexity of data increases. A well-structured indexing system ensures that engineers can find the necessary data quickly, which is crucial for timely decision-making.

2
Incorporating natural language processing capabilities into data search systems can enhance user experience.
By allowing users to query data in natural language, organizations can reduce the learning curve for new employees and improve overall productivity.

3
Utilizing trust signals in data ranking can help prioritize high-quality data artifacts.
By focusing on data quality and recency, teams can avoid outdated or irrelevant data, ensuring that decisions are based on the most accurate information available.

Common Pitfalls

1

Relying solely on grassroots solutions for data discovery can lead to scalability issues.

As organizations grow, the number of data artifacts increases, making it difficult for ad-hoc solutions to keep up with demand. A centralized, scalable solution like Nemo is essential for maintaining efficiency.

2

Failing to maintain up-to-date metadata can result in engineers using outdated data artifacts.

Without a robust indexing and updating mechanism, users may inadvertently rely on stale data, leading to poor decision-making. Nemo addresses this by ensuring real-time updates to its index.

Related Concepts

Data Indexing Techniques

Natural Language Processing In Search

Scalable Data Architectures