Fetching and serving billions of URLs with Aragog

Pinterest Engineering

•

Pinterest Engineering

•8 min read•beginner•

--

•View Original

Thrift

Overview

The article discusses the development and implementation of Aragog, a system designed by Pinterest to efficiently fetch, store, process, and serve billions of URLs at low latencies. It highlights the architecture of Aragog, including its components like the Aragog Fetcher and UrlStore, and addresses key considerations for managing large-scale URL data.

What You'll Learn

1

How to implement URL normalization and canonicalization in a large-scale system

2

Why crawl politeness is crucial when fetching URLs at scale

3

How to design a federated storage system for URL metadata

Prerequisites & Requirements

Understanding of web crawling and data storage concepts
Familiarity with Thrift and HBase(optional)

Key Questions Answered

What is Aragog and what does it do?

Aragog is a suite of systems developed by Pinterest to fetch, store, process, and serve billions of URLs efficiently. It enables the company to create a rich user experience by leveraging the metadata and signals from the fetched web pages, all while maintaining low latency.

How does the Aragog Fetcher ensure crawl politeness?

The Aragog Fetcher respects the rules in robots.txt and implements rate limiting to control the traffic sent to each domain. It caches the robots.txt file for seven days and uses a rate limiter to manage the number of requests to a domain, allowing up to 10 queries per second.

What types of data does the Aragog UrlStore manage?

The Aragog UrlStore manages metadata extracted from fetched pages, including the full page content, semi-structured data, and web graph metadata such as inlinks and outlinks. This allows product teams to build functionalities without needing their own scalable infrastructure.

How does Aragog handle URL normalization?

Aragog performs URL normalization and canonicalization to deduplicate different representations of the same URL. This process is crucial for reducing storage requirements and ensuring accurate data retrieval from the UrlStore.

Key Statistics & Figures

Queries per second allowed to a single domain

10 QPS

This limit is enforced by the rate limiter to prevent overloading domains during URL fetching.

Caching duration for robots.txt

7 days

The Aragog Fetcher caches the robots.txt file for this duration to optimize fetching efficiency.

Technologies & Tools

Backend

Thrift

Used by the Aragog Fetcher for issuing HTTP requests and retrieving page content.

Database

Hbase

Serves as the underlying storage system for the Zen graph storage service used in Aragog.

Storage

S3

Used to store the full content of web pages fetched by Aragog.

Key Actionable Insights

1
Implementing URL normalization can significantly reduce data storage needs in large-scale systems.
By ensuring that multiple representations of the same URL are consolidated, organizations can save on storage costs and improve data retrieval efficiency.

2
Establishing a robust rate limiting mechanism is essential for maintaining crawl politeness.
This helps prevent overwhelming web servers and ensures compliance with robots.txt, which is critical for ethical web scraping practices.

3
Creating a federated storage system can streamline access to URL metadata across different teams.
This approach allows for efficient data management and enables teams to leverage shared resources without duplicating efforts.

Common Pitfalls

1

Failing to respect robots.txt can lead to being blocked by web servers.

It's crucial to implement checks for robots.txt rules to avoid legal and ethical issues in web scraping.

2

Overloading a single S3 bucket with keys can cause performance degradation.

Using a hash of the URL as the key can lead to hotspotting; it's important to design key structures to avoid this.

Related Concepts

Web Crawling Best Practices

Data Storage Optimization Techniques

Graph Data Modeling