Fighting spam with Guardian, a real-time analytics and rules engine

Pinterest Engineering
12 min readadvanced
--
View Original

Overview

The article discusses Guardian, a real-time analytics and rules engine developed by Pinterest's Trust & Safety team to combat spam. It details the evolution from a Python-based system to a more efficient Elixir-based architecture, highlighting improvements in rule creation, query processing, and overall system efficiency.

What You'll Learn

1

How to streamline spam detection using a real-time analytics engine

2

Why using SQL-like queries improves rule creation efficiency

3

How to implement feature enrichment for better spam detection

Prerequisites & Requirements

  • Understanding of real-time analytics and rules engines
  • Familiarity with SQL and data querying(optional)
  • Experience with Elixir or similar programming languages(optional)

Key Questions Answered

How does Guardian improve spam detection at Pinterest?
Guardian enhances spam detection by unifying data into a single denormalized table, allowing for faster queries and real-time analytics. This system replaces the previous Python-based engine, enabling quicker rule creation and back-testing, which significantly reduces the time from days to less than an hour.
What are the main features of the Guardian query engine?
The Guardian query engine supports a SQL-like syntax, includes custom functions for analytics, and allows instantaneous query results. It also integrates with a Presto Connector for additional data joins, enhancing its functionality for spam detection.
What challenges did Pinterest face with its previous spam detection system?
The previous Python-based system was not scalable and required a lengthy rule creation process involving multiple steps, including data logging, querying, and translating results into Python code. This complexity led to delays and increased the risk of bugs affecting legitimate users.
How does Guardian handle real-time data processing?
Guardian processes events in real-time by grouping them into segments and storing them in a columnar format. This design allows for quick access and querying of only the necessary columns, significantly improving performance compared to traditional systems like Hive and Presto.

Key Statistics & Figures

Number of rows in Guardian's dataset
10B+ rows
Guardian's dataset is designed to handle vast amounts of data efficiently, which is essential for effective spam detection.
Time to create a new rule
Less than an hour
The streamlined process in Guardian allows for rapid rule creation compared to the previous system, which took days.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Utilize a denormalized data structure to improve query performance.
By consolidating data into a single table, you can reduce the complexity of queries and speed up the data retrieval process, which is crucial for real-time applications like spam detection.
2
Implement a real-time analytics engine to streamline rule creation.
Transitioning from a traditional batch processing system to a real-time engine can drastically reduce the time needed for rule deployment, enhancing your ability to respond to spam threats quickly.
3
Incorporate feature enrichment to enhance data quality for analysis.
By enriching incoming data with additional context from various sources, you can improve the accuracy of your spam detection algorithms, leading to better outcomes in identifying malicious activity.

Common Pitfalls

1
Overcomplicating rule creation processes can lead to inefficiencies.
When rules are too complex or require excessive translation between languages, it can slow down deployment and increase the risk of errors. Simplifying the rule creation process, as done in Guardian, can mitigate these issues.
2
Neglecting real-time data processing capabilities can hinder responsiveness.
Failing to implement a system that can handle real-time data can result in delayed responses to spam threats, allowing malicious activity to proliferate. Guardian's design addresses this by enabling immediate feedback on queries.

Related Concepts

Real-time Analytics
Rules Engines
Data Enrichment
Spam Detection Techniques