How we built Text-to-SQL at Pinterest

Overview

The article discusses Pinterest's development of a Text-to-SQL feature that utilizes Large Language Models (LLMs) to assist data users in generating SQL queries from natural language questions. It covers the architecture, implementation challenges, and improvements made over time to enhance user productivity and query accuracy.

What You'll Learn

1

How to implement a Text-to-SQL feature using Large Language Models

2

Why incorporating Retrieval Augmented Generation (RAG) improves table selection in SQL queries

3

How to enhance SQL query accuracy by processing low-cardinality columns

4

How to evaluate the performance of a Text-to-SQL system against real-world user interactions

Prerequisites & Requirements

  • Understanding of SQL and database schemas
  • Familiarity with WebSocket for streaming responses(optional)
  • Experience with AI/ML concepts and implementation(optional)

Key Questions Answered

How does Pinterest's Text-to-SQL feature assist data users?
Pinterest's Text-to-SQL feature transforms natural language analytical questions into SQL queries using Large Language Models. It retrieves relevant table schemas and compiles them into prompts for the LLM, which generates SQL code, streamlining the query writing process for users.
What challenges did Pinterest face when implementing Text-to-SQL?
Pinterest faced challenges such as ensuring accurate SQL generation for low-cardinality columns and managing large table schemas that could exceed context window limits. Techniques like column pruning and metadata processing were implemented to address these issues.
What improvements were made in the second iteration of Text-to-SQL?
In the second iteration, Pinterest integrated Retrieval Augmented Generation (RAG) to help users select the correct tables from a vast number of options. This involved creating a vector index of table summaries and using embeddings for similarity searches.
What was the impact of the Text-to-SQL feature on user productivity?
The implementation of the Text-to-SQL feature resulted in a 35% improvement in task completion speed for writing SQL queries. Initially, the first-shot acceptance rate of generated SQL increased from 20% to over 40% as users became more familiar with the tool.

Key Statistics & Figures

Initial first-shot acceptance rate
20%
This was the acceptance rate for generated SQL queries before users became familiar with the Text-to-SQL feature.
Improvement in task completion speed
35%
This statistic reflects the increase in speed for writing SQL queries using AI assistance after the implementation of the Text-to-SQL feature.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

AI/ML
Large Language Models
Used to transform analytical questions into SQL queries.
Communication
Websocket
Employed to stream responses from the LLM to users.
Library
Langchain
Utilized for partial JSON parsing during response streaming.
Database
Opensearch
Used as the vector store for conducting similarity searches.

Key Actionable Insights

1
Integrate Retrieval Augmented Generation (RAG) to enhance table selection for users.
By using RAG, users can more easily identify relevant tables from a large dataset, improving the accuracy of their SQL queries and reducing the time spent searching for the right data.
2
Implement a feedback mechanism to gather user insights on SQL query generation.
Collecting user feedback can help refine the Text-to-SQL feature, allowing for continuous improvement based on actual user experiences and needs.
3
Focus on processing low-cardinality columns to improve SQL query accuracy.
By ensuring that the generated SQL respects the actual values in low-cardinality columns, the system can produce more reliable and accurate queries, enhancing overall user trust in the tool.

Common Pitfalls

1
Failing to validate SQL queries generated by the LLM can lead to execution errors.
Without proper validation, users may run queries that do not function as intended, resulting in wasted time and resources. Implementing a validation step can help mitigate this risk.

Related Concepts

Natural Language Processing
SQL Query Optimization
AI/ML In Data Analysis
Data Warehouse Management