How We Improved Data Discovery for Data Scientists at Spotify

Andrew Maher
15 min readintermediate
--
View Original

Overview

The article discusses Spotify's efforts to enhance data discovery for data scientists through the development and iteration of Lexikon, a library that centralizes data and insights. Key improvements focused on understanding user intent, facilitating knowledge exchange, and assisting users in starting to use discovered datasets.

What You'll Learn

1

How to improve data discovery experiences for data scientists using user intent analysis

2

Why personalized dataset recommendations enhance user engagement

3

How to facilitate knowledge exchange among data scientists through collaboration tools

4

When to implement schema-field consumption statistics for better dataset usability

Prerequisites & Requirements

  • Understanding of data science concepts and practices
  • Familiarity with BigQuery and data management tools(optional)

Key Questions Answered

What challenges did Spotify face in data discovery before Lexikon?
Spotify faced challenges such as a lack of centralized data catalog, unclear dataset ownership, and insufficient documentation, which hindered data scientists from efficiently finding and utilizing datasets. This led to significant time spent on data discovery, impacting the speed of insights production.
How did Spotify improve the data discovery experience with Lexikon?
Spotify improved the data discovery experience by understanding user intent, enabling knowledge exchange among data scientists, and providing features like personalized dataset recommendations and schema-field consumption statistics. These changes led to increased engagement and reduced pain points in data discovery.
What impact did the changes in Lexikon have on user engagement?
The changes in Lexikon increased its adoption among data scientists from 75% to 95%, with the user base growing from approximately 550 to 870 monthly active users. Additionally, the average number of sessions per user increased from about 3 to 9, indicating higher engagement.
What features were added to help data scientists get started with discovered datasets?
New features such as schema-field consumption statistics, recent queries, and tables commonly joined were added to Lexikon to assist data scientists in understanding how to effectively use datasets they discovered, thereby improving their workflow.

Key Statistics & Figures

Adoption rate of Lexikon among data scientists
95%
This increase from 75% indicates a significant improvement in user engagement after enhancements were made.
Monthly active users of Lexikon
870
The user base grew from approximately 550, showing the tool's expanded utility beyond just insights specialists.
Average sessions per monthly active user
9
This increase from about 3 sessions indicates higher engagement with the tool after improvements were implemented.
Percentage of users navigating to BigQuery tables through personalized recommendations
20%
This reflects the effectiveness of the revamped homepage in guiding users to relevant datasets.
Percentage of Lexikon's monthly active users visiting new entity pages
44%
This shows the importance of new entity types in enhancing the data discovery process.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement personalized dataset recommendations in your data tools to enhance user engagement.
By analyzing user behavior and preferences, you can provide tailored suggestions that help users discover relevant datasets, similar to how Spotify improved Lexikon's homepage.
2
Encourage knowledge exchange among team members to improve data discovery.
Facilitating conversations and connections between data scientists can lead to more effective data utilization, as demonstrated by Spotify's mapping of expertise within the insights community.
3
Utilize schema-field consumption statistics to guide users in selecting relevant data fields.
This feature helps users identify which fields are most frequently used, streamlining their data exploration process and enhancing their ability to derive insights.

Common Pitfalls

1
Relying solely on documentation can hinder effective data discovery.
Many data scientists may overlook the value of personal connections and discussions with colleagues, which can provide insights that documentation alone cannot offer.
2
Neglecting to update example queries can lead to confusion and inefficiency.
Outdated example queries may mislead users, which is why transitioning to a system that allows users to view recent queries is essential for maintaining relevance.

Related Concepts

Data Discovery
Knowledge Management
User Experience Design