Categorizing Products at Scale

Jeet Mehta

Shopify

•

Jeet Mehta

•13 min read•advanced•

--

•View Original

Artificial IntelligenceComputer VisionGolangGPTHTMLPySpark

Overview

The article discusses the challenges and methodologies involved in categorizing products at scale on the Shopify platform, which has over 1 million business owners and billions of products. It outlines the implementation of a product categorization model leveraging Google Product Taxonomy, addressing issues of scale and structure, and detailing the featurization and modeling processes used to enhance personalized insights for business owners.

What You'll Learn

1

How to implement a product categorization model using Google Product Taxonomy

2

Why hierarchical classification presents unique challenges in machine learning

3

How to leverage Kesler’s Construction for scaling binary classifiers

4

When to apply hierarchical evaluation metrics for model performance assessment

Prerequisites & Requirements

Understanding of machine learning concepts and classification techniques
Familiarity with PySpark for data processing(optional)

Key Questions Answered

What are the main challenges in categorizing products at scale?

Categorizing products at scale involves challenges such as handling a large number of categories (over 5000 in Google Product Taxonomy) and managing the hierarchical structure of these categories. The complexity increases as traditional classification methods struggle to scale effectively with the growing number of product classes.

How does Shopify implement product categorization?

Shopify implements product categorization by leveraging a model that uses features like product title, description, and tags to classify products. The model employs Kesler’s Construction to convert multi-class problems into binary classification tasks, allowing it to scale efficiently across thousands of categories.

What evaluation metrics are used for hierarchical classification models?

For hierarchical classification models, Shopify uses metrics such as hierarchical accuracy, precision, recall, and F1 score. These metrics account for the structure of the taxonomy, allowing for a more nuanced evaluation of model performance based on the distance to the nearest common ancestor in the taxonomy.

What feedback mechanisms are in place for incorrect classifications?

Shopify has implemented a feedback mechanism using schematized Kafka events and an in-house annotation platform to capture misclassifications. This human-in-the-loop setup allows for continuous improvement of the model by incorporating user feedback on incorrect predictions.

Key Statistics & Figures

Number of business owners on Shopify

over 1 million

This statistic highlights the scale at which Shopify operates and the diversity of products being categorized.

Number of categories in Google Product Taxonomy

over 5000

The extensive number of categories presents significant challenges for product classification.

Number of teams using the product categorization system at Shopify

20+

This indicates the widespread application and importance of the categorization engine across the organization.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Processing

Pyspark

Used for implementing the term-frequency hashing featurizer in the product categorization model.

Messaging

Kafka

Utilized for the feedback mechanism to capture misclassifications and improve the model.

Key Actionable Insights

1
Implement a product categorization model using a hierarchical taxonomy to improve product discovery.
Utilizing a structured taxonomy like Google Product Taxonomy can streamline the categorization process, making it easier for business owners to manage diverse product offerings and enhance customer experience.

2
Leverage Kesler’s Construction to simplify the scaling of classification models.
This approach allows for the efficient handling of large datasets by transforming multi-class classification into binary classification, which is crucial for managing the complexity of product categorization at scale.

3
Adopt hierarchical evaluation metrics to better assess model performance.
Using hierarchical metrics provides a more accurate representation of model effectiveness, particularly in cases where misclassifications occur within the same category tree, allowing for targeted improvements.

Common Pitfalls

1

Relying solely on flat evaluation metrics for hierarchical classification can misrepresent model performance.

Flat metrics do not account for the hierarchical structure of categories, leading to a lack of insight into how closely predictions align with actual categories. It's essential to use hierarchical metrics to capture the nuances of classification accuracy.

Related Concepts

Hierarchical Classification

Machine Learning Model Evaluation

Product Taxonomy