Using Rich Image and Text Data to Categorize Products at Scale

We reevaluated our existing product categorization model to ensure we’re understanding what our merchants are selling, to build the best products that help power their sales.

Kshetrajna Raghavan
14 min readadvanced
--
View Original

Overview

This article discusses the evolution of Shopify's product categorization model, which has been enhanced to improve accuracy and coverage in classifying products sold by millions of merchants. Key improvements include the integration of rich image and text data, the use of advanced machine learning techniques, and the implementation of a hierarchical classification approach.

What You'll Learn

1

How to implement a hierarchical classification model for product categorization

2

Why using both text and image features improves model performance

3

How to optimize model training using data parallelization techniques

4

When to apply smart thresholding to filter predictions

Prerequisites & Requirements

  • Understanding of machine learning concepts and classification models
  • Familiarity with TensorFlow and Keras for model implementation(optional)

Key Questions Answered

How does Shopify categorize products at scale?
Shopify categorizes products at scale by using a hierarchical classification model that incorporates both text and image features. The model leverages pre-trained neural networks for embedding these features, allowing it to make accurate predictions across multiple languages and product categories.
What metrics are used to evaluate the performance of the categorization model?
The performance of the categorization model is evaluated using metrics such as hierarchical accuracy, precision, recall, F1 score, and coverage. These metrics help ensure that the model provides high-quality predictions while maintaining a balance between precision and coverage.
What improvements were made to the product categorization model?
The new model improved leaf precision by 8% and doubled coverage. It also expanded capabilities to classify products in multiple languages, addressing previous limitations of the model that only supported English text.
What challenges are associated with training a large-scale classification model?
Training a large-scale classification model presents challenges such as managing over 250 million parameters and handling a massive dataset. To address these issues, Shopify employed data parallelization techniques using distributed TensorFlow on Google Cloud Platform to speed up the training process.

Key Statistics & Figures

Leaf precision improvement
8%
This improvement was achieved through the implementation of the new categorization model.
Coverage increase
Doubled
The new model significantly increased the number of products that received predictions.
Model parameters
Over 250 million
The complexity of the model necessitated advanced training techniques to manage its size.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Tensorflow
Used for building and training the hierarchical classification model.
Backend
Keras
Utilized for creating the model architecture and managing the training process.
Cloud
Google Cloud Platform
Provided the infrastructure for distributed training of the model.
Machine Learning
Multi-lingual Bert
Used for text feature embedding in the categorization model.
Machine Learning
Mobilenet-v2
Employed for image feature embedding in the categorization model.

Key Actionable Insights

1
Incorporate both text and image features in your classification models to enhance accuracy.
Using a combination of features allows the model to leverage rich data, leading to better predictions and improved performance across diverse product categories.
2
Utilize hierarchical classification to manage complex categorization tasks effectively.
Hierarchical classification helps in preserving relationships between categories, making it easier to predict child nodes based on parent node predictions.
3
Implement smart thresholding to filter out low-confidence predictions.
This practice ensures that only high-confidence predictions are presented, improving the overall quality of the model's output and enhancing user experience.
4
Adopt data parallelization techniques to optimize model training times.
By distributing the training workload across multiple machines, you can significantly reduce training times while maintaining model performance.

Common Pitfalls

1
Neglecting to incorporate diverse data sources can lead to biased predictions.
When models are trained on limited datasets, they may not generalize well across different product categories, resulting in poor performance in real-world applications.
2
Ignoring the importance of hierarchical relationships in classification tasks.
Failing to account for the parent-child relationships in a taxonomy can lead to inaccurate predictions and a lack of coherence in categorization.

Related Concepts

Hierarchical Classification
Machine Learning Model Training
Feature Vectorization Techniques
Data Parallelization Strategies