Apr 4, 20226 min read

Topological Data Analysis in E-Commerce.

by Data Science

Introduction
Product search in e-commerce is a complex and fascinating problem with two main aspects: user needs on one side and a large collection of products on the other. Building the bridge connecting the two, and allowing the user to have a rewarding and positive experience while crossing it is part of our mission. As data scientists at Klarna we are constantly looking for new ways to improve user experience, putting the customers at the center of our problem space.

In this post I will describe an approach that we are proposing for a multi-class text classification problem. An instance of these types of problems originates in the need to understand product categories and to add structure to the product catalog (the set of products the user can buy), thus allowing the user to exploit that structure and to enjoy a better experience. The solution we have developed is built on top of different domain knowledge: Natural Language Processing (NLP), Topological Data Analysis (TDA) and Network Theory.

The problem
Products are typically organized in a catalog containing all the information one has about each single item, for example: title: “Air Jordan 1”; brand: “Nike”; description: “Inspired by…”; image: “image.jpg”; and category: “sneakers”.

An additional source of data is provided by products (not necessarily present in our catalog) that were bought using our Buy Now Pay Later (BNPL) service, and in this latter case we might only have access to the title of the product.

A common need for these different data sources is the ability to infer the product category from some textual information and this is precisely what we are going to address. We can phrase the problem as a typical multiclass classification problem: train a model that can infer from a product title the category to which the product belongs. A first challenge is related to the available taxonomy structure: in our case a large flat set of category labels (possibly more than 500) without hierarchical structure. A second challenge is provided by the unbalance between different categories in the training dataset: few categories have a lot of training examples while most of the categories have just a few.

To tackle these issues we decided to take a closer look at the category space of the problem, trying to understand the structure of the relations between different categories.

The analysis presented in this post has been developed using internal data.

Unsupervised learning in category space
Starting from the reasonable assumption that the set of categories has some structure (e.g. boots and sneakers are more similar, from a semantic point of view, than boots and televisions), we aim to reduce the complexity of the categorization problem by identifying relevant groups of categories (let’s call them super-categories) and then to reformulate the initial problem with respect to the smaller set of super-categories.

Figure 1: Products distribution across some initial categories.

 

But how to identify super-categories? Here is where we are going to be brave and open the door to some of the most recent tools provided to data scientists by our fellow mathematicians: Topological Data Analysis (TDA) and network theory. We will talk later about network theory, for now let’s start with TDA.

Topological Data Analysis
TDA identifies structure in the data with the help of topology, a branch of math that is interested in the qualitative understanding of proximity or similarity patterns between data points rather than in the details of the data configuration. Quoting from a nerdy joke: a topologist is someone that cannot tell the difference between a donut and a coffee mug. The surfaces of these two objects (the donut and the mug) have in common the property of having a hole, which — from a topological point of view — makes them belong to the same topological species. One reason to use a topological approach in data analysis is the fact that quite often data is noisy and — especially in high dimensions — a small perturbation in the way we assign numerical features to our data can have important consequences in the exact distances between data points. TDA claims to be more robust to this kind of noise.

To start using TDA we need to represent our data (the product’s title and the category names) as a set of data points — or a point cloud — in a vector space. A lot of research has been devoted to finding ways to convert textual data into vectors, which can then be used to calculate similarity. Here we will exploit a lighter version of the BERT model (that can be obtained from this link), which maps full sentences into vectors, preserving their semantic similarity.

Going back to the category space, we now map each category name into a 768 dimensional vector space with the BERT model. The next step is to apply TDA to this high-dimensional point cloud to detect clusters of similar categories. To do this we use UMAP (Uniform Manifold Approximation & Projections), which provides a great tool to perform unsupervised dimensionality reduction and to detect similarities between the data points in a robust way (from a topological point of view).

Figure 2: UMAP 2D representation of ~500 different categories (the dots in the figure). For some of them we explicitly display the name of the category.

 

From the reduced dimensionality representation it is possible to visually spot some region with a high density of categories. Still, this is not enough to call for clusters of data points. To do that we need to exploit an implicit representation that UMAP uses before performing dimensionality reduction. The way UMAP works is by first representing the point cloud as a weighted graph (or network), and then solving the graph visualization problem of placing each node of the graph (a data point) in a 2 or 3 dimensional vector space that can be later visualized. The intermediate network representation is the key to get our category clusters.

Figure 3: The network representation of the categories.


Community detection

Network theory has provided a new framework to approach tons of problems in real or synthetic networks. An important example is given by community detection, where you want to identify the groups of network elements that share more connections inside the group than outside the group. It’s a network theoretic formulation of the classical clustering problem. In the previous paragraph using UMAP we constructed a graph representation of the category point-cloud. What we now want to do is to apply one of the most known community detection algorithms — the Louvain algorithm — to separate categories into different communities or super-categories. The outcome of this classification step is visualized in the figures below.

Figure 4: The clusters obtained by community detection. Different colors are associated with different super-categories. The numerical labels can be later manually converted into category names summarizing the content of the cluster.

 

We can summarize the logical steps done so far (and the tools used) as follows:

  1. Represent category names as 768-dim vectors (NLP and BERT)
  2. Map each vector into a graph node, then link with an edge pairs of nodes based on topological considerations (TDA and UMAP)
  3. Use network theory to identify communities of nodes in the graph and consider these communities as super-categories (Louvain algorithm)

Supervised learning of product categories
With the previous steps we have been able to pre-process our data in order to have a more manageable problem, whose nature has not changed though: we still want to be able to infer product (super-)categories from the title of the products, but we have now reduced the number of categories from 500 to 17.

To solve the initial multi-class text classification problem we are going to use UMAP again, but in a different way. UMAP is very flexible and it also allows supervised dimensionality reduction using categorical labels. That’s exactly what we are going to do. Our training data consists of a list of product titles and their respective super-categories labels. Once we get the vector embeddings of each product title we can train UMAP to learn a low-dimensional latent space, where each point is labeled using the super-category information.

Figure 5: The super-category labels learned and associated with product titles (the dots in the figure) semantically embedded in 2 dimensions

The learned latent space can then be used to train a classifier (like KNN) on the transformed data. Predictions on new data will be obtained by first getting their representation in the latent space model and then calling the classifier.

In conclusion, we have presented a proof-of-concept to address a multiclass classification problem: Topological Data Analysis and Network Theory are used to learn the structure of the available product category space in an unsupervised fashion; then TDA is adopted again to infer product categories in a supervised way. I believe this approach is extremely powerful, especially in an exploratory phase, where there is the need to learn more about the taxonomy structure of the products and possibly to adapt that structure to the company specific needs. Although the implementation details of the UMAP algorithm can make it sub-optimal for a real-time production environment, the method can still be extremely useful in an off-line context. Moreover, the family of TDA algorithms is growing fast and it might be possible to soon find a more efficient implementation or an alternative to UMAP that would allow the proposed approach to be deployed in real-time production environments as well.

Author: Silvano Garnerone