Data Rationalization: A Journey from Messy to Magical

December 11, 2023

Alan Czysz

Introduction

In any business, the quality and reliability of your data is paramount. One critical aspect of maintaining data integrity is the process of data rationalization - a technique used to eliminate redundant or duplicate data from your systems. This not only streamlines business intelligence analysis but also enhances the accuracy of insights derived from the data.

In this blog post, we'll embark on a journey through the process of data rationalization – showing how data can be transformed from messy into magical. From taking stock of your data inventory to identifying key fields for processing, and from utilizing various matching techniques—including exact, semantic, and AI-assisted matches—to effectively packaging the results for use, we'll touch on each step. This post aims to provide you with a general roadmap for repeatably executing successful data rationalization initiatives. We’re going to focus on product data rationalization, but the approach can be applied to any data category where duplication runs rampant (yes, even customer / contact data!). Let's dive in and explore how to transform a sea of data (complete with floating plastic) into a streamlined reservoir of valuable information.

Set-up

As with most transformative change, the first step is to admit you have a problem. Since you’re here… Congrats! We can safely assume you’ve conquered denial.

Your next step is to determine the subset of data you can clean-up that will add business value. For instance, cleaning up product or service data would enable a BI tool to show profitability analysis with full drill-down capabilities.

Now that you know your data subset, the last set-up task is to gather, extract, download, cajole, or otherwise wrangle the data into one spot. At Attainable Edge, we’re old enough to have grown up with relational databases, so that’s our tool of choice. However, use whatever is convenient for you and your team to work with: CSV files, JSON, XML, etc. As long as the data is in one spot, your journey is off to a great start.

Inventory Data

With the data in a single spot, take inventory of what’s available to you and what will be pertinent for data rationalization. If you’re inventorying product data, you’ll likely be keying on Names, Descriptions, Manufacturer numbers, UPC codes, Taxonomy, etc.

During this process, also look for distinctions in data patterns between categories. For instance, you may note differences in the reliability of catalogue numbers between private label and manufacturer products. Note these differences and come up with a plan to make use of the most reliable pieces of data within each category. Yes, we’re proposing you embrace these differences rather than lament that you can’t rely on a specific set of fields for all records!

Rationalization

We’ll break the rationalization, or “matching”, process into three main categories:

Exact matches
Semantic matches
AI-assisted matches

We’ll jump into explanations of each method below, but please note the order of execution matters for two reasons:

Each match process builds off its predecessor, effectively reducing the data set each time.
Each match process escalates in cost-per-record as you get further into the process.
1. Exact matches require the research you did during the inventory step and a few lines of code to find the records with exact matches.
  1. Once the code is written, the matches are essentially free.
2. Semantic matches require the use of an embedding model, significantly more storage space, additional code to work with and compare the vectors, and/or the use of a vector database and search.
  1. Once the code is written, you’ll be paying a very small fee per embedding, plus storage and/or bandwidth cost of vectors, and any vector database fees.
3. AI-assisted matches require code to utilize LLM model APIs and manipulation of your data to fit into the LLM context windows. More on that below.
  1. Once the code is written, you’ll be paying a small fee per call to the LLM for both input and output.
4. One last note… while the Semantic and AI-assisted steps involve additional costs, those costs pale in comparison to human labor. Bottom line… you’ll be more accurate (LLMs don’t get tired of staring at data) and spending less money than if you tried to “solve the problem with people”.

‍

Exact Matches

Exact matches are the easiest, cheapest, most reliable, and [typically] least frequently found matches for data rationalization. These may require a combination of fields (I.E. Manufacturer name + Manufacturer number) to find unique matches, but exact matches are worth spending time and effort to find.

The trouble with exact matches is that they only work if fields are exactly the same. For instance, a product description containing “3.5 feet” is not an exact match to “3.5 ft” or “three and a half feet” or “42 inches” – even though a human can tell they have the exact same meaning. If your development team has some sadist-like tendencies, they may attempt to handle these discrepancies with libraries of common abbreviations, complicated regex expressions, or maybe the purchase of specialized APIs or MDM tools.

At Attainable Edge, we recommend a different approach for non-exact matches that we’ll cover in the next section. Our preference is to take advantage of exact matches where they exist (be generous in offering your “thanks” to any data-related deities you believe in) and not force them where they don’t exist.

Semantic Matches

We’ll breeze past this section, since you’re already an expert on Semantic matches from our previous blog found here.

Wait… you haven’t read it? Check it out… we’ll wait.

Fine. Let’s review the cliff notes.

Semantic vectors use an embedding model to generate a numeric representation of the meaning behind a word/phrase/sentence/acronym/etc.
The semantic vectors for 0.5’, ½ foot, 6 in., six inches, and 6” are all nearly identical even though the plain text representation varies wildly.
You can use tools like Cosine Similarity to determine the difference between two sets of semantic vectors… the closer to zero your result is, the closer the semantic match.

It’s pretty easy to see where this is going now… you generate embeddings for the key fields you identified during your inventory exercise, then compare the vectors to find close matches. This comparison can be accomplished in many ways. You can store the embeddings in CSV or txt files, then read them into a program for use with Cosine Similarity or Euclidean Distance. Alternatively, you can use a specialized vector database that has built-in search capabilities for running the compares. Both options can be run locally or in the cloud, each with the requisite pros and cons.

At Attainable Edge, we prefer specialized vector databases because they’re built with performance in mind. Once your data (including the massive vectors) is loaded into the vector database, searches can be performed in milliseconds. Those milliseconds become essential when you’re dealing with hundreds of thousands or millions of records.

With the comparison done, you now have a map of similar pieces of data. If the data is relatively uniform, you may pick a cut-off value (I.E. similarity score <= 0.001) and determine that’s good enough. Feel free to let us know how much you enjoyed this article and then move on with your day.

If your data isn’t uniform, you’re going to want to read on…

AI Assisted Matches

The final tool in our toolbox for data rationalization involves asking an AI to help us with the data comparisons. At any point previously, you could have simply dumped a large number of records into a LLM API and asked it to find duplicates. That is a completely viable option. The difficult part is working effectively within the context window limitations for both input and output to ensure you’re getting the best bang for the buck.

The approach we’ve found at Attainable Edge that optimizes our results is to use the semantic similarity matches we created in the prior step, combined with the original data record for comparison, and asking the LLM to look for duplicates within that smaller data set. This approach limits the amount of data the LLM needs to sift through and keeps the comparison context focused on a single product instance.

The prompts to the LLM are key for this comparison. If you want the LLM to weight the comparison of product descriptions more heavily than the comparison of UPC Codes, you need to explicitly let the LLM know to do this. We’ve also found that it’s useful to have the LLM include its rationale for a match decision in the response being returned (you can implement this with JSON, CSV, plain text, etc.). By including the rationale, you can gain an understanding of how the LLM is executing the prompt you’ve provided and tweak the prompt to achieve a different/better result. Lastly, we also recommend including examples (good and bad), utilizing chain of thought (COT) techniques, and explicitly stating the response format you’re expecting within the LLM prompt.

Once you’ve got your prompt worked out, set up your batch process (don’t forget your rate limit and retry steps!) and start cataloging your results.

Putting it All Together

You’ve now got your exact matches, semantic matches, and AI-assisted matches run and stored (you remembered to store the results, right!?!), it’s time to break out your favorite programming language and get those results consolidated. The data rationalization will include collapsing duplicate records down into unique product instances, implementing data quality standards for descriptions and other key fields, and providing a mapping of which original product records map into the finalized product instances.

The format you use to represent the final rationalization will be dependent on what you’re doing with the data. If you’re reloading the product data into an ERP or PIM, then text, CSV, or JSON files might be appropriate. If it’s feeding directly into an e-commerce website, then Excel files or SQL tables may be the best bet. If the data is going to be poured over by Supply Chain, Finance, and/or Operations personnel for true SKU rationalization, then a BI Tool or web interface could be the appropriate presentation layer.

Going Forward

Congrats! You’re now the proud owner of a pristine product data set for use with your ERP, E-commerce, PIM, EDI, CXML, and any other acronym you choose. Time to archive all of that work you did to get here and move on to the next project, right?

Let’s not be too hasty...

That pristine data set will only remain pristine with some TLC. You could simply rerun the above steps every few months to re-clean the data, but that doesn’t sound like a lot of fun. Instead, let’s build off our product clean-up creation and put it to use in evergreen fashion. Maybe we create an API that accepts a new product record from a user, manufacturer, or data feed and runs it through the above process before committing it into your primary data source (ERP, PIM, data warehouse, etc.). You could also turn that API into a full-blown user interface complete with user feedback. Regardless of the path you choose, the goal is to make sure any/all new data is also pristine before it’s allowed entrance to your primary data repository.

Wrap Up

There you have it, the data journey from messy to magical is complete. You’ve gathered, inventoried, matched, consolidated, and future-proofed the quality of your product data. Now go make use of that data to unlock magical insights, new markets, and new opportunities!

Of course, if you’d prefer a partner for the above journey, Attainable Edge is here to help! Contact us here.

Data Rationalization: A Journey from Messy to Magical

Introduction

Set-up

Inventory Data

Rationalization

Exact Matches

Semantic Matches

AI Assisted Matches

Putting it All Together

Going Forward

Wrap Up

Related Posts

From 4K to 10 Million: The Revolutionary Expansion of AI's Memory Window

Understanding Vector Embeddings and Their Power in AI

Popular posts

From 4K to 10 Million: The Revolutionary Expansion of AI's Memory Window

Data Rationalization: A Journey from Messy to Magical

AI Leadership Shuffle: OpenAI, Microsoft, and the Future of AI Innovation

Understanding Vector Embeddings and Their Power in AI

Categories

Tags

Contact Info