Nets are for fish; Once you get the fish, you can forget the net.
Words are for meaning; Once you get the meaning, you can forget the words
庄子(Zhuangzi), Chapter 26
At first glance you may ask - what on earth does that quote have to do with AI, but I suspect as you read this blog you'll go...ah-ha!. Words are constructs for us to share meaning, once we understand the meaning the words are irrelevant - under the hood, it's the meaning of the words represented as vectors that Large Language Models (LLMs) use to work with language.
Imagine we're trying to understand if these are the same pieces of text:
- 6 inches
- 6"
- 1/2 foot
Clearly, the actual text is very different - but at least here in the U.S. we would agree that they have the same meaning - we call this concept "semantic meaning". In the world of AI when we talk about embeddings we're talking about semantic vectors - think of them like a GPS system for the meaning of language. And an embedding for those three pieces of text shows accurately that they are nearly identical in meaning. The embedding model learns the meaning of words and concepts based on the fact that in it's training data they appear in similar contexts. For example, the word embeddings for “dog” and “puppy” would be close together in the vector space because they share a similar meaning and often appear in similar texts - but "puppy" and "computer" would have embeddings further apart because their meanings and contexts are quite different.
Looking at the picture below and you can image how we might be able to group together similar concepts by meaning in a 2-dimensional chart:
Instead of 2-dimensions like the chart above, the leading embedding model by OpenAI (ada-002) gives back vectors with 1,536 dimensions! Embeddings aren't a new invention- but with the training on massive data that has gone into creating frontier models like GPT4 we also get a revolutionary powerful embedding engine that has a much better view into the extent and meaning of human language and understanding.
So what?
- Vector embeddings can create a massive amount of data. To handle it we need to use specialized technology to work with them generally referred to as Vector Databases. As an example, let's say we have a product description a few sentences long, that may be only a few hundred bytes of data - but any content we get an embedding for always has 1,536 dimensions in it. The size of the vector is constant - approximately 34KB per embedding - so if you are generating embeddings for small pieces of text, the embedding will often be over 50 times bigger than the text itself!
- Certainly we can also use a frontier LLM like GPT4 for this work - if you ask ChatGPT if 6" and 1/2 foot mean the same thing it will confirm it, but that's very expensive, and very slow. GPT4 currently costs up to 600x more than the leading embedding model, and has rate limits 100x slower.
- By having semantic vectors - we enable searching by similar meanings. Using algorithms like cosine similarity, vector databases allow us to rapidly search across huge datasets to find items that have a similar meaning to our search target. And amazingly, this works even across different forms of media. Since the models as trained are multi-modal, a picture of a dog will have a similar embedding to the word "dog".
But why is this useful?
- If we can understand the similarity in meaning between any two pieces of content - we have a great tool for finding similar products, customers or vendors to help clean up master data within large companies and when doing things like building a recommendation engine. Imagine a recommendation engine where recommendations are based on understanding your needs, and understanding the meaning of product descriptions - so you get recommendations for the stretchy pants you are looking for vs a numerically optimized suggestion because "customers like you purchased" - suddenly we're able to provide a truly personalized experience where intention and meaning matter more than just numerical optimization.
- Another powerful use of embeddings is for breaking down large pieces of content into small chunks, and getting the embeddings for those small chunks so that we can feed in just the right info to an LLM at the right time to give it up-to-date or proprietary information to give a better response to a given prompt. This is key for solutions like AI powered KBs and customer service engines. Because we know what chunks of knowledge are being passed into the LLM - we massively increase our ability to get to the right answer, while simultaneously being able to cite our sources.
More reading on this topic: