Optimizing vector search: Unleashing the power of metadata in vector databases
As AI and machine learning applications become more prevalent, efficiently storing and retrieving high-dimensional vector data has become crucial. While vector databases excel at similarity search, combining them with well-structured metadata can significantly enhance retrieval accuracy and performance. Here’s the approach I’ve found most effective:
(I am using Langchain framework as it is widely used by the developers)
# Vector store without any metadata
embeddings = OpenAIEmbeddings()
texts = [
"Sample text 1",
"Sample text 2"
]
vectorstore = Chroma.from_texts(texts, embeddings)
- Store metadata alongside vectors: Instead of keeping it somewhere, store the metadata right in the database with your vectors. It will be faster and much more efficient for querying subsequently.
# Store metadata alongside vectors
embeddings = OpenAIEmbeddings()
texts = [
"Sample text 1",
"Sample text 2"
]
metadatas = [
{"source": "book1", "page": 1},
{"source": "book2", "page": 2},
]
vectorstore = Chroma.from_texts(texts, embeddings, metadatas=metadatas)
# Remember that the length of metadatas list must match with the length of the texts
2. Optimize metadata schema: Be smart while designing your metadata schema. If possible, a flat structure will allow for better performance in queries instead of deeply nested objects.
# Optimize metadata schema
document = Document(
page_content = "We shall suppose that the rods 'expand' by an amoimt proportional to the increase of temperature",
metadata = {
"source": "book1",
"chapter": 1,
"page": 21,
"paragraph": 4,
"line": 2,
"author": "Albert E",
"category": "science",
"publish_date": "1986-07-23"
# ... add as more as possible
}
)
# This is more about database design, varies from case to case.
# The more details you add to a document/chunk, the more you will be able to narrow down
3. Leverage pre-filtering: Run metadata filters before the vector similarity search. This can vastly reduce the search space and give a huge speedup for queries.
# Pre-filtering
retriever = vectorstore.as_retriever(
search_kwargs={
"filter": {
"source": "book1",
"chapter": 4
}, # filter vectors for data in chapter 4 of book1
"k": 10 # Only top 10 records to be retrieved
}
)
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
return_source_documents=True,
# ...
)
chain.invoke("Briefly describe the contents in chapter 4 in 300 words")
4. Encode Efficiently: The categorical metadata should be encoded using techniques like one-hot encoding or embedding so that it can be represented in vectors for easy searching.
# Implement efficient encoding
from sklearn.preprocessing import OneHotEncoder
import numpy as np
categories = ['fiction', 'non-fiction', 'biography']
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(np.array(categories).reshape(-1, 1))
# Combine with vector
vector = embeddings.embed_query("Sample text")
combined_vector = np.concatenate([vector, encoded[0]]) # For 'fiction'
5. Regular Maintenance: Many patterns in the data change with time, especially when there may be many users. Periodically observe the query patterns and modify the indexing strategy to maintain the optimization of the database.
# Regular maintenance
def update_index(vectorstore):
# Assuming new_documents is a list of new or updated documents
new_documents = [
{"text": "New text 1", "metadata": {"source": "book3", "page": 1}},
{"text": "New text 2", "metadata": {"source": "book3", "page": 2}}
]
vectorstore.add_texts([doc["text"] for doc in new_documents],
metadatas=[doc["metadata"] for doc in new_documents])
# This would typically involve analyzing query patterns and updating indexes
6. Periodical cleanup: Periodically cleanup for deprecated/unused datasets for optimized performance.
# Periodical cleanup
# You should periodically deprecate older data for best retrieval performance
all_ids = vectorstore.get()['ids']
# Get metadatas for all documents
all_metadatas = vectorstore.get()['metadatas']
# Find ids of documents older than the cutoff date
ids_to_remove = [
id for id, metadata in zip(all_ids, all_metadatas)
if metadata["source"] == "book1"
]
# Remove the old documents
if ids_to_remove:
vectorstore.delete(ids_to_remove)
# This is also applicable when new edition of existing books/ similar data comes into the system
7. Use approximate nearest neighbor algorithms: Such algorithms show a good trade-off between the speed of search and accuracy in the case of large-scale applications, together with metadata filtering.
results = vectorstore.similarity_search_with_score("query", k=5)
# Not all data stores support this feature. So, choose your vector database acordingly
Design a full-bodied, flexible system that combines the force of vector similarity search with the precision of traditional database filtering. These strategies have been instrumental in my continued ability to get closer to more accurate, contextually relevant results in AI applications ranging from recommendation systems to semantic search engines.
What are some of your favorite strategies for handling large vector databases? Share in the comments!
#VectorDatabases #MachineLearning #DataEngineering #AI #Langchain