Embeddings and Vector Databases: A Practical Guide

Embeddings convert text or any source media into arrays of numbers that capture the meaning of the source.

The Big Deal: Google Search understands "car" means "automobile." Your app doesn't. Until now.

Why It Matters

In order to have semantic search, recommendation engines, or even RAG systems you first need to have the data stored in a way that it can be used, and that way is embedding your data. The process of embedding changes words or images to a numeric representation.

Similar meanings = Similar numbers. "King" and "monarch" are closer than "king" and "pizza."

How Embeddings Are Created

Embeddings are created into numbers using a three-step process:

Text is broken down into tokens (this could be individual words or sub-words)
The generated tokens pass through a transformer that uses context to understand meaning (e.g., the word "bank" near "river" is understood differently than "bank" near "money")
The final vector is generated

Types of Embeddings

Text embeddings aren't the only types of embeddings you can have. Some of the other types include:

Sentence/Document Embeddings

What they are: Entire sentences/paragraphs converted to a single vector
Popular models:
- Sentence-BERT (SBERT): 384d-768d, bi-encoder architecture
- all-MiniLM-L6-v2: Fast, good general-purpose (384d)
- all-mpnet-base-v2: Higher quality, slower (768d)
- OpenAI text-embedding-3-small: 1536d, API-based
- Cohere embed-v3: Task-specific variants
Use cases: Semantic search, RAG, clustering

Contextual Embeddings

What they are: Same word gets different embeddings based on context
Popular models:
- BERT (2018): Bidirectional context
- RoBERTa: Optimized BERT
- T5, GPT variants: Decoder-based
Key difference: "bank" near "river" ≠ "bank" near "money"

Image Embeddings

What they are: Images converted to vectors capturing visual features
Popular models:
- CLIP (OpenAI): 512d, text-image aligned
- ResNet: 2048d, classification-focused
- Vision Transformers (ViT): Patch-based processing
- DINOv2 (Meta): Self-supervised, strong features
Use cases: Reverse image search, recommendation, classification

Multimodal Embeddings

What they are: Different data types mapped to a shared embedding space
Popular models:
- CLIP: Images and text in same space
- ImageBind (Meta): 6 modalities (image, text, audio, depth, thermal, IMU)
- ALIGN (Google): Similar to CLIP, larger scale
Power move: Search images with text, or text with images

Now that we understand embeddings, let's tackle the next challenge: How do you actually store and search through millions of these vectors efficiently?

Vector Databases: What They Solve

The problem: Finding similar items in millions of vectors.

Vector databases provide a way to utilize embedding models. They allow developers to create unique experiences such as visual, semantic, and multimodal search by storing vectors generated by embedding models and allowing applications to search for meaning or intent.

In the current landscape, they've been used in conjunction with generative AI models to create conversational search agents. For large language models (LLMs), vector databases provide an external, up-to-date knowledge base, helping to ground responses in factual information and reducing hallucinations.

Popular options:

Local: ChromaDB, Qdrant, Milvus
Cloud: Pinecone, Weaviate
Extensions: pgvector for PostgreSQL

Hands-On: Local Vector Database

Let's put all of this into practical use by building a simple similarity search system that will:

Convert text documents into embeddings using Hugging Face
Store them in ChromaDB (a vector database)
Perform semantic similarity searches
Show what the data actually looks like in the database

Setup

First, install the required packages:

pip install chromadb sentence-transformers numpy

Step 1: Create Embeddings with Hugging Face

We'll use Sentence Transformers, a Hugging Face library designed for creating embeddings:

from sentence_transformers import SentenceTransformer

# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample documents
documents = [
    "Python is a high-level programming language known for its simplicity.",
    "Machine learning is a subset of artificial intelligence.",
    "Vector databases store data as mathematical vectors for similarity search."
]

# Generate embeddings
embeddings = model.encode(documents)

What Exactly Does an Embedding Look Like?

Let's look at the first embedding:

print(f"Shape: {embeddings[0].shape}")
# Output: Shape: (384,)

print(f"First 10 values: {embeddings[0][:10]}")
# Output: First 10 values: [-0.06037837  0.02286933 -0.01833793  0.0179959  -0.03234914 -0.15243402
#  -0.02566721  0.0894085  -0.08026304  0.00828351]

Each document is represented by a 384-dimensional vector of floating numbers. These numbers represent the semantic meaning of the text.

Step 2: Storing in ChromaDB

ChromaDB provides a simple interface for storing and querying embeddings:

import chromadb

# Initialize client
client = chromadb.Client()

# Create a collection (like a table in traditional databases)
collection = client.create_collection(
    name="tech_articles",
    metadata={"description": "Technical articles about programming and AI"}
)

# Add documents
collection.add(
    embeddings=embeddings.tolist(),
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))],
    metadatas=[{"source": "blog", "index": i} for i in range(len(documents))]
)

What's in the Database?

Let's peek into the database:

all_items = collection.get(include=['embeddings', 'documents', 'metadatas'])

print(f"IDs: {all_items['ids']}")
# Output: IDs: ['doc_0', 'doc_1', 'doc_2']

print(f"Embedding dimensions: {len(all_items['embeddings'][0])}")
# Output: Embedding dimensions: 384

print(f"First document: {all_items['documents'][0]}")
# Output: First document: "Python is a high-level programming language..."

print(f"First metadata: {all_items['metadatas'][0]}")
# Output: First metadata: {'source': 'blog', 'index': 0}

The database stores:

IDs: Unique identifiers for each document
Embeddings: 384-dimensional vectors
Documents: The original source text
Metadata: Additional information about each document

Step 3: Similarity Search

Now let's search for similar documents:

# Search query
query = "What is a vector database?"

# Convert query to embedding
query_embedding = model.encode([query])

# Search for similar documents
results = collection.query(
    query_embeddings=query_embedding.tolist(),
    n_results=3
)

for doc, distance in zip(results['documents'][0], results['distances'][0]):
    print(f"Document: {doc}")
    print(f"Distance: {distance:.4f}\n")

Output:

Document: Vector databases store data as mathematical vectors for similarity search.
Distance: 0.6234

Document: Machine learning is a subset of artificial intelligence.
Distance: 0.9876

Document: Python is a high-level programming language known for its simplicity.
Distance: 1.2341

The lower the distance, the more closely the document matches the query.

Understanding Similarity Scores

Vector databases use distance metrics to measure similarity. The most common methods:

Cosine Similarity: Measures the angle between vectors
Euclidean Distance: Measures straight-line distance

Why This Matters

Traditional keyword search would struggle with queries like:

"What is a vector database?" vs "How do embedding databases work?"

They might not use the same words, but they have the same semantic meaning. Embeddings capture this meaning, allowing the system to find relevant documents even when exact words don't match.

Common Mistakes When Implementing Embeddings

1. Using the Wrong Model for Your Use Case

The mistake: Defaulting to OpenAI's text-embedding-3-large for everything.

What happens: You're paying 10x more and getting 5x slower performance for marginal quality gains on simple tasks.

2. Not Chunking Documents Properly

The mistake: Embedding entire documents as single vectors.

What happens: Lost context, terrible search results.

3. Ignoring Embedding Drift

The mistake: Mixing embeddings from different models or versions.

Real scenario:

Monday: Embed docs with all-MiniLM-L6-v2
Tuesday: Switch to text-embedding-3-small
Result: New queries can't find old documents

The fix: Version your embeddings.

4. Not Preprocessing Text Consistently

The mistake: Different preprocessing for indexing vs querying.

5. Over-Engineering for Small Datasets

The mistake: Setting up Pinecone/Weaviate for 1,000 documents.

Reality check:

< 10K documents: Use in-memory search (FAISS)
< 100K documents: SQLite with vector extension
< 1M documents: PostgreSQL with pgvector
> 1M documents: Now consider dedicated vector DBs

If you have < 10K documents, you're over-engineering. Use NumPy.

6. Not Caching Embeddings

The mistake: Re-embedding the same content repeatedly.

The cost: OpenAI charges $0.13 per million tokens. Re-embedding your FAQ page 1000 times = wasted money.

Cost comparison: 1M embeddings = $0.13 with OpenAI vs. $0 with Hugging Face.

7. Forgetting About Hybrid Search

The mistake: Relying 100% on semantic search.

The problem: Semantic search fails on:

Exact matches (product IDs, names)
Rare terms
Acronyms

The solution: Combine with keyword search.

Key Takeaways

Embeddings convert source to numbers that capture semantic meaning
Vector databases store and index these embeddings
Similarity search finds related content based on meaning
The database structure includes IDs, embeddings, original source documents, and metadata

Use Cases

This pattern powers:

Semantic search engines
RAG (Retrieval Augmented Generation) systems
Recommendation engines
Document clustering and classification
Question-answering systems

Your Move

Right now: Copy the code above. Run it. You'll have semantic search working in a few lines of code.

This week: Replace your current keyword search with this.

This month: Show your boss why you deserve a raise.