There is a specific kind of failure that anyone who has built a search system knows well. A user types something completely reasonable, the system returns results that are technically correct by every metric you defined, and the user still cannot find what they were looking for. You added BM25 ranking. You tuned the weights. You even threw in some synonym expansion. And still, a query like "how do I cancel my subscription" returns nothing useful because the docs call it "account termination" and your keyword index has no idea those two phrases mean the same thing.

That problem, multiplied across millions of queries and terabytes of documents, is what vector databases exist to solve.

What embeddings actually are

Before getting into the database mechanics, it helps to be precise about what we mean by a vector embedding, because the word "embedding" gets used loosely.

When you pass text through a model like text-embedding-ada-002 or a sentence transformer, the model outputs a list of floating point numbers, typically somewhere between 384 and 1536 of them depending on the model. That list is the embedding. What makes it useful is that the model was trained in a way that causes semantically similar inputs to produce numerically similar outputs.

"Cancel my subscription" and "terminate my account" will produce vectors that are close to each other in that high-dimensional space. "The weather in Paris" will produce a vector that is far from both. The geometry of the space encodes meaning.

This is not magic. It is a byproduct of how these models are trained, usually on massive amounts of text where co-occurrence and context teach the model which concepts tend to appear together. The vector is a compressed representation of meaning, not a lookup table.

Each dimension in that vector does not correspond to a human-interpretable concept. You cannot look at dimension 247 and know it represents "subscription-related content." The meaning is distributed across all dimensions simultaneously. This is what makes the math work, and also what makes debugging it annoying when things go wrong.

How similarity search works

Given a query vector and a corpus of stored vectors, the goal is to find the stored vectors that are most similar to the query. The most common measure of similarity is cosine similarity, which measures the angle between two vectors regardless of their magnitude. Dot product similarity is also used, particularly when vectors are normalized, because it becomes mathematically equivalent to cosine similarity and is cheaper to compute.

The naive approach is exact search: compute the distance between the query vector and every stored vector, rank them, return the top k. This works fine at small scale. At a million vectors, it starts to hurt. At a hundred million, it is not viable for any latency requirement a real product would have.

This is where approximate nearest neighbor search comes in.

The core insight behind ANN is that you do not need to find the mathematically perfect nearest neighbor. You need to find vectors that are close enough, fast enough. In practice, a result that is 98% as good as the exact nearest neighbor, returned in 10 milliseconds instead of 2 seconds, is strictly better for every real use case.

Different ANN algorithms approach this tradeoff differently. HNSW (Hierarchical Navigable Small World) builds a multi-layer graph where each layer is a progressively sparser view of the data. At query time, you start at the top layer with few nodes, find the rough neighborhood, then descend through layers to increasingly fine-grained search. It is fast and maintains high recall, but it uses a lot of memory and the index build time is significant.

IVF (Inverted File Index) partitions the vector space into clusters using something like k-means, then at query time only searches the nearest clusters. This is more memory efficient but recall drops if your data distribution is uneven or your cluster count is poorly tuned.

PQ (Product Quantization) compresses vectors by splitting them into subvectors and quantizing each one separately. You lose some precision but you can fit dramatically more vectors in memory. Combining IVF with PQ (IVFPQ) is a common pattern in production when you are dealing with hundreds of millions of vectors and memory is the constraint.

Every ANN algorithm involves the same fundamental tradeoff: recall versus speed versus memory. There is no configuration that wins on all three simultaneously, and anyone who tells you otherwise is selling something.

Why a traditional database cannot do this

A relational database is built around exact lookups and range scans. The B-tree index that makes WHERE id = 42 fast is completely useless for "find me the 10 rows whose embedding column is most similar to this query vector." There is no index structure that makes this efficient in a general-purpose RDBMS.

You could store vectors as arrays in Postgres and write a query that computes cosine similarity across all rows. It will work. It will also do a full table scan every time, and it will not scale past a few hundred thousand rows before your query latency becomes unacceptable.

The pgvector extension changes this picture somewhat. It adds native vector types and HNSW/IVF index support to Postgres, which means you can do approximate nearest neighbor search inside a database you already know. For many workloads, especially where you need to combine vector search with relational filtering ("find documents similar to this query, but only from this user's workspace, created after this date"), pgvector is genuinely good. The operational simplicity of staying on Postgres is real. The performance ceiling is also real. At tens of millions of vectors with high query throughput requirements, purpose-built systems have a meaningful edge.

Dedicated vector databases like Pinecone, Weaviate, Qdrant, and Milvus are built from the ground up around vector indexing and retrieval. They handle things like index sharding, replication, and incremental updates to ANN indices in ways that are operationally non-trivial to implement yourself. Milvus in particular has a sophisticated architecture with separate components for storage, indexing, and query, which gives it flexibility at scale but also makes it the most complex to operate.

FAISS is worth mentioning separately because it is not a database, it is a library. Facebook's Approximate Similarity Search library gives you raw access to highly optimized ANN implementations in C++ with Python bindings. If you need maximum control over the index type, compression strategy, and hardware utilization, FAISS is where you go. Many hosted vector databases use FAISS under the hood. The tradeoff is that you own everything, including persistence, serving, and updates.

How this fits into a RAG pipeline

Retrieval-Augmented Generation is the pattern where you give an LLM access to a relevant subset of your data at inference time, rather than trying to bake all your knowledge into the model weights. The vector database is the retrieval component.

Here is what a production document retrieval system actually looks like, stripped of the marketing diagrams.

At index time: you take your corpus, chunk it into segments (the chunking strategy matters more than most people initially realize), run each chunk through an embedding model, and store the resulting vectors alongside the original text and any metadata. The metadata matters because you will almost certainly want to filter on it at query time. If you are building a support knowledge base, that metadata might be product version, customer tier, date range, or document type.

At query time: a user sends a question. You embed the question using the same model you used to embed the corpus. You query the vector database with that embedding, optionally applying metadata filters, and retrieve the top k most similar chunks. Those chunks get injected into the prompt context, and the LLM generates a response grounded in the retrieved content.

The failure modes are specific and worth knowing in advance. Chunking too aggressively (very small chunks) means individual chunks often lack enough context to be useful. Chunking too conservatively means your chunks exceed the embedding model's context window or dilute the signal. A 512-token chunk with dense technical content will embed differently than a 512-token chunk that is mostly boilerplate, even if they are the same size.

Embedding model choice has a bigger impact than most tutorials suggest. Models trained on general web text perform worse on domain-specific retrieval tasks. If you are building a retrieval system for medical literature, legal documents, or code, the generic OpenAI embedding model will probably work fine for a prototype and disappoint you in production. Fine-tuned or domain-specific embedding models (BGE, E5, and others from the MTEB leaderboard) are worth evaluating early.

Retrieval recall is the metric that bites you at 2am. You can measure answer quality all day, but if the relevant chunk is not in your retrieved set, the LLM cannot help you no matter how good it is.

Hybrid search, combining vector similarity with BM25 keyword search via reciprocal rank fusion, meaningfully improves recall in most production systems. Pure vector search misses exact matches. Pure keyword search misses semantic matches. Running both and merging the ranked lists costs more but the recall improvement is usually worth it.

The scaling and operational reality

At small scale, the main cost is the embedding step. Running a corpus of a few million documents through an embedding API is not free. At 0.0001 dollars per thousand tokens and a corpus of 10 million 200-token chunks, you are looking at around 200 dollars for a single full re-index. That is fine. Until you change your chunking strategy and have to do it again.

At large scale, the index itself becomes the problem. HNSW indices are fast but memory-hungry. A billion 768-dimensional float32 vectors takes roughly 3TB of raw storage before indexing overhead. Quantization (PQ or scalar quantization) can bring this down by 4-8x with modest recall degradation. Most teams end up on compressed indices in production without fully internalizing that decision until they need to debug why recall degraded on a specific query type.

Incremental index updates are where a lot of systems quietly break. Adding new vectors to an HNSW index is fine. Deleting or updating vectors is messier because most ANN implementations do soft deletes and the "deleted" vectors still participate in search until you rebuild. If you have a document store where content changes frequently, your retrieval quality will degrade over time in ways that are not immediately obvious.

Latency budgets in production RAG systems are tight. Embedding the query typically takes 10-50ms depending on whether you are calling a remote API or running a local model. Vector search on a properly indexed corpus should be under 10ms for most configurations. The LLM call is usually 500ms to several seconds depending on output length. The retrieval step is rarely the bottleneck, but it is also rarely optimized until everything else is.

What to actually take away from this

Vector databases are not complicated conceptually. They are a specialized index for a specific type of query, the same way a full-text search engine is a specialized index for keyword queries. The complexity is in the operational details: choosing the right index type for your scale and recall requirements, managing index updates, picking an embedding model that actually fits your domain, and building the evaluation infrastructure to know when your retrieval quality is degrading.

The tooling has matured quickly. Pinecone makes it easy to get started without managing infrastructure. pgvector makes it easy to stay on Postgres if your scale allows. Weaviate and Qdrant give you more control with reasonable operational complexity. Milvus is the choice when you are operating at serious scale and have the team to run it.

The thing that does not get said often enough is that the embedding model and the chunking strategy matter more than the choice of vector database for most applications. Get those right first. Worry about which ANN algorithm you are using later.

And when you are debugging why your retrieval pipeline is returning irrelevant results at 2am, the answer is almost never the vector database. It is usually the chunk boundaries.