Vector Search Glossary | ScyllaDB Docs

Vector Search Glossary¶

This glossary defines key terms related to Vector Search in ScyllaDB. It covers core concepts essential to understanding how vectors are stored, indexed, and queried.

ANN (Approximate Nearest Neighbor) Search — A search technique that efficiently finds data points in a large dataset that are most similar to a given query vector. Instead of looking for an exact match, ANN speeds up the search by accepting results that are close enough — making it ideal for working with large datasets and high-dimensional vector spaces in applications like semantic search, recommendations, and generative AI.

CDC Reader — A Change Data Capture (CDC) consumer that propagates base table mutations to the vector index. ScyllaDB uses two readers: a fine-grained reader with sub-second intervals for low-latency updates, and a wide-framed reader with a 30-second safety interval that ensures consistency.

Embedding — A vector generated by a machine learning model to represent raw data in a numerical form. ScyllaDB can store and query embeddings generated by an external tool and inserted into the database.

Filtering — The ability to combine an ANN similarity search with predicate constraints on primary key columns. ScyllaDB supports filtering via global vector indexes (with ALLOW FILTERING) and local vector indexes (with partition key restrictions). See Filtering.

Global Vector Index — A vector index that spans all partitions and enables cluster-wide similarity search. Global indexes support filtering on the base table’s primary key columns with ALLOW FILTERING. See Global Vector Indexes.

HNSW (Hierarchical Navigable Small World) — A graph-based algorithm for Approximate Nearest Neighbor search. It organizes vectors into a multi-layered graph, allowing efficient navigation from coarse to fine resolution. ScyllaDB’s vector index uses the HNSW algorithm implemented by the USearch library.

Local Vector Index — A per-partition vector index that is co-located with the data it indexes. Local indexes require the full partition key in the WHERE clause and support efficient filtered similarity search within a single partition. See Local Vector Indexes.

Oversampling — An index option (range 1.0-100.0) that controls how many candidate vectors the index evaluates internally before returning the top-k results. Higher oversampling improves recall at the cost of higher latency. See Quantization and Rescoring.

Quantization — A technique that reduces index memory usage by storing vectors in lower-precision formats (e.g., f16, i8, b1) instead of the default f32. Lower precision reduces memory but may decrease accuracy; combine with rescoring to recover precision. See Quantization and Rescoring.

Rescoring — A post-processing step where the index re-ranks candidate results using the original full-precision vectors from disk, improving accuracy after quantized search. Enabled by setting 'rescoring': 'true' in the index options. See Quantization and Rescoring.

Semantic Search — A type of similarity search that compares the meaning of a query and data items using vector embeddings. It enables context-aware retrieval by focusing on semantic relevance rather than exact terms.

Similarity Function (Distance Metric) — A mathematical function that measures how close two vectors are. In ScyllaDB, three similarity functions are supported: COSINE (default), DOT_PRODUCT, and EUCLIDEAN. See Choosing a Similarity Function.

Similarity Search — A technique for finding items in a dataset that are most similar to a query vector, using a distance or similarity measure. It is commonly used in high-dimensional vector spaces to retrieve approximate matches efficiently.

USearch Index — A high-performance, in-memory vector index library developed by Unum, designed for fast approximate nearest-neighbor (ANN) search. ScyllaDB uses USearch as the underlying engine for its Vector Search Index to deliver low-latency similarity queries and efficient memory utilization.

Vector — An ordered list of numbers (floats) representing data, such as text, images, or audio, in a way that captures its meaning or features.

Vector Search Index — In ScyllaDB, a USearch index built on a vector column that accelerates similarity queries. Unlike traditional indexes (for exact matches or ranges), a Vector Search index is optimized for approximate nearest neighbor (ANN) lookups over high-dimensional data.

Vector Type — A native ScyllaDB column type used to store fixed-length numeric vectors directly in a table for similarity search. See Data Types - Vectors in the ScyllaDB documentation.

Was this page helpful?