ScyllaDB Docs ScyllaDB Cloud Vector Search Quantization and Rescoring

Quantization and Rescoring¶

Quantization and rescoring help you balance memory efficiency and search accuracy for Vector Search indexes in ScyllaDB. This page explains how to configure these features when creating a vector index.

Overview¶

By default, ScyllaDB stores vectors in the in-memory index using full 32-bit floating-point precision (f32). Quantization reduces the memory footprint of the index by storing vectors at lower precision. This compression trades some search accuracy for significant memory savings.

To mitigate the accuracy loss from quantization, ScyllaDB provides two complementary mechanisms:

Oversampling — retrieves a larger candidate set during the initial index search, increasing the chance that the true nearest neighbors are included.
Rescoring — re-calculates exact distances for candidates using the original full-precision vectors stored in ScyllaDB, then re-ranks results before returning them to the client.

Note

Quantization applies only to the in-memory vector index. The source vectors stored in your ScyllaDB table always remain in their original float format. Your data is never degraded.

Quantization Levels¶

The quantization index option controls the numeric precision used in the vector index. The following levels are supported:

Value	Description	Memory per dimension
`f32` (default)	32-bit single-precision IEEE 754 floating-point	4 bytes
`f16`	16-bit standard half-precision floating-point (IEEE 754)	2 bytes
`bf16`	16-bit “Brain” floating-point (optimized for ML workloads)	2 bytes
`i8`	8-bit signed integer	1 byte
`b1`	1-bit binary value (packed 8 per byte)	0.125 bytes

Lower-precision quantization levels use less memory but produce less accurate distance calculations in the index. Use oversampling and rescoring to recover accuracy.

Important

Quantization compresses only the vector data in the index. The HNSW graph structure (neighbor lists and edge metadata) is not compressed and its size stays constant regardless of quantization level. Because the graph overhead is a significant portion of total index memory, the actual memory savings from quantization are always much less than the raw compression ratio suggests. For example, going from f32 to i8 gives a 4x reduction in vector storage, but total index memory typically drops only ~3x. See Sizing and Capacity Planning for worked examples.

Oversampling¶

When a client requests the top K vectors, the search algorithm normally retrieves exactly K candidates from the index. With oversampling, the algorithm retrieves a larger candidate set:

Candidate pool size = ceil(K × oversampling)

The candidates are then sorted by distance and only the top K results are returned. A larger candidate pool increases the probability that the true nearest neighbors survive this final selection.

Range: 1.0 to 100.0
Default: 1.0 (no oversampling)

Oversampling offers two advantages over simply increasing the query LIMIT:

Performance — candidate filtering happens internally in ScyllaDB, avoiding the overhead of fetching and transporting extra rows to the application.
Scale — allows an effective internal limit of up to 100.0 × 1000 = 100,000 candidates.

Note

Even without quantization, the ANN algorithm is approximate. Setting oversampling > 1.0 can improve recall on high-dimensionality datasets even when using the default f32 precision.

Rescoring¶

Rescoring is a second-pass operation that re-calculates distances using the full-precision vectors stored in the ScyllaDB table, then re-ranks candidates before returning results.

``true`` — ScyllaDB fetches original vectors and re-ranks candidates by exact distance.
``false`` (default) — results are returned directly based on the approximate distances in the quantized index.

Caution

Rescoring can reduce search throughput by roughly 4 times because ScyllaDB must fetch the original full-precision vectors and recalculate exact distances for every candidate. Enable rescoring only when high recall is critical, and benchmark your workload to confirm acceptable performance.

Note

Rescoring is only beneficial when quantization is enabled. For unquantized indexes (default f32), the index already contains full-precision data, making the rescoring pass redundant.

CQL Syntax¶

Quantization, oversampling, and rescoring are configured as options when creating a vector index with CREATE CUSTOM INDEX:

CREATE CUSTOM INDEX ON myapp.comments(comment_vector)
USING 'vector_index'
WITH OPTIONS = {
  'similarity_function': 'COSINE',
  'quantization': 'i8',
  'oversampling': '5.0',
  'rescoring': 'true'
};

Options reference:

Option	Type	Default	Description
`quantization`	string	`'f32'`	Numeric precision for the index. Values: `f32`, `f16`, `bf16`, `i8`, `b1`.
`oversampling`	string (float)	`'1.0'`	Multiplier for the candidate set size. Range: 1.0-100.0.
`rescoring`	string (bool)	`'false'`	Whether to perform a second-pass exact distance calculation using full-precision vectors from storage.

Warning

The ALTER INDEX statement is not supported for vector indexes. To change quantization settings, you must drop the existing index and recreate it.

When to Use Quantization¶

Scenario	Recommendation
Small dataset, high recall required	Use default `f32` — no quantization needed.
Large dataset, memory-constrained	Use `i8` or `f16` with `oversampling` of 3.0-10.0. Add `rescoring: true` only if very high recall is required.
Very large dataset, approximate results acceptable	Use `b1` for maximum memory savings. Enable oversampling to compensate for accuracy loss.
High-dimensionality vectors (>= 768)	Consider `oversampling` > 1.0 even with `f32` to improve recall.

What’s Next¶

Working with Vector Search — vector data type, index creation, and ANN queries.
Filtering Vector Search Results — combine similarity search with metadata filtering.
Vector Search Concepts — architecture and data flow.

Was this page helpful?