ScyllaDB Docs ScyllaDB Cloud Vector Search Sizing and Capacity Planning

Sizing and Capacity Planning¶

This page helps you estimate the memory requirements for your vector search workload and choose appropriate instance types. Because the HNSW index resides entirely in memory on the vector search nodes, memory is typically the constraining resource.

Memory Estimation Formula¶

The total memory required for a vector index depends on three factors:

Raw vector data — the number of vectors × dimensions × bytes per dimension (determined by the quantization level).
HNSW graph overhead — each node in the graph stores edges to its neighbors. The overhead scales with the m parameter (maximum_node_connections).
Operational headroom — memory for query processing, CDC readers, and system overhead.

The simplified formula:

\[\text{Memory} \approx N \times \left( D \times B + m \times 16 \right) \times 1.2\]

Where:

\(N\) = number of vectors
\(D\) = number of dimensions
\(B\) = bytes per dimension (see table below)
\(m\) = maximum_node_connections (default: 16)
The \(m \times 16\) term accounts for HNSW graph edges (each edge stores a neighbor ID and metadata)
The 1.2 multiplier provides ~20% operational headroom

Bytes per Dimension by Quantization Level¶

Quantization	Bytes / dim	Notes
`f32`	4	Full precision (default). Highest recall, highest memory.
`f16`	2	Half precision. Good balance for most workloads.
`bf16`	2	Brain float. Statistically equivalent to f16 for most models.
`i8`	1	8-bit integer. ~4× memory savings vs. f32.
`b1`	0.125	Binary. ~32× memory savings vs. f32. Use with rescoring.

Worked Examples¶

Example 1: 10M vectors, 768 dimensions, f32¶

A typical OpenAI or sentence-transformer embedding workload:

\[\begin{split}\text{Memory} &\approx 10{,}000{,}000 \times (768 \times 4 + 16 \times 16) \times 1.2 \\ &\approx 10{,}000{,}000 \times (3{,}072 + 256) \times 1.2 \\ &\approx 10{,}000{,}000 \times 3{,}328 \times 1.2 \\ &\approx 39.9 \text{ GB}\end{split}\]

Recommendation: An r7g.8xlarge (or larger) instance on AWS, or n4-highmem-16 (or larger) on GCP.

Example 2: 10M vectors, 768 dimensions, i8 (quantized)¶

The same workload with i8 quantization:

\[\begin{split}\text{Memory} &\approx 10{,}000{,}000 \times (768 \times 1 + 16 \times 16) \times 1.2 \\ &\approx 10{,}000{,}000 \times (768 + 256) \times 1.2 \\ &\approx 10{,}000{,}000 \times 1{,}024 \times 1.2 \\ &\approx 12.3 \text{ GB}\end{split}\]

Savings: ~3.2x less memory than f32 for the same dataset - not 4x, because quantization compresses only the vector data, while the HNSW graph structure (the \(m \times 16\) term) remains the same size regardless of quantization level. Combine with oversampling and rescoring to maintain recall. See Quantization and Rescoring.

Example 3: 100M vectors, 1536 dimensions, f16¶

A large-scale workload with OpenAI text-embedding-3-large vectors:

\[\begin{split}\text{Memory} &\approx 100{,}000{,}000 \times (1{,}536 \times 2 + 16 \times 16) \times 1.2 \\ &\approx 100{,}000{,}000 \times (3{,}072 + 256) \times 1.2 \\ &\approx 100{,}000{,}000 \times 3{,}328 \times 1.2 \\ &\approx 399 \text{ GB}\end{split}\]

Recommendation: This workload requires multiple vector search nodes. ScyllaDB Cloud distributes the index across nodes within each Availability Zone. Contact ScyllaDB support for guidance on large deployments.

Impact of Higher m Values¶

Increasing m (maximum_node_connections) improves recall but adds graph overhead. The following table shows the impact for 10M vectors of 768 dimensions with f32 quantization:

`m`	Graph overhead	Total memory	Trade-off
16 (default)	~2.4 GB	~39.9 GB	Good default for most workloads.
32	~4.8 GB	~42.2 GB	Higher recall for high-dimensional vectors.
64	~9.6 GB	~47.0 GB	Maximum recall; recommended for D > 1024.

Quantization Impact Summary¶

For 10M vectors of 768 dimensions with m=16:

Quantization	Vector data	Graph overhead	Total	Recall impact
f32	28.6 GB	2.4 GB	~39.9 GB	Baseline (highest recall)
f16	14.3 GB	2.4 GB	~21.0 GB	Negligible recall loss
i8	7.2 GB	2.4 GB	~12.3 GB	Minor recall loss; use oversampling
b1	0.9 GB	2.4 GB	~4.0 GB	Significant recall loss; use rescoring

Notice that the graph overhead (2.4 GB) is constant across all quantization levels. Only the vector data column shrinks. This is why the actual memory savings from quantization are always less than the raw compression ratio - for example, i8 is 4x smaller per dimension than f32, but total memory drops only ~3.2x.

Sizing Guidelines¶

Start with the memory formula to estimate your baseline requirement. Add 20-30% headroom for operational overhead.
Choose quantization early. Quantization has the largest impact on memory. For most workloads, f16 or i8 with oversampling provides a good balance of memory savings and recall.
Match instance type to workload. Choose an instance with enough RAM for your estimated memory requirement. See Supported Instance Types for available options.
Plan for growth. If your dataset is growing, size for expected data volume 6-12 months out.
Test with your data. Memory formulas are estimates. Load a representative sample and measure actual memory usage before committing to production instance types.

See our 1B vector benchmark on the blog for a detailed look at ScyllaDB Vector Search performance on a dataset of one billion vectors.

What’s Next¶

Vector Search Deployments — create, enable and manage vector search clusters.
Quantization and Rescoring — reduce memory usage while maintaining recall.
Reference — instance types, CQL syntax, and API endpoints.

Was this page helpful?