$ cd /engineering && cat ./how-we-cut-our-embedding-costs-in-half-and-shrank-storage-by-50.md
AIInfrastructure
> How We Cut Our Embedding Costs in Half and Shrank Storage by 50%
Sathyanarayanan Saravanamuthu
||8 min read
share:
width:
Author: Sathyanarayanan Saravanamuthu
Date: 2026-03-13
Read Time: 7 min
Overview
ConductorOne uses vector embeddings extensively—matching users to applications, powering semantic search in our copilot, and driving identity governance automations. For example, when an automation needs to route a request to the CEO, embeddings let it resolve “CEO” to “Alex Bovee”—matching a role description to the right person even though the strings share no characters.
As our customer base grew, so did the cost and operational pain of generating these embeddings in real time. We recently refactored our entire embedding pipeline, and the results were dramatic: ~50% reduction in Bedrock inference costs, 50% reduction in vector storage, and embeddings that now refresh in hours instead of days—keeping search and automations current as your organization changes.
This post covers the three key changes we made and why each one mattered.
The Problem: Real-Time Inference at Scale Doesn’t Scale
ConductorOne’s platform connects to hundreds of enterprise applications through connectors that continuously sync identity data—users, accounts, roles, and entitlements—into our datastore. Once ingested, this data needs vector embeddings to power search and AI features. Our original architecture used online (real-time) inference for every embedding. Each time a user, application, or entitlement needed a vector, we called Amazon Bedrock’s InvokeModel API synchronously. This worked fine at small scale, but created compounding problems as we grew:
Cost: Real-time Bedrock inference is priced per input token with no volume discount for batch workloads.
Rate limits: Every embedding request consumed the same API quota as interactive features like our Automation Architect. During large sync operations, we’d exhaust our rate limits and throttle user-facing AI features.
Fragility: We had overloaded a single background job to handle multiple, unrelated embedding workloads—semantic text embeddings (used for search and AI features) and username similarity embeddings (used for matching accounts across applications). A failure in one would block the other.
We needed to rethink the pipeline from the ground up.
Change 1: Batch Inference Instead of Online Calls
The highest-impact change was switching from synchronous InvokeModel calls to Bedrock’s Batch Inference API for workloads above 1,000 items.
The flow now works like this:
Collect all items needing embeddings from the database (up to 50,000 per batch)
Generate a JSONL manifest and upload it to S3
Submit an asynchronous batch inference job to Bedrock
Poll for completion with exponential backoff (30s → 60s → 120s → 300s)
Download results from S3 and persist the embeddings
We orchestrate this pipeline using Temporal Workflows and Activities, which give us durable execution, automatic retries, and visibility into each step of the process.
For workloads under 1,000 items, we fall back to direct InvokeModel calls to avoid the overhead of S3 round-trips and batch scheduling. This hybrid approach gives us the best of both worlds.
Why this matters: Bedrock Batch Inference is priced at half the cost of real-time inference. For the same model (amazon.titan-embed-text-v2:0), we immediately cut our per-token embedding cost by 50% on the vast majority of our workload—without changing the model or sacrificing quality.
To put the scale in perspective: at scale, a single tenant can have millions of items that need embeddings. With online inference, each item requires a separate InvokeModel call. At ~100 calls per minute (a realistic sustained rate accounting for API latency and rate limits), millions of items would take weeks of continuous inference for a single tenant.
With batch inference processing 50,000 items per batch and each batch completing in roughly 16 minutes, the same workload completes in hours instead of days—a 30x improvement in wall-clock time, before even considering the cost savings.
Equally important, batch jobs run asynchronously inside AWS infrastructure. They don’t consume our real-time API quota, which means interactive features like copilot and automation architect are no longer competing with bulk sync operations for rate-limit headroom.
Change 2: Half-Precision Vector Storage
Our original embeddings were stored as 1024-dimensional float32 vectors. Each vector consumed 4,096 bytes. We switched to float16 (half-precision) storage using PostgreSQL’s halfvec type.
Before committing to half-precision storage, we evaluated whether the precision loss affected retrieval quality. For our use cases—identity matching, semantic search over user profiles, and entitlement similarity—float16 quantization at 1024 dimensions had negligible impact on recall and ranking. The embedding model (amazon.titan-embed-text-v2:0) produces values well within float16’s representable range, so quantization noise is minimal.
We also tested lower-dimensional embeddings (512 dimensions), but saw measurable degradation in retrieval quality—particularly for nuanced queries involving job titles and organizational relationships. The 1024-dimension, half-precision configuration gave us the best balance of storage savings and search quality.
The math is straightforward:
Format
Dimensions
Bytes per vector
Relative size
float32
1024
4,096
1.0x
float16
1024
2,048
0.5x
50% storage reduction per vector, across every embedding table.
One important caveat: the savings apply to row/heap storage, not to index overhead. PostgreSQL’s pgvector HNSW indexes store graph connectivity and distance metadata internally at full precision regardless of the underlying column type. So while each row’s vector payload is halved, the HNSW index size remains roughly the same. For our workload, row storage dominates total disk usage, so the net savings are still substantial—but if your bottleneck is index size rather than heap size, halfvec alone won’t help.
To manage the rollout safely, we created dedicated *Embedding1024Halfvec tables rather than adding halfvec columns to existing tables. Our embedding tables are partitioned per tenant—each tenant gets its own partition, automatically provisioned during tenant creation—so HNSW indexes are always tenant-local. This means index builds and searches never cross tenant boundaries, keeping both performance and isolation guarantees intact. We populated the new halfvec format in parallel behind a feature flag, validated results, and then switched reads over with a second flag—zero downtime and full rollback capability.
Change 3: Split Architecture with Independent State Machines
The original system used a single monolithic state machine to orchestrate all embedding work. We split it into two independent, self-scheduling state machines:
Siamese Embedding Sync FSM — handles username-based similarity embeddings via our internal inference service
Each FSM manages its own lifecycle:
Success: schedules next run with a jittered 2-hour delay
Partial progress (batch was full): retries in ~5 minutes for faster catch-up
Error: exponential backoff (5 min → 5 min → 5 min → 2 hours)
Rate limit: 60–90 second jittered delay
This separation means a failure in username similarity processing can’t block semantic embeddings from completing, and vice versa. Each FSM tracks its own state, progress, retries, and errors independently—making debugging and monitoring significantly clearer.
We also added a distributed concurrency limiter that caps concurrent Bedrock batch jobs at 20 across all tenants. When at capacity, the FSM polls every 60 seconds until a slot opens. This prevents any single large tenant from monopolizing Bedrock resources during a sync storm.
Rethinking the Text Representation
As part of this refactor, we also revisited how we generate the input text for user profile embeddings. The original format stored each user as multi-line key-value pairs in their original order from the data source. But which format actually produces better search results? Rather than guessing, we ran a systematic evaluation.
AI-Judge Blind Evaluation
We tested 28 embedding configurations—7 text formats x 2 embedding providers (Amazon Titan, Cohere) x 2 dimensions (512, 1024)—against 25 natural language queries on production data. To remove bias, all configurations were randomized into anonymous labels (“Embedding 1–28”) and evaluated blindly on text relevance alone.
The 25 queries covered the full range of real-world search patterns: name lookups (“find sathya”), relationship queries (“who reports to mallory”), status filters (“suspended accounts”), role searches (“product managers”), domain-scoped queries (“sathya in microsoft domain”), and ID lookups (“employee id 1234”).
The winning format reorders fields to place categorical and shared attributes first, while preserving the newline-delimited structure that Titan prefers:
Directory Status
Employment Status
Department
Job Title
Status
Manager
Domain
ID, Name, Email, custom profile fields
Why AI-as-Judge Works Here
Using an AI judge for blind evaluation let us test 28 configurations systematically in a way that would have been impractical with human evaluation. The key design choices that made it work:
Blind labels prevented any bias toward a known-good configuration
Realistic data (406 user profiles representative of real-world organizational structures) ensured results reflected actual search patterns
Diverse query set covering 6 categories ensured no single query type dominated the ranking
Per-query vote tracking let us understand why a format won, not just that it won—which informed the final field ordering
Results
After rolling this out across our fleet:
Metric
Before (online/sync)
After (batch/async)
Improvement
Time to embed millions of items
Weeks
Hours
30x faster
Bedrock inference cost
Baseline
~50% of baseline
~50% reduction
Vector storage per embedding
4,096 bytes
2,048 bytes
50% reduction
Rate-limit conflicts with live features
Frequent
Eliminated
No more throttling
What We’d Do Differently
If I were starting this from scratch, I’d go directly to batch inference with half-precision vectors from day one. The online inference path made sense when we had a handful of tenants, but the migration cost—running parallel pipelines, dual-writing to new tables—was significant. Starting with the cheaper, more scalable option would have saved weeks of migration work.
I’d also look at even lower precision formats sooner. Binary quantization and Matryoshka embeddings are promising for further storage reduction with minimal quality loss—they’re next on my list.
Takeaways
Batch inference APIs exist for a reason. If your workload isn’t latency-sensitive, you’re likely overpaying for real-time inference. Check whether your cloud provider offers a batch pricing tier.
Half-precision vectors are probably fine. For most retrieval and similarity tasks, float16 quantization has negligible impact on quality. Test it on your data, but don’t assume you need float32.
Split your state machines by failure domain. A monolithic orchestrator that handles multiple independent workloads will eventually have a failure in one block the others. Separate them early.
Feature-flag your data migrations. Parallel-populate, validate, switch reads, then deprecate. It’s more tables and more flags, but it’s the difference between a zero-downtime migration and a maintenance window.