Mixed Index Routing

Mixed index routing in Apache JanusGraph decides which backend answers a given vertex or edge property predicate — the storage-backed composite index, the external search cluster, or, when nothing matches, a full graph scan. When a traversal predicate is routed to the wrong backend, the symptom is never a clear error: it is a query that was sub-millisecond in staging silently turning into a multi-second full scan in production, or a full-text query returning stale documents because it was dispatched to an index that trails the storage commit. This page scopes that routing surface within the parent External Index Synchronization & Consistency Tuning subsystem, and covers deterministic backend selection, shard alignment against storage topology, the async visibility window that routing must account for, and the gremlin-python fallback logic that keeps a degraded search cluster from taking down the whole traversal path.

The decision tree below shows how JanusGraph selects a backend for an incoming predicate.

The optimizer branches on predicate type at query-execution time — two branches reach an index, and the unindexed branch silently degrades to a full scan.

Routing Mechanics & Backend Selection

JanusGraph resolves the routing decision at query-execution time, not at schema-definition time. The query optimizer inspects each predicate in the traversal, matches it against the registered index definitions held in the schema, and picks the most selective index that can answer it. A predicate on a property key bound to a mixed index — anything requiring textContains, geoWithin, or an inequality range — is dispatched to the external search backend through the IndexProvider interface. An equality predicate on a key covered by a composite index stays inside the storage layer. A predicate that matches no index falls through to a full scan of the vertex or edge space.

Three operational gaps produce almost every routing failure:

Unbound property keys — the key was never added to a mixed index, or the index build never reached ENABLED, so the optimizer cannot route to it and silently falls back to a scan.
Degraded backend health — the search cluster is up but rejecting writes or reads (shard relocation, circuit breaker tripped), so routed queries time out or return partial results.
Mapping drift — the field mapping or analyzer in the search cluster no longer matches what JanusGraph expects, so a routed full-text predicate matches nothing even though the documents exist.

To enforce deterministic routing, bind property keys explicitly to a named index backend using index.<name>.backend and index.<name>.hostname, and confirm the binding reached ENABLED before serving traffic. When you operate two backends in parallel — during a live migration between Elasticsearch Integration and OpenSearch Sync Patterns, for example — routing logic has to account for analyzer differences, tokenization rules, and shard allocation, because default field mappings and text analyzers diverge across the two engines and across major versions of each. A predicate that routes cleanly on one backend can match nothing on the other. When routing requires a hard fallback because the primary backend is unreachable, the topology and circuit-breaker design lives in Configuring Mixed Index Fallback Chains.

Which properties are eligible to route to a mixed index at all is governed upstream by your schema — see Property Indexing Rules for the cardinality and mapping decisions that determine whether a key can back a routed predicate.

Core Configuration & Consistency Tuning

Routing behavior is governed by janusgraph.properties. The block below is a production baseline tuned for high-throughput ingestion, a bounded visibility window, and stable backend routing. Every non-default value carries a routing or consistency consequence — the numbered constraints after it are the ones that change routing outcomes.

properties

# Storage backend
storage.backend=cql
storage.hostname=cassandra-01,cassandra-02,cassandra-03
storage.cql.keyspace=janusgraph_prod
storage.cql.read-consistency-level=LOCAL_QUORUM
storage.cql.write-consistency-level=LOCAL_QUORUM

# Mixed index backend binding
index.search.backend=elasticsearch
index.search.hostname=es-01,es-02,es-03
index.search.port=9200
index.search.elasticsearch.client-only=true
index.search.elasticsearch.http.auth.type=basic
index.search.elasticsearch.http.auth.basic.username=graph_user
index.search.elasticsearch.http.auth.basic.password=${ES_PASSWORD}

# Index creation settings (applied once, at index creation time)
index.search.elasticsearch.create.ext.number_of_shards=3
index.search.elasticsearch.create.ext.number_of_replicas=1
index.search.elasticsearch.create.ext.refresh_interval=5s

# Bulk routing & consistency
index.search.elasticsearch.bulk-refresh=false
index.search.elasticsearch.bulk-size=1000
index.search.elasticsearch.max-retry-time=300000

Operational constraints that govern routing:

client-only=true routes JanusGraph through the REST transport client to your existing cluster nodes rather than starting an embedded node — a routing prerequisite in any shared cluster, since an embedded node would join the search cluster and disturb shard allocation.
create.ext.number_of_shards=3 fixes the shard count at index-creation time and is immutable afterward. Set it to mirror your storage topology (see the shard-alignment note below); changing it later requires a full reindex.
create.ext.refresh_interval=5s caps the visibility lag on the routed index. Lower values increase segment churn and merge I/O on the search cluster without giving you transactional read-after-write — that guarantee comes from bulk-refresh, not the refresh interval.
bulk-refresh=false disables a synchronous refresh per write batch. Leaving it on false is what keeps ingestion from thrashing the segment-merge pipeline; flip it to wait_for only on the narrow set of writes that must be immediately visible to a subsequent routed read.
bulk-size=1000 caps documents per bulk request. Keep the resulting payload under a few MB — bulk requests above roughly 5 MB trip the Elasticsearch/OpenSearch parent circuit breaker, which surfaces as EsRejectedExecutionException and looks identical to index lag from the traversal side.
max-retry-time bounds JanusGraph’s internal retry window for a failed index commit. Tune it below your circuit-breaker recovery time so a single degraded flush does not queue behind an unbounded retry loop and stall the async dispatch queue.

Shard alignment. By default the search cluster auto-routes documents to shards by hash, which produces hot shards when graph partitions have skewed degree distributions — a handful of super-nodes concentrating writes onto one shard. Fix shard and replica counts at index-creation time and mirror the storage topology so routed writes spread across the same partitions the storage layer already balances. Over-sharding inflates cluster-state heap and slows recovery; under-sharding bottlenecks concurrent mutation bursts. This alignment depends on your storage partitioning, so settle your Replication Strategies before you fix shard counts.

Index Synchronization Protocol

Routing has to be reasoned about together with synchronization because JanusGraph deliberately decouples storage consistency from index visibility. A mutation commits to storage at LOCAL_QUORUM and returns to the caller immediately; the corresponding index document is dispatched asynchronously and becomes searchable only after a propagation window elapses. A predicate routed to the mixed index in that window reads a stale view.

The visibility window is the sum of three intervals:

t_{visible} = t_{queue} + t_{bulk} + t_{refresh}

where t_queue is time spent in the async dispatch queue, t_bulk is bulk-transport plus indexing latency on the search cluster, and t_refresh is bounded by create.ext.refresh_interval. Routing that assumes read-after-write will intermittently fail whenever t_visible exceeds the gap between a write and the read that depends on it.

Two patterns keep routed reads correct without serializing all ingestion:

Selective wait_for. For the specific writes whose result must be immediately visible to a subsequent routed query, issue the mutation with bulk-refresh=wait_for semantics so the commit blocks until the routed document is searchable. Applying wait_for globally serializes throughput behind refresh and multiplies thread contention — scope it to the read-your-writes path only.
Lag-gated polling. For pipelines that reconcile against the index, poll a lag metric rather than sleeping a fixed interval. Track the index write-queue depth and the indexing-latency trend, and only route the reconciliation read once both are within threshold.

The lag signals to watch, and what a rising value means for routing:

IndexProvider queue size — a monotonically rising queue is producer backpressure; routed reads are trailing further behind with every batch.
Search cluster /_cat/thread_pool/write?v — non-zero rejections mean the producer is outrunning the search cluster, and routed writes are being dropped into the retry loop.
/_nodes/stats/indices/indexing latency — the leading indicator that t_bulk is growing and the visibility window is widening.

The deeper trade-off analysis of where to place the acknowledgment boundary lives in Eventual vs Strong Consistency; this page assumes you have chosen a boundary and are tuning routing to respect it.

Storage acknowledges at t₀, but the routed document is not searchable until t_visible — routing that assumes read-after-write fails whenever this window outlasts the gap between a write and its dependent read.

Python Integration Pattern

A production pipeline has to treat a routed mixed-index query as fallible: the backend can be mid-relocation, a routed predicate can time out, and the visibility window can return an empty set that is not actually empty. The gremlin-python pattern below wraps a routed traversal with bounded exponential backoff and jitter, distinguishes a transport failure from an empty result, and degrades to a storage-level path rather than hard-failing when the search backend is unreachable.

python

import time
import random
import logging
from gremlin_python.driver import client, serializer

logger = logging.getLogger(__name__)

# Routing-aware client: bounded pool, explicit connection timeout so a
# stalled search backend surfaces as a timeout, not a hung request.
gremlin_client = client.Client(
    "ws://graph-cluster:8182/gremlin",
    "g",
    message_serializer=serializer.GraphSONSerializersV3d0(),
    pool_size=4,
    max_in_process_per_connection=10,
    connection_timeout=5.0,
)


def execute_with_backoff(query: str, max_retries: int = 4, base_delay: float = 2.0):
    """Run a mixed-index-routed traversal with exponential backoff and jitter.

    Raises ConnectionError once retries are exhausted so the caller can
    trigger an explicit fallback rather than swallowing the failure.
    """
    for attempt in range(max_retries):
        try:
            callback = gremlin_client.submit(query)
            return callback.all().result()
        except Exception as exc:
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            logger.warning(
                "Routing attempt %d/%d failed: %s. Retrying in %.2fs",
                attempt + 1, max_retries, exc, delay,
            )
            time.sleep(delay)
    raise ConnectionError("Mixed index backend unreachable after max retries")


def run_mixed_index_pipeline(predicate: str, limit: int = 100):
    """Route a full-text predicate through the mixed index, with a storage fallback."""
    routed = (
        f"g.V().hasLabel('user')"
        f".has('email', {predicate})"
        f".limit({limit}).elementMap()"
    )
    try:
        vertices = execute_with_backoff(routed)
        if not vertices:
            # Empty is ambiguous: genuinely no match, or index lag.
            logger.info("Routed query returned empty set; verify index sync lag before trusting it.")
        return vertices
    except ConnectionError as exc:
        logger.critical("Search backend down (%s); degrading to composite-index path.", exc)
        # Fallback: an equality predicate on a composite-indexed key stays in
        # storage and never touches the search cluster.
        fallback = f"g.V().hasLabel('user').has('email_exact', {predicate}).limit({limit}).elementMap()"
        return execute_with_backoff(fallback)

Two design points make this routing-safe rather than merely retry-wrapped. First, an empty result is logged as ambiguous, not returned as authoritative — during the visibility window an empty set can mean “not yet indexed,” and a pipeline that treats it as “does not exist” will make wrong decisions. Second, the fallback routes to a composite-indexed equality key that lives entirely in storage, so a search-cluster outage degrades to a narrower-but-correct storage query instead of an unbounded full scan.

Connection Lifecycle & Pool Management

Routing failures and pool exhaustion produce the same symptom — TimeoutException on the traversal — so the pool must be sized and bounded deliberately, or you will spend incidents chasing index lag that is actually a starved client.

Pool sizing. Set pool_size to match your write/read concurrency, not higher. A pool larger than the Gremlin Server’s threadPoolWorker count just queues on the server; a pool smaller than your concurrency starves callers and manifests as routed-query timeouts. Start at pool_size × max_in_process_per_connection ≈ peak concurrent traversals and adjust from utilization metrics.
Connection timeout. Keep connection_timeout short (5 s here) so a stalled search backend fails fast into the backoff loop rather than pinning a connection. A long timeout lets one degraded routed query hold a slot until every slot is held, converting a partial backend degradation into a full client stall.
Idle timeout and keepalive. Let idle connections recycle rather than pinning the full pool open against the server, and rely on the driver keepalive to detect a half-open socket after a network blip — a common aftermath of a search-cluster relocation.
Retry policy. Bound retries (4 attempts here) and always terminate in an explicit fallback. An unbounded retry loop against a down backend holds pool slots and turns a recoverable degradation into an outage.

The full sizing model — worker threads, in-flight request caps, and how to derive pool size from measured concurrency — is documented in Connection Pooling, and the underlying transport is covered in Cassandra Backend Setup.

Diagnostics & Operational Fallbacks

Validate routing topology before deploying: run mgmt.printIndexes() in the Gremlin console to confirm every property key is bound to the intended backend and reports ENABLED, and check the search cluster /_cat/indices?v for shard health and document counts. When routing fails, JanusGraph falls back to a full graph scan unless query.force-index=true is set — with that flag, an unindexed traversal throws instead of quietly scanning, which is the safer production default. Pre-define composite indexes for every critical traversal path so a mixed-index outage degrades to a bounded storage query rather than a scan.

The triplets below cover the routing failure modes an on-call engineer will actually see.

Symptom	Diagnose	Resolve
Routed full-text query suddenly slow (staging was fast)	`.profile()` on the traversal shows a full scan step, not an index step; `mgmt.printIndexes()` shows the key not `ENABLED`	Rebuild/enable the mixed index for the key; set `query.force-index=true` so future routing gaps throw instead of scanning
Routed query returns fewer results than expected	Documents exist in `/_cat/indices?v` but the analyzer/mapping differs from what JanusGraph expects	Mapping drift — realign the field mapping/analyzer, then `REINDEX` via the Management API; verify on both backends if mid-migration
Recent writes missing from routed results	`/_nodes/stats/indices/indexing` latency rising; `IndexProvider` queue climbing	Visibility window stretching under load — throttle the producer, or scope `bulk-refresh=wait_for` to the read-your-writes path only
`EsRejectedExecutionException` on routed writes	`/_cat/thread_pool/write?v` shows non-zero rejections; bulk payload near/over 5 MB	Lower `bulk-size`; add producer-side backpressure; scale index write threads before retrying
Routed traversal times out during ingestion bursts	`nodetool tpstats` clean but driver throws `TimeoutException`; pool utilization at 100%	Starved client pool, not index lag — resize per the Connection Pooling model and cap batch concurrency

For ScyllaDB-backed clusters, run nodetool repair before any index rebuild so the routed index is not populated from an under-replicated storage view — the read/write consistency benchmarks that bound how tight the visibility window can safely go are in ScyllaDB Migration.

Frequently Asked Questions

Why did a fast query become a full scan in production but not staging? The property key is not routed to a mixed index in production — either it was never added, or its index build never reached ENABLED. The optimizer cannot route to an index that is not enabled, so it falls back to a full scan. Confirm with .profile() and mgmt.printIndexes(), and set query.force-index=true so a routing gap throws instead of silently scanning.

Does raising storage consistency to QUORUM fix stale routed reads? No. Storage consistency governs durability and read repair inside the storage cluster only. Index visibility is a separate downstream concern bounded by t_queue + t_bulk + t_refresh. Raising the storage level increases write latency without shrinking the routed-index visibility window.

Should I lower refresh_interval to get read-after-write on routed queries? No. A lower refresh interval reduces t_refresh but adds segment-merge I/O and never gives transactional read-after-write. Immediate visibility comes from bulk-refresh=wait_for scoped to the specific writes that need it, not from the global refresh interval.

Why does a routed predicate match nothing after switching backends? Elasticsearch and OpenSearch ship different default analyzers and field mappings, and they diverge across major versions. A predicate that tokenizes one way on the old backend can match nothing on the new one. Realign the analyzer/mapping and REINDEX, and during a migration verify routing on both backends before cutting over.

Up a level: External Index Synchronization & Consistency Tuning — the subsystem this routing surface sits inside.
Configuring Mixed Index Fallback Chains — circuit-breaker topology for when a routed backend is unreachable.
Elasticsearch Integration — transport, auth, and dispatch wiring for Elasticsearch backends.
OpenSearch Sync Patterns — version-aware mapping and drift reconciliation for OpenSearch.
Eventual vs Strong Consistency — choosing the acknowledgment boundary the routing window must respect.
Connection Pooling — the client sizing model that separates pool starvation from index lag.