Configuring Mixed Index Fallback Chains
Production Apache JanusGraph deployments routinely encounter transient mixed index degradation caused by network partitions, Elasticsearch/OpenSearch garbage collection pauses, or shard rebalancing. Hard-failing queries during these windows violates SLOs and triggers downstream pipeline backpressure. Configuring Mixed Index Fallback Chains requires explicit topology design, strict timeout boundaries, and application-level routing logic to gracefully degrade query execution without violating consistency guarantees or triggering unbounded storage scans.
The state machine below captures the circuit-breaker lifecycle the fallback router moves through.
stateDiagram-v2
[*] --> Primary
Primary --> Fallback: health check fails
Fallback --> Recovery: replicas UN and queue drained
Recovery --> Primary: index status ENABLED
Fallback --> Fallback: serve via composite index
Index Topology & Routing Architecture
JanusGraph does not natively auto-chain mixed indexes. You must provision multiple index backends and implement deterministic routing logic at the query proxy or pipeline layer. A production-grade fallback hierarchy operates as follows:
- Primary Mixed Index: High-throughput Elasticsearch/OpenSearch cluster handling standard predicate queries.
- Secondary Mixed Index: Geographically isolated or lower-tier cluster absorbing overflow during primary degradation.
- Composite Index + Bounded Storage Scan: Fallback path restricted to exact-match or range predicates covered by composite indexes, enforced by
query.force-index=false.
The routing layer must evaluate index health, replication lag, and query complexity before dispatch. When primary latency exceeds the defined SLO threshold, traffic shifts to the secondary. If both external indexes become unreachable, the pipeline routes to composite indexes with explicit vertex/edge constraints. This architecture aligns with established Mixed Index Routing patterns, where query planners integrate circuit breakers instead of relying on backend auto-discovery.
JanusGraph Configuration for Degraded Index States
Define both primary and fallback mixed indexes explicitly in janusgraph.properties. Enforce strict timeout boundaries and index-force toggles to prevent thread pool exhaustion during state transitions.
# Primary Mixed Index Configuration
index.primary.backend=elasticsearch
index.primary.hostname=es-primary-01,es-primary-02,es-primary-03
index.primary.elasticsearch.client-only=true
index.primary.elasticsearch.connect-timeout-ms=2000
index.primary.elasticsearch.socket-timeout-ms=5000
index.primary.elasticsearch.max-retry-timeout=10000
# Fallback Mixed Index Configuration
index.fallback.backend=elasticsearch
index.fallback.hostname=os-fallback-01,os-fallback-02
index.fallback.elasticsearch.client-only=true
index.fallback.elasticsearch.connect-timeout-ms=1500
index.fallback.elasticsearch.socket-timeout-ms=3000
# Query Safety Boundaries
query.force-index=true
query.fast-property=true
storage.read-time=10000
storage.write-time=15000
During degradation, dynamically toggle query.force-index=false only when fallback composite indexes fully cover the query predicates. Leaving query.force-index=true while routing to composite indexes triggers an immediate IllegalArgumentException on unindexed traversals. Apply configuration changes at runtime via the ManagementSystem API or a hot-reload endpoint. For detailed synchronization mechanics, consult External Index Synchronization & Consistency Tuning to align backend commit intervals with fallback thresholds.
Python Pipeline Integration for Dynamic Fallback
Implement the fallback chain in your ingestion or query pipeline using a stateful circuit breaker. The following Python implementation demonstrates health polling, dynamic configuration toggling, and bounded fallback execution.
import time
import requests
from typing import List, Any, Callable
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.traversal import Traversal
class JanusFallbackRouter:
def __init__(self, primary_url: str, graph: Any):
self.primary_url = primary_url
self.graph = graph
self.state = "primary"
self.last_health_check = 0.0
self.cooldown = 5.0 # seconds
def check_index_health(self) -> bool:
if time.time() - self.last_health_check < self.cooldown:
return self.state == "primary"
try:
# Validate Elasticsearch/OpenSearch cluster health
resp = requests.get(f"{self.primary_url}/_cluster/health", timeout=2)
healthy = resp.json().get("status") in ("green", "yellow")
self.last_health_check = time.time()
return healthy
except requests.RequestException:
self.last_health_check = time.time()
return False
def toggle_force_index(self, force: bool) -> None:
# Runtime configuration update via JanusGraph ManagementSystem
mgmt = self.graph.openManagement()
mgmt.set("query.force-index", str(force).lower())
mgmt.commit()
def execute_query(self, build_traversal: Callable[[], Traversal], timeout_ms: int = 5000) -> List[Any]:
# Accept a factory: a traversal cannot be re-iterated once executed, so
# each attempt must build a fresh one. Per-query timeout is set with the
# 'evaluationTimeout' option (gremlinpython has no withTimeout step).
if self.state == "primary" and not self.check_index_health():
self.state = "fallback"
self.toggle_force_index(False)
if self.state == "fallback":
try:
# Execute against composite/bounded path
return build_traversal().with_("evaluationTimeout", timeout_ms).toList()
except Exception as e:
raise RuntimeError("Fallback traversal failed. Verify composite index coverage.") from e
try:
return build_traversal().with_("evaluationTimeout", timeout_ms).toList()
except Exception:
self.state = "fallback"
self.toggle_force_index(False)
return build_traversal().with_("evaluationTimeout", timeout_ms).toList()
Diagnostic Validation & Recovery Procedures
Verify the fallback chain before promoting changes to production. Follow these reproducible steps to validate routing behavior and Apache JanusGraph Storage Backend & Index Synchronization alignment.
- Baseline Health Check: Query the Elasticsearch cluster health API to confirm
status: green. Verify JanusGraph index registration viamgmt.getGraphIndex('index.primary').getIndexStatus(). - Simulate Primary Degradation: Throttle network traffic to the primary index hosts using
tc qdisc(Linux Traffic Control) or inject a 503 response at the API gateway layer. - Observe Routing Shift: Monitor pipeline logs for state transitions. Confirm
query.force-indextoggles tofalsewithin 2 seconds of primary failure. - Validate Composite Coverage: Execute a bounded traversal using
g.V().has('property_key', 'value'). If the query returns results without a full-scan warning, composite coverage is valid. - Recovery Verification: Restore primary index connectivity. The pipeline should automatically revert to
state = "primary"and togglequery.force-index=trueafter the cooldown period expires.
If fallback traversals return empty result sets despite valid data, verify index lifecycle alignment per JanusGraph Index Management. Ensure commit intervals match the fallback timeout window, and confirm that composite indexes include all required predicate properties. For cluster health API specifics, reference the Elasticsearch Cluster Health API documentation to tune polling intervals against GC pause durations.