Eventual vs Strong Consistency Tradeoffs in JanusGraph

Decoupling the graph traversal engine from its persistence and indexing layers is JanusGraph’s primary architectural advantage, but it establishes a hard operational boundary around transactional guarantees. When evaluating Eventual vs Strong Consistency, you are explicitly trading write throughput for read accuracy across the Apache JanusGraph Storage Backend & Index Synchronization boundary. The storage layer (Cassandra, ScyllaDB, HBase) commits vertex/edge mutations using tunable quorum semantics, while the external index (Elasticsearch, OpenSearch, Solr) processes mixed-index queries asynchronously. No native distributed two-phase commit spans both systems. Platform teams must enforce consistency boundaries at the configuration layer, implement application-level verification, and define explicit fallback procedures for SLA breaches.

Configuration Matrix: Storage vs Index Alignment

JanusGraph exposes consistency controls through janusgraph.properties. The following configurations isolate the exact properties required to shift between tightened consistency (near-strong) and high-throughput eventual consistency postures.

Tightened Consistency Posture

This configuration minimizes index lag by forcing rapid segment refreshes and enforcing strict storage quorums. It increases backend I/O pressure and reduces ingestion throughput. Use only for low-volume, read-critical workloads requiring immediate query accuracy.

properties
# Storage Backend (Cassandra/ScyllaDB)
storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.keyspace=janusgraph_prod
storage.cql.write-consistency-level=QUORUM
storage.cql.read-consistency-level=QUORUM

# External Index (Elasticsearch/OpenSearch)
index.search.backend=elasticsearch
index.search.hostname=10.0.2.10,10.0.2.11,10.0.2.12
index.search.elasticsearch.http-connection-timeout=10000
index.search.elasticsearch.ext.index.refresh_interval=1s
index.search.elasticsearch.ext.index.number_of_replicas=1
index.search.elasticsearch.ext.index.number_of_shards=5

Eventual Consistency Posture

Default production posture for high-throughput ingestion pipelines. Storage commits locally for minimal latency, and index updates are batched asynchronously. Read-after-write queries will return stale results until the next index refresh cycle completes. Refer to External Index Synchronization & Consistency Tuning for backend-specific queue tuning.

properties
# Storage Backend
storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.write-consistency-level=LOCAL_ONE
storage.cql.read-consistency-level=LOCAL_ONE

# External Index
index.search.backend=elasticsearch
index.search.hostname=10.0.2.10,10.0.2.11,10.0.2.12
index.search.elasticsearch.http-connection-timeout=10000
index.search.elasticsearch.ext.index.refresh_interval=30s
index.search.elasticsearch.ext.index.number_of_replicas=2
index.search.elasticsearch.ext.index.number_of_shards=10

Diagnostic Procedures & Lag Measurement

Consistency boundaries must be validated against actual ingestion rates. Use the following reproducible steps to measure index synchronization lag and verify storage-to-index alignment.

Step 1: Baseline Index Lag Measurement

Execute this Gremlin traversal immediately after a known write operation. It compares the timestamp of the committed mutation against the timestamp returned by the mixed-index query.

gremlin
// 1. Write a test vertex with a timestamp property
g.addV('test_node').property('id', 'diag-001').property('ts', System.currentTimeMillis()).next()

// 2. Immediately query via mixed index
g.V().has('test_node', 'id', 'diag-001').values('ts').next()

Calculate delta: System.currentTimeMillis() - returned_ts. A delta exceeding refresh_interval * 2 indicates index queue backlog or segment merge contention.

Step 2: Python Pipeline Verification

Embed this validation routine into ingestion pipelines to detect stale reads before downstream consumers process stale graph state.

python
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.graph_traversal import __
import time

def verify_consistency(graph, vertex_label, prop_key, prop_value, timeout_sec=10):
    start = time.time()
    while time.time() - start < timeout_sec:
        try:
            result = graph.V().has(vertex_label, prop_key, prop_value).hasNext()
            if result:
                return True
        except Exception:
            pass
        time.sleep(0.5)
    raise TimeoutError(f"Index synchronization exceeded {timeout_sec}s SLA for {prop_key}={prop_value}")

Step 3: Backend Metrics Correlation

Monitor Elasticsearch segment refresh latency directly. High refresh_total_time_in_millis relative to refresh_total indicates disk I/O saturation. Query the _stats API:

bash
curl -s "http://10.0.2.10:9200/janusgraph_mixed_index/_stats/refresh?pretty" | jq '.indices[].total.refresh'

Cross-reference with Cassandra/ScyllaDB WriteLatency and ReadLatency metrics. If storage latency is stable but index lag grows, the bottleneck is the external index segment merge process, not JanusGraph’s transaction manager.

Fallback & Incident Response Protocols

When consistency boundaries degrade or SLAs breach, execute the following fallback procedures in order. Do not attempt to bypass the index without explicit validation of storage state.

Fallback 1: Forced Index Refresh

If read-after-write queries consistently time out or return stale data, trigger an immediate segment refresh on the mixed index. This forces pending mutations into searchable segments.

bash
curl -X POST "http://10.0.2.10:9200/janusgraph_mixed_index/_refresh"

Warning: Frequent forced refreshes degrade indexing throughput. Use only during incident response or scheduled maintenance windows.

Fallback 2: Storage-Only Read Bypass

When the external index is unavailable or severely desynchronized, route critical read operations directly to the storage backend using graph traversals that avoid has() predicates on indexed properties.

gremlin
// Bypass mixed index: scan by label and filter in-memory (use with LIMIT)
g.V().hasLabel('critical_entity').limit(1000).filter { it.get().property('status').value() == 'active' }

Constraint: This bypasses index-backed range and text queries. Apply strict limit() clauses to prevent full-table scans in production.

Fallback 3: Index Rebuild Procedure

If index corruption or persistent desynchronization occurs, rebuild the mixed index from storage. This operation is blocking and requires a maintenance window.

  1. Disable automatic index updates:
properties
index.search.elasticsearch.ext.index.auto_expand_replicas=false
  1. Execute JanusGraph management API reindex:
java
JanusGraphManagement mgmt = graph.openManagement();
PropertyKey key = mgmt.getPropertyKey("indexed_property");
mgmt.updateIndex(mgmt.getGraphIndex("mixed_index"), SchemaAction.REINDEX).get();
mgmt.commit();
  1. Monitor reindex progress via mgmt.getGraphIndex("mixed_index").getIndexStatus(key).
  2. Re-enable automatic updates and verify consistency using Step 1 diagnostics.

Operational Guardrails

  • Never set refresh_interval below 1s in production. Sub-second refreshes cause excessive segment creation and trigger Elasticsearch circuit breakers.
  • Align storage.cql.write-consistency-level with your cluster topology. QUORUM on a 3-node cluster tolerates 1 node failure; ALL provides zero fault tolerance.
  • Implement idempotent write patterns in Python pipelines. JanusGraph does not guarantee exactly-once semantics across the storage/index boundary during network partitions.
  • Log index lag metrics alongside application traces. Correlate refresh_total_time_in_millis with pipeline ingestion rates to predict SLA breaches before they impact consumers.