Eventual vs Strong Consistency Tradeoffs in JanusGraph

This guide walks you through selecting and enforcing the correct consistency posture for a single JanusGraph workload — mapping its read-your-writes SLA to a concrete janusgraph.properties configuration — so you never discover, in production, that a fraud-detection traversal is reading index state that trails the committed graph by seconds. It sits under Eventual vs Strong Consistency, which explains where the acknowledgment boundary lives; this page is the decision-and-enforcement procedure that turns that boundary into a tuned deployment. The failure it prevents is the silent one: a global posture chosen by default rather than by SLA, where either every write serializes behind an index refresh or every search query risks returning stale results.

The core constraint is architectural. JanusGraph decouples the traversal engine from its persistence and index layers, so the storage backend (Cassandra, ScyllaDB, HBase) commits vertex and edge mutations under tunable quorum semantics while the external index (Elasticsearch, OpenSearch, Solr) processes mixed-index queries asynchronously. No distributed two-phase commit spans both systems. You enforce the posture you want at the configuration layer, verify it with lag measurement, and define explicit fallbacks for when the boundary breaches.

Prerequisites

Confirm all of the following before you change any property. Getting the posture wrong is cheap to fix; discovering a version or topology mismatch mid-incident is not.

JanusGraph 0.6.x or 1.0.x, with the CQL storage adapter and the elasticsearch index backend (the same backend value drives OpenSearch).
Storage cluster (Cassandra 4.x / ScyllaDB 5.x) healthy: nodetool status shows every node UN, and the replication factor of the JanusGraph keyspace is known. Align it first via Replication Strategies — a QUORUM write that must cross a datacenter changes every latency number below.
Index cluster green: GET /_cluster/health returns "status":"green" with no unassigned shards.
A defined SLA for the workload you are tuning, stated as a number: the maximum tolerable delay between a committed write and that write being visible to a mixed-index query. “Immediate”, “a few seconds”, and “don’t care” each map to a different posture.
Management access to edit janusgraph.properties and restart the server pool, plus HTTP access to the index _stats, _refresh, and _settings endpoints.
A non-production graph to validate the change first. Never move a consistency boundary for the first time under live traffic.

Step-by-Step Procedure

Step 1 — Classify the workload against its SLA

Write down which of three postures the workload actually needs. This single decision drives every property that follows.

Eventual, fire-and-forget — bulk ingestion, ETL, analytical scans. No traversal reads its own write inside the replication window. Maximize throughput.
Eventual with selective read-your-writes — mostly asynchronous, but a narrow set of records must be visible immediately. Keep the fast path and poll the index only for those records (covered in the parent Eventual vs Strong Consistency integration pattern).
Strong, near-synchronous — low-volume, read-critical workloads where the write cannot be considered done until it is searchable. Accept the throughput tax.

Step 2 — Apply the storage + index properties for that posture

Both postures below are complete janusgraph.properties fragments. The only lines that move the boundary are the consistency levels and bulk-refresh; everything else is topology.

Tightened (near-strong) posture — forces rapid segment refresh and strict storage quorums. Higher backend I/O, lower ingestion throughput. Use only for low-volume, read-critical graphs.

properties

# Storage backend (Cassandra / ScyllaDB)
storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.keyspace=janusgraph_prod
storage.cql.write-consistency-level=QUORUM
storage.cql.read-consistency-level=QUORUM

# External index (Elasticsearch / OpenSearch)
index.search.backend=elasticsearch
index.search.hostname=10.0.2.10,10.0.2.11,10.0.2.12
index.search.elasticsearch.client-only=true
index.search.elasticsearch.bulk-refresh=wait_for
index.search.elasticsearch.create.ext.refresh_interval=1s
index.search.elasticsearch.create.ext.number_of_replicas=1
index.search.elasticsearch.create.ext.number_of_shards=5

Eventual (high-throughput) posture — the default for ingestion pipelines. Storage commits locally for minimal latency; index updates batch asynchronously. Read-after-write queries return stale results until the next refresh cycle.

properties

# Storage backend
storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.write-consistency-level=LOCAL_ONE
storage.cql.read-consistency-level=LOCAL_ONE

# External index
index.search.backend=elasticsearch
index.search.hostname=10.0.2.10,10.0.2.11,10.0.2.12
index.search.elasticsearch.client-only=true
index.search.elasticsearch.bulk-refresh=false
index.search.elasticsearch.create.ext.refresh_interval=30s
index.search.elasticsearch.create.ext.number_of_replicas=2
index.search.elasticsearch.create.ext.number_of_shards=10

Note that create.ext.* values are applied only at index creation. Changing refresh_interval on an existing index requires the _settings API in Step 5, not a property edit. For the transport, auth, and dispatch wiring both backends share, see Elasticsearch Integration; for OpenSearch-specific node coordination, see OpenSearch Sync Patterns.

Step 3 — Understand the window you just configured

Under eventual consistency the replication window — the gap between the storage ack and the moment the shard becomes searchable — has a lower bound set by three additive terms:

W_{\text{drift}} = t_{\text{queue}} + t_{\text{bulk}} + t_{\text{refresh}}

With refresh_interval=30s, the t_refresh term alone pins the floor near thirty seconds no matter how idle the pipeline is. Setting bulk-refresh=wait_for collapses t_refresh toward zero for the affected writes but converts t_queue and t_bulk from background cost into synchronous commit latency. Expect roughly a 15–40 ms increase per batch under near-synchronous mode versus fire-and-forget. This is the number you are trading against the SLA from Step 1.

Step 4 — Measure the actual index lag

Do not trust the theoretical floor; measure it against real ingestion rates. Write a timestamped vertex, then immediately read it back through the mixed index and compute the delta.

gremlin

// 1. Write a test vertex carrying its own commit timestamp
g.addV('test_node').property('id', 'diag-001').property('ts', System.currentTimeMillis()).next()

// 2. Immediately read it back via the mixed index
g.V().has('test_node', 'id', 'diag-001').values('ts').next()

Compute System.currentTimeMillis() - returned_ts. A delta exceeding refresh_interval * 2 indicates index-queue backlog or segment-merge contention, not normal latency.

For pipeline code, embed a bounded visibility check so downstream consumers never process a stale graph unknowingly:

python

from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal
import time

def verify_consistency(g, vertex_label, prop_key, prop_value, timeout_sec=10):
    start = time.time()
    while time.time() - start < timeout_sec:
        try:
            if g.V().has(vertex_label, prop_key, prop_value).hasNext():
                return True
        except Exception:
            pass
        time.sleep(0.5)
    raise TimeoutError(
        f"Index sync exceeded {timeout_sec}s SLA for {prop_key}={prop_value}"
    )

Idempotency is mandatory here: a retried non-idempotent write double-commits and surfaces as a phantom index document. Use mergeV/coalesce on a stable key, and size the driver pool for whichever posture you chose — wait_for holds each connection for the full refresh, so a pool tuned for fire-and-forget will starve. The sizing model lives in Connection Pooling.

Verification

Confirm the posture is behaving as configured before you hand the workload back to traffic.

Confirm the storage side is stable. If storage latency is flat but index lag grows, the bottleneck is the index segment-merge process, not JanusGraph’s transaction manager:

bash

nodetool tablestats janusgraph_prod | grep -E "Write Latency|Read Latency"

Confirm index refresh health. A high refresh.total_time_in_millis relative to refresh.total means disk I/O is saturating the refresh cycle:

bash

curl -s "http://10.0.2.10:9200/janusgraph_mixed_index/_stats/refresh?pretty" \
  | jq '.indices[].total.refresh'

Confirm the queue is not backing up. Any non-zero rejected on the write pool means bulk requests are being dropped — divergence is accumulating right now:

bash

curl -s "http://10.0.2.10:9200/_cat/thread_pool/write?v&h=node_name,active,queue,rejected"

Confirm the boundary end-to-end. Re-run the Step 4 Gremlin write/read: under the tightened posture the delta should be well under refresh_interval; under the eventual posture it should sit near t_refresh and never grow monotonically across repeated runs. A rising delta is backpressure — align the routing partition via Mixed Index Routing, because a hot shard makes wait_for disproportionately expensive by pinning each blocking commit to the slowest shard.

Fallback Procedures

Execute these in order when the boundary degrades or the SLA breaches. Never bypass the index without first confirming storage state.

Fallback 1 — Force an index refresh

If read-after-write queries consistently time out or return stale data, push pending mutations into searchable segments immediately:

bash

curl -X POST "http://10.0.2.10:9200/janusgraph_mixed_index/_refresh"

Warning: frequent forced refreshes degrade indexing throughput. Use only during incident response or a maintenance window. Rollback: none required — a single refresh is non-destructive.

Fallback 2 — Storage-only read bypass

When the index is unavailable or badly desynchronized, route critical reads directly to storage by avoiding has() predicates on indexed properties. Apply a strict limit() to prevent a full scan:

gremlin

// Bypass the mixed index: scan by label, filter in memory, bound the result
g.V().hasLabel('critical_entity').limit(1000)
  .filter { it.get().property('status').value() == 'active' }

Constraint: this bypasses index-backed range and text queries. Rollback: remove the bypass and restore has()-based traversals once the index verifies clean in the Verification step.

Fallback 3 — Rebuild the mixed index from storage

If corruption or persistent desynchronization occurs, rebuild the index from the authoritative storage view. This is blocking and needs a maintenance window.

bash

# 1. Suspend automatic refresh during the rebuild
curl -X PUT "http://10.0.2.10:9200/janusgraph_mixed_index/_settings" \
  -H "Content-Type: application/json" \
  -d '{"index.refresh_interval": "-1"}'

java

// 2. Reindex through the JanusGraph Management API
JanusGraphManagement mgmt = graph.openManagement();
mgmt.updateIndex(mgmt.getGraphIndex("mixed_index"), SchemaAction.REINDEX).get();
mgmt.commit();

Monitor progress with mgmt.getGraphIndex("mixed_index").getIndexStatus(key). Rollback: restore the original refresh_interval (1s or 30s) via the _settings API and re-run Step 4 diagnostics before resuming writes.

Operational Guardrails

Never set refresh_interval below 1s in production — sub-second refreshes thrash segment creation and trip index circuit breakers long before they meaningfully shrink the window.
Match storage.cql.write-consistency-level to cluster topology: QUORUM on a 3-node cluster tolerates one node loss; ALL provides zero fault tolerance.
Raising the storage consistency level does not shrink the replication window — it only adds write latency. Storage consistency and index visibility are independent concerns.
Log the index-lag delta alongside application traces so you can correlate refresh.total_time_in_millis with ingestion rate and predict a breach before consumers feel it.

Frequently Asked Questions

How do I pick between eventual and strong for a specific workload? State the workload’s maximum tolerable delay between commit and index visibility as a number. If it is “don’t care” or larger than your refresh_interval, stay on eventual fire-and-forget. If only a few records need immediacy, keep eventual and poll the index for those. Only if the write is not “done” until it is searchable should you set bulk-refresh=wait_for.

Does setting bulk-refresh=wait_for globally make my graph strongly consistent? No, and it is the most common tuning mistake. Applied globally it serializes throughput behind index refresh and multiplies contention on the search cluster’s write pool. It also does not remove the queue and transport terms of the window — it converts them into synchronous commit latency. Scope it to the specific writes that need it.

Why does index lag keep growing even though storage latency is flat? Flat storage latency with rising lag means the bottleneck is the index, not JanusGraph’s transaction manager — segment-merge contention or a saturated write thread pool. Check _cat/thread_pool/write for non-zero rejected and verify a hot shard is not serializing refresh via mixed-index routing.

What is the fastest safe recovery when reads miss recent writes during an incident? Force a single index refresh (Fallback 1); it is non-destructive and pushes pending mutations into searchable segments. If the index is unavailable, use a bounded storage-only read bypass (Fallback 2). Reserve a full Management API reindex (Fallback 3) for confirmed corruption during a maintenance window.

Up a level: Eventual vs Strong Consistency — where the acknowledgment boundary lives and the configuration that implements it.
Syncing JanusGraph with Elasticsearch Step by Step — wire the index backend the tightened posture depends on.
Configuring Mixed-Index Fallback Chains — keep wait_for commits cheap by aligning shard routing.
Resolving OpenSearch Index Drift in Production — the reconciliation workflow when the boundary has already breached.