Does JanusGraph replicate graph data across datacenters on its own?

No. JanusGraph implements no cross-datacenter replication of its own. It delegates partition tolerance, write propagation, and quorum enforcement entirely to the storage engine (Cassandra or ScyllaDB). Every multi-DC decision is a keyspace replication and consistency-level decision, plus a separate, looser decision about how far the mixed index is allowed to drift behind storage.

Why do multi-DC traversals return stale or missing data after failover even though writes succeeded?

The mixed index in Elasticsearch or OpenSearch replicates asynchronously via a separate mutation log and does not participate in Cassandra's consensus. During a partition, one region's search cluster lags the storage backend, so a vertex durable at quorum is still invisible to a has() predicate until the index applies. Reconcile by running a serialized reindex job that treats storage as authoritative, and never trigger concurrent reindex operations across regions.

What is the correct failover sequence when a JanusGraph datacenter goes down?

Drain connections from the failing DC without terminating its storage nodes, override consistency to LOCAL_QUORUM in the surviving region and restart JanusGraph, pause mixed-index writes with batch-loading, run nodetool repair on the surviving seed nodes, and re-enable index sync only once nodetool status shows all nodes UN. Match fallback routing weights to physical rack distribution to avoid write-timeout cascades.

Configuring Multi-Datacenter Replication for Graph Data

This guide walks through configuring topology-aware, multi-datacenter replication for an Apache JanusGraph deployment so that a committed write survives the loss of an entire region without returning stale or missing traversals during failover. It sits under the Replication Strategies reference and extends it to the multi-region case: JanusGraph implements no cross-datacenter replication of its own, so partition tolerance, write propagation, and quorum enforcement are all delegated to the storage engine. The specific failure this procedure prevents is split-brain — writes succeed at quorum in a surviving region while the storage topology, the consistency levels, and the mixed-index routing path drift out of alignment and silently serve inconsistent data.

Prerequisites

Confirm every item below before touching a keyspace. Skipping the version and cluster-state checks is the most common cause of a failed cross-DC cutover.

JanusGraph 0.6+ with the cql storage adapter (the DataStax Java Driver 4.x load-balancing policy requires an explicit local datacenter and will refuse to start without one).
Apache Cassandra 4.x or ScyllaDB 5.x, snitch configured as GossipingPropertyFileSnitch, with cassandra-rackdc.properties populated on every node so each node advertises a correct dc and rack. If you plan to move to a CQL-compatible backend later, align the driver overrides in the ScyllaDB migration guide first.
All datacenters joined and healthy: nodetool status reports every node UN (Up/Normal) in every DC, and cross-DC network ACLs allow the storage port (7000/7001) and CQL port (9042).
Elasticsearch or OpenSearch reachable from each region as a client-only mixed index, so index-node lifecycle stays out of the graph failure domain.
Permissions: a CQL role with CREATE/ALTER on keyspaces, plus shell access to run nodetool on the seed nodes of each DC.
A verified snapshot (nodetool snapshot) of the existing keyspace, so the fallback steps have a known-consistent restore point.

Step-by-Step Procedure

Step 1 — Provision the keyspace with NetworkTopologyStrategy

Multi-DC replication begins at the storage layer. NetworkTopologyStrategy is mandatory: SimpleStrategy walks the ring without regard to rack or datacenter, causing cross-DC read amplification and breaking local-quorum availability the moment a region partitions. Provision the keyspace before initializing JanusGraph — if you let JanusGraph auto-create it, it defaults to SimpleStrategy and your topology options become cosmetic.

cql

CREATE KEYSPACE IF NOT EXISTS janusgraph_prod
WITH REPLICATION = {
  'class': 'NetworkTopologyStrategy',
  'us-east-1': 3,
  'eu-west-1': 3,
  'ap-southeast-1': 2
} AND DURABLE_WRITES = true;

The asymmetric ap-southeast-1: 2 is deliberate — a lighter disaster-recovery region receives replicas asynchronously without dragging every LOCAL_QUORUM write in the primaries into a cross-DC negotiation. Keep an odd local replica factor (3) in every write-serving region so LOCAL_QUORUM tolerates a single node loss; never provision RF=2, which pays quorum cost while tolerating zero failures.

Step 2 — Map the topology into janusgraph.properties

The replication-strategy-options block must match the keyspace DDL byte-for-byte. A mismatch — ap-southeast-1,2 in the DDL, ap-southeast-1,3 in the properties — produces replicas Cassandra honors but JanusGraph’s schema assumptions do not, and read repair papers over the divergence until a node dies.

properties

storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.keyspace=janusgraph_prod
storage.cql.local-datacenter=us-east-1
storage.cql.read-consistency-level=LOCAL_QUORUM
storage.cql.write-consistency-level=LOCAL_QUORUM
storage.cql.replication-strategy-class=NetworkTopologyStrategy
storage.cql.replication-strategy-options=us-east-1,3,eu-west-1,3,ap-southeast-1,2
storage.cql.only-use-local-consistency-for-system-operations=true

Pin local-datacenter to the region each JanusGraph instance actually runs in — this value differs per deployment and is what keeps reads and writes local. Setting only-use-local-consistency-for-system-operations=true keeps ID-block allocation and schema locks on LOCAL_QUORUM instead of escalating to a global QUORUM that stalls every schema mutation on cross-DC round trips. For the full keyspace provisioning walkthrough, see Cassandra Backend Setup. Keep LOCAL_QUORUM as the steady-state read and write default and escalate to EACH_QUORUM only for the bulk-load window in Step 3.

Step 3 — Use EACH_QUORUM only for bulk load and schema migration

For an initial bulk import or a schema migration, you want synchronous cross-DC durability so a region loss mid-load cannot leave replicas unacknowledged. Override the write path for the duration of the load only, then revert.

properties

# TEMPORARY — bulk-load / migration window only
storage.cql.write-consistency-level=EACH_QUORUM
storage.batch-loading=true

EACH_QUORUM demands a quorum in every datacenter before a write returns, which is correct for a one-time load but a permanent latency tax in steady state. Revert both lines to LOCAL_QUORUM and remove batch-loading before returning the Cassandra cluster to production traffic. The trade-off between synchronous cross-DC acknowledgment and local-quorum latency is examined in depth under eventual vs strong consistency.

Step 4 — Reconcile the mixed index against storage

Composite indexes replicate synchronously with vertex and edge mutations inside the storage layer. Mixed indexes (Elasticsearch/OpenSearch) replicate asynchronously via a separate mutation log, so in a multi-DC deployment a network partition or a regional search-node failure opens a window where one region’s search cluster lags the storage backend. Run a scheduled job that treats the CQL backend as authoritative, queries the index state, and forces a reindex only when drift is confirmed — never speculatively.

python

import os
import logging
from gremlin_python.driver.client import Client

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def check_index_status(gremlin_endpoint: str, index_name: str) -> str:
    # JanusGraph management scripts run server-side; submit them as Groovy
    # scripts via a Client rather than through a traversal source.
    client = Client(gremlin_endpoint, "g")
    try:
        status_script = f"""
            mgmt = graph.openManagement()
            idx = mgmt.getGraphIndex('{index_name}')
            status = idx.getIndexStatus()
            mgmt.rollback()
            status.toString()
        """
        result = client.submit(status_script).all().result()
        return result[0] if result else "UNKNOWN"
    except Exception as e:
        logger.error(f"Index status query failed: {e}")
        return "UNKNOWN"
    finally:
        client.close()


def trigger_reindex(gremlin_endpoint: str, index_name: str):
    client = Client(gremlin_endpoint, "g")
    try:
        reindex_script = f"""
            mgmt = graph.openManagement()
            idx = mgmt.getGraphIndex('{index_name}')
            mgmt.updateIndex(idx, SchemaAction.REINDEX).get()
            mgmt.commit()
            'REINDEX_TRIGGERED'
        """
        result = client.submit(reindex_script).all().result()
        return result[0] if result else None
    except Exception as e:
        logger.error(f"Reindex trigger failed: {e}")
        return None
    finally:
        client.close()


if __name__ == "__main__":
    endpoint = os.getenv("GREMLIN_SERVER_URL", "ws://localhost:8182/gremlin")
    target_index = os.getenv("JANUSGRAPH_INDEX_NAME", "searchIndex")

    status = check_index_status(endpoint, target_index)
    if status in ("INSTALLED", "REGISTERED"):
        logger.warning(f"Index '{target_index}' is in {status} state. Forcing reindex.")
        trigger_reindex(endpoint, target_index)
    else:
        logger.info(f"Index '{target_index}' status: {status}. No action required.")

Schedule this as a Kubernetes CronJob at a 60-second interval and serialize reindex triggers across regions — concurrent REINDEX operations fanning out to the same index compound the storage write contention that widened the drift window in the first place. The polling cadence and OpenSearch-specific refresh tuning are covered in OpenSearch Sync Patterns.

Step 5 — Define the failover routing sequence

Regional outages require deterministic traffic shifting and a temporary consistency override. When a primary datacenter becomes unreachable, the application layer routes traffic to a surviving region while the storage layer repairs in the background. Execute in this exact order:

Isolate the failing DC. Update the load balancer or service mesh to drain connections from the affected region. Do not terminate storage nodes; allow them to recover asynchronously.
Override consistency levels. Temporarily set both read-consistency-level and write-consistency-level to LOCAL_QUORUM scoped to the surviving DC’s local-datacenter, and restart the JanusGraph service to apply. This prevents stale reads from partially replicated replicas.
Pause mixed-index writes. Set storage.batch-loading=true in the surviving region until storage replication catches up, preventing orphaned index mutations.
Initiate storage repair. Run nodetool repair janusgraph_prod (or scylla-nodetool repair) on the surviving DC’s seed nodes; track progress with nodetool compactionstats.
Re-enable index sync. Once repair completes and nodetool status shows all nodes UN, revert batch-loading and re-enable the mixed-index client.

Match your fallback weights to physical rack distribution — mismatched routing triggers write timeouts that cascade into application-level 5xx errors. Pool sizing for the quorum fan-out these overrides demand is covered in Connection Pooling.

Verification Commands

Run these after any topology change, failover, or index event. All commands assume cqlsh and nodetool are available on the storage nodes.

Verify keyspace replication factor. Output must show NetworkTopologyStrategy with per-DC weights matching replication-strategy-options exactly.

bash

cqlsh -e "DESCRIBE KEYSPACE janusgraph_prod;"

Validate cross-DC write propagation. Insert at EACH_QUORUM on one DC’s seed node, then read it back from another DC’s seed node; the row must return in under 200 ms.

bash

# On the us-east-1 seed node
cqlsh -e "CONSISTENCY EACH_QUORUM; INSERT INTO janusgraph_prod.edgestore (key, column1, value) VALUES (0x00000000000000000000000000000001, 0x01, 0x01);"
# On the eu-west-1 seed node
cqlsh -e "CONSISTENCY EACH_QUORUM; SELECT * FROM janusgraph_prod.edgestore WHERE key = 0x00000000000000000000000000000001;"

Confirm index synchronization state. Run the Step 4 script; a healthy index logs status: ENABLED. No action required.

Check for split-brain residue after any partition:

bash

nodetool verify janusgraph_prod

A clean run exits 0 with 0 errors reported.

Fallback Procedures

Each step has a defined rollback. Do not improvise a recovery on a live multi-DC cluster.

Step 1/2 — replication mismatch detected. If DESCRIBE KEYSPACE disagrees with replication-strategy-options, do not edit the live keyspace under traffic. Drain the affected region, ALTER KEYSPACE to the correct per-DC counts, run nodetool repair janusgraph_prod on every DC before re-enabling traffic, then re-run the verification.
Step 3 — bulk load aborts mid-window. Revert to LOCAL_QUORUM immediately, run nodetool repair to reconcile any partially acknowledged EACH_QUORUM writes, and resume the load from the last committed batch checkpoint rather than replaying from zero.
Step 4 — reindex stalls or errors. A REINDEX left in INSTALLED/REGISTERED blocks query resolution. Cancel it via the Management API, restore index mappings from the last snapshot, and re-run a single serialized reindex during a maintenance window. Changing a property key’s index binding mid-flight is a schema evolution concern that belongs in a CI gate, not a live reconciliation job.
Step 5 — failover makes consistency worse. If overriding consistency surfaces WriteTimeoutException cluster-wide, the surviving region cannot meet quorum alone. Restore the drained region if reachable; otherwise fail back to the last nodetool snapshot and reconcile manually. If nodetool verify reports errors, run nodetool scrub janusgraph_prod on the affected nodes before re-enabling application writes.

Any deviation from a verification pass criterion requires rollback to the last known-consistent snapshot and manual reconciliation of the storage layer — never leave a partially converged topology serving production traffic.

Up a level: Replication Strategies — the consistency and topology model this multi-region procedure specializes.
Cassandra Backend Setup — keyspace provisioning and the CQL DDL your replica counts must match.
ScyllaDB Migration — driver overrides that preserve these write semantics on a CQL-compatible backend.
Connection Pooling — sizing the CQL pool for the quorum fan-out multi-DC writes demand.
OpenSearch Sync Patterns — reconciling index drift after a regional partition.