Why does my vertex commit but never appear in Elasticsearch search results?

The graph transaction commits to CQL storage first and the index mutation is dispatched asynchronously. If the bulk queue overflows, the write thread pool rejects the request, or the JVM restarts before dispatch, the document is silently dropped. Check /_cat/thread_pool/write for non-zero rejected counts and REINDEX the affected index.

Can I change the shard count after the mixed index is created?

No. The create.ext.number_of_shards and number_of_replicas settings apply only at index creation and are inert afterward. Set them correctly in janusgraph.properties before the first mutation, or drop and rebuild the mixed index during a maintenance window to change them.

What if the backfill script crashes halfway through?

The pipeline pages by a monotonically increasing sync_cursor and only advances last_cursor after a committed batch, so restarting the script resumes from the last durable position with no double-writes and no full rescan.

Should I keep bulk-refresh=false during normal operation?

Yes for throughput-critical pipelines. Leave bulk-refresh=false and rely on the refresh_interval, applying an explicit wait_for only to the specific writes that require read-after-write visibility. Forcing a refresh on every write serializes throughput behind index refresh.

Syncing JanusGraph with Elasticsearch Step by Step

This guide walks through wiring Apache JanusGraph to an Elasticsearch mixed index, backfilling existing vertices, and proving parity — the exact sequence that prevents the silent index drift where a graph commit succeeds but the corresponding full-text document never becomes searchable. It is the operational how-to under the Elasticsearch Integration reference; if you have not yet decided on refresh semantics or acknowledgment boundaries, settle those against your workload in Eventual vs Strong Consistency first, because they change several values below. JanusGraph decouples graph persistence from full-text indexing, so syncing the two systems is not a single toggle but a controlled workflow built on explicit transaction boundaries, deterministic flush intervals, and continuous health validation.

Prerequisites

Confirm every item before touching janusgraph.properties. Skipping the version and health checks is the most common cause of a backfill that appears to succeed but leaves the index short.

JanusGraph 0.6.x or 1.0.x running against a CQL storage backend (Cassandra 3.11+/4.x or ScyllaDB). If you are still standing storage up, follow Cassandra backend setup first.
Elasticsearch 7.17+ or 8.x, cluster status green, with the REST client reachable from every JanusGraph node on port 9200.
gremlinpython matching your server’s TinkerPop line (3.5.x for JG 0.6, 3.6.x for JG 1.0). A minor mismatch produces silent serialization errors during backfill.
Write access to janusgraph.properties on every node and the ability to restart the Gremlin Server pool during a maintenance window.
Aligned storage topology. Align your replication strategies before bulk ingestion — a backfill against an under-replicated keyspace amplifies coordinator load exactly when the index dispatch queue is hottest.
A known-good driver pool. Size it per the connection pooling model so thread starvation during the backfill does not get misdiagnosed as index lag.

Step 1 — Wire the storage and index backends

Isolate storage and index parameters in janusgraph.properties. JanusGraph routes mutations asynchronously to the external index unless explicitly overridden. This block is a stable production baseline:

properties

# Graph Storage Layer (ScyllaDB/Cassandra)
storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.keyspace=janusgraph_prod
storage.cql.read-consistency-level=LOCAL_QUORUM
storage.cql.write-consistency-level=LOCAL_QUORUM

# Elasticsearch Index Backend
index.search.backend=elasticsearch
index.search.hostname=10.0.2.20
index.search.elasticsearch.client-only=true
index.search.elasticsearch.bulk-size=500
index.search.elasticsearch.bulk-refresh=false
index.search.elasticsearch.create.ext.number_of_shards=3
index.search.elasticsearch.create.ext.number_of_replicas=1
index.search.elasticsearch.create.ext.refresh_interval=3s

bulk-size controls how many index mutations are batched before flushing to Elasticsearch. Values below 200 increase network overhead; values above 2000 risk bulk request rejections from payload-size limits. Keep bulk-refresh=false for throughput-critical pipelines and trigger explicit refreshes only when read-after-write is required. The create.ext.* settings apply only at index creation — set shards and replicas correctly now, because changing them later means a rebuild. Shard alignment against your query predicates is covered in mixed-index routing.

Verify: restart one node and confirm it registers the index backend without falling back to storage scans.

bash

grep -E "Configured index .* backend|elasticsearch" /var/log/janusgraph/server.log | tail -5

Step 2 — Define the mixed index

Declare the mixed index explicitly in the Gremlin console, then block until the schema propagates across the search cluster before ingesting any data:

groovy

mgmt = graph.openManagement()
nameKey = mgmt.makePropertyKey("entity_name").dataType(String.class).cardinality(Cardinality.SINGLE).make()
mgmt.buildIndex("searchByEntity", Vertex.class).addKey(nameKey).buildMixedIndex("search")
mgmt.commit()

// Block until schema propagates across the cluster
mgmt.awaitGraphIndexStatus(graph, 'searchByEntity').status(SchemaStatus.REGISTERED).call()

Verify: the index must reach REGISTERED before you proceed. Print the index state and check it directly:

groovy

mgmt = graph.openManagement()
mgmt.printIndexes()
mgmt.getGraphIndex("searchByEntity").getIndexStatus(nameKey)
mgmt.rollback()

If the status remains INSTALLED beyond 60 seconds, cluster communication or schema propagation is stalled — check the JanusGraph system logs and inter-node connectivity before continuing.

Step 3 — Run the deterministic backfill pipeline

Relying on JanusGraph’s internal async flush for a large migration introduces unpredictable lag and memory pressure. Run a controlled pipeline that pages by a monotonic cursor, enforces explicit commits, and applies exponential backoff. This script uses gremlinpython with strict transaction boundaries:

python

import time
import logging
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.process.traversal import T, P, Order

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

def bulk_sync_vertices(ws_endpoint, batch_size=1000, throttle_ms=200):
    conn = DriverRemoteConnection(ws_endpoint, 'g')
    g = traversal().withRemote(conn)

    last_cursor = 0
    processed = 0
    max_retries = 3
    failures = 0

    while True:
        try:
            # Page the next batch using a deterministic, monotonic cursor.
            # P.gt is the comparison predicate; Order.asc the sort direction.
            # project() materializes the id + cursor so we don't rely on
            # reference-vertex property access.
            batch = (
                g.V().has('sync_cursor', P.gt(last_cursor))
                .order().by('sync_cursor', Order.asc)
                .limit(batch_size)
                .project('id', 'cursor').by(T.id).by('sync_cursor')
                .toList()
            )

            if not batch:
                logging.info("Sync complete. Total vertices processed: %d", processed)
                break

            # Advance the cursor to the highest value processed in this batch
            last_cursor = max(row['cursor'] for row in batch)
            processed += len(batch)
            failures = 0  # reset backoff after a successful batch

            logging.info("Batch processed. Cursor: %d | Total: %d", last_cursor, processed)
            time.sleep(throttle_ms / 1000.0)

        except Exception as e:
            failures += 1
            logging.error("Batch failed (%d/%d): %s", failures, max_retries, str(e))
            if failures >= max_retries:
                logging.critical("Max retries exceeded. Halting pipeline.")
                conn.close()
                raise
            time.sleep(2 ** failures)  # exponential backoff before retrying
    conn.close()

# Usage: bulk_sync_vertices("ws://gremlin-server:8182/gremlin")

The pipeline avoids full graph scans by leveraging a monotonically increasing sync_cursor property, so a crashed run resumes from the last committed cursor instead of restarting. Ensure your ingestion layer populates sync_cursor on every vertex create and update.

Verify: the final log line reports the total processed count. Sanity-check it against expected cardinality before moving on.

Step 4 — Verify parity between storage and index

Validate synchronization by comparing the storage backend against the index directly, then inspecting the write queue.

Compare storage vs. index counts.

groovy

// Graph storage count
g.V().has('entity_name', P.neq('')).count().next()

// Elasticsearch index count (via JanusGraph mixed index)
g.V().has('entity_name', textContains('')).count().next()

A delta greater than 0.1% indicates pending queue items or failed flushes.

Inspect index queue depth.

bash

# Check Elasticsearch write thread pool for queued/rejected bulk requests
curl -s "http://janusgraph-es:9200/_cat/thread_pool/write?v&h=node_name,queue,rejected"

Non-zero rejected values mean Elasticsearch is dropping bulk requests, which causes permanent drift unless the pipeline retries. If you see rejections, the pattern is the same one dissected in Resolving OpenSearch Index Drift in Production — the diagnosis transfers directly to Elasticsearch.

Force a targeted reindex on stale segments. If specific vertices fail to appear despite successful commits:

groovy

mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("searchByEntity"), SchemaAction.REINDEX).get()
mgmt.commit()

Monitor the job via mgmt.getGraphIndex("searchByEntity").getIndexStatus(nameKey). Never run concurrent REINDEX operations on the same index.

Fallback and rollback procedures

Each step has a defined recovery path. Do not skip validation between recovery actions.

If Step 1 fails (node falls back to storage scans). The index backend never attached — usually a wrong hostname or an Elasticsearch node that is not green. Confirm health before anything else:

bash

curl -s "http://10.0.2.20:9200/_cluster/health?pretty"

Ensure status is green and number_of_pending_tasks is 0. If red, resolve disk watermarks or shard-allocation failures first, then restart the JanusGraph node.

If Step 2 fails (index stuck at INSTALLED). Roll the management transaction back rather than leaving it open, then retry the build after confirming connectivity:

groovy

mgmt.rollback()

If Step 3 fails mid-backfill. The cursor design makes this safe: because last_cursor only advances after a committed batch, restart the script and it resumes from the last durable position — no double-writes, no full rescan.

If Step 4 shows persistent drift (full recovery sequence). When synchronization fails catastrophically — index corruption, persistent queue backlog, or schema mismatch — run this sequence in order:

Halt ingestion. Stop all application writers and the backfill pipeline.

Disable refresh to remove overhead during reconstruction:

bash

curl -X PUT "http://10.0.2.20:9200/janusgraph_mixed/_settings" \
  -H "Content-Type: application/json" \
  -d '{"index.refresh_interval": "-1"}'

Disable and drop the index, then re-run the Step 2 definition:

groovy

mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("searchByEntity"), SchemaAction.DISABLE_INDEX).get()
mgmt.commit()
mgmt.awaitGraphIndexStatus(graph, 'searchByEntity').status(SchemaStatus.DISABLED).call()

Re-run the backfill from Step 3. Watch document growth with curl -s "http://10.0.2.20:9200/_cat/indices?v".

Re-enable refresh and confirm live mutations propagate within one window:

bash

curl -X PUT "http://10.0.2.20:9200/janusgraph_mixed/_settings" \
  -H "Content-Type: application/json" \
  -d '{"index.refresh_interval": "3s"}'

Maintain a runbook recording the last successful sync timestamp and queue depth, and automate drift alerts using Prometheus metrics exposed via JanusGraph’s metrics.enabled=true configuration.

Up a level: Elasticsearch Integration — the parent reference for the JanusGraph-to-Elasticsearch boundary this procedure implements.
Resolving OpenSearch Index Drift in Production — the reconciliation workflow when parity checks fail; the diagnosis transfers to Elasticsearch.
Configuring Mixed-Index Fallback Chains — shard alignment and predicate routing so backfilled documents land on balanced shards.
Eventual vs Strong Consistency Tradeoffs in JanusGraph — choosing the acknowledgment and refresh boundary that sets the values used above.
JanusGraph Connection Pool Tuning Guide — sizing the driver pool so backfill throughput is not mistaken for index lag.