Why do I get NoHostAvailableException even though every Cassandra node is up?

The exception is frequently a client-side pool signal, not a node-down signal. When every socket in the per-host pool is saturated and a request cannot be acquired within connection-timeout, the driver reports NoHostAvailableException. Confirm with JMX in-flight metrics against max-connections-per-host before assuming a network partition, and rule out long stop-the-world GC pauses that consume the acquisition budget.

Should I set the same pool values for ScyllaDB as for Cassandra?

No. ScyllaDB uses a shard-per-core reactor I/O model that multiplexes far more requests per socket, so raise storage.cql.max-requests-per-connection to 4096. Leaving it at the Cassandra-oriented 1024 under-utilizes each connection and forces the pool to open more physical sockets than necessary.

How large should max-connections-per-host be?

Size it to at least the Gremlin Server worker thread pool, then raise it in +8 increments only while the storage backend still shows headroom in nodetool tpstats. Each socket costs a file descriptor and backend memory, so an oversized pool trades pool exhaustion for coordinator overload rather than fixing the bottleneck.

Can I change connection pool settings without a restart?

The storage.cql.* pool properties are read at graph initialization, so a Gremlin Server pool restart is required for them to take effect. Apply changes on a canary node, restart it, load-test, and only then roll the values across the fleet with graph.close() between nodes to drain connections cleanly.

JanusGraph Connection Pool Tuning Guide

This guide is the step-by-step procedure for sizing the JanusGraph CQL connection pool so a production cluster survives sustained ingestion without the NoHostAvailableException and P99 latency spikes that a driver default pool guarantees under load. It sits under the Connection Pooling reference and narrows that subsystem to one task: computing, applying, and validating exact per-host socket and multiplexing values against your topology. The specific failure this prevents is thread starvation — producers block on connection acquisition, commits queue behind an undersized pool, and the external index receives writes out of order. Everything below is a bounded, observable change you can canary and roll back, not a value to copy blindly.

Threads acquire a slot through the connection-timeout gate; the pool opens core sockets and grows to the max, each multiplexing async frames to the coordinator. When every socket is busy at timeout, the driver fails fast with NoHostAvailableException rather than blocking.

Prerequisites

Confirm every item before editing janusgraph.properties. Skipping the backend-capacity check is the most common cause of a “pool tuning” change that simply moves the bottleneck one layer down.

JanusGraph 0.6.x or 1.0.x running against a CQL storage backend — Cassandra 3.11+/4.x or ScyllaDB. If storage is not yet stood up, follow Cassandra Backend Setup first.
DataStax Java Driver 4.x (bundled with the JanusGraph CQL adapter). The property names below are the JanusGraph-namespaced storage.cql.* keys, not raw driver keys.
gremlinpython matching your server’s TinkerPop line (3.5.x for JG 0.6, 3.6.x for JG 1.0) for the load-test step.
Write access to janusgraph.properties on every node and a maintenance window to restart the Gremlin Server pool.
A JMX endpoint or Prometheus scrape enabled (metrics.enabled=true) so pool saturation is measurable, not guessed.
Known cluster state. nodetool status must show UN for all storage nodes, and the storage backend must not already be saturated — a pool change cannot fix a backend that is CPU- or I/O-bound. Align your Replication Strategies before tuning, because replica count and consistency level set the coordinator fan-out each pooled request pays for.

Step 1 — Establish the pool baseline

The DataStax driver’s default allocations target development workloads and will throttle production traffic. Apply this annotated baseline to janusgraph.properties on one node first. It targets a three-node local datacenter and is the profile you tune from, not a universal constant — the correct maximum is a function of node count and per-node capacity.

properties

storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.keyspace=janusgraph_prod
storage.cql.local-datacenter=us-east-1

# TCP socket allocation per host
storage.cql.core-connections-per-host=4
storage.cql.max-connections-per-host=12

# Request multiplexing & timeout budget
storage.cql.max-requests-per-connection=1024
storage.cql.connection-timeout=5000
storage.cql.request-timeout=12000

# Consistency & routing
storage.cql.read-consistency-level=LOCAL_QUORUM
storage.cql.write-consistency-level=LOCAL_QUORUM

Operational constraints for each value:

max-connections-per-host caps physical TCP sockets per storage node. It must equal or exceed the Gremlin Server worker pool (gremlinserver.threadPoolWorker) — a smaller pool means threads block on socket acquisition under concurrency.
max-requests-per-connection governs async frame multiplexing over each socket. For ScyllaDB, raise this to 4096 to match its shard-per-core reactor I/O model; leaving it at the Cassandra-oriented 1024 under-utilizes each socket.
connection-timeout is a hard acquisition limit. Exceeding it raises NoHostAvailableException immediately rather than letting threads pile up unbounded.
request-timeout must sit above your slowest legitimate traversal but below any upstream client deadline, or slow deep traversals will masquerade as pool exhaustion.

Verify: restart the node and confirm the driver attaches with the configured pool, not a fallback default.

bash

grep -E "Using .* pool|core-connections|max-connections|DefaultDriverContext" \
  /var/log/janusgraph/server.log | tail -5

Step 2 — Confirm the backend is not the real bottleneck

Pool exhaustion and backend saturation produce the same symptom — elevated P99 and driver timeouts — so prove which one you have before enlarging the pool. A larger pool aimed at a saturated backend just accelerates the overload.

bash

# Storage thread-pool saturation: watch MutationStage / ReadStage
nodetool tpstats | grep -E "Pool Name|MutationStage|ReadStage"

If Active consistently matches the pool maximum while Pending climbs and Blocked/All time blocked is non-zero, the storage backend is the constraint — add storage capacity or lower ingestion rate, and do not touch pool sizing yet.

Verify: capture driver-side pool metrics over JMX and compare in-use sockets against the ceiling.

bash

# Active vs configured sockets per host, plus error rates (JMX → Prometheus)
curl -s http://localhost:9090/api/v1/query \
  --data-urlencode 'query=cql_pool_in_flight' | jq '.data.result'
grep -c "NoHostAvailableException" /var/log/janusgraph/janusgraph.log

Only when Active at the JanusGraph pool matches max-connections-per-host while the backend tpstats shows headroom is the pool itself the bottleneck — that is the signal to raise it in Step 3.

Step 3 — Apply and load-test the tuned pool

Raise the ceiling in bounded increments, then replay realistic traffic. Do not jump max-connections-per-host straight to a large number — each socket consumes a file descriptor and backend memory, and an oversized pool trades pool exhaustion for coordinator overload. Increase by +8, restart the canary, and drive load with gremlinpython:

python

import time
import logging
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

def load_test(ws_endpoint, batch_size=500, batches=200, throttle_ms=50):
    conn = DriverRemoteConnection(ws_endpoint, "g")
    g = traversal().withRemote(conn)
    committed = 0
    errors = 0
    for i in range(batches):
        tx = g.tx()
        gtx = tx.begin()
        try:
            for n in range(batch_size):
                gtx.addV("entity").property("batch", i).property("n", n).iterate()
            tx.commit()                      # explicit commit holds a pool slot until acked
            committed += batch_size
        except Exception as e:
            errors += 1
            tx.rollback()                    # release the slot; log for the retry queue
            logging.error("batch %d failed: %s", i, e)
        time.sleep(throttle_ms / 1000.0)
    logging.info("committed=%d errors=%d", committed, errors)
    conn.close()

# Usage: replay at ~1.5x production write rate against the canary node only
# load_test("ws://canary-gremlin:8182/gremlin")

Batch mutations in chunks of 500–1000 per transaction to keep commit frequency low; each open transaction holds a pool slot until the coordinator acknowledges, so smaller batches multiply slot pressure. Keep the index dispatch decoupled with index.search.elasticsearch.bulk-refresh=false so index refresh backpressure does not propagate into the storage pool — the sync mechanics are covered in OpenSearch Sync Patterns.

Verify: during the replay, NoHostAvailableException count must stay flat and P99 must stay within your SLO.

bash

watch -n 5 'grep -c "NoHostAvailableException" /var/log/janusgraph/janusgraph.log'

Step 4 — Promote or roll back

Treat promotion as a gated decision, not an assumption.

Canary window: hold the tuned config on the single node under 1.5x load for 30 minutes.
Promote: if NoHostAvailableException stays flat and P99 remains within SLO, propagate janusgraph.properties to the full cluster via configuration management and restart pools in a rolling fashion.
Rollback trigger: if NoHostAvailableException rises by more than 15% or P99 exceeds 2x baseline, revert the canary to the previous file and investigate backend I/O wait before retrying.

Verify parity across all cluster nodes after propagation:

bash

for host in node1 node2 node3; do
  echo "== $host =="
  ssh "$host" "grep -E 'max-connections-per-host|max-requests-per-connection' \
    /etc/janusgraph/janusgraph.properties"
done

Every node must report identical pool values — a mixed fleet routes disproportionate load to the nodes with the larger pool and reproduces the exhaustion you just fixed.

Fallback and rollback procedures

Each step has a defined recovery path. Validate between actions rather than stacking changes.

If Step 1 fails (driver falls back to defaults). A malformed property or wrong local-datacenter makes the driver ignore the block. Confirm the datacenter name matches nodetool status output exactly, fix the typo, and restart before continuing.

If Step 2 shows backend saturation. Stop. Do not enlarge the pool. Scale storage horizontally or throttle ingestion at the pipeline. If you must absorb a transient spike while adding nodes, raise connection-timeout to 8000 temporarily — never above 15000, or acquisition backpressure cascades into application-server thread exhaustion.

If Step 3 regresses under load. The canary is isolating exactly this. Revert its janusgraph.properties, restart, and re-baseline. Common causes: max-connections-per-host raised above what the backend coordinators can absorb, or max-requests-per-connection set for ScyllaDB while pointing at Cassandra. Add a client-side retry policy with exponential backoff and jitter for genuinely transient partitions rather than widening the pool further.

If a rolling restart leaves orphaned connections. Forceful termination strands sockets that consume backend memory until the TCP keepalive timeout expires. Always drain first:

groovy

// In the Gremlin console on the node being cycled
graph.close()   // lets in-flight transactions finish and sockets drain cleanly

If the whole change must be reverted cluster-wide. Restore the previous janusgraph.properties from configuration management, roll pools node by node with graph.close() between each, and confirm the NoHostAvailableException rate returns to baseline before declaring the rollback complete.

Up a level: Connection Pooling — the parent reference for pool lifecycle, idle-socket recovery, and eviction policy this procedure tunes.
How to Configure Cassandra for JanusGraph Storage — keyspace, consistency, and the storage baseline the pool sits on top of.
Configuring Multi-Datacenter Replication for Graph Data — replica topology and consistency levels that set the coordinator fan-out each pooled request pays for.
Optimizing ScyllaDB Read/Write Consistency for Graphs — the shard-per-core model behind the ScyllaDB max-requests-per-connection value used above.
Syncing JanusGraph with Elasticsearch Step by Step — where index backpressure meets the pool during bulk ingestion.