Vertex and Edge Validation

Vertex and edge validation is the synchronous, pre-commit gate that decides whether a mutation is allowed to enter the graph at all — and it is the one control that no downstream repair can reconstruct. This page scopes that gate within the parent Graph Schema Validation & Modeling Strategies subsystem: the specific failure it prevents is malformed structure reaching storage. An unregistered property key, a numeric ID written as a string on some vertices and an integer on others, an edge that violates its declared multiplicity, or a supernode that grows without bound will not raise a clean error — it drifts. The graph keeps accepting writes while query plans degrade into full scans, the mixed-index cardinality estimator loses calibration, and rollback storms appear under concurrency. Because JanusGraph delegates structural checking to the management layer but only enforces it at the storage boundary, validation has to be strict, type-exact, and applied before the transaction commits. Everything below is the operational surface an on-call engineer needs: the janusgraph.properties that make drift impossible or observable, the reindex protocol for the async index seam, runnable gremlin-python with explicit transaction handling, pool sizing, and a symptom-to-resolution table for the pages you will actually take.

Validation is a synchronous, pre-commit gate: a non-conforming mutation raises a JanusGraphException before it reaches storage. The composite index commits in the same mutation; only the mixed-index dispatch is asynchronous — the seam where drift originates.

The state diagram below shows the JanusGraph index status lifecycle you must wait on before a newly created index over a validated property serves traffic.

An index serves traffic only in ENABLED. A read routed against an index still at INSTALLED or REGISTERED silently falls back to a full scan.

Core Configuration & Consistency Tuning

JanusGraph delegates structural validation to the ManagementSystem, but the storage backend is what rejects a non-conforming mutation at the transaction boundary. The block below is a hardened janusgraph.properties baseline for a CQL-backed cluster with a mixed-index backend; every non-default value exists to turn silent structural drift into an immediate, catchable exception.

properties

# --- Strict schema enforcement ---
schema.default=none
schema.constraints=true

# --- Storage backend: Cassandra / ScyllaDB via the CQL driver ---
storage.backend=cql
storage.hostname=10.0.1.10,10.0.1.11,10.0.1.12
storage.cql.keyspace=graph_prod
storage.cql.write-consistency-level=QUORUM
storage.cql.read-consistency-level=LOCAL_QUORUM
storage.transactions=true
graph.set-vertex-id=true
storage.batch-loading=false

# --- Mixed index backend: Elasticsearch / OpenSearch ---
index.search.backend=elasticsearch
index.search.hostname=10.0.2.20,10.0.2.21,10.0.2.22
index.search.elasticsearch.client-only=true
index.search.elasticsearch.create.ext.refresh_interval=5s
index.search.elasticsearch.bulk-size=500

# --- Cache & eviction ---
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000

Operational constraints that govern validation, in the order a mutation encounters them:

schema.default=none blocks implicit property-key and label creation. Every key, vertex label, and edge label must be registered through the ManagementSystem first; the permissive default (default) auto-creates schema elements on first use, which is exactly how a typo becomes a permanent, unindexed key. This is the single most important line for validation.
schema.constraints=true enables runtime type checking, cardinality enforcement, and required-field validation. Without it, schema.default=none still rejects undeclared keys but permits a declared key to be written with the wrong Java type — the mixed-type corruption that defeats the cardinality estimator.
storage.cql.write-consistency-level=QUORUM requires majority acknowledgment (N/2)+1 before commit, so a validation failure is caught at the storage layer before any index dispatch. Downgrading to ONE risks a validating writer reading a stale schema during node failure and believing a registered key is undeclared. Align this with your replication strategies before bulk ingestion.
graph.set-vertex-id=true lets the pipeline supply deterministic vertex IDs so retries are idempotent; without it a retried batch double-inserts under fresh auto-generated IDs.
storage.batch-loading=false keeps consistency checks and locking active. Setting it true disables the very validation this page enforces — reserve it for a one-shot initial migration, never a running production writer.
create.ext.refresh_interval=5s bounds the visibility window for newly indexed vertices and edges; it does not affect structural validation, which is synchronous, but it governs when a validated write becomes searchable.

A misregistered property type or a missing required field raises JanusGraphException at commit time, before the mutation reaches the index pipeline. Register keys explicitly at schema-deploy time — never implicitly at write time — with the declared type and cardinality the pipeline will honor. The mapping-type parity that keeps an indexed property aligned with its CQL representation is settled by Property Indexing Rules; keyspace provisioning and token-range alignment are in Cassandra Backend Setup, and teams cutting over from Cassandra should read ScyllaDB Migration before enabling schema.default=none.

Structural Rules the Constraints Enforce

schema.constraints=true is only as strong as the schema you register against it. The management-time definitions below are what convert the runtime flags above into actual structural guarantees:

java

JanusGraphManagement mgmt = graph.openManagement();

// Property keys: explicit Java type + cardinality close the mixed-type gap.
PropertyKey userId = mgmt.makePropertyKey("userId")
        .dataType(Long.class).cardinality(Cardinality.SINGLE).make();
PropertyKey email = mgmt.makePropertyKey("email")
        .dataType(String.class).cardinality(Cardinality.SINGLE).make();

// Vertex label + a property-key constraint so 'account' MUST carry userId.
VertexLabel account = mgmt.makeVertexLabel("account").make();
mgmt.addProperties(account, userId, email);

// Edge label with declared multiplicity — rejects a second 'owns' from one owner.
EdgeLabel owns = mgmt.makeEdgeLabel("owns").multiplicity(Multiplicity.ONE2MANY).make();
mgmt.addConnection(owns, account, account);

mgmt.commit();

The rules worth internalizing before ingestion:

Cardinality (SINGLE / LIST / SET) decides whether a second write to the same key on a vertex replaces, appends, or de-duplicates. SINGLE on userId makes a duplicate write an overwrite, not silent multi-valued corruption.
Edge multiplicity (MULTI / SIMPLE / MANY2ONE / ONE2MANY / ONE2ONE) is enforced at commit. Declaring ONE2MANY on owns makes a second owner for the same object a SchemaViolationException rather than a data-model violation you discover during analytics.
Property-key constraints (addProperties) reject a vertex or edge that omits a required key, so validation catches missing fields structurally instead of relying on the pipeline to remember them.
Connection constraints (addConnection) bound which labels an edge may join, so a mislabeled endpoint fails at the boundary.

Index Synchronization Protocol

Validation is synchronous, but the index that makes a validated vertex queryable is not — and reasoning about the two together is what keeps read-after-write correct. When a mutation commits, the CQL backend writes first (the composite index travels in the same mutation and is queryable at once), then JanusGraph dispatches the mixed-index update to Elasticsearch or OpenSearch through a background worker. That decoupling opens a window in which the graph holds a validated vertex the search index has not yet seen. The acknowledgment-boundary trade-off behind this is analyzed under Eventual vs Strong Consistency; the practical rule for validation is that structural correctness is guaranteed at commit, but searchability is bounded by the visibility window.

Model the window as the sum of three intervals:

t_{visible} = t_{queue} + t_{bulk} + t_{refresh}

where t_queue is time in the async dispatch queue, t_bulk is bulk transport plus indexing latency on the search cluster, and t_refresh is capped by create.ext.refresh_interval. A read that assumes read-after-write on the mixed index fails intermittently whenever t_visible exceeds the gap between the write and the dependent read.

Two patterns keep validated reads correct without serializing all ingestion:

Lag-gated polling. For reconciliation after a bulk validation run, poll a lag metric instead of sleeping a fixed interval. Track the IndexProvider write-queue depth and the indexing-latency trend, and only route the confirming read once both are within threshold.
Index-status gating. After registering a new index over a validated key, block on ManagementSystem.awaitGraphIndexStatus(graph, "byUserId").status(SchemaStatus.ENABLED).call() before routing production reads — an index stuck at INSTALLED or REGISTERED silently falls back to a full scan.

The lag signals to watch, and what a rising value means:

org.janusgraph.diskstorage.indexing.IndexProvider queue depth (JMX) — a monotonically rising queue is producer backpressure; validated writes are trailing further behind with every batch.
Search cluster /_cat/thread_pool/write?v — non-zero rejections mean the producer is outrunning the search cluster and index writes are dropping into the retry loop.
/_nodes/stats/indices/indexing latency — the leading indicator that t_bulk is widening the window.

When the mixed index drifts from the validated graph after a partition or a crash between storage commit and index flush, reconcile with the Management API rather than re-ingesting. The transport specifics behind these commands are in Mixed-Index Routing and OpenSearch Sync Patterns; the canonical repair is a targeted SchemaAction.REINDEX followed by an awaitGraphIndexStatus gate.

Python Integration Pattern

A production pipeline must treat validation as a pre-flight requirement and the write itself as fallible: the payload can be malformed, the backend can be mid-relocation, and a mutation can partially commit. The gremlin-python pattern below validates the whole batch against the registered contract before opening a transaction, commits inside an explicit transaction boundary, keys idempotency on the deterministic vertex ID, and separates non-retryable structural violations from transient failures so backoff never masks a schema bug.

python

import logging
import time
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.traversal import T
from gremlin_python.driver.protocol import GremlinServerError
from pydantic import BaseModel, StrictInt, StrictStr, ValidationError

logger = logging.getLogger(__name__)


# The contract mirrors the registered JanusGraph property keys exactly.
# StrictInt/StrictStr reject the mixed-type coercion that corrupts cardinality.
class AccountVertex(BaseModel):
    userId: StrictInt        # matches PropertyKey("userId", Long, SINGLE)
    email: StrictStr         # matches PropertyKey("email", String, SINGLE)


def batch_upsert(conn_url, records, batch_size=500, max_retries=3):
    conn = DriverRemoteConnection(conn_url, "g")
    g = traversal().with_remote(conn)
    try:
        for start in range(0, len(records), batch_size):
            chunk = records[start:start + batch_size]

            # 1. Pre-flight validation — reject the whole batch before any write,
            #    so a single bad record never produces a partial commit.
            try:
                valid = [AccountVertex(**r) for r in chunk]
            except ValidationError as exc:
                raise RuntimeError(f"Schema violation at offset {start}: {exc}") from exc

            # 2. Idempotent, transaction-bounded commit with bounded retry.
            for attempt in range(1, max_retries + 1):
                tx = g.tx()
                gtx = tx.begin()
                try:
                    for v in valid:
                        # coalesce() makes re-ingest idempotent on deterministic userId.
                        gtx.V(v.userId).fold().coalesce(
                            __.unfold(),
                            __.addV("account").property(T.id, v.userId),
                        ).property("email", v.email).iterate()
                    tx.commit()
                    break
                except GremlinServerError as exc:
                    tx.rollback()
                    # A schema/multiplicity violation is not transient — fail fast.
                    if "SchemaViolationException" in str(exc):
                        logger.error("structural violation at offset %s: %s", start, exc)
                        raise
                    if attempt == max_retries:
                        raise RuntimeError(
                            f"batch at offset {start} failed after {attempt} attempts: {exc}"
                        ) from exc
                    time.sleep(min(2 ** attempt, 8))  # exponential backoff, capped
    finally:
        conn.close()

Pipeline rules that keep this deterministic under load:

Validate the entire batch before opening a transaction. Structural checks belong in front of the write; a ValidationError should never reach the transaction boundary as a partial commit.
Classify errors, do not blanket-retry. A SchemaViolationException is permanent — route it to a dead-letter queue. Only connection drops and transient GremlinServerError states earn a backoff retry.
Key idempotency on the deterministic vertex ID. coalesce(unfold(), addV(...)) never double-adds, so a retried batch updates rather than duplicates — the reason graph.set-vertex-id=true matters.
Use .iterate() for mutations, never .toList(), and end each mutation on an explicit terminal step so the transaction boundary and index dispatch are triggered predictably.
Push the same contract into CI. Run an automated schema diff against a staging instance and block any merge that adds an undeclared key, a mismatched type, or a weakened multiplicity — the gating workflow is in Schema Evolution and CI Gating.

Connection Lifecycle & Pool Management

Validation-latency symptoms and connection-starvation symptoms are indistinguishable from the traversal side — both surface as a TimeoutException on a commit that used to be fast — so pool sizing has to be correct before any validation diagnosis can be trusted. gremlin-python opens a connection pool per DriverRemoteConnection; each connection multiplexes requests up to a per-connection concurrency limit, and an undersized pool queues validating traversals client-side while the storage backend sits idle.

Sizing and lifecycle rules for a validating ingestion pipeline:

Size the pool to backend concurrency at QUORUM, not to worker count. Set pool_size near the number of concurrent in-flight commits the storage cluster can absorb; oversizing moves the bottleneck into the storage coordinator, undersizing starves ingestion while the backend is healthy.
Reuse the pool across the batch. Opening a DriverRemoteConnection per vertex thrashes TCP and TLS setup and inflates t_queue. Construct the pool once per worker and close it in a finally block, exactly as the pattern above does.
Bound idle connections. Set an idle timeout so the pool reclaims sockets between batches rather than holding them open — long-lived idle connections accumulate against both the Gremlin Server and the CQL driver behind it.
Match the retry window to the pool. Keep the backoff ceiling below the pool’s idle timeout so a failed commit does not queue behind an unbounded retry loop and stall the async index dispatch.

The full sizing model — how pool depth, per-connection concurrency, and storage coordinator threads interact — is in Connection Pooling; settle it before tuning validation, because a starved pool masquerades as commit-time validation lag through every diagnostic below.

Diagnostics & Operational Fallbacks

When validation fails or an index disagrees with the validated graph, work from symptom to diagnosis command to resolution rather than reindexing or disabling constraints blindly. The table below covers the failure modes you will actually page on.

Symptom	Diagnose	Resolve
Writes fail with `SchemaViolationException` after enabling `schema.default=none`	`mgmt.printPropertyKeys()` — confirm the key is unregistered	Register the key with explicit `dataType` + `Cardinality`, deploy schema, replay the batch — do not weaken the default
Second edge between the same pair rejected at commit	`mgmt.getEdgeLabel(name).multiplicity()` — check declared multiplicity	Confirm the model intends `MULTI`; if so, alter multiplicity via a gated migration, otherwise fix the ingestion payload
Validated property missing from a mixed-index query	`awaitGraphIndexStatus(...).status(ENABLED)` + `IndexProvider` queue depth	Structural write succeeded but index trails — run `SchemaAction.REINDEX`; if lag is structural, scale index threads or raise `bulk-size`
One storage node runs hot; p99 write latency spikes	`nodetool tablehistograms graph_prod` — inspect partition size distribution	Re-key vertices with consistent hashing and split supernodes; cap high-fan-out edge cardinality
Retried ingestion batch double-inserts vertices	`g.V().hasLabel('account').groupCount().by('userId')` — count duplicates	Enable `graph.set-vertex-id=true` and key writes on deterministic IDs via `coalesce`
Rising `ConcurrentModificationException` / abort rate under load	`StandardJanusGraph` transaction abort-rate JMX bean	Optimistic-lock contention on hot vertices — reduce batch overlap, key idempotency, back off and retry the losing writer

Distinguish transient index lag (a warning that self-heals as the queue drains) from a hard structural violation (critical — it represents data the graph will never accept correctly). Threshold on the JMX beans and the index queue depth, escalate structural violations to on-call, and log transient warnings to your observability stack. For ScyllaDB-backed clusters, run nodetool repair before any index rebuild so the index is not populated from an under-replicated view — the read/write consistency benchmarks are in ScyllaDB Migration. The full severity-classification and escalation policy is in Alert Routing for Violations. For the exact step-by-step hardening runbook — every management command and rollback in order — see Enforcing Strict Vertex and Edge Validation.

Frequently Asked Questions

What is the difference between schema.default=none and schema.constraints=true? They cover different gaps and you want both. schema.default=none rejects any write that references an unregistered property key or label — it stops implicit schema creation. schema.constraints=true then enforces the rules on the keys that are registered: data type, cardinality, required properties, and edge multiplicity. With only the first flag, a declared key can still be written with the wrong type; with only the second, undeclared keys are still auto-created. Together they make both undeclared and malformed structure a commit-time exception.

Why does my second edge between the same two vertices get rejected? Because the edge label was declared with a restrictive multiplicity such as SIMPLE, ONE2MANY, or ONE2ONE, and the second edge violates it. JanusGraph enforces multiplicity at commit under schema.constraints=true. Confirm the intended model with mgmt.getEdgeLabel(name).multiplicity(); if the relationship genuinely allows parallel edges, migrate the label to MULTI through a gated schema change rather than disabling constraints.

Should I use storage.batch-loading=true to speed up validated ingestion? No, not against a running production writer. storage.batch-loading=true disables consistency checks and locking to gain throughput, which also disables the validation this page depends on. Reserve it for a one-shot initial migration into an empty keyspace, then turn it off and run a REINDEX before serving reads. For steady-state ingestion, get throughput from batch sizing and connection-pool tuning instead.

Does a structural violation retry safely under exponential backoff? No, and treating it as retryable hides the bug. A SchemaViolationException — unregistered key, wrong type, multiplicity breach — is deterministic: it will fail identically on every attempt. Classify it and route it to a dead-letter queue. Reserve backoff retry for transient failures such as connection drops and coordinator timeouts, where a later attempt can genuinely succeed.

Why is a vertex I just committed missing from a mixed-index query? The structural write succeeded synchronously, but the mixed index updates asynchronously, so the vertex is not searchable until the visibility window t_queue + t_bulk + t_refresh elapses. Confirm the index reached ENABLED with awaitGraphIndexStatus, check the IndexProvider queue depth for backpressure, and if the index trails structurally after a partition or crash, run SchemaAction.REINDEX rather than re-ingesting the data.

Up a level: Graph Schema Validation & Modeling Strategies — the subsystem this pre-commit gate sits inside.
Enforcing Strict Vertex and Edge Validation — the step-by-step hardening runbook with every management command and rollback.
Property Indexing Rules — the mapping and cardinality decisions that determine which validated keys can back an index.
Schema Evolution and CI Gating — gating structural changes so a weakened type or multiplicity cannot ship unversioned.
Alert Routing for Violations — severity classification and on-call escalation for validation and index events.
Connection Pooling — the client sizing model that separates pool starvation from validation lag.