Production-Grade Neo4j on AWS
Executive Summary
A production-grade Neo4j deployment on AWS typically uses Neo4j Enterprise Causal Clustering across multiple availability zones (AZs) or regions to achieve high availability and scalability. A minimum of three core (primary) instances is required for fault tolerance. Additional read replicas can be added for read scaling and analytics offloading. AWS infrastructure should include a dedicated VPC with subnets in each AZ, security groups to allow Bolt (7687) and HTTP(S) traffic (if used), and optional placement groups for low-latency network performance. Use memory-optimized EC2 instances (e.g. R6i/R6a/R7a families) sized by workload type (OLTP, OLAP, mixed) and ample EBS storage (gp3 or io2) to meet IOPS/throughput needs. Automate deployment with Terraform/CloudFormation and an Auto Scaling Group with persistent EBS volumes to maintain node identity. Configure backups using neo4j-admin backup (to local storage or S3) and regular EBS snapshots with lifecycle policies. Prepare a Disaster Recovery (DR) plan: e.g. multi-region causal cluster or cross-region backups and a runbook for failover. Monitor cluster health with Prometheus/Grafana (using Neo4j’s exported metrics) and AWS CloudWatch (CPU, network, disk). Secure data at rest with AWS KMS encryption for EBS/S3, enable TLS for Bolt/HTTP and intra-cluster communication, and use IAM roles and secrets managers for credentials. Performance tune Java heap/page cache for data size, disable OS swap, and apply recommended Linux filesystem settings. In the runbook, include procedures for node replacement, rolling upgrades, maintenance, and chaos testing.
Recommended Cluster Architectures
- Causal Cluster (3C+R) – At minimum, use 3 Core servers to form a quorum-based cluster (per Neo4j’s fault-tolerance design). Deploy one core in each of three distinct AZs for high availability. For high read throughput or analytics, add 1+ Read Replica in any AZ. This yields e.g. a 3C+2R cluster (3 primaries, 2 secondaries). Writes go to any core (leader); reads can go to any replica or follower, isolating heavy reads from writers.
- Cluster with 5 Cores – For faster leader elections and 2-fault tolerance, use 5 cores (2 cores in AZ1, 2 in AZ2, 1 in AZ3). This “2+2+1” layout lets the cluster survive any two node or single-AZ failure. Extra cores also speed up compactions/checkpoints by parallelizing tasks.
- Multi-AZ vs. Multi-Region – For latency-critical writes, keep all cores in one region (multi-AZ). For DR across regions, use Geo-Distributed Clustering: e.g. 2 regions each with 3 cores, using Neo4j’s multi-datacenter support (requires Enterprise Edition). Alternatively, run an asynchronous cluster in primary region and use scheduled backups to recover in secondary region. A hybrid approach: one region hosts primaries and secondaries for writes (fast intra-DC writes), and a remote region has only secondaries for reads and DR (see Designing resilient cluster).
Cluster Diagram Examples (Mermaid syntax):
graph LR
subgraph "Region: us-east-1 (Prod)"
direction LR
subgraph AZ1
C1(Core1)
R1(Rep1)
end
subgraph AZ2
C2(Core2)
R2(Rep2)
end
subgraph AZ3
C3(Core3)
end
C1 --- C2 --- C3
C1 --- R1
C2 --- R2
C3 --- R2
LoadBal(NLB)
LoadBal --- C1
LoadBal --- C2
LoadBal --- C3
end
graph TB
subgraph "Region: us-east-1"
direction TB
C1(Core1) --- C2(Core2) --- C3(Core3)
C1 <-- SSL/TLS --> R1(Replica1)
C2 --- R2(Replica2)
end
subgraph "Region: eu-west-1 (DR/Read-Replica Site)"
C4(Core4) --- C5(Core5) --- C6(Core6)
C4 --- R3(Replica3)
C5 --- R4(Replica4)
end
%% Inter-region replication (via causal clustering or separate backup processes)
C3 -.- C4
- Figure: Examples of Neo4j causal cluster topologies. The first graph shows a 3C+2R cluster across 3 AZs (with an internal NLB). The second graph shows two regions with 3 cores each (multi-region setup).
EC2 Instances & Sizing
Choose instance families optimized for memory and networking. Neo4j is JVM-based, using in-memory caching: favor large RAM and good CPU. Recommended types (per Neo4j’s AWS docs) include:
- R6i, R6a, R7a series (memory-optimized Nitro instances). For example, r6i.8xlarge (32 vCPU, 256 GiB RAM) or r7a.12xlarge.
- For GPU-based GDS jobs or intense compute, consider C6gn/C7g (Graviton) or future C7gn series.
- For very large DBs, X1e/X2idn (extreme memory) can be used, but most use R**.
- Instance sizes depend on workload:
- OLTP-heavy (many small transactions): more vCPU and moderate RAM (e.g. r6i.8x / r7a.12x).
- OLAP/analytics (bulk reads, Graph Data Science): maximize RAM (to cache data) – e.g. r6i.16x or r7a.16x.
- Mixed workload: balance CPU/RAM, maybe R6i.12x or R7a.16x.
- Minimum: for dev/test, r5.large or r6g.large (but production needs bigger, 8–16 GB is bare minimum for any real use).
- Network: Enable enhanced networking (ENA) and placement groups (cluster placement group) to reduce latency and increase bandwidth. Larger instances (e.g. 24–32xlarge) can exceed 50 Gbps.
- Pricing: Large instances are costly (e.g. r6i.8x ~$1.75/hr). Consider savings plans and reserved instances. Right-size by load testing and monitoring usage.
| Instance Type | vCPU | RAM (GiB) | Best Use | Pros | Cons |
|---|---|---|---|---|---|
| r6i.large (2, 16GiB) | 2 | 16 | Small dev/test | Low cost; Nitro AMI | Too small for production |
| r6i.xlarge (4, 32GiB) | 4 | 32 | Light workloads (pilot) | Balanced; Nitro networking | May run out of memory |
| r6i.4xlarge (16,128) | 16 | 128 | Medium OLTP | Plenty of RAM; 10 GigE networking | Higher cost |
| r6i.8xlarge (32,256) | 32 | 256 | High-load OLTP/OLAP | High mem & vCPU; up to 25 Gbps nw | High cost; overkill for small |
| r7a.12xlarge (48,384) | 48 | 384 | Very large graphs/OLAP | Very large memory; ~25 Gbps net | Very expensive |
| c7g.8xlarge (32,64) | 32 | 64 | Compute-bound read analytics | Graviton3 price/perf; 12.5 Gbps nw | Less RAM for caching |
Table: EC2 instance recommendations. Select by balancing CPU, RAM, and price to fit OLTP (I/O, many writes) vs OLAP (CPU, reads).
Storage Options
Use persistent SSD volumes; avoid instance-store for data (since node identity matters).
- EBS gp3 – General Purpose SSD. Baseline 3,000 IOPS and 125 MiB/s throughput per vol by default, scalable to 16,000 IOPS and 1,000 MiB/s. Cost-effective: ~$0.08/GB-month. Use for data and transaction logs. Tune IOPS and throughput to workload (e.g. high-write needs ~5–10k IOPS).
- EBS io2 – Provisioned IOPS SSD. Extremely durable (99.999% reliability vs 99.8% for gp3) and high IOPS (up to 256k/io2 Block Express). Use if required IOPS >16k or strict durability is needed (large clusters). More costly ($0.125/GB and $0.065/IOPS-month in example).
- Instance Store (NVMe) – Many Nitro instances have local NVMe SSDs (e.g. i3/ i4, r5d, etc.). These are very fast but ephemeral. Can be used for page cache on restart or temp files, but data lost on stop/terminate. Generally not recommended for primary store.
- Throughput and IOPS: Ensure the chosen volume supports your peak. Example: a 2 TiB gp3 can burst to 250 MiB/s, or provision to 500 MiB/s with 15k IOPS. For pure write-heavy (each write must replicate 2x sync), aim 5-10x throughput.
- S3 for backups: Use
neo4j-admin backup --to-path=s3://bucket/...to send backups directly to S3. Use lifecycle rules to archive (Glacier) old backups.
| Storage Type | IOPS (max) | Throughput (max) | Durability | Cost ($/GB) | Pros | Cons |
|---|---|---|---|---|---|---|
| EBS gp3 | 16,000 | 1,000 MiB/s | ~99.8% (noneprovisioned) | $0.08 (baseline) + $0.005/IOPS | Low cost; scalable IOPS/throughput | Lower durability; random performance if under-provisioned |
| EBS io2 | 256,000 | 4,000 MiB/s (Block Express) | 99.999% | $0.125 + $0.065/IOPS | Highest durability, consistent performance | Very high cost for many IOPS |
| Instance NVMe (e.g. i4) | 100k+ | 3,500 MiB/s | Ephemeral (0%) | Included in instance | Ultra-fast (up to 7M IOPS); low-latency | Data lost on stop; not auto-mounted on restart |
| Recommended: EBS (gp3/io2) | – | – | – | – | Persistent; attach on boot | Requires snapshot backups for DR |
Table: Storage options. gp3 is usually sufficient and cost-effective; use io2 for extreme cases. Always back up volumes.
Networking
Design a private VPC with subnets in each AZ. Key points:
- VPC & Subnets: Create a VPC (e.g. 10.0.0.0/16) with 3 subnets (one per AZ) for the cluster nodes. Optionally, a separate public subnet for bastion or NLB.
- Security Groups: Allow Bolt (7687) and HTTP/HTTPS (7474/7473) inbound to Neo4j instances. Also allow cluster ports (5000, 6000, 7000, 6362 for gossip/raft) between SG members. Use strict CIDR (e.g. corporate networks or internal VPC) for access.
- NACLs: By default, allow all within subnet. Use NACLs for extra filtering if needed, but SGs are usually sufficient.
- ENIs: Each EC2 has a primary ENI. Optionally attach an extra ENI for separate data traffic. Not usually necessary unless in multi-homed networks.
- Placement Group: For very low intra-node latency, launch all Neo4j EC2 in a cluster placement group. This packs instances in racks with high-speed links. Increases bandwidth (several 10s Gbps). Useful when replicas sync frequently.
- Load Balancer: Use a Network Load Balancer (NLB) in front of core nodes for client connections (Bolt/HTTP). NLBs are recommended (not ALB) for Neo4j’s binary protocol. Note: the cluster’s internal routing ignores the NLB, so use it only for initial contact.
- Private Endpoints: For backup, configure a VPC endpoint for S3 so instances can backup to S3 without traversing the Internet. Also consider AWS Systems Manager (SSM) agent with endpoint for runbooks, and Transit Gateway if connecting multiple VPCs.
From AWS: use private subnets and SGs to isolate the database. “Deploying Neo4j in a VPC with a private subnet and configuring your security group to permit ingress on Bolt/HTTP builds another layer of network security”. Every EC2 has bandwidth per its size; ensure selected instance has sufficient network****.
High Availability & Failover
Neo4j Enterprise ensures HA via Raft consensus. Key behaviors:
- Leader Election: One core per database is the leader; if it fails, remaining cores hold an election. With 3 cores, cluster tolerates 1 failure (
f = (n-1)/2 = 1). With 5 cores,f=2. - Failover: On node failure, the cluster automatically reconfigures (followers elect new leader). Avoid simultaneous multiple failures: do rolling restarts/updates (only 1 down at a time) to keep quorum.
- Replica Promotion: Read replicas (async secondaries) are not part of quorum. If all primaries fail and you still have secondaries, manual promotion is needed (via
dbms.cluster.demote/ promoteor recreating). For full HA, ensure at least 3 primaries survive. - Multi-AZ: Spread cores across AZs to handle a whole AZ outage. If an AZ (with 1 core) fails, 3-core cluster still has 2 cores (quorum lost). A 5-core cluster in 3 AZs (2-2-1) can lose any one AZ and still have 3 cores in 2 AZs (quorum maintained).
- Multi-Region: Latency affects write performance (remote primaries add ping time). So carefully choose where writes happen. Writes require a majority ACK; across regions, be aware of the time. Neo4j’s causal clustering handles this but plan for slower latencies.
Neo4j Cluster Note: A cluster with n primaries tolerates up to ⌊(n–1)/2⌋ failures. Always maintain quorum by not losing more than that many primaries.
Backup and Restore Strategies
- Online Backup: Use
neo4j-admin backupfor hot backups. This tool can write directly to AWS S3 (via--to-path=s3://bucket/dir). Example:
The above sends a full backup to S3. Useneo4j-admin database backup \ --to-path=s3://mybucket/neo4j-backups/DB1 \ --database=mydb --pagecache=64G --parallel-download=trueDIFFmode thereafter for incremental backups. - Offline Snapshot: Alternatively, pause writes (stop DB) and create an EBS snapshot of the volume. This is instant and consistent on modern kernels (with
fsfreeze). Useaws ec2 create-snapshot. Snapshots can be copied cross-region or to other accounts for DR. - Backup Schedule: Automate daily full backups (online or snapshot) with shorter diffs in between. Use AWS Backup service with S3 to schedule EBS snapshots and move them to cold storage after N days.
- AWS S3 Lifecycle: Store backup files in an S3 bucket with lifecycle rules (e.g. delete or move to Glacier after 90 days).
- Restore: For online backups, use
neo4j-admin restoreon a new server:
For snapshots, create a volume from the snapshot and attach to a fresh instance. Mount it at Neo4j’s data directory.neo4j-admin database restore --from-path=/mnt/backup/mydb --database=mydb --force - Sample Commands:
# Full backup to S3 bin/neo4j-admin database backup --to-path=s3://myBucket/backups/ --database=neo4j # Create EBS snapshot (offline): aws ec2 stop-instances --instance-ids i-0123456789 aws ec2 create-snapshot --volume-id vol-0abcdef123456 --description "Neo4j data" aws ec2 start-instances --instance-ids i-0123456789
Disaster Recovery (DR)
- Cross-Region Cluster: For continuous DR, deploy a geo-distributed Neo4j cluster (requires Enterprise). This keeps a cluster stretched across regions. E.g. two regions each with 3 cores (total 6). Writes may be slow, but read-only fallback exists if one region fails.
- Backup-based DR: More common: regular backups/Snapshots to another region. In DR event, bring up EC2 instances (using IaC), attach the latest snapshot, and join or seed them into a new cluster.
- DR Playbook (example):
- Failover Trigger: Detect primary region failure (via monitoring).
- Promote Standby: If using multi-region cluster, force role promotions in standby region (e.g. via
CALL dbms.cluster.forceAddressToBePrimary(...)). - Restore from Backup: If no cluster standby, spin up new instances in DR region via Terraform.
- Attach Data: For each core, create from latest snapshots or S3 backup. Restore via
neo4j-admin restore. - Rebind Clients: Update DNS / config to point applications to the new cluster endpoints.
- Failback: Once original region is healthy, you may reverse the process.
- Cross-Region Replication (not automatic): Use AWS DataSync or backup/restore.
Always document and test the DR procedures. Runbook should include AWS steps (restore RDS/Aurora, if any, etc. but focusing on Neo4j here).
Monitoring and Alerting
- Metrics: Enable Neo4j’s built-in metrics: JMX/Prometheus. Configure
neo4j.metrics.enabled=trueand expose Prometheus endpoint on port 2004 (or use JMX exporter). Collect metrics like transaction throughput, page cache hit ratio, heap usage, GC pauses, store sizes. - Prometheus/Grafana: Use the official Neo4j Prometheus integration or exporters to scrape metrics. Build dashboards showing: average query latencies, TPS, page cache usage, heap used, file descriptors, open transactions, replication lag.
- AWS CloudWatch: Also monitor at AWS level: CPUUtilization, DiskQueueLength, NetworkIn/Out, StatusCheckFailed. Set alerts (e.g. CPU >80% for 5 min).
- Alerts (example thresholds):
- Cluster down: No heartbeat from >1 core (use custom CloudWatch or Grafana alert).
- Long GC: If JVM GC pause >5s -> alarm.
- Heap >= 80%: if usedHeap/MaxHeap >80%.
- Page cache miss rate: if <90% for extended period (indicates data outgrowing cache).
- Store file size growth: unexpected large growth may mean old data not pruned.
- Example Dashboard Widgets: (Given as a concept; not code)
- Live transaction rate per second.
- Core replicas up/down status panel.
- JVM memory (heap and page cache) bars.
- OS metrics: CPU, disk I/O, network throughput (with thresholds).
- NLB healthy hosts count.
- Logging: Forward Neo4j logs (debug.log, queries.log, neo4j.log) to CloudWatch Logs or a central ELK stack. Alert on ERROR/Exceptions.
No direct citations for alert metrics, but usage of Prometheus and Grafana is common. AWS docs confirm standard EC2 metrics and NACL. Use Neo4j’s metrics reference.
Security
- Encryption at Rest: Enable AWS KMS encryption for EBS volumes and S3 buckets. Use separate CMK for neo4j (managed or custom key). All data and backups thus encrypted.
- Encryption in Transit:
- Client -> DB: Enable TLS for Bolt and HTTP. Use valid certificates (can use AWS ACM certificates on NLB for HTTPS). In neo4j.conf, set e.g.
dbms.ssl.policy.https.enabled=true, and use strong ciphers. - Inter-Cluster: Enable intra-cluster SSL (
dbms.ssl.policy.cluster.enabled=true) to encrypt leader-to-follower traffic. Use your own CA or AWS Private CA.
- Client -> DB: Enable TLS for Bolt and HTTP. Use valid certificates (can use AWS ACM certificates on NLB for HTTPS). In neo4j.conf, set e.g.
- Network Security: Use private subnets. Restrict SG ingress to known CIDRs (app servers, on-premises via VPN). Do not open 7687/7474 to 0.0.0.0/0.
- IAM Roles/Policies: Attach an IAM role to EC2 instances that allows only needed actions (e.g.
s3:PutObjecton backup bucket,ec2:CreateSnapshot,elasticloadbalancing:Describe*for Terraform as shown). Do NOT put AWS credentials in files. - Secrets Management: Store the Neo4j initial password or TLS certificate keys in AWS Secrets Manager or SSM Parameter Store. Retrieve at startup via instance role.
- Neo4j Security: Use Bolt SSL (e.g.
dbms.connector.bolt.tls_level=REQUIRED). Limit Bolt/HTTP to TLS protocols only. Require password auth (change default “neo4j/neo4j”). Use RBAC for DB roles. - Audit: Enable Neo4j query logging (
dbms.logs.query.enabled) and send logs to a SIEM. Use CloudTrail to audit AWS API calls.
Performance Tuning
- Java Heap / Page Cache: For each server, set
-Xmxto ~50% of RAM (leave room for OS/disk cache). Setdbms.memory.pagecache.sizeto fit the working set of data. Ideally the entire DB or hot portion in page cache. E.g. on 256GiB instance, heap ~64G, pagecache ~128G. - OS Settings: Disable swap (
vm.swappiness=0) so Neo4j doesn’t get swapped. Use ext4 or XFS withnoatime,nodiratime. - File Handles: Increase ulimit (e.g. 65536) to avoid FD limits on large graphs.
- Garbage Collector: By default, G1GC is good for large heaps. Tune G1 reserve (see Neo4j Tuning of GC docs).
- I/O Scheduler: Use
noopordeadlinescheduler on disks (especially SSDs) as they are optimal for Neo4j’s write patterns. - Query Tuning: On Neo4j side, create indexes on frequently searched properties. Use EXPLAIN/PROFILE to optimize Cypher queries.
- Drivers: Use the latest Bolt drivers (Neo4j Java/JS drivers) and connection pools for concurrency.
- Batching: For bulk loads, disable indexes/constraints, load in parallel with multiple threads (apoc.load.csv, etc.), then re-enable constraints.
These are general guidelines (Neo4j docs emphasize memory config).
Automation and Infrastructure as Code
- Terraform / CloudFormation: Use IaC to provision VPC, subnets, EC2, and other resources. Neo4j provides a CloudFormation template and a Terraform module (see [Terraform module example]). Example Terraform snippet:
This sets up a 3-node cluster (with NLB, ASG, etc.) as per Neo4j’s AWS reference.module "neo4j_cluster" { source = "github.com/neo4j-partners/neo4j-aws-terraform" node_count = 3 instance_type = "r6i.4xlarge" region = "us-east-1" s3_backup_bucket = "my-neo4j-backups" enable_gds = true } - Ansible / Chef: Use a config management tool to install Neo4j (enterprise edition), configure
neo4j.conf(cluster settings, JVM options), and attach EBS volumes (using the volume’s tag). - User Data / Cloud-Init: On instance launch, a startup script can identify its role (by tag or hostname) and attach the correct EBS volume for data. Ensure persistent mounting so that node identity (ID) remains constant.
- Versioning: Use AMI baking (Packer) or SSM to manage Neo4j version updates. The terraform example tags the NLB with the Neo4j version for consistency.
- Testing IaC: Include automated tests (terratest) to validate infrastructure (e.g. all 3 nodes join cluster).
Cost Estimation & Optimization
Key cost drivers: EC2 instance-hours, EBS storage/IOPS, data transfer, NLB, backup storage.
- Instances: Example (US-East) – 3 × r6i.8xlarge (32vCPU, 256GiB) ~ $5.2/hr each on demand ≈ $37,000/month. Reserved Instances or Savings Plans can cut ~30%.
- EBS: 3 × 2TiB gp3 volumes ≈ $480/month; add cost for IOPS if >3k each. Snapshots in S3 add ~$15/TiB-month.
- Network: Intra-AZ is free; inter-AZ has small charge (~$0.01/GB). Data out (e.g. backups, client queries) costs standard rates.
- NLB: ~$0.0225 per hour + $0.0065 per LCU per hour (~couple hundred $/month for moderate load).
- Cost Optimization: Use smaller instances for dev/test. Turn off clusters when idle (set schedules). Use spot instances for read replicas if acceptable (handle losing spot by using multiple on-demand fallback). Use gp3 over gp2 to save costs.
- Examples: For moderate OLTP: 3 × r6i.4xlarge (~64 GiB RAM, $1.10/hr each) + 1 TiB gp3 per node. Approx cost: ~$3.3K/month on demand; with RI ~50% less.
(No official benchmark cited, but AWS pricing and instance info used.)
Testing & Validation
- Load Testing: Before production, simulate expected workload. Use Neo4j’s Benchmark tools (Neo4j Benchmark, or 3rd-party like Gatling). Measure TPS, latency. Scale cores/replicas until performance stabilizes.
- Chaos/Failure Testing: Simulate node failures (EC2 stop/terminate) to ensure cluster auto-recovery works as intended (one node down, cluster still writes; then node rejoined). Test AZ failure by isolating an AZ.
- Failover Drills: Periodically perform DR test (restore from backups in a different region) to ensure runbook steps are valid.
- CI/CD: Include cluster health checks in deployment pipelines. On upgrade, do rolling upgrades: cordon and stop one node at a time (using
CALL dbms.cluster.cordonServer(...)to prevent routing). - Metrics Validation: Confirm monitoring alerts fire correctly (e.g. induce high CPU to trigger CloudWatch alarm).
Operational Runbook
- Startup/Shutdown: To start cluster: launch all core nodes first (with EBS volumes attached) and then replicas. Order doesn’t matter for core as they auto-discover. To stop: either stop all nodes simultaneously (for complete shutdown) or sequentially. Always stop replicas before primaries if possible to avoid quorum loss surprises.
- Scaling: Scale up by adding new instances (same Neo4j version), attaching new volume tags, and letting them join cluster. Scale down by using
CALL dbms.cluster.deallocateDatabase(...)on core, then remove instance. Never delete a live core without deallocation. For ASG-managed clusters, ensure persistent volume reattachment logic is in place. - Upgrades: For minor upgrades (same major version), use rolling upgrades: drain one core (cordon), shut it down, upgrade OS/Neo4j (via updated AMI or package), restart and let it rejoin. Repeat. For major upgrades (v5→v6+), follow Neo4j official migration guides (likely via offline restores or in-place using
neo4j-admin dump/restore). - Maintenance Window: Schedule maintenance at low-traffic times (weekends/overnight). Inform clients if possible; cluster usually stays online if at most one core is down.
- Health Checks: Use
CALL dbms.cluster.overview()andCALL dbms.cluster.replicationInfo()to check cluster status. Expose custom health endpoints or use Grafana alerts. - Troubleshooting: Common issues include: JVM crashes (OOME – check heap), storage full (monitor
dbms.logs.query.totalRecoveryTime), network timeouts (monitor instance reachability). Keep documented steps for each symptom.
Checklist (Deployment Steps):
- Infrastructure Setup: Provision VPC, subnets, SGs, NACLs, and NLB via Terraform/CloudFormation. Ensure cross-AZ connectivity.
- Instance Launch: Create EC2 launch template/Auto Scaling Group. Use custom AMI with Neo4j pre-installed or User Data to install. Attach IAM role for S3/EBS access.
- Volume Attach: If reusing volumes, attach EBS to correct instance (by tag). Format/mount volume to
/var/lib/neo4j(or data path). - Configure Neo4j: Set
neo4j.conf: cluster discovery (dbms.cluster.initial_discovery_members), advertise addresses, memory settings, TLS paths, metrics enabling. Include any AWS endpoints or Region config if needed. - Start Neo4j Service: Start the service and verify membership (
neo4j status,cypher-shell "CALL dbms.cluster.overview()"). - Set Firewall Rules: Update SG to allow client IPs.
- Seed Data: (If needed) load initial data with
neo4j-admin loador import tools. - Backup Init: Create initial full backup (neo4j-admin or snapshot) and verify restore on a test node.
- Monitoring Setup: Deploy Prometheus node exporter and Neo4j JMX exporter on each instance. Configure Grafana dashboards.
- Alert Setup: Define CloudWatch alarms (CPU, Disk) and Grafana alerts (clusters down, long GC).
- DR Validation: Copy snapshots to DR region and test restore.
- Go Live: Point clients (update DNS/NLB) to cluster endpoints.