Setting Up PostgreSQL Streaming Replication for HA

Introduction

Streaming replication is the cornerstone of modern PostgreSQL high‑availability solutions. Unlike file‑based WAL shipping, streaming sends WAL records over a TCP socket in real time, allowing a standby to stay close to the primary’s state. This post walks through every step you need to set up streaming replication, from basic configuration to advanced fail‑over strategies, and ties the discussion into broader DBA practices such as performance tuning, backup automation, and cross‑platform insights (Oracle DBA, SQL Server, RMAN, Data Guard, ASM).

Architecture Overview

In a typical streaming replication topology, one primary node writes WAL records and streams them to one or more standby nodes. Standbys apply WAL to stay in sync, and can be promoted to primary on failure. The key components are:

Primary postgresql.conf (streaming settings)
Standby recovery.conf (now a set of parameters in postgresql.conf or standby.signal)
Connection authentication via pg_hba.conf
Optional pg_basebackup or third‑party tools for initial data copy
Monitoring tools (e.g., pg_stat_replication, pg_rewind)

Prerequisites

Hardware & OS – Ensure each node meets the RAM, CPU, and disk I/O requirements for your workload. SSDs reduce WAL flush latency.
Networking – A dedicated, low‑latency network link between primary and standbys. Port 5432 (or custom) must be open.
PostgreSQL version – 9.6+ supports streaming replication out of the box. For advanced features (logical replication, hot standby promotion with pg_auto_failover), use 10+.
Credentials – Create a replication role:
CREATE ROLE replicator WITH REPLICATION LOGIN ENCRYPTED PASSWORD 'strongpass';
Backup strategy – Have a recent base backup before enabling replication. Use pg_basebackup or a tool like pgBackRest for database backup automation.

Setting Up the Primary

On the primary server, adjust postgresql.conf to enable WAL streaming and replication:



listen_addresses = '*'n

wal_level = replican

max_wal_senders = 5n

wal_keep_size = 1GBn

synchronous_commit = onn

Explanation:

wal_level must be replica (or logical for logical replication).
max_wal_senders defines the number of concurrent replication connections.
wal_keep_size keeps enough WAL for standbys that lag.
synchronous_commit can be set to on if you want transactions to wait until replication acknowledges receipt.

Authentication

Update pg_hba.conf to allow the replication role from standby IPs:



# TYPE  DATABASE        USER            ADDRESS                 METHOD

host    replication     replicator      192.168.1.20/32          md5

Reload the configuration:

SELECT pg_reload_conf();

Preparing the Standby

Standby nodes need a fresh copy of the primary’s data directory. The simplest method is pg_basebackup:



pg_basebackup -h primary_host -D /var/lib/postgresql/12/main -U replicator -Fp -Xs -P

After the copy completes, create standby.signal (or set primary_conninfo in postgresql.conf):



primary_conninfo = 'host=primary_host port=5432 user=replicator password=strongpass sslmode=prefer'

Optionally, enable synchronous_standby_names on the primary to control which standbys are considered synchronous.

Starting the Standby

Restart PostgreSQL on the standby:

sudo systemctl restart postgresql

Verify the replication status from the primary:



SELECT pid, usename, application_name, state, sync_state

FROM pg_stat_replication;

You should see a row for each standby with state = streaming and sync_state = async (or sync).

Monitoring & Metrics

WAL lag – pg_stat_replication shows write_lag and flush_lag.
Disk usage – Monitor wal_keep_size and pg_wal directory growth.
Health checks – Use pg_isready or external tools (Prometheus exporter, pgWatch).
Alerting – Configure alerts when lag > X ms or pg_stat_activity indicates long‑running transactions.

Fail‑over Strategies

Manual Promotion

On a standby, run:



SELECT pg_promote();

This stops WAL replay and makes the standby the new primary. You must also update clients to point to the new primary and ensure the former primary is shut down or reconfigured as standby.

Automatic Fail‑over with `pg_auto_failover`

For zero‑downtime fail‑over, install pg_auto_failover and create a cluster:



psql -c 'SELECT pg_autoctl create cluster 12 mycluster';

Follow the pg_autoctl documentation to set up node registration, election, and monitoring.

High Availability Patterns

Primary‑Standby Pair – Single primary + one or more asynchronous standbys.
Multi‑Master (Logical Replication) – Use wal_level=logical and pglogical to sync across nodes.
Distributed Clusters (Citus) – Scale writes across shards, while replication ensures data locality.
Cloud‑Native (RDS, CloudSQL) – Leverage managed fail‑over but still configure pg_stat_replication for custom monitoring.

Performance Tuning for Replication

WAL Buffers – Increase wal_buffers on primary to reduce WAL flush frequency.
Checkpoint Settings – Tweak checkpoint_timeout, max_wal_size, and min_wal_size to balance durability vs. performance.
I/O Scheduling – Use deadline or noop on SSDs; cfq on spinning disks.
Network Compression – Use wal_compression = on to reduce bandwidth, at the cost of CPU.
Connection Pooling – Deploy PgBouncer or Pgpool-II on standbys to reduce connection overhead for read‑only queries.

Backup and Point‑in‑Time Recovery (PITR)

Replication is often the first step toward a robust backup strategy. Combine streaming replication with:

Base Backups – Periodic pg_basebackup or pgBackRest snapshots.
WAL Archiving – archive_mode=on and archive_command to store WAL segments externally.
Point‑in‑Time Recovery – Rewind the standby to a specific time using pg_waldump or pg_rewind.
Automation – Integrate with CI/CD pipelines or configuration management tools.

Security Considerations

Encryption – Enable TLS for replication connections by setting ssl=on and ssl_ca_file on both primary and standby.
Least Privilege – Use a dedicated replication role with REPLICATION privilege only.
Firewall Rules – Restrict replication port to known IPs.
Monitoring Credentials – Avoid using superuser credentials for monitoring tools.
Audit Logging – Enable log_connections and log_disconnections to track replication sessions.

Common Pitfalls & Troubleshooting

“Replication slots are full” – Increase wal_keep_size or use replication_slot to manage WAL retention.
Lag spikes – Check network latency, disk I/O bottlenecks, or long transactions on the primary.
Authentication failures – Verify pg_hba.conf and the replication role password.
Promote error: “primary is already in recovery” – The node is still applying WAL; run pg_is_in_recovery() to confirm.
Missing standby.signal – PostgreSQL 12+ uses this file; older versions use recovery.conf.

Best Practices Checklist

Use dedicated replication slots to avoid WAL backlog.
Keep primary and standby on the same OS version to avoid compatibility issues.
Implement regular fail‑over drills to validate promotion scripts.
Maintain consistent timezone settings across nodes.
Document the replication topology and recovery procedures.
Integrate replication metrics into central monitoring dashboards.

Conclusion

Streaming replication in PostgreSQL is a mature, flexible mechanism that can be combined with backup automation, performance tuning, and robust fail‑over solutions to build highly available, resilient databases. By following the steps outlined above and adhering to best‑practice guidelines, you can ensure that your PostgreSQL workloads remain online, consistent, and ready to scale. The same principles—configuration hygiene, monitoring, and automation—apply whether you manage on‑prem databases, cloud‑hosted instances, or hybrid environments alongside Oracle, SQL Server, and other enterprise systems.

If you found this guide helpful, subscribe to our newsletter and follow us on LinkedIn for more DBA insights.

Setting Up PostgreSQL Streaming Replication for HA

Introduction

Architecture Overview

Prerequisites

Setting Up the Primary

Authentication

Preparing the Standby

Starting the Standby

Monitoring & Metrics

Fail‑over Strategies

Manual Promotion

Automatic Fail‑over with `pg_auto_failover`

High Availability Patterns

Performance Tuning for Replication

Backup and Point‑in‑Time Recovery (PITR)

Security Considerations

Common Pitfalls & Troubleshooting

Best Practices Checklist

Conclusion

Related Posts:

Leave a Comment Cancel Reply

Sign up for Newsletter

Introduction

Architecture Overview

Prerequisites

Setting Up the Primary

Authentication

Preparing the Standby

Starting the Standby

Monitoring & Metrics

Fail‑over Strategies

Manual Promotion

Automatic Fail‑over with pg_auto_failover

High Availability Patterns

Performance Tuning for Replication

Backup and Point‑in‑Time Recovery (PITR)

Security Considerations

Common Pitfalls & Troubleshooting

Best Practices Checklist

Conclusion

Related Posts:

Must Read

Leave a Comment Cancel Reply

Automatic Fail‑over with `pg_auto_failover`