feat(tuned): migrate sysctl tunings to tuned by hunleyd · Pull Request #2082 · supabase/postgres

hunleyd · 2026-03-11T18:07:10Z

Summary of Changes

Ansible Task Refactoring (ansible/tasks/):
- setup-system.yml: Removed direct sysctl calls (e.g., net.ipv4.tcp_keepalive_time, vm.panic_on_oom) to prevent conflicts with the new tuned profile.
- setup-postgres.yml: Explicitly defined Group IDs (GIDs) for ssl-cert (1001) and postgres (1002). This ensures deterministic GIDs, which is critical for the new HugePages configuration in setup-tuned.yml that references GID 1002.
- setup-tuned.yml:
  - Adopted throughput-performance as the base profile.
  - Added a comprehensive list of Supabase-specific sysctl parameters to /etc/tuned/profiles/postgresql/tuned.conf based on examing the postgres, platform, and salt repos for existing sysctl calls.
  - Added logic to ~~dynamically~~ calculate and set vm.nr_hugepages based on our default shared_buffers and configure vm.hugetlb_shm_group.
Service Ordering (ansible/files/gotrue.service.j2):
- Added After=tuned.service to the gotrue service unit. This ensures network and kernel optimizations are fully applied before the authentication service starts.

Detailed Analysis of Sysctl Parameters

The following sysctl parameters are now being applied via tuned. These changes generally aim to optimize for a high-throughput database workload, improve network resilience, and prevent memory exhaustion issues.

Memory Management & HugePages

vm.swappiness = 10: Reduces the kernel's tendency to swap out application memory. For PostgreSQL, swapping is detrimental to performance; we prefer the kernel to reclaim page cache instead.
vm.overcommit_memory = 2: Tells the kernel to never overcommit memory. This is a safer mode for dedicated database servers, ensuring the OOM killer is less likely to trigger unpredictably, though it requires careful sizing of Swap + RAM.
vm.dirty_ratio = 40 / vm.dirty_background_ratio = 10:
- dirty_background_ratio: Start writing dirty pages to disk when they reach 10% of memory.
- dirty_ratio: Force processes to write dirty pages synchronously when they reach 40% of memory.
- Impact: These settings buffer more write operations in memory (up to 40%), which can smooth out I/O spikes but may increase checkpoint times.
vm.dirty_expire_centisecs = 3000 (30s) / vm.dirty_writeback_centisecs = 500 (5s): Controls how often the kernel flusher threads wake up and how old dirty data can be before flush.
kernel.numa_balancing = 0: Disables automatic page migration between NUMA nodes. For PostgreSQL, automatic balancing can sometimes cause unpredictable latency spikes; pinning or manual management is often preferred on large multi-socket systems.
vm.nr_hugepages ~~(Calculated)~~: Allocates explicit HugePages for PostgreSQL shared_buffers.
- Impact: Reduces TLB (Translation Lookaside Buffer) misses and overhead for managing large amounts of memory, slightly improving CPU efficiency for memory-intensive queries. The computed value assumes our default of 128MB for shared_buffers. Salt will need modified to adjust this based on the current shared_buffers value.
vm.hugetlb_shm_group = 1002: Grants the postgres group (GID 1002) permission to use HugePages.

Network & TCP Stack

net.core.somaxconn = 16384: Increases the maximum backlog of pending connections. Critical for handling bursts of new connections (connection storms).
net.ipv4.ip_local_port_range = 1025 65499: Widens the range of available ephemeral ports, allowing more outgoing connections (e.g., to external APIs or replicas).
net.ipv4.tcp_tw_reuse = 1: Allows reusing sockets in TIME_WAIT state for new connections. This helps prevent port exhaustion in high-turnover scenarios.
net.ipv4.tcp_keepalive_time = 1800 (30m) / net.ipv4.tcp_keepalive_intvl = 60: Reduces the time dead connections hang around.
net.core.rmem_max / net.core.wmem_max (~100MB): drastically increases the maximum TCP read/write buffer sizes.
- Impact: Allows TCP windows to scale larger, significantly improving throughput over high-latency links (e.g., cross-region replication).
net.ipv4.tcp_window_scaling = 1: Enables large TCP windows (required for the buffers above to be effective).
net.core.netdev_max_backlog = 10000: Increases the queue for incoming packets on the network interface, preventing packet drops during high traffic bursts.

Kernel Stability & Shared Memory

kernel.panic = 10 / kernel.panic_on_oom = 1 / vm.panic_on_oom = 1: Configures the system to reboot 10 seconds after a kernel panic or Out-Of-Memory (OOM) event.
- Impact: In a cloud/HA environment, it is often better to fail fast and reboot (letting HA failover handle traffic) than to hang in a degraded state.
kernel.shmmax / kernel.shmall: Set to effectively infinite values to ensure the kernel allows PostgreSQL to allocate as much shared memory as it requests.
fs.file-max / fs.aio-max-nr: Increases limits on open files and asynchronous I/O events, essential for databases handling many connections and disk operations.

Conclusion

These changes represent a move towards a "production-ready," high-performance configuration. The system is explicitly tuned for high throughput (via buffer/window increases), stability (via OOM panic policies), and reduced CPU overhead (via HugePages and NUMA settings). These settings were based on existing Supabase settings throughout the code, and the recommended tuning practices from Red Hat: PostgreSQL Load Tuning on RHEL (https://www.redhat.com/en/blog/postgresql-load-tuning-red-hat-enterprise-linux). This ensures that the OS is not just a general-purpose host, but is specifically optimized for the high-concurrency, high-I/O profile of a production PostgreSQL instance.

… GIDs - Move various sysctl parameters from setup-system.yml into the postgresql tuned profile. - Explicitly define GIDs for ssl-cert (1001) and postgres (1002) to ensure stable HugePages access. - Add HugePages calculation and hugetlb_shm_group configuration to the tuned profile. - Ensure gotrue.service waits for tuned.service before starting.

Copilot

Pull request overview

This PR migrates host kernel/sysctl tunings from direct Ansible sysctl tasks to a dedicated tuned profile, aiming to centralize and avoid conflicts while improving PostgreSQL-oriented performance and stability.

Changes:

Updated PostgreSQL package release strings to *-tuned-1 variants.
Added/expanded tuned profile configuration for PostgreSQL, including a base profile include, Supabase-specific sysctls, and HugePages settings.
Removed direct sysctl tuning from setup-system.yml, added deterministic GIDs for postgres-related groups in nixpkg_mode, and ordered gotrue after tuned.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
ansible/vars.yml	Updates Postgres release identifiers to tuned-specific builds.
ansible/tasks/setup-tuned.yml	Creates/activates a `tuned` profile and writes sysctl/HugePages tuning into `tuned.conf`.
ansible/tasks/setup-system.yml	Removes direct sysctl configuration now intended to be handled by tuned.
ansible/tasks/setup-postgres.yml	Sets deterministic GIDs for `ssl-cert` and `postgres` groups in `nixpkg_mode`.
ansible/files/gotrue.service.j2	Ensures gotrue starts after tuned to apply kernel/network tunings first.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…-378 * 'INDATA-378' of github.com:supabase/postgres: Update vars.yml Update vars.yml revert: stop using conf.d directory got generated-optimizations (#2101)

brendan-stephens · 2026-04-17T17:21:58Z

If the goal is to encourage more active memory management, we may want to have 3 different profiles:

Low Memory (1GB-4GB), which may require more swap to stay afloat for recovery.
Standard Memory (8GB-64GB), with mild swap tuning.
High Memory (>64GB), with minimal swap tuning.

--

For all systems:
vm.swappiness = 10 (default is 30)
vm.dirty_expire_centisecs = 500 (default is 3000)
vm.dirty_writeback_centisecs = 100 (default is 500)
vm.min_free_kbytes = 3% of system memory

Swap is an expensive operation, we avoid swap as much as possible, don’t let write backlog build up and flush continuously instead of in bursts. In practice, this often means a higher baseline IO, but less spikes.

For systems with 64GB of memory or less:
vm.dirty_background_ratio = 3 (default is 10)
vm.dirty_ratio = 10 (default is 30)
vm.compaction_proactiveness = 40

For systems with more than 64GB of memory:
vm.dirty_background_bytes = 1610612736 (1.5 GB - Default is 0)
vm.dirty_bytes = 4294967296 (4GB - Default is 0)
vm.dirty_background_ratio=0 (use background bytes)
vm.dirty_ratio=0 (use dirty_bytes)
vm.compaction_proactiveness = 60

Get the kernel to starts background writeback early with a hard cap on dirty memory in RAM and avoid processes block until writes catch up.

As we increase our available disk cache, we need to be more proactive on maintaining contiguous bytes for atomic operations, network and IO buffers. Reducing long-term fragmentation buildup. The compaction adds some CPU, but will help to ensure we do not encounter page allocation failures.

https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-tunables

--

vm.overcommit_memory = 2
vm.overcommit_ratio = 90-95

Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable amount (default is 50%) of physical RAM.

Without memory overcommit, Linux will return the error “out of memory” (ENOMEM).

If a PostgreSQL process receives this error, the running statement fails with the error code 53200, but the database as a whole remains operational.

I'm unclear if the other user proceses (Envoy, REST, Auth) deal with this error, but this is certainly better for the database.

committable memory = swap + (RAM - vm.nr_hugepages * huge page size) * vm.overcommit_ratio / 100

Most of the instances have much more RAM than swap space, and we adjust the ratio into the 90% percentile so the majority is accessible.

TheOtherBrian1 · 2026-04-17T18:57:19Z

vm.overcommit_memory = 2
vm.overcommit_ratio = 90-95

Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable amount (default is 50%) of physical RAM.

Without memory overcommit, Linux will return the error “out of memory” (ENOMEM).

Definitely a +1 on this. OOMs are the most common, if not the most common, reason for a project to fail. When an overcommit limit is enforced, offending connections get killed with the error out of memory. The visibility over memory exhaustion makes it easier for support to debug.

I'd argue that the setting should be configured automatically once someone gets to a medium.

brendan-stephens · 2026-04-17T21:15:18Z

It looks like we previously based our decision on not disabling overcommit based on the micro instances? https://github.com/supabase/platform/issues/1404

For the low memory instances, back of the napkin:
1GB swap + (1GB RAM - (0 vm.nr_hugepages * 2MB)) * .9 vm.overcommit_ratio
These should be operating with 1.8G of committable memory space.

hunleyd added 4 commits March 11, 2026 13:43

chore: adjust ammi version vars

4e604d6

fix(tuned): allow overcommit

8bffbbe

Merge branch 'develop' into INDATA-378

acf08e4

hunleyd requested a review from Copilot March 19, 2026 16:18

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Comment thread ansible/tasks/setup-tuned.yml Outdated

Comment thread ansible/tasks/setup-tuned.yml Outdated

Comment thread ansible/tasks/setup-tuned.yml Outdated

Comment thread ansible/tasks/setup-tuned.yml Outdated

hunleyd and others added 15 commits March 19, 2026 16:23

Potential fix for pull request finding

fba0f65

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

d04c6a5

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Merge branch 'develop' into INDATA-378

245ce46

Merge branch 'develop' into INDATA-378

890f860

debugging by adding only one option at a time

86d9184

Merge branch 'develop' into INDATA-378

528efe8

debug: add explitic HP back

33a09bf

Merge branch 'develop' into INDATA-378

b6298df

debug: disable HP again, start walking thru sysctl loop

0cfa7e1

Merge branch 'develop' into INDATA-378

9746e0b

Update vars.yml

1a42c76

Update vars.yml

6186420

feat: add sched_autogroup_enabled as per Dimitrios

8fb59af

Merge branch 'INDATA-378' of github.com:supabase/postgres into INDATA…

3854666

…-378 * 'INDATA-378' of github.com:supabase/postgres: Update vars.yml Update vars.yml revert: stop using conf.d directory got generated-optimizations (#2101)

Update vars.yml

098ef47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(tuned): migrate sysctl tunings to tuned #2082

feat(tuned): migrate sysctl tunings to tuned #2082
hunleyd wants to merge 19 commits intodevelopfrom
INDATA-378

hunleyd commented Mar 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brendan-stephens commented Apr 17, 2026 •

edited

Loading

Uh oh!

TheOtherBrian1 commented Apr 17, 2026

Uh oh!

brendan-stephens commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

hunleyd commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Changes

Detailed Analysis of Sysctl Parameters

Conclusion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brendan-stephens commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheOtherBrian1 commented Apr 17, 2026

Uh oh!

brendan-stephens commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hunleyd commented Mar 11, 2026 •

edited

Loading

brendan-stephens commented Apr 17, 2026 •

edited

Loading