Linux Network Testing and Optimisation Techniques for Elasticsearch Clusters Source

May 17, 2026 Markdown source

1---
2title: "Linux Network Testing and Optimisation Techniques for Elasticsearch Clusters"
3date: "2026-05-17"
4published: true
5tags: ["linux", "networking", "elasticsearch", "logstash", "rhel", "performance", "observability", "nmon", "capacity-planning"]
6author: "Gavin Jackson"
7excerpt: "A practical guide to testing and improving Linux network performance for a three-node Elasticsearch cluster fed by two Logstash servers, including NIC bonding, iperf3, ethtool, TuneD, PCP, nmon, and RHEL-specific tooling."
8---
9
10# Linux Network Testing and Optimisation Techniques for Elasticsearch Clusters
11
12![Elastic logo](/assets/elastic-logo-mark.svg)
13
14A busy logging platform rarely fails in a glamorous way.
15
16It normally starts with a small delay in dashboards, a few Logstash retries, a queue that grows during the business day but drains overnight, or Elasticsearch bulk requests that occasionally come back slower than expected. Then someone asks the obvious question: is the network fast enough?
17
18Imagine a fairly common setup:
19
20- three Elasticsearch nodes
21- two Logstash servers
22- more than 100 servers sending logs into the platform
23- sustained ingest, with bursts during deployments, outages, batch jobs, and security events
24
25That is exactly the kind of environment where network tuning matters, but it is also the kind of environment where tuning the wrong thing can waste a lot of time. A Linux server with a 10 GbE or 25 GbE NIC can still perform poorly if interrupts are pinned badly, NIC queues are undersized, switch links are dropping frames, Logstash is sending too many small bulk requests, or all traffic is accidentally landing on one Elasticsearch node.
26
27The goal is not to collect sysctl settings like trading cards. The goal is to prove where the bottleneck is, remove avoidable packet loss, and make the network boring.
28
29## Understand the traffic first
30
31In this sort of logging design, there are two important traffic paths.
32
33The first path is from the estate into Logstash:
34
35```text
36100+ servers -> Logstash nodes
37```
38
39This might be Beats traffic, syslog, HTTP, TCP, UDP, or a mixture. If the sending agents have disk-backed queues, this path is usually forgiving. If the traffic is UDP syslog, it is much less forgiving because drops may simply vanish.
40
41The second path is from Logstash into Elasticsearch:
42
43```text
44Logstash nodes -> Elasticsearch HTTP bulk API -> Elasticsearch transport traffic
45```
46
47Logstash writes to Elasticsearch over HTTP, normally to port `9200`, using the Elasticsearch output plugin. Elasticsearch then uses its transport layer, normally port `9300`, for node-to-node traffic, shard replication, cluster state, recoveries, and internal coordination.
48
49That means a single incoming log event can create more than one network movement:
50
51- from the original server to Logstash
52- from Logstash to an Elasticsearch node
53- from the primary shard to replica shards
54- during recovery or rebalancing, between Elasticsearch nodes again
55
56Before changing anything, draw the actual traffic flow. Which VLAN carries Logstash output? Which VLAN carries Elasticsearch transport traffic? Are the Logstash hosts writing to all three Elasticsearch nodes, or only one? Are the Elasticsearch nodes in the same rack, across racks, or across a routed fabric? Are you using TLS, compression, jumbo frames, or a load balancer?
57
58The network shape matters as much as the network speed.
59
60## Start with Elasticsearch and Logstash basics
61
62It is easy to blame Linux networking when the real issue is ingest behaviour.
63
64Elastic's own [indexing speed guidance](https://www.elastic.co/docs/deploy-manage/production-guidance/optimize-performance/indexing-speed) starts with bulk requests, worker concurrency, refresh intervals, replicas during initial loads, filesystem cache, shard layout, and hot spotting. That is the right order. A perfect NIC cannot fix undersized shards, saturated disks, too many concurrent writers to one shard, or Logstash sending badly shaped batches.
65
66For Logstash, the [Elasticsearch output plugin](https://www.elastic.co/docs/reference/logstash/plugins/plugins-outputs-elasticsearch) is also worth reading closely. A few details matter for network planning:
67
68- The plugin sends batches to the Elasticsearch Bulk API.
69- Very large batches are split when they exceed 20 MB.
70- Using one Elasticsearch output is usually more efficient than creating many outputs, because each output creates its own client and connection pool.
71- Modern plugin documentation describes request compression through `compression_level`, which can reduce network IO at the cost of CPU.
72- DNS caching and long-lived keepalive connections can affect failover and traffic distribution.
73
74That leads to some practical checks:
75
76- Configure Logstash outputs with all suitable Elasticsearch endpoints, not just `es01`.
77- Watch for Elasticsearch `429` responses, rejected writes, indexing pressure, and long bulk latencies.
78- Avoid one hot Elasticsearch node receiving most of the Logstash traffic.
79- Keep Logstash persistent queues enabled if the platform must absorb short Elasticsearch or network interruptions.
80- Tune Logstash batch and worker settings alongside Elasticsearch bulk sizing, not separately from it.
81
82Elasticsearch networking settings should normally stay simple. The [Elasticsearch network settings](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/networking-settings) documentation recommends `network.host` for the common case where a node binds and publishes one address. Advanced bind and publish settings are useful for multi-homed hosts, but they are also a good way to create confusing cluster behaviour if DNS, routing, or interface selection is wrong.
83
84For a production cluster, be deliberate:
85
86```yaml
87network.host: 10.20.30.11
88http.port: 9200
89transport.port: 9300
90```
91
92Do not use `0.0.0.0` casually on a multi-homed Elasticsearch node. It can be fine for binding in controlled cases, but the publish address must be reachable by the nodes and clients that need it. If the node publishes the wrong interface, the network can look broken even when the OS is doing exactly what it was told.
93
94## Measure before tuning
95
96I would split testing into three layers.
97
98First, check the physical and OS layer:
99
100```bash
101ip -br link
102ip -s link show dev ens1f0
103ethtool ens1f0
104ethtool -S ens1f0 | egrep -i 'drop|discard|error|timeout|miss|fifo|collision'
105ethtool -g ens1f0
106ethtool -k ens1f0
107ethtool --show-channels ens1f0
108```
109
110You are looking for boring fundamentals: expected link speed, full duplex, no physical errors, no growing drop counters, sensible ring sizes, enabled offloads, and enough NIC queues for the hardware.
111
112Second, check the kernel and socket layer:
113
114```bash
115ss -s
116ss -nti
117nstat -az Tcp\* Ip\*
118awk '{for (i=1; i<=NF; i++) printf strtonum("0x" $i) (i==NF ? "\n" : " ")}' /proc/net/softnet_stat | column -t
119mpstat -P ALL 1
120sar -n DEV,TCP,ETCP 1
121```
122
123The most useful symptoms here are retransmits, receive queues that do not drain, softirq pressure on a small number of CPUs, and kernel backlog drops. If `ss -nti` shows retransmits on Logstash-to-Elasticsearch connections, that is more useful than a vague feeling that "the network is slow".
124
125Third, test the application layer:
126
127```bash
128curl -s https://es01.example.net:9200/_nodes/stats/http,transport,indices,thread_pool?pretty
129curl -s https://es01.example.net:9200/_cat/nodes?v
130curl -s https://es01.example.net:9200/_cat/thread_pool/write?v
131curl -s http://logstash01.example.net:9600/_node/stats/pipelines?pretty
132```
133
134The exact Elasticsearch thread pool names can vary across major versions, so check your version rather than copying old examples blindly. What matters is the pattern: correlate network counters with bulk latency, write rejections, indexing rate, queue growth, and node-level hot spots.
135
136## Use iperf3, but use it properly
137
138The first active test tool to reach for is `iperf3`.
139
140The [iperf3 project](https://software.es.net/iperf/) describes it as a tool for measuring maximum achievable bandwidth on IP networks, with support for TCP, UDP, SCTP, buffer tuning, zero-copy, and JSON output. Red Hat's [RHEL network performance documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/tuning-the-network-performance_monitoring-and-managing-system-status-and-performance) also uses `iperf3` for TCP throughput testing, while warning that synthetic test results can differ from real application throughput.
141
142That warning matters. `iperf3` does not tell you how fast Elasticsearch can index. It tells you whether two hosts can move traffic cleanly across the network path.
143
144Install the usual toolbox on the test hosts:
145
146```bash
147dnf install iperf3 ethtool sysstat pcp tuned bcc-tools tcpdump
148```
149
150Temporarily open the iperf3 port on the receiver:
151
152```bash
153firewall-cmd --add-port=5201/tcp --timeout=1h
154iperf3 --server
155```
156
157Then test from each Logstash node to each Elasticsearch node:
158
159```bash
160iperf3 --client es01.example.net --time 60 --omit 5
161iperf3 --client es01.example.net --parallel 4 --time 60 --omit 5
162iperf3 --client es01.example.net --parallel 8 --time 60 --omit 5 --reverse
163iperf3 --client es01.example.net --parallel 8 --time 60 --omit 5 --bidir
164iperf3 --client es01.example.net --parallel 8 --time 60 --omit 5 --json
165```
166
167I like to run tests in a matrix:
168
169| Source | Destination | Test |
170|---|---|---|
171| `logstash01` | `es01`, `es02`, `es03` | single stream, parallel streams, reverse |
172| `logstash02` | `es01`, `es02`, `es03` | single stream, parallel streams, reverse |
173| `es01` | `es02`, `es03` | transport network tests |
174| `es02` | `es01`, `es03` | transport network tests |
175| `es03` | `es01`, `es02` | transport network tests |
176
177Single-stream testing is important because one TCP flow will often be limited by one CPU path, one NIC queue, one bond member, or one switch hash decision. Parallel-stream testing is important because Logstash and Elasticsearch normally use multiple connections and multiple flows.
178
179For higher speed networks, Red Hat documents `--zerocopy` as useful when simulating zero-copy-capable applications or trying to reach very high single-stream throughput:
180
181```bash
182iperf3 --client es01.example.net --time 60 --omit 5 --zerocopy
183```
184
185Do not run one test and declare victory. Run a baseline, make one change, run the same test again, and keep the JSON results.
186
187## When NIC bonding helps
188
189Bonding NICs can help in two different ways: availability and aggregate throughput.
190
191Those are not the same thing.
192
193On RHEL, Red Hat's [network bonding documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/configuring_and_managing_networking/configuring-network-bonding_configuring-and-managing-networking) describes bonding as a way to aggregate interfaces into one logical interface for higher throughput or redundancy. It can be configured with `nmcli`, the RHEL web console, `nmtui`, `nmstatectl`, or RHEL system roles.
194
195The practical modes to think about are:
196
197| Bond mode | What it is good for | Switch requirement |
198|---|---|---|
199| `active-backup` | Redundancy. One link active, another takes over if it fails. | No special switch configuration. |
200| `802.3ad` / LACP | Aggregate capacity across multiple flows and link redundancy. | LACP port channel on the switch. |
201| `balance-xor` | Static aggregation with hash-based distribution. | Static EtherChannel, not LACP. |
202| `balance-rr` | Round-robin packet distribution. | Static EtherChannel, but not a good default for ordered TCP workloads. |
203
204For Elasticsearch and Logstash, `active-backup` is excellent when resilience matters more than throughput. It will not make one Logstash-to-Elasticsearch TCP connection faster, because only one physical link is active at a time.
205
206LACP is the more interesting option for throughput. With `802.3ad`, the bond can distribute different flows across different physical links. This can help a logging stack because you have multiple Logstash hosts, multiple Elasticsearch nodes, HTTP bulk connections, and Elasticsearch transport connections. The aggregate traffic can spread.
207
208But there is a catch: a single TCP flow usually lands on one member link. If you bond two 10 GbE ports with LACP, do not expect one TCP stream to become 20 Gbps. You should expect many flows to have up to 20 Gbps of aggregate headroom, if the switch and host hashing distribute those flows well.
209
210That is why the `xmit_hash_policy` matters. Red Hat documents `layer3+4` as a transmit hash policy that considers IP addresses and ports for port selection. For Logstash and Elasticsearch, where the same hosts may maintain multiple TCP connections, that can distribute traffic better than a policy that only considers MAC addresses or IP addresses. The switch side has to be compatible with the design.
211
212A RHEL 9.4 or later `nmcli` example for an LACP bond looks like this:
213
214```bash
215nmcli connection add type bond con-name bond0 ifname bond0 bond.options "mode=802.3ad,miimon=100,xmit_hash_policy=layer3+4"
216nmcli connection add type ethernet port-type bond con-name bond0-eno1 ifname eno1 controller bond0
217nmcli connection add type ethernet port-type bond con-name bond0-eno2 ifname eno2 controller bond0
218nmcli connection modify bond0 ipv4.addresses 10.20.30.11/24 ipv4.gateway 10.20.30.1 ipv4.method manual
219nmcli connection modify bond0 connection.autoconnect-ports 1
220nmcli connection up bond0
221cat /proc/net/bonding/bond0
222```
223
224On older RHEL versions, the `nmcli` terminology may use `master`, `slave-type`, and `connection.autoconnect-slaves` instead of `controller`, `port-type`, and `connection.autoconnect-ports`.
225
226Two cautions are worth spelling out.
227
228First, do not use NIC teaming for a new RHEL 9 design. Red Hat marks [NIC teaming as deprecated in RHEL 9](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/configuring_and_managing_networking/configuring-network-teaming_configuring-and-managing-networking) and recommends the bonding driver instead.
229
230Second, bonding is not a substitute for buying the right NIC speed. If the cluster needs deterministic throughput above 10 Gbps between any two nodes, 25 GbE, 40 GbE, or 100 GbE is usually cleaner than hoping a bond will make every flow faster.
231
232## Jumbo frames: useful, but only when boring
233
234Jumbo frames can reduce packet overhead and CPU work for large contiguous data streams. Red Hat's [RHEL performance documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/tuning-the-network-performance_monitoring-and-managing-system-status-and-performance) notes that a 9000 byte MTU reduces Ethernet frame overhead compared with the standard 1500 byte payload.
235
236That sounds attractive for Elasticsearch bulk traffic and shard recovery traffic. It can be, especially on a dedicated backend network.
237
238The problem is that every device in the path must agree: source NIC, switch ports, port channels, VLANs, routed interfaces, firewalls, destination NIC, and sometimes virtual switches. A partial jumbo-frame configuration is worse than no jumbo frames because it creates fragmentation, drops, and strange latency.
239
240If you enable jumbo frames, do it on a dedicated ingest or cluster transport network and test it explicitly:
241
242```bash
243nmcli connection modify bond0 mtu 9000
244nmcli connection up bond0
245ip link show dev bond0
246nstat -az IpReasm\*
247ping -c1 -Mdo -s 8972 es02.example.net
248```
249
250The ping payload calculation for IPv4 is `MTU - 8 bytes ICMP header - 20 bytes IPv4 header`, so `8972` is the common test size for a 9000 byte MTU.
251
252If there is any doubt, stay at 1500 MTU until you can test the whole path properly.
253
254## RHEL tools that are worth using
255
256RHEL has a very good network performance toolbox, and the best part is that much of it integrates with NetworkManager rather than disappearing after reboot.
257
258### TuneD
259
260`tuned` is the first RHEL-specific tool I would check.
261
262Red Hat's [TuneD documentation for RHEL 9](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html-single/monitoring_and_managing_system_status_and_performance/monitoring_and_managing_system_status_and_performance) lists `network-throughput` as a profile for streaming network throughput and `network-latency` as a profile focused on low-latency network performance.
263
264For Elasticsearch ingest nodes, `network-throughput` is usually the more natural starting point:
265
266```bash
267dnf install tuned
268systemctl enable --now tuned
269tuned-adm list
270tuned-adm active
271tuned-adm profile network-throughput
272tuned-adm verify
273```
274
275Do this in a maintenance window and measure before and after. TuneD profiles are useful, but they are still system-wide behaviour changes.
276
277### NetworkManager and ethtool settings
278
279The old habit was to run `ethtool -G` or `ethtool -K`, then forget that the setting would vanish on reboot. RHEL's [NetworkManager ethtool settings](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/configuring_and_managing_networking/configuring-ethtool-settings-in-networkmanager-connection-profiles_configuring-and-managing-networking) avoid that by storing offload, coalescing, ring buffer, and channel settings in the connection profile.
280
281Useful checks:
282
283```bash
284ethtool -S ens1f0 | egrep -i 'drop|discard|error|miss|timeout'
285ethtool -g ens1f0
286ethtool -k ens1f0
287ethtool --show-coalesce ens1f0
288ethtool --show-channels ens1f0
289```
290
291If RX drops are rising and the NIC supports larger rings:
292
293```bash
294nmcli connection modify bond0 ethtool.ring-rx 4096
295nmcli connection modify bond0 ethtool.ring-tx 4096
296nmcli connection up bond0
297```
298
299If interrupt rate is too high and CPU is spending too much time handling packets, interrupt coalescing can improve throughput:
300
301```bash
302nmcli connection modify bond0 ethtool.coalesce-rx-frames 128
303nmcli connection up bond0
304```
305
306That can add latency, so test it with real ingest traffic. For logging pipelines, a tiny increase in latency may be acceptable if it buys lower CPU usage and fewer drops. For latency-sensitive request paths, it might not be.
307
308If the NIC has more channels available than it is using:
309
310```bash
311ethtool --show-channels ens1f0
312nmcli connection modify bond0 ethtool.channels-combined 8
313nmcli connection up bond0
314```
315
316Do not blindly set channels to the maximum. Match the NIC, CPU topology, interrupt distribution, and workload.
317
318### irqbalance
319
320On RHEL, `irqbalance` is enabled by default and should normally stay enabled. Red Hat warns that disabling it can hurt network throughput. If one CPU is doing all interrupt work, network performance can suffer even when the link itself is fine.
321
322Check it:
323
324```bash
325systemctl status irqbalance
326cat /proc/interrupts
327mpstat -P ALL 1
328```
329
330If CPU 0 is much busier than everything else during network tests, interrupts and queues are worth investigating.
331
332### PCP and sysstat
333
334For historical evidence, use `sysstat` and Performance Co-Pilot.
335
336`sar` is still excellent for quick before-and-after work:
337
338```bash
339sar -n DEV,TCP,ETCP 1
340sar -n SOCK 1
341```
342
343Performance Co-Pilot is more powerful for longer investigations. Red Hat's [PCP data sheet](https://access.redhat.com/articles/3119481) describes it as a supported system-level performance monitoring suite with live and archived metrics, broad Linux coverage, and tools that overlap with familiar utilities such as `iostat`, `pidstat`, `vmstat`, and `mpstat`.
344
345For an Elasticsearch cluster, PCP is useful because the performance problem may have happened at 2:00 AM during an index rollover, shard recovery, or log storm. Live commands are nice. Archived metrics are better.
346
347> ## nmon and old-school capacity planning
348>
349> Before every platform had a time-series database attached to it, `nmon` was one of the nicer ways to get long-running performance evidence from Unix and Linux systems.
350>
351> The appeal was simple: start a collector, let it run through real business cycles, then analyse the generated files later. That made it useful for capacity planning because you could capture quiet periods, daily peaks, backup windows, month-end processing, and the awkward spikes that never happen while someone is watching a terminal.
352>
353> A typical collection pattern looked like this:
354>
355> ```bash
356> nmon -f -s 60 -c 1440
357> ```
358>
359> That records one sample every 60 seconds for 24 hours. For longer studies, you would schedule it from cron, keep the `.nmon` files, and compare days or weeks rather than arguing from one busy five-minute sample.
360>
361> The reporting workflow was also very practical. You could feed the generated files into a spreadsheet-based analyser, often a custom Excel workbook, and turn raw counters into charts for CPU, disk, memory, paging, network, and process behaviour. In more automated environments, the same kind of data could be pushed into an RRDTool-backed web view so teams could browse historical graphs without passing spreadsheets around.
362>
363> The lesson still applies even if the modern tooling is PCP, Prometheus, Grafana, Elastic monitoring, or a vendor platform: capacity planning needs history. One `iperf3` run can prove a network path. It cannot tell you whether the cluster runs out of IO every weekday at 9:15 AM.
364>
365> For this kind of troubleshooting, I think of the core tuning loop as an I/O triangle:
366>
367> - **Disk**: indexing, merging, translog writes, shard recovery, and queue spillover all become disk problems eventually.
368> - **Network**: Logstash bulk requests, Elasticsearch transport traffic, replication, and recovery all need clean paths with low retransmits and no drops.
369> - **Memory**: filesystem cache, JVM heap, Logstash queues, socket buffers, and paging decide whether the system absorbs bursts or thrashes.
370>
371> CPU still matters, but often as the tax paid to compress, encrypt, copy, interrupt, parse, merge, and garbage collect. If disk, network, and memory are all healthy, CPU tuning becomes much easier to reason about.
372
373### BCC/eBPF tools
374
375RHEL also ships BCC tooling for low-overhead network tracing. Red Hat's [BCC network tracing documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/configuring_and_managing_networking/network-tracing-using-the-bpf-compiler-collection_configuring-and-managing-networking) includes tools for TCP drops, retransmits, connection latency, TCP session summaries, softirq time, and per-connection throughput.
376
377Useful examples:
378
379```bash
380/usr/share/bcc/tools/tcpretrans
381/usr/share/bcc/tools/tcpdrop
382/usr/share/bcc/tools/tcptop
383/usr/share/bcc/tools/tcplife
384/usr/share/bcc/tools/softirqs
385```
386
387These are not first-line tuning tools. They are excellent when normal counters say "there are retransmits" but not "which connections are doing it and when".
388
389## TCP buffers, backlog, and drops
390
391RHEL defaults are good for most systems, but high-throughput ingest can still hit kernel limits.
392
393Red Hat's RHEL 9 [network performance tuning](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/tuning-the-network-performance_monitoring-and-managing-system-status-and-performance) documentation highlights several areas that matter for fast NICs:
394
395- NIC ring buffers
396- network device backlog queues
397- SoftIRQ budget
398- TCP socket buffers
399- TCP window scaling
400- TCP SACK
401- TCP timestamps
402- Ethernet flow control
403
404The important part is to tune based on counters.
405
406If `/proc/net/softnet_stat` shows the second column incrementing over time, the kernel backlog queue is dropping frames. Red Hat suggests increasing `net.core.netdev_max_backlog` progressively and verifying that the counter stops increasing.
407
408If the third column increments, SoftIRQ processing may not be getting enough budget. Red Hat documents `net.core.netdev_budget` and `net.core.netdev_budget_usecs` for that case.
409
410If `ss -nti` shows receive queues growing while the application is not reading fast enough, the problem may be application CPU, JVM pauses, Logstash backpressure, or socket buffers. Do not assume it is the switch.
411
412I would avoid throwing a generic "high performance sysctl.conf" at Elasticsearch. It is much better to record the counter that proves the problem, change one setting, and then record the counter again.
413
414## A practical optimisation sequence
415
416If I were working through this cluster, I would go in this order.
417
418### 1. Prove the baseline
419
420Record:
421
422- NIC model, driver, firmware, speed, duplex
423- switch port configuration
424- MTU
425- bond mode and hash policy
426- Elasticsearch `network.host`, `http.port`, and `transport.port`
427- Logstash Elasticsearch output hosts
428- current TuneD profile
429- `irqbalance` status
430- NIC counters before and after a busy period
431
432This is dull work, which is exactly why it pays off.
433
434### 2. Remove obvious design bottlenecks
435
436Make sure both Logstash servers can write to all appropriate Elasticsearch endpoints. Avoid routing all bulk traffic through one node unless it is a deliberate coordinating or load-balanced design.
437
438If Elasticsearch nodes have multiple interfaces, make sure HTTP and transport publish addresses are correct and stable.
439
440If the switch uplinks are oversubscribed, no Linux tuning will make that disappear.
441
442### 3. Test every important path with iperf3
443
444Run single-stream and parallel-stream tests between:
445
446- each Logstash host and each Elasticsearch host
447- each Elasticsearch host and every other Elasticsearch host
448
449Save JSON output where possible. Test both directions. Run tests while watching `sar`, `mpstat`, `ethtool -S`, and switch counters.
450
451### 4. Fix physical and link-layer errors first
452
453Any growing CRC, frame, carrier, FIFO, missed, or dropped counters deserve attention before sysctl tuning. Replace cables, check optics, verify switch port settings, check LACP state, and update NIC firmware or drivers if necessary.
454
455### 5. Choose the right bonding mode
456
457Use `active-backup` if the requirement is failover.
458
459Use `802.3ad` if the requirement is aggregate throughput across many flows and the switch is configured for LACP.
460
461Use `layer3+4` hashing only after confirming it fits the switch and network design.
462
463Do not expect bonding to make one TCP flow faster than one member link.
464
465### 6. Apply RHEL tuning carefully
466
467For throughput-oriented ingest nodes:
468
469```bash
470tuned-adm profile network-throughput
471```
472
473For rising NIC drops:
474
475```bash
476ethtool -S ens1f0 | egrep -i 'drop|discard|error'
477ethtool -g ens1f0
478nmcli connection modify bond0 ethtool.ring-rx 4096 ethtool.ring-tx 4096
479nmcli connection up bond0
480```
481
482For interrupt and queue issues:
483
484```bash
485systemctl enable --now irqbalance
486ethtool --show-channels ens1f0
487mpstat -P ALL 1
488cat /proc/interrupts
489```
490
491For jumbo frames:
492
493```bash
494nmcli connection modify bond0 mtu 9000
495nmcli connection up bond0
496ping -c1 -Mdo -s 8972 es02.example.net
497nstat -az IpReasm\*
498```
499
500Make one change at a time. Keep the before-and-after output.
501
502### 7. Validate with real ingest
503
504Once the network path looks clean, validate with Elasticsearch itself.
505
506[Rally](https://esrally.readthedocs.io/en/stable/) is the proper Elasticsearch benchmarking tool when you want repeatable indexing and search tests. It can run benchmarks, record results, compare runs, and attach telemetry. Use `iperf3` to prove the network path; use Rally or a controlled Logstash replay to prove ingest performance.
507
508During the test, watch:
509
510- Logstash pipeline throughput and queue growth
511- Elasticsearch indexing throughput
512- bulk request latency
513- write rejections or `429` responses
514- node CPU and disk IO
515- network retransmits
516- NIC drops
517- shard recovery or relocation traffic
518
519If improving the network does not improve ingest, the bottleneck is probably somewhere else. That is still a useful result.
520
521## What I would recommend for the three-node cluster
522
523For a three-node Elasticsearch cluster fed by two Logstash servers, my default recommendation would be:
524
525- Use at least 10 GbE for the Logstash-to-Elasticsearch and Elasticsearch transport paths; prefer 25 GbE or faster if daily ingest is high or recoveries must complete quickly.
526- Put Elasticsearch transport traffic on a reliable, low-latency backend VLAN or subnet where possible.
527- Configure Logstash to use all suitable Elasticsearch HTTP endpoints.
528- Use `active-backup` bonding where resilience is the main goal.
529- Use LACP bonding where aggregate throughput across many flows is needed, with switch configuration and hash policy tested.
530- Avoid NIC teaming on new RHEL 9 builds; use bonding.
531- Consider jumbo frames only on a fully controlled path.
532- Use `network-throughput` TuneD as a tested profile, not as folklore.
533- Keep `irqbalance` enabled unless there is a very specific reason not to.
534- Use NetworkManager to persist `ethtool` ring, coalescing, offload, and channel settings.
535- Use PCP or sysstat to keep historical evidence.
536- Use BCC tools when you need to trace TCP retransmits, drops, and connection behaviour.
537
538Most importantly, keep the work evidence-driven.
539
540The best network optimisation is not a magic set of kernel parameters. It is a repeatable loop:
541
542```text
543measure -> change one thing -> test again -> keep or revert -> document
544```
545
546In an Elasticsearch logging platform, that discipline matters because the network is only one part of the ingest path. Logstash batching, Elasticsearch shard layout, disk IO, JVM heap, filesystem cache, replicas, refresh intervals, and cluster hot spots can all look like "network slowness" from a distance.
547
548Get close enough to the problem, and the tuning usually becomes much less mysterious.
549

---
title: "Linux Network Testing and Optimisation Techniques for Elasticsearch Clusters"
date: "2026-05-17"
published: true
tags: ["linux", "networking", "elasticsearch", "logstash", "rhel", "performance", "observability", "nmon", "capacity-planning"]
author: "Gavin Jackson"
excerpt: "A practical guide to testing and improving Linux network performance for a three-node Elasticsearch cluster fed by two Logstash servers, including NIC bonding, iperf3, ethtool, TuneD, PCP, nmon, and RHEL-specific tooling."
---

# Linux Network Testing and Optimisation Techniques for Elasticsearch Clusters

![Elastic logo](/assets/elastic-logo-mark.svg)

A busy logging platform rarely fails in a glamorous way.

It normally starts with a small delay in dashboards, a few Logstash retries, a queue that grows during the business day but drains overnight, or Elasticsearch bulk requests that occasionally come back slower than expected. Then someone asks the obvious question: is the network fast enough?

Imagine a fairly common setup:

- three Elasticsearch nodes
- two Logstash servers
- more than 100 servers sending logs into the platform
- sustained ingest, with bursts during deployments, outages, batch jobs, and security events

That is exactly the kind of environment where network tuning matters, but it is also the kind of environment where tuning the wrong thing can waste a lot of time. A Linux server with a 10 GbE or 25 GbE NIC can still perform poorly if interrupts are pinned badly, NIC queues are undersized, switch links are dropping frames, Logstash is sending too many small bulk requests, or all traffic is accidentally landing on one Elasticsearch node.

The goal is not to collect sysctl settings like trading cards. The goal is to prove where the bottleneck is, remove avoidable packet loss, and make the network boring.

## Understand the traffic first

In this sort of logging design, there are two important traffic paths.

The first path is from the estate into Logstash:

```text
100+ servers -> Logstash nodes
```

This might be Beats traffic, syslog, HTTP, TCP, UDP, or a mixture. If the sending agents have disk-backed queues, this path is usually forgiving. If the traffic is UDP syslog, it is much less forgiving because drops may simply vanish.

The second path is from Logstash into Elasticsearch:

```text
Logstash nodes -> Elasticsearch HTTP bulk API -> Elasticsearch transport traffic
```

Logstash writes to Elasticsearch over HTTP, normally to port `9200`, using the Elasticsearch output plugin. Elasticsearch then uses its transport layer, normally port `9300`, for node-to-node traffic, shard replication, cluster state, recoveries, and internal coordination.

That means a single incoming log event can create more than one network movement:

- from the original server to Logstash
- from Logstash to an Elasticsearch node
- from the primary shard to replica shards
- during recovery or rebalancing, between Elasticsearch nodes again

Before changing anything, draw the actual traffic flow. Which VLAN carries Logstash output? Which VLAN carries Elasticsearch transport traffic? Are the Logstash hosts writing to all three Elasticsearch nodes, or only one? Are the Elasticsearch nodes in the same rack, across racks, or across a routed fabric? Are you using TLS, compression, jumbo frames, or a load balancer?

The network shape matters as much as the network speed.

## Start with Elasticsearch and Logstash basics

It is easy to blame Linux networking when the real issue is ingest behaviour.

Elastic's own [indexing speed guidance](https://www.elastic.co/docs/deploy-manage/production-guidance/optimize-performance/indexing-speed) starts with bulk requests, worker concurrency, refresh intervals, replicas during initial loads, filesystem cache, shard layout, and hot spotting. That is the right order. A perfect NIC cannot fix undersized shards, saturated disks, too many concurrent writers to one shard, or Logstash sending badly shaped batches.

For Logstash, the [Elasticsearch output plugin](https://www.elastic.co/docs/reference/logstash/plugins/plugins-outputs-elasticsearch) is also worth reading closely. A few details matter for network planning:

- The plugin sends batches to the Elasticsearch Bulk API.
- Very large batches are split when they exceed 20 MB.
- Using one Elasticsearch output is usually more efficient than creating many outputs, because each output creates its own client and connection pool.
- Modern plugin documentation describes request compression through `compression_level`, which can reduce network IO at the cost of CPU.
- DNS caching and long-lived keepalive connections can affect failover and traffic distribution.

That leads to some practical checks:

- Configure Logstash outputs with all suitable Elasticsearch endpoints, not just `es01`.
- Watch for Elasticsearch `429` responses, rejected writes, indexing pressure, and long bulk latencies.
- Avoid one hot Elasticsearch node receiving most of the Logstash traffic.
- Keep Logstash persistent queues enabled if the platform must absorb short Elasticsearch or network interruptions.
- Tune Logstash batch and worker settings alongside Elasticsearch bulk sizing, not separately from it.

Elasticsearch networking settings should normally stay simple. The [Elasticsearch network settings](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/networking-settings) documentation recommends `network.host` for the common case where a node binds and publishes one address. Advanced bind and publish settings are useful for multi-homed hosts, but they are also a good way to create confusing cluster behaviour if DNS, routing, or interface selection is wrong.

For a production cluster, be deliberate:

```yaml
network.host: 10.20.30.11
http.port: 9200
transport.port: 9300
```

Do not use `0.0.0.0` casually on a multi-homed Elasticsearch node. It can be fine for binding in controlled cases, but the publish address must be reachable by the nodes and clients that need it. If the node publishes the wrong interface, the network can look broken even when the OS is doing exactly what it was told.

## Measure before tuning

I would split testing into three layers.

First, check the physical and OS layer:

```bash
ip -br link
ip -s link show dev ens1f0
ethtool ens1f0
ethtool -S ens1f0 | egrep -i 'drop|discard|error|timeout|miss|fifo|collision'
ethtool -g ens1f0
ethtool -k ens1f0
ethtool --show-channels ens1f0
```

You are looking for boring fundamentals: expected link speed, full duplex, no physical errors, no growing drop counters, sensible ring sizes, enabled offloads, and enough NIC queues for the hardware.

Second, check the kernel and socket layer:

```bash
ss -s
ss -nti
nstat -az Tcp\* Ip\*
awk '{for (i=1; i<=NF; i++) printf strtonum("0x" $i) (i==NF ? "\n" : " ")}' /proc/net/softnet_stat | column -t
mpstat -P ALL 1
sar -n DEV,TCP,ETCP 1
```

The most useful symptoms here are retransmits, receive queues that do not drain, softirq pressure on a small number of CPUs, and kernel backlog drops. If `ss -nti` shows retransmits on Logstash-to-Elasticsearch connections, that is more useful than a vague feeling that "the network is slow".

Third, test the application layer:

```bash
curl -s https://es01.example.net:9200/_nodes/stats/http,transport,indices,thread_pool?pretty
curl -s https://es01.example.net:9200/_cat/nodes?v
curl -s https://es01.example.net:9200/_cat/thread_pool/write?v
curl -s http://logstash01.example.net:9600/_node/stats/pipelines?pretty
```

The exact Elasticsearch thread pool names can vary across major versions, so check your version rather than copying old examples blindly. What matters is the pattern: correlate network counters with bulk latency, write rejections, indexing rate, queue growth, and node-level hot spots.

## Use iperf3, but use it properly

The first active test tool to reach for is `iperf3`.

The [iperf3 project](https://software.es.net/iperf/) describes it as a tool for measuring maximum achievable bandwidth on IP networks, with support for TCP, UDP, SCTP, buffer tuning, zero-copy, and JSON output. Red Hat's [RHEL network performance documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/tuning-the-network-performance_monitoring-and-managing-system-status-and-performance) also uses `iperf3` for TCP throughput testing, while warning that synthetic test results can differ from real application throughput.

That warning matters. `iperf3` does not tell you how fast Elasticsearch can index. It tells you whether two hosts can move traffic cleanly across the network path.

Install the usual toolbox on the test hosts:

```bash
dnf install iperf3 ethtool sysstat pcp tuned bcc-tools tcpdump
```

Temporarily open the iperf3 port on the receiver:

```bash
firewall-cmd --add-port=5201/tcp --timeout=1h
iperf3 --server
```

Then test from each Logstash node to each Elasticsearch node:

```bash
iperf3 --client es01.example.net --time 60 --omit 5
iperf3 --client es01.example.net --parallel 4 --time 60 --omit 5
iperf3 --client es01.example.net --parallel 8 --time 60 --omit 5 --reverse
iperf3 --client es01.example.net --parallel 8 --time 60 --omit 5 --bidir
iperf3 --client es01.example.net --parallel 8 --time 60 --omit 5 --json
```

I like to run tests in a matrix:

| Source | Destination | Test |
|---|---|---|
| `logstash01` | `es01`, `es02`, `es03` | single stream, parallel streams, reverse |
| `logstash02` | `es01`, `es02`, `es03` | single stream, parallel streams, reverse |
| `es01` | `es02`, `es03` | transport network tests |
| `es02` | `es01`, `es03` | transport network tests |
| `es03` | `es01`, `es02` | transport network tests |

Single-stream testing is important because one TCP flow will often be limited by one CPU path, one NIC queue, one bond member, or one switch hash decision. Parallel-stream testing is important because Logstash and Elasticsearch normally use multiple connections and multiple flows.

For higher speed networks, Red Hat documents `--zerocopy` as useful when simulating zero-copy-capable applications or trying to reach very high single-stream throughput:

```bash
iperf3 --client es01.example.net --time 60 --omit 5 --zerocopy
```

Do not run one test and declare victory. Run a baseline, make one change, run the same test again, and keep the JSON results.

## When NIC bonding helps

Bonding NICs can help in two different ways: availability and aggregate throughput.

Those are not the same thing.

On RHEL, Red Hat's [network bonding documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/configuring_and_managing_networking/configuring-network-bonding_configuring-and-managing-networking) describes bonding as a way to aggregate interfaces into one logical interface for higher throughput or redundancy. It can be configured with `nmcli`, the RHEL web console, `nmtui`, `nmstatectl`, or RHEL system roles.

The practical modes to think about are:

| Bond mode | What it is good for | Switch requirement |
|---|---|---|
| `active-backup` | Redundancy. One link active, another takes over if it fails. | No special switch configuration. |
| `802.3ad` / LACP | Aggregate capacity across multiple flows and link redundancy. | LACP port channel on the switch. |
| `balance-xor` | Static aggregation with hash-based distribution. | Static EtherChannel, not LACP. |
| `balance-rr` | Round-robin packet distribution. | Static EtherChannel, but not a good default for ordered TCP workloads. |

For Elasticsearch and Logstash, `active-backup` is excellent when resilience matters more than throughput. It will not make one Logstash-to-Elasticsearch TCP connection faster, because only one physical link is active at a time.

LACP is the more interesting option for throughput. With `802.3ad`, the bond can distribute different flows across different physical links. This can help a logging stack because you have multiple Logstash hosts, multiple Elasticsearch nodes, HTTP bulk connections, and Elasticsearch transport connections. The aggregate traffic can spread.

But there is a catch: a single TCP flow usually lands on one member link. If you bond two 10 GbE ports with LACP, do not expect one TCP stream to become 20 Gbps. You should expect many flows to have up to 20 Gbps of aggregate headroom, if the switch and host hashing distribute those flows well.

That is why the `xmit_hash_policy` matters. Red Hat documents `layer3+4` as a transmit hash policy that considers IP addresses and ports for port selection. For Logstash and Elasticsearch, where the same hosts may maintain multiple TCP connections, that can distribute traffic better than a policy that only considers MAC addresses or IP addresses. The switch side has to be compatible with the design.

A RHEL 9.4 or later `nmcli` example for an LACP bond looks like this:

```bash
nmcli connection add type bond con-name bond0 ifname bond0 bond.options "mode=802.3ad,miimon=100,xmit_hash_policy=layer3+4"
nmcli connection add type ethernet port-type bond con-name bond0-eno1 ifname eno1 controller bond0
nmcli connection add type ethernet port-type bond con-name bond0-eno2 ifname eno2 controller bond0
nmcli connection modify bond0 ipv4.addresses 10.20.30.11/24 ipv4.gateway 10.20.30.1 ipv4.method manual
nmcli connection modify bond0 connection.autoconnect-ports 1
nmcli connection up bond0
cat /proc/net/bonding/bond0
```

On older RHEL versions, the `nmcli` terminology may use `master`, `slave-type`, and `connection.autoconnect-slaves` instead of `controller`, `port-type`, and `connection.autoconnect-ports`.

Two cautions are worth spelling out.

First, do not use NIC teaming for a new RHEL 9 design. Red Hat marks [NIC teaming as deprecated in RHEL 9](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/configuring_and_managing_networking/configuring-network-teaming_configuring-and-managing-networking) and recommends the bonding driver instead.

Second, bonding is not a substitute for buying the right NIC speed. If the cluster needs deterministic throughput above 10 Gbps between any two nodes, 25 GbE, 40 GbE, or 100 GbE is usually cleaner than hoping a bond will make every flow faster.

## Jumbo frames: useful, but only when boring

Jumbo frames can reduce packet overhead and CPU work for large contiguous data streams. Red Hat's [RHEL performance documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/tuning-the-network-performance_monitoring-and-managing-system-status-and-performance) notes that a 9000 byte MTU reduces Ethernet frame overhead compared with the standard 1500 byte payload.

That sounds attractive for Elasticsearch bulk traffic and shard recovery traffic. It can be, especially on a dedicated backend network.

The problem is that every device in the path must agree: source NIC, switch ports, port channels, VLANs, routed interfaces, firewalls, destination NIC, and sometimes virtual switches. A partial jumbo-frame configuration is worse than no jumbo frames because it creates fragmentation, drops, and strange latency.

If you enable jumbo frames, do it on a dedicated ingest or cluster transport network and test it explicitly:

```bash
nmcli connection modify bond0 mtu 9000
nmcli connection up bond0
ip link show dev bond0
nstat -az IpReasm\*
ping -c1 -Mdo -s 8972 es02.example.net
```

The ping payload calculation for IPv4 is `MTU - 8 bytes ICMP header - 20 bytes IPv4 header`, so `8972` is the common test size for a 9000 byte MTU.

If there is any doubt, stay at 1500 MTU until you can test the whole path properly.

## RHEL tools that are worth using

RHEL has a very good network performance toolbox, and the best part is that much of it integrates with NetworkManager rather than disappearing after reboot.

### TuneD

`tuned` is the first RHEL-specific tool I would check.

Red Hat's [TuneD documentation for RHEL 9](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html-single/monitoring_and_managing_system_status_and_performance/monitoring_and_managing_system_status_and_performance) lists `network-throughput` as a profile for streaming network throughput and `network-latency` as a profile focused on low-latency network performance.

For Elasticsearch ingest nodes, `network-throughput` is usually the more natural starting point:

```bash
dnf install tuned
systemctl enable --now tuned
tuned-adm list
tuned-adm active
tuned-adm profile network-throughput
tuned-adm verify
```

Do this in a maintenance window and measure before and after. TuneD profiles are useful, but they are still system-wide behaviour changes.

### NetworkManager and ethtool settings

The old habit was to run `ethtool -G` or `ethtool -K`, then forget that the setting would vanish on reboot. RHEL's [NetworkManager ethtool settings](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/configuring_and_managing_networking/configuring-ethtool-settings-in-networkmanager-connection-profiles_configuring-and-managing-networking) avoid that by storing offload, coalescing, ring buffer, and channel settings in the connection profile.

Useful checks:

```bash
ethtool -S ens1f0 | egrep -i 'drop|discard|error|miss|timeout'
ethtool -g ens1f0
ethtool -k ens1f0
ethtool --show-coalesce ens1f0
ethtool --show-channels ens1f0
```

If RX drops are rising and the NIC supports larger rings:

```bash
nmcli connection modify bond0 ethtool.ring-rx 4096
nmcli connection modify bond0 ethtool.ring-tx 4096
nmcli connection up bond0
```

If interrupt rate is too high and CPU is spending too much time handling packets, interrupt coalescing can improve throughput:

```bash
nmcli connection modify bond0 ethtool.coalesce-rx-frames 128
nmcli connection up bond0
```

That can add latency, so test it with real ingest traffic. For logging pipelines, a tiny increase in latency may be acceptable if it buys lower CPU usage and fewer drops. For latency-sensitive request paths, it might not be.

If the NIC has more channels available than it is using:

```bash
ethtool --show-channels ens1f0
nmcli connection modify bond0 ethtool.channels-combined 8
nmcli connection up bond0
```

Do not blindly set channels to the maximum. Match the NIC, CPU topology, interrupt distribution, and workload.

### irqbalance

On RHEL, `irqbalance` is enabled by default and should normally stay enabled. Red Hat warns that disabling it can hurt network throughput. If one CPU is doing all interrupt work, network performance can suffer even when the link itself is fine.

Check it:

```bash
systemctl status irqbalance
cat /proc/interrupts
mpstat -P ALL 1
```

If CPU 0 is much busier than everything else during network tests, interrupts and queues are worth investigating.

### PCP and sysstat

For historical evidence, use `sysstat` and Performance Co-Pilot.

`sar` is still excellent for quick before-and-after work:

```bash
sar -n DEV,TCP,ETCP 1
sar -n SOCK 1
```

Performance Co-Pilot is more powerful for longer investigations. Red Hat's [PCP data sheet](https://access.redhat.com/articles/3119481) describes it as a supported system-level performance monitoring suite with live and archived metrics, broad Linux coverage, and tools that overlap with familiar utilities such as `iostat`, `pidstat`, `vmstat`, and `mpstat`.

For an Elasticsearch cluster, PCP is useful because the performance problem may have happened at 2:00 AM during an index rollover, shard recovery, or log storm. Live commands are nice. Archived metrics are better.

> ## nmon and old-school capacity planning
>
> Before every platform had a time-series database attached to it, `nmon` was one of the nicer ways to get long-running performance evidence from Unix and Linux systems.
>
> The appeal was simple: start a collector, let it run through real business cycles, then analyse the generated files later. That made it useful for capacity planning because you could capture quiet periods, daily peaks, backup windows, month-end processing, and the awkward spikes that never happen while someone is watching a terminal.
>
> A typical collection pattern looked like this:
>
> ```bash
> nmon -f -s 60 -c 1440
> ```
>
> That records one sample every 60 seconds for 24 hours. For longer studies, you would schedule it from cron, keep the `.nmon` files, and compare days or weeks rather than arguing from one busy five-minute sample.
>
> The reporting workflow was also very practical. You could feed the generated files into a spreadsheet-based analyser, often a custom Excel workbook, and turn raw counters into charts for CPU, disk, memory, paging, network, and process behaviour. In more automated environments, the same kind of data could be pushed into an RRDTool-backed web view so teams could browse historical graphs without passing spreadsheets around.
>
> The lesson still applies even if the modern tooling is PCP, Prometheus, Grafana, Elastic monitoring, or a vendor platform: capacity planning needs history. One `iperf3` run can prove a network path. It cannot tell you whether the cluster runs out of IO every weekday at 9:15 AM.
>
> For this kind of troubleshooting, I think of the core tuning loop as an I/O triangle:
>
> - **Disk**: indexing, merging, translog writes, shard recovery, and queue spillover all become disk problems eventually.
> - **Network**: Logstash bulk requests, Elasticsearch transport traffic, replication, and recovery all need clean paths with low retransmits and no drops.
> - **Memory**: filesystem cache, JVM heap, Logstash queues, socket buffers, and paging decide whether the system absorbs bursts or thrashes.
>
> CPU still matters, but often as the tax paid to compress, encrypt, copy, interrupt, parse, merge, and garbage collect. If disk, network, and memory are all healthy, CPU tuning becomes much easier to reason about.

### BCC/eBPF tools

RHEL also ships BCC tooling for low-overhead network tracing. Red Hat's [BCC network tracing documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/configuring_and_managing_networking/network-tracing-using-the-bpf-compiler-collection_configuring-and-managing-networking) includes tools for TCP drops, retransmits, connection latency, TCP session summaries, softirq time, and per-connection throughput.

Useful examples:

```bash
/usr/share/bcc/tools/tcpretrans
/usr/share/bcc/tools/tcpdrop
/usr/share/bcc/tools/tcptop
/usr/share/bcc/tools/tcplife
/usr/share/bcc/tools/softirqs
```

These are not first-line tuning tools. They are excellent when normal counters say "there are retransmits" but not "which connections are doing it and when".

## TCP buffers, backlog, and drops

RHEL defaults are good for most systems, but high-throughput ingest can still hit kernel limits.

Red Hat's RHEL 9 [network performance tuning](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/tuning-the-network-performance_monitoring-and-managing-system-status-and-performance) documentation highlights several areas that matter for fast NICs:

- NIC ring buffers
- network device backlog queues
- SoftIRQ budget
- TCP socket buffers
- TCP window scaling
- TCP SACK
- TCP timestamps
- Ethernet flow control

The important part is to tune based on counters.

If `/proc/net/softnet_stat` shows the second column incrementing over time, the kernel backlog queue is dropping frames. Red Hat suggests increasing `net.core.netdev_max_backlog` progressively and verifying that the counter stops increasing.

If the third column increments, SoftIRQ processing may not be getting enough budget. Red Hat documents `net.core.netdev_budget` and `net.core.netdev_budget_usecs` for that case.

If `ss -nti` shows receive queues growing while the application is not reading fast enough, the problem may be application CPU, JVM pauses, Logstash backpressure, or socket buffers. Do not assume it is the switch.

I would avoid throwing a generic "high performance sysctl.conf" at Elasticsearch. It is much better to record the counter that proves the problem, change one setting, and then record the counter again.

## A practical optimisation sequence

If I were working through this cluster, I would go in this order.

### 1. Prove the baseline

Record:

- NIC model, driver, firmware, speed, duplex
- switch port configuration
- MTU
- bond mode and hash policy
- Elasticsearch `network.host`, `http.port`, and `transport.port`
- Logstash Elasticsearch output hosts
- current TuneD profile
- `irqbalance` status
- NIC counters before and after a busy period

This is dull work, which is exactly why it pays off.

### 2. Remove obvious design bottlenecks

Make sure both Logstash servers can write to all appropriate Elasticsearch endpoints. Avoid routing all bulk traffic through one node unless it is a deliberate coordinating or load-balanced design.

If Elasticsearch nodes have multiple interfaces, make sure HTTP and transport publish addresses are correct and stable.

If the switch uplinks are oversubscribed, no Linux tuning will make that disappear.

### 3. Test every important path with iperf3

Run single-stream and parallel-stream tests between:

- each Logstash host and each Elasticsearch host
- each Elasticsearch host and every other Elasticsearch host

Save JSON output where possible. Test both directions. Run tests while watching `sar`, `mpstat`, `ethtool -S`, and switch counters.

### 4. Fix physical and link-layer errors first

Any growing CRC, frame, carrier, FIFO, missed, or dropped counters deserve attention before sysctl tuning. Replace cables, check optics, verify switch port settings, check LACP state, and update NIC firmware or drivers if necessary.

### 5. Choose the right bonding mode

Use `active-backup` if the requirement is failover.

Use `802.3ad` if the requirement is aggregate throughput across many flows and the switch is configured for LACP.

Use `layer3+4` hashing only after confirming it fits the switch and network design.

Do not expect bonding to make one TCP flow faster than one member link.

### 6. Apply RHEL tuning carefully

For throughput-oriented ingest nodes:

```bash
tuned-adm profile network-throughput
```

For rising NIC drops:

```bash
ethtool -S ens1f0 | egrep -i 'drop|discard|error'
ethtool -g ens1f0
nmcli connection modify bond0 ethtool.ring-rx 4096 ethtool.ring-tx 4096
nmcli connection up bond0
```

For interrupt and queue issues:

```bash
systemctl enable --now irqbalance
ethtool --show-channels ens1f0
mpstat -P ALL 1
cat /proc/interrupts
```

For jumbo frames:

```bash
nmcli connection modify bond0 mtu 9000
nmcli connection up bond0
ping -c1 -Mdo -s 8972 es02.example.net
nstat -az IpReasm\*
```

Make one change at a time. Keep the before-and-after output.

### 7. Validate with real ingest

Once the network path looks clean, validate with Elasticsearch itself.

[Rally](https://esrally.readthedocs.io/en/stable/) is the proper Elasticsearch benchmarking tool when you want repeatable indexing and search tests. It can run benchmarks, record results, compare runs, and attach telemetry. Use `iperf3` to prove the network path; use Rally or a controlled Logstash replay to prove ingest performance.

During the test, watch:

- Logstash pipeline throughput and queue growth
- Elasticsearch indexing throughput
- bulk request latency
- write rejections or `429` responses
- node CPU and disk IO
- network retransmits
- NIC drops
- shard recovery or relocation traffic

If improving the network does not improve ingest, the bottleneck is probably somewhere else. That is still a useful result.

## What I would recommend for the three-node cluster

For a three-node Elasticsearch cluster fed by two Logstash servers, my default recommendation would be:

- Use at least 10 GbE for the Logstash-to-Elasticsearch and Elasticsearch transport paths; prefer 25 GbE or faster if daily ingest is high or recoveries must complete quickly.
- Put Elasticsearch transport traffic on a reliable, low-latency backend VLAN or subnet where possible.
- Configure Logstash to use all suitable Elasticsearch HTTP endpoints.
- Use `active-backup` bonding where resilience is the main goal.
- Use LACP bonding where aggregate throughput across many flows is needed, with switch configuration and hash policy tested.
- Avoid NIC teaming on new RHEL 9 builds; use bonding.
- Consider jumbo frames only on a fully controlled path.
- Use `network-throughput` TuneD as a tested profile, not as folklore.
- Keep `irqbalance` enabled unless there is a very specific reason not to.
- Use NetworkManager to persist `ethtool` ring, coalescing, offload, and channel settings.
- Use PCP or sysstat to keep historical evidence.
- Use BCC tools when you need to trace TCP retransmits, drops, and connection behaviour.

Most importantly, keep the work evidence-driven.

The best network optimisation is not a magic set of kernel parameters. It is a repeatable loop:

```text
measure -> change one thing -> test again -> keep or revert -> document
```

In an Elasticsearch logging platform, that discipline matters because the network is only one part of the ingest path. Logstash batching, Elasticsearch shard layout, disk IO, JVM heap, filesystem cache, replicas, refresh intervals, and cluster hot spots can all look like "network slowness" from a distance.

Get close enough to the problem, and the tuning usually becomes much less mysterious.