Logstash output traffic significantly higher than input (minimal transformations) + TCP RST

Hi,

I'm investigating a situation where Logstash receives a certain amount of data, but the traffic sent from Logstash to Elasticsearch is significantly larger, even though I perform minimal transformations.

Setup

  • Inputs from multiple Winlogbeat agents
  • Output to Elasticsearch.
  • Pipeline does basic operations
  • Compression is enabled for the Elasticsearch output.

Logstash pipeline:


input {
  beats {
    port => 15044
    ssl_enabled => true
    ssl_client_authentication   => "optional"
    ssl_certificate_authorities => "/appl/logstashdata/certs/http_ca.crt"
    ssl_certificate             => "/appl/logstashdata/certs/testlogstash1/testlogstash1.crt"
    ssl_key                     => "/appl/logstashdata/certs/testlogstash1/testlogstash1.key"
  }
}

filter {
  if [agent][version] =~ /^8\./ {
    mutate {
      replace => {
        "[data_stream][type]" => "logs"
        "[data_stream][dataset]" => "winlogbeat_8"
        "[data_stream][namespace]" => "default"
      }
    }
  } else if [agent][version] =~ /^9\./ {
    mutate {
      replace => {
        "[data_stream][type]" => "logs"
        "[data_stream][dataset]" => "winlogbeat_9"
        "[data_stream][namespace]" => "default"
      }
    }
  } else if [agent][version] =~ /^10\./ {
    mutate {
      replace => {
        "[data_stream][type]" => "logs"
        "[data_stream][dataset]" => "winlogbeat_10"
        "[data_stream][namespace]" => "default"
      }
    }
  } else {
    mutate {
      replace => {
        "[data_stream][type]" => "logs"
        "[data_stream][dataset]" => "winlogbeat_fallback"
        "[data_stream][namespace]" => "default"
      }
    }
  }

  mutate {
    lowercase => [
      "[log][level]",
      "[agent][hostname]",
      "[host][name]",
      "[winlog][computer_name]",
      "[winlog][event_data][SubjectDomainName]"
    ]
  }


  mutate {
    remove_field => [
      "[@version]",
      "[agent][id]",
      "[agent][name]",
      "[agent][type]",
      "[ecs][version]",
      "[winlog][event_data][Binary]",
      "[event][original]"
    ]
  }

}

output {
  elasticsearch {
    hosts => [
      "https://testelastic1:9200",
      "https://testelastic2:9200",
      "https://testelastic3:9200"
    ]

    api_key => "${ES_LOGSTASH_API_KEY}"

    ssl_enabled => true
    ssl_certificate_authorities => "/appl/logstashdata/certs/http_ca.crt"
    ssl_verification_mode => "full"

    data_stream => "true"

    compression_level => 7
  }
}


There is winlogbeat.yml configuration

winlogbeat.event_logs: 
  - name: Application 
    ignore_older: 24h 
  - name: System 
    ignore_older: 24h 
  - name: Security 
    ignore_older: 24h 
  - name: Windows PowerShell 
    ignore_older: 24h 
  - name: Microsoft-Windows-PowerShell/Operational 
    ignore_older: 24h 
  - name: Microsoft-Windows-Windows Defender/Operational 
    ignore_older: 24h
 
fields:
  project:
    name: "codera"
  env: "test"
fields_under_root: true

output.logstash:
  compression_level: 7
  bulk_max_size: 32
  loadbalance: true
  hosts:
    - "testlogstash1:15044"
    - "testlogstash2:15044"
  ssl:
    enabled: true
    ...
 
logging.level: warning
logging.to_eventlog: true
monitoring.enabled: false

Observation

  • Verified via vnstat and Packetbeat on the Logstash servers.

Screen from testlogstash1

  • Input traffic (from Beats to Logstash) is noticeably smaller than output traffic (from Logstash to Elasticsearch), even with compression enabled.

  • The difference is surprisingly large.

Question

Why does Logstash generate so much more outbound traffic than it receives?

Could this be caused by:

  • Metadata or structure overhead in the bulk API?

  • Something else in how Logstash handles events?

Can you share both logstash.yml and pipelines.yml ? This is your only pipeline right?

What do you have in Logstash logs? Any error regarding retries?

Hello @leandrojmp,

Logstash Configuration & OS

pipelines.yml

- pipeline.id: winlogbeat
  path.config: "/etc/logstash/pipelines/winlogbeat.conf"

logstash.yml

path.data: /appl/logstashdata
path.logs: /appl/logstashlog
pipeline.workers: 2
queue.type: persisted
queue.max_bytes: 4096mb

Logstash JVM bootstrap flags:

JVM bootstrap flags: [-Xms6g, -Xmx6g, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djruby.compile.invokedynamic=true, -XX:+HeapDumpOnOutOfMemoryError, -Djava.security.egd=file:/dev/urandom, -Dlog4j2.isThreadContextMapInheritable=true, -Dlogstash.jackson.stream-read-constraints.max-string-length=200000000, -Dlogstash.jackson.stream-read-constraints.max-number-length=10000, -Djruby.regexp.interruptible=true, -Djdk.io.File.enableADS=true, --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED, --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED, --add-exports=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED, --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED, --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED, --add-opens=java.base/java.security=ALL-UNNAMED, --add-opens=java.base/java.io=ALL-UNNAMED, --add-opens=java.base/java.nio.channels=ALL-UNNAMED, --add-opens=java.base/sun.nio.ch=ALL-UNNAMED, --add-opens=java.management/sun.management=ALL-UNNAMED, -Dio.netty.allocator.maxOrder=11]

OS

free -mh
               total        used        free      shared  buff/cache   available
Mem:            15Gi       3.7Gi       6.1Gi       293Mi       6.3Gi        11Gi
Swap:             0B          0B          0B

nproc
2

CPU is 97 % idle.

Log Inspecition

From the Winlogbeat agent logs, it appears that Logstash is unexpectedly closing the connection. Logs are being sent from around 35 Windows servers to two identical Logstash instances.

These two messages occur very frequently:

Failed to publish events caused by: EOF

failed to publish events: write tcp WINLOGBEAT_IP:PORT->LOGSTASH_IP:15044: wsasend: An existing connection was forcibly closed by the remote host

I don’t see any information in the Logstash logs indicating that it’s closing the connection — no errors or warnings either.

Running Logstash With Log Debug

So I ran Logstash with log.level: debug.

Out of thousands of events, I selected two cases where Winlogbeat (WINLOGBEAT_IP:PORT) failed to send data to Logstash (port 15044) with the error:

An existing connection was forcibly closed by the remote host

Logstash DEBUG logs show no sign of closing the connection — the last entries at 09:01:15 indicate normal communication.

Winlogbeat Logs – Connection Error

Selected Logstash Logs – No Disconnect Logged

But I don’t see any signs of retransmission in logs.
I’ll take a look using tcpdump as well.

I’ve identified that TCP retransmissions and duplicate ACKs are occurring on both sides — between Winlogbeat agents and Logstash , as well as between Logstash and Elasticsearch .

However, based on system metrics and mtr results, the network between Logstash and Elasticsearch shows 0% packet loss, minimal latency (~0.2 ms), and no jitter. Both Logstash and Elasticsearch are reporting low CPU and memory usage.

Despite this, retransmissions are still observed in the packet captures on both segments of the pipeline.

Current Linux kernel TCP buffer settings on both the Logstash host and the Elasticsearch nodes:
net.ipv4.tcp_rmem = 4096 131072 6291456
net.ipv4.tcp_wmem = 4096 16384 4194304
net.core.rmem_max = 212992 # ≈208 kB hard ceiling
net.core.wmem_max = 212992 # ≈208 kB hard ceiling

Based on the screenshots, the Elasticsearch cluster appears to be healthy and performing well. CPU usage, JVM heap, indexing/search rates, and latencies are all within normal ranges. There are no alerts, and the cluster status is green, indicating no issues with performance or resource utilization.


You don't need packet loss to have retransmissions. If something is flipping the order of packets then the congestion window may be closed and that will severely restrict throughput, whilst also increasing retransmissions.

Do you normally collect anything from the Logstash machine or just configured an agent for this troubleshoot?

From what I’ve seen, this doesn’t look like a packet reordering issue — no out_of_order packets show up in the captures ( tcp.analysis.out_of_order in Wireshark ).

But the connection between Winlogbeat agents and Logstash is clearly unstable, probably due to delays or congestion. Here’s what’s actually happening:

  • Logstash sends TCP RST packets, which means it’s force-closing connections instead of doing a proper TCP FIN handshake.
  • Winlogbeat agents send TCP retransmissions and spurious retransmissions, suggesting they’re not getting ACKs in time and assume the packets were lost.
  • Logstash replies with TCP Duplicate ACKs, basically saying “I’m still waiting for the same segment.”

To help with possible congestion, I bumped the TCP buffers and changed the congestion control algorithm:

sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 262144 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 262144 16777216"
sysctl -w net.ipv4.tcp_congestion_control=bbr

I also considered the possibility that some intermediate network device might be terminating the connections between Winlogbeat agents and Logstash. To test this, I temporarily adjusted the TCP keepalive settings:

sysctl -w net.ipv4.tcp_keepalive_time=30
sysctl -w net.ipv4.tcp_keepalive_intvl=10
sysctl -w net.ipv4.tcp_keepalive_probes=5

However, this didn’t resolve the issue either

Moreover based on the packet captures, it’s clear that Logstash is reducing the TCP receive window size.

Logstash still sends TCP RST packets.

No, I’m not collecting any additional logs from Logstash. Logstash is only receiving Windows logs from remote Winlogbeat agents.

You shared a screenshot with Logstash logs on Elasticsearch, so how those logs are arriving there?

Using tcpdump , I monitored traffic on Logstash on port 9200 too, and I observed TCP RST packets being sent both from Logstash to Elasticsearch and from Elasticsearch back to Logstash .

The TCP window size during these events was 63 or 128 bytes.

The same applies to tcp.analysis.duplicate_ack (in Wireshark) — I’m seeing duplicate ACKs in both directions on that connection as well.

Ah, you’re right — I didn’t realize that. These logs are being sent directly from Filebeat to the Elasticsearch cluster, not through Logstash.

There is configuration filebeat.yml

filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

setup.template.settings:
  index.number_of_shards: 1

setup.kibana:
  host: "https://<logstash-host>:5601"
  ssl.certificate_authorities: "/etc/filebeat/certs/<ca-cert>.crt"

output.elasticsearch:
  hosts: ['https://<es-node1>:9200', 'https://<es-node2>:9200', 'https://<es-node3>:9200']
  preset: balanced
  protocol: "https"
  api_key: "${ES_API_KEY}"
  ssl:
    enabled: true
    ca_trusted_fingerprint: "<fingerprint>"
  compression_level: 7


logging.level: warning

with logstash module /etc/filebeat/modules.d/logstash.yml

- module: logstash
  log:
    enabled: true
    var.paths:
      - /appl/logstashlog/logstash-plain.log

  slowlog:
    enabled: true
    var.paths:
      - /appl/logstashlog/logstash-slowlog-plain.log

So Filebeat is also collecting Logstash logs and sending them directly to Elasticsearch.
There are no errors reported in the Filebeat logs.

Since I increased the timeout values in the Logstash configuration:

In the input section for Beats:

input {
  beats {
    port => 15044
    client_inactivity_timeout => 300
  }
}

And in the output section for Elasticsearch:

output {
  elasticsearch {
    timeout => 120
  }
}

the number of

“wsasend: An existing connection was forcibly closed by the remote host”

messages from Winlogbeat agents has significantly decreased — see the attached screenshot.

The current timeout values feel excessively high to me.
The link has a capacity of only 4 MB, but it’s not even utilized halfway.

I’ve performed further analysis using TCPDUMP and Wireshark. The number of TCP Duplicate ACKs and retransmissions between Winlogbeats and Logstash is now minimal.

However, the situation between Logstash and the Elasticsearch cluster remains the same — there are still many TCP Duplicate ACKs and also RST packets being sent from Logstash to the Elasticsearch nodes.

So, if you have other things running on the same machine sending data to Elasticsearch as well, the amount of data sent will always be higher than the amount of data received.

For example, if you have logstash logs on DEBUG mode per default, this will send a lot of data as debug logs are noisy.

Same thing if you are collecting metrics with metricbeat or packetbeat, both generate a lot of data.

What exactly do you have running on this VM sending data to Elasticsearch?

There was another thread about the RSTs recently. I posted some tcpdump analysis. It was related to channels which have no data (idle) for a given period. In that case it was filebeat <—> logstash. One side seemed to use a keepalive counter, the other (logstash) didn’t.

That other case, like this one, wasn’t impacting me, I was just curious. And there was a workaround.

But if you’d consider the RSTs from logstash rather than proper FINs a bug, either practically troublesome or just poor networking etiquette, you might wish to open a bug :bug:

1 Like

This seems to be expected behavior from Logstash — it closes idle connections with a TCP RST when client_inactivity_timeout is reached. Technically it’s valid, but not very “clean” compared to a graceful FIN/ACK close.

In an ideal setup, Beats agents would send TCP keepalives during idle periods to keep the connection open. Since they don’t, Logstash interprets the silence as inactivity and forcefully closes the connection.

Not a critical issue, but it can cause confusion during traffic analysis or connection troubleshooting. Increasing the timeout helped in my case.

1 Like

IIRC filebeat did send KeepAlives, every 15 seconds, and logstash ACKed them. But it kept no keepalive timer (can see with netstat) so effectively ignored them, then sent the RST.

1 Like

You’re right @RainTown — I checked it in Wireshark and confirmed that Winlogbeat sends a TCP Keep-Alive approximately every 15 seconds, and Logstash responds with a Keep-Alive ACK.

So it can be said that Logstash’s client_inactivity_timeout ignores TCP-level keep-alives and only considers application-level data as activity — meaning that even if the TCP connection is alive, Logstash may still close it if no logs are sent?

Yep, that was my conclusion too. so if you have a sometimes-quiet channel, you need set a large client_inactivity_timeout. Not ideal, but is that a bug given there is a way to workaround it ? I know some network people dont really like long running TCP sessions if no real data is being exchanged. I don't really have a strong (informed) view.