Capturing and Uploading JVM Thread Dumps (or other logs) in Kubernetes with Fluent Bit
If you’re reading this, you’re probably already taking manual thread dumps of running Java apps in your Kubernetes cluster (or at least know how). Recently for us though, we needed to debug an issue that was causing our app container to restart randomly. There was nothing obviously wrong with JVM metrics, resource utilization, etc., so for better debugging, we decided a thread dump taken at the exact moment before the container restarts would potentially be useful in identifying the problem (SPOILER: it was). We found that we could retrieve a thread dump with the lightweight and powerful log processor/forwarder Fluent Bit, which we already utilize as a DaemonSet to collect container logs from our cluster. For this solution however, we deployed Fluent Bit as a sidecar alongside the app to collect and forward the thread dump on a per-app basis.
In this article, I’ll walk through our solution to capture a thread dump just before an app container terminates, and then immediately process and upload the file to an S3 bucket — all within two minutes of the container receiving a termination signal.
- A Kubernetes cluster ≥ v1.14
- An S3 bucket able to receive uploads from your cluster or application
- Basic knowledge of Kubernetes and Docker
This simple Spring Boot app (taken from Baeldung) exposes one endpoint and returns a static message:
To keep things simple, I won’t go into detail about build specifications with Maven or Gradle for this app. I’ll skip ahead to where we actually have a compiled JAR in the
target/ directory so we’re ready to containerize the application.
Let’s build and push this image to Docker Hub or your own repository:
➜ docker build -t yourdockerrepo/message-server:1.0.0 .
➜ docker push yourdockerrepo/message-server:1.0.0
Now we’re ready to deploy the application with Fluent Bit as a sidecar container (in the same pod with a shared volume). The general design of this deployment looks something like this:
Things to note
- Authentication with AWS is done in our cluster with kube2iam. This is apparent with the
iam.amazonaws.com/roleannotation. This would only be required if you need to authenticate before uploading the thread dump file to S3.
terminationGracePeriodSecondsis set to 90 seconds instead of the default 60 seconds to allow some time for Fluent Bit to execute.
preStopcontainer lifecycle hook allows us to execute some logic immediately after the container receives the termination signal (SIGTERM), but before it is actually terminated.
- I used
jstackto take the thread dump and save it in the shared emptyDir volume at
- The app container then sleeps for 2 minutes while the fluent-bit container can process and forward the log file to S3. This sleep step is important; without it, the pod would terminate since the
preStopstage for this container will have completed.
- The fluent-bit container also sleeps for 2 minutes while it processes/forwards the log file. NOTE: although both containers sleep for 2 minutes,
terminationGracePeriodSecondswill not allow the pod to live beyond 90 seconds after receiving a SIGTERM.
- Fluent Bit watches for files added to the shared emptyDir volume, and executes its configuration (detailed below) for any new files it detects that match its input criteria.
This ConfigMap is how Fluent Bit knows how to process the thread dump file and then stream the output to a file in an S3 bucket. In this case, we’re using the
tail input plugin that will begin reading at the head of the thread dump file(s), that is any file in the
/var/log/jvm/ directory with a
The output configuration here is to a bucket called
message-server-thread-dumps in S3. I kept the upload timeout fairly short so that it could fail quickly.
Fluent Bit output is always JSON. To account for this, since thread dumps are not really useful to us as JSON, I enabled multiline configuration and had Fluent Bit read the entire thread dump file into the first JSON value as a very long string in the output. We then parsed that string with another script after retrieving the output file from S3 (I won’t be detailing how we do this here).
With that, any time the app container fails while the pod is running or is terminated, a thread dump will be taken and then uploaded to S3.