All posts
Using DVR Playback for ClickHouse® Incident Response

Using DVR Playback for ClickHouse® Incident Response

July 1, 20253 min readQuantrail Team
Share:

At 2 AM, PagerDuty fires. Your ClickHouse® cluster is returning slow queries and client timeouts are spiking. By the time you open your laptop, the immediate symptoms may have resolved themselves. But what caused it?

The problem with real-time dashboards

Traditional monitoring dashboards show you what is happening right now. When an incident has already passed or is intermittent, real-time data is not enough. You need to see what happened 30 minutes ago, correlate metrics across that window, and pinpoint the root cause.

How DVR playback works

CHOps DVR playback records cluster metrics continuously. When you need to investigate, you open the playback interface and scrub through time like a video player. The timeline shows system metrics - CPU, memory, disk I/O, merge operations, query counts - with per-second granularity.

A real investigation workflow

Here is a typical incident investigation using DVR:

Step 1: Set the time window. Your alert fired at 2:03 AM. Set the playback window from 1:45 AM to 2:15 AM to see what led up to the incident and what happened after.

Step 2: Look for the trigger. Scrub through the timeline. You notice memory usage on node 3 spikes from 60% to 95% at 1:58 AM. That is your starting point.

Step 3: Correlate with queries. Switch to the query view for that time window. At 1:57 AM, a large JOIN query started running. It consumed 40GB of memory and forced other queries to queue.

Step 4: Trace the query. Click the query to see its full text, the user who ran it, and which client application submitted it. In this case, a batch analytics job ran a query without a memory limit.

Step 5: Identify the fix. The root cause is clear: an unbounded query from the analytics pipeline. The fix: set max_memory_usage for the analytics user's profile, and add an alert for queries exceeding 20GB.

From incident to prevention

DVR playback turns reactive firefighting into structured investigation. Instead of guessing, you have a recording of exactly what happened. The flame graph profiler adds another layer - if the query was slow, you can see exactly which operations consumed time.

Combining DVR with alerting

Set up alerts for the conditions that DVR helped you identify. Memory usage above 85%, query duration above 60 seconds, merge queue depth above 100. When the next alert fires, you already know where to look in the DVR timeline.

Share: