Skip to main content

Data Volume Maintenance

This guide provides SQL queries for maintaining ClickHouse data volume in your Langfuse deployment. Use these queries to monitor disk usage, clean up old data, and verify retention policies.

Connect to ClickHouse

To execute these queries, you need access to the ClickHouse pod. Here's how to connect:

Find the ClickHouse Pod

kubectl get pods -n langfuse | grep clickhouse

Connect to ClickHouse Pod

kubectl exec -it langfuse-clickhouse-shard0-X -n langfuse -- /bin/bash

Replace X with your shard number.

Get ClickHouse Password

kubectl get secret langfuse-clickhouse -n langfuse -o jsonpath='{.data.admin-password}' | base64 --decode; echo

Connect using ClickHouse Client

Inside the pod, connect to ClickHouse using the password from above:

clickhouse-client --password <password_from_above>

1. Disk Usage Analysis

Top Tables by Disk Size

This query identifies which tables are consuming the most disk space.

Database Types

ClickHouse contains two types of databases:

  • default – Langfuse application database containing business data (Langfuse uses default as the database name by default):

    • observations
    • traces
    • blob_storage_file_log
    • scores
    • Other Langfuse tables and views
  • system – ClickHouse internal database containing metadata (Located in the system database):

    • trace_log
    • zookeeper_log
    • metric_log
    • opentelemetry_span_log
    • Other ClickHouse tables
SELECT
database,
`table`,
formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE active
GROUP BY
database,
`table`
ORDER BY
database ASC,
sum(bytes_on_disk) DESC;

List All Tables

Displays tables sorted by size, including their engine type, row count, and total size.

Key Columns:

  • engine:
    • MergeTree / Replicated...: Real tables that store data.
    • View: Virtual tables (saved queries) that take up 0 bytes.
  • total_rows: The number of records in the table.

Query to list all Langfuse application tables in the default database:

SELECT
name AS table_name,
engine,
total_rows,
formatReadableSize(total_bytes) AS size
FROM system.tables
WHERE database = 'default'
ORDER BY total_bytes DESC;

Data Distribution by Time Period

The following queries help you understand how data is distributed over time and identify which periods consume the most storage.

Choose the appropriate query based on your needs:

note

Per-day compressed size is not available because ClickHouse partitions data by month (PARTITION BY toYYYYMM()).

Compressed Size by Month (Fast)

Shows actual compressed disk usage by month. Reads partition metadata from system.parts.

SELECT
partition AS month,
sum(rows) AS rows,
formatReadableSize(sum(bytes_on_disk)) AS compressed_size
FROM system.parts
WHERE database = 'default' AND table = 'observations' AND active
GROUP BY partition
ORDER BY partition ASC;

By Day: Row Count (Fast)

Shows row count per day. Executes instantly by reading indices only.

SELECT
toDate(start_time) AS day,
count() AS rows
FROM default.observations
GROUP BY day
ORDER BY day ASC;

Uncompressed Size by Day (Heavy)

SELECT
toDate(start_time) AS day,
count() AS rows,
formatReadableSize(sum(length(toString(input)) + length(toString(output)))) AS approx_size
FROM default.observations
GROUP BY day
ORDER BY day ASC;
Date Column Names

Check the date column name for your table:

  • default.observations uses start_time
  • default.traces and default.scores uses timestamp

To verify the date column for other tables, see Table Structure section.


2. Data Cleanup

This section covers both manual and automatic data cleanup strategies.

Manual Data Deletion

If you need to clean up data manually (e.g., before applying a new TTL or for testing), use the ALTER ... DELETE command.

Important

This operation is a Mutation. It is asynchronous and resource-intensive. ClickHouse effectively rewrites the data parts without the deleted rows.

Always use toDate() or specific date strings. Using non-deterministic functions like now() or today() can cause errors in replicated tables.

Delete Data Older Than a Specific Date

-- Delete all records older than a specific date
ALTER TABLE default.observations
DELETE WHERE toDate(start_time) < toDate('2025-07-13');

Check Mutation Status

Since deletion is not instant, check the progress here:

SELECT command, is_done
FROM system.mutations
WHERE table = 'observations'
ORDER BY create_time DESC
LIMIT 5;

TTL Monitoring

Monitor automatic data deletion through Time-To-Live (TTL) policies.

First: Verify TTL is Configured

Before monitoring, check if TTL exists in your table definition: Table Structure. Look for the TTL line – if missing, automatic deletion is not configured.

Verify Old Data Deletion

Shows the 15 oldest days with data to check if TTL is deleting old records as expected.

SELECT
toDate(start_time) AS day,
count() AS rows
FROM default.observations
GROUP BY day
ORDER BY day ASC
LIMIT 15;

TTL Expiration Status

Check when ClickHouse will delete expired data. ClickHouse stores data in physical files called parts. Each part contains multiple rows, and TTL is checked during background merges.

SELECT
partition,
name AS part_name,
toDateTime(delete_ttl_info_min) AS min_ttl,
toDateTime(delete_ttl_info_max) AS max_ttl
FROM system.parts
WHERE database = 'default' AND table = 'observations' AND active
ORDER BY min_ttl;
Column Meaning

Since each partition contains multiple rows with different timestamps:

  • min_ttl: When the oldest row in this partition will expire
  • max_ttl: When the newest row in this partition will expire
  • partition: The monthly partition (format: YYYYMM)
How to Interpret Status

Compare min_ttl with the current date/time:

min_ttl ValueStatusWhat it means
1970-01-01TTL Not ProcessedTTL rule exists but hasn't been evaluated yet. ClickHouse will process it during the next background merge.
Past DateExpired (Pending)Data should be deleted but still exists on disk. ClickHouse deletes it during the next merge. This is normal "lazy cleanup" behavior.
Future DateActiveData is within retention period. It will be automatically deleted when this date is reached.

Why isn't expired data deleted immediately? ClickHouse performs TTL cleanup during background merges to avoid impacting query performance.

Force Immediate Cleanup

If expired data still occupies disk space and you need to free it immediately:

OPTIMIZE TABLE default.observations FINAL;

This forces ClickHouse to merge all parts and apply TTL rules immediately.


3. Table Structure

Use this command to view the full table definition. This is critical for:

  1. Column Names: Finding the correct date column (e.g., start_time vs timestamp).
  2. TTL Verification: Checking if a retention policy is currently configured.
SHOW CREATE TABLE default.observations;
What to Look For
  • PARTITION BY: How data is split (usually by month).
  • TTL: The automatic deletion rule (e.g., TTL toDateTime(start_time) + INTERVAL 60 DAY DELETE). If this line is missing, no retention is active.