Friday, December 19, 2025

DuckDB: very useful were least expected

If you haven't heard about DuckDB yet, you definitely have to check it out. It is fascinating piece of technology for quick data exploration and analysis. But today we are going to talk about somewhat surprising but exceptionally useful area where DuckDB could be tremendously helpful - dealing with chatty HTTP/REST services.

One of the best things about DuckDB is that it is just a single binary (per OS/arch), with no additional dependencies, so the installation process is a breath.

Let me set the stage here. I have been working with OpenSearch (and Elasticsearch) for years, those are great search engines with very reach HTTP/REST APIs. The production grade clusters constitute hundreds of nodes, and at this scale, mostly every single cluster wide HTTP/REST endpoint invocation returns unmanageable JSON blobs. Wouldn't it be cool to somehow transform such JSON blobs into structured, queryable form somehow? Like relational table for example and run SQL queries over it, without writing a single line of code? It is absolutely doable with DuckDB and its JSON Processing Functions.

As an exercise, we are going to play with Nodes API which returns a detailed per node response, following deep nested JSON structure:

{
  "cluster_name" : "...",
  "_nodes" : {
     ...
  },
  "nodes" : {
    <node1> : {
        ...
    },
    <node2> : {
        ...
    },
    ...
    <nodeN> : {
        ...
    }
  }

Ideally, what we want is to flatten this structure into a table where each row represents individual node and each JSON key becomes an individual column by itself. To put things in the context, each node structure has nested arrays and objects, we will not recursively traverse them (although it is possible but needs more complex transformations). With that, let us start our exploration journey!

The first step is the simplest: extract nodes collection of objects from the Nodes API response and just feed it directly into DuckDB.

$ curl "https://localhost:9200/_nodes?pretty" -u admin:<password> -k --raw -s | duckdb -c "
  WITH nodes AS (
    SELECT key as id, value FROM  read_json_auto('/dev/stdin') AS r, json_each(r, '$.nodes')
  )
  SELECT * FROM nodes"

We would get back something like this (the OpenSearch cluster I use for tests has only two nodes, hence we see only two rows):

┌──────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│          id          │                                                                 value                                                                  │
│       varchar        │                                                                  json                                                                  │
├──────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ YQzVQ_kHT-WnYTReh0…  │ {"name":"opensearch-node2","transport_address":"10.89.0.3:9300","host":"10.89.0.3","ip":"10.89.0.3","version":"3.0.0","build_type":"…  │
│ J9a5OM8STainCdkaLm…  │ {"name":"opensearch-node1","transport_address":"10.89.0.2:9300","host":"10.89.0.2","ip":"10.89.0.2","version":"3.0.0","build_type":"…  │
└──────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Although we could stop just here (since DuckDB lets you query over JSON values easily), we would like to do something useful with value blob: we want it to become a table with real columns. There are many ways that could be accomplished in DuckDB, some do require precise JSON structure (schema) to be provided, some require predefined list of keys upfront. We would stick to dynamic exploration instead and extract all keys from the actual JSON data.

$ curl "https://localhost:9200/_nodes?pretty" -u admin:<password> | duckdb -c "
  WITH nodes AS (
    SELECT key as id, value FROM  read_json_auto('/dev/stdin') AS r, json_each(r, '$.nodes')
  ),
  all_keys AS (
    SELECT distinct(unnest(json_keys(value))) AS key FROM nodes
  )
  SELECT * FROM all_keys"

In the version of the OpenSearch I am running, there are 22 unique keys (JSON field names) returned, an example of the output is below:

┌────────────────────────────────┐
│              key               │
│            varchar             │
├────────────────────────────────┤
│ plugins                        │
│ jvm                            │
│ host                           │
│ version                        │
│ build_hash                     │
│ ...                            │
│ modules                        │
│ build_type                     │
│ os                             │
│ transport                      │
│ search_pipelines               │
│ attributes                     │
├────────────────────────────────┤
│            22 rows             │
└────────────────────────────────┘

Good progress so far, but we need to go over the last mile and build a relational table out of these pieces. This is where DuckDB's powerful PIVOT statement comes in very handy.

$ curl "https://localhost:9200/_nodes?pretty" -u admin:<password> | duckdb -c "
  WITH nodes AS (
    SELECT key as id, value FROM  read_json_auto('/dev/stdin') AS r, json_each(r, '$.nodes')
  ),
  all_keys AS (
    SELECT distinct(unnest(json_keys(value))) AS key FROM nodes
  ),
  keys AS (
    SELECT * FROM all_keys WHERE key not in ['plugins', 'modules']
  )
  SELECT id, node.* FROM nodes, (PIVOT keys ON(key) USING first(json_extract(value, '$.' || key))) as node"

And here we are:

┌──────────────────────┬──────────────────────┬──────────────────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┬───────────────────┬─────────┐
│          id          │     aggregations     │      attributes      │      build_hash      │ … │     thread_pool      │ total_indexing_buf…  │      transport       │ transport_address │ version │
│       varchar        │         json         │         json         │         json         │   │         json         │         json         │         json         │       json        │  json   │
├──────────────────────┼──────────────────────┼──────────────────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┼───────────────────┼─────────┤
│ YQzVQ_kHT-WnYTReh0…  │ {"adjacency_matrix…  │ {"shard_indexing_p…  │ "dc4efa821904cc2d7…  │ … │ {"remote_refresh_r…  │ 53687091             │ {"bound_address":[…  │ "10.89.0.3:9300"  │ "3.0.0" │
│ J9a5OM8STainCdkaLm…  │ {"adjacency_matrix…  │ {"shard_indexing_p…  │ "dc4efa821904cc2d7…  │ … │ {"remote_refresh_r…  │ 53687091             │ {"bound_address":[…  │ "10.89.0.2:9300"  │ "3.0.0" │
├──────────────────────┴──────────────────────┴──────────────────────┴──────────────────────┴───┴──────────────────────┴──────────────────────┴──────────────────────┴───────────────────┴─────────┤
│ 2 rows                                                                                                                                                                      21 columns (9 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

As we set for the goal, we have transformed each node from JSON to structured table. There is one subtle quirk to mention, the presence of the additional step to filter out plugins and modules fields from the transformations, DuckDB seems to have difficulties pivoting those:

Binder Error:
PIVOT is not supported in correlated subqueries yet

I hope you find it useful, typical enterprise grade HTTP/REST services often throw a pile of JSON at you, try to make sense of it!

I πŸ‡ΊπŸ‡¦ stand πŸ‡ΊπŸ‡¦ with πŸ‡ΊπŸ‡¦ Ukraine.