Niklas Heringer
Niklas Heringer
Cybersecurity & Math.
⌘K
  • Home
  • About me
  • Blog
  • Penetration Testing
  • Skills Lab
  • Research
  • Digital Forensics
  • Security News
  • Field Notes
  • Books
  • Prospero
  • Picks
Subscribe Sign in
Niklas Heringer
members-only post Apr 05, 2025 dagster

Data Engineering With Dagster Part Eight: Metadata

Dig into materialization metadata, inline visualizations, and best practices for asset observability using Dagster.

by Niklas Heringer

Data Engineering With Dagster Part Eight: Metadata

On this page

Metadata in Dagster: The Two Worlds

Metadata is data about data - and while that sounds meta (and it is), it’s also the secret to making your pipelines understandable, traceable, and collaborative.

Imagine you're running a bakery:

  • The cookies = your data
  • The label saying “baked today at 7:32am, by Sam” = your metadata

In Dagster, metadata powers both visibility and observability. There are two types you’ll work with:

  • Definition metadata → fixed context like descriptions and groupings
  • Materialization metadata → dynamic info like row counts, timestamps, or even rendered charts from each run

We’ll explore both, starting with how to better describe your data for your team - and future you.


Definition Metadata: Describe Your Assets

Dagster gives you two clean ways to document your assets:

Option 1: Python Docstrings

Add a triple-quoted string at the top of your asset function - Dagster will pick it up:

@dg.asset
def taxi_zones_file() -> None:
    """
    The raw CSV file for the taxi zones dataset. Sourced from the NYC Open Data portal.
    """

These show up in the Dagster UI - perfect for quick inline docs.

Option 2: The description= Parameter

You can also add a description directly in the decorator:

@dg.asset(
    description="The raw CSV file for the taxi zones dataset. Sourced from the NYC Open Data portal."
)
def taxi_zones_file() -> None:
    """This docstring won’t be shown in the UI."""

If both exist, Dagster uses the description=. Handy if you want internal vs. external descriptions.

Where You See It

  • In the Assets tab under the asset name
  • In the Global Asset Lineage graph, when you hover over a node

Documentation becomes part of your code - and your UI.


Grouping Assets: Don’t Let It Get Messy

As your project grows, your assets multiply. It’s time to group them - not just visually, but functionally.

You can group assets:

  • Individually using the group_name= parameter
  • By module, using load_assets_from_modules(..., group_name=...)

Both result in clearly labeled boxes in the Dagster UI - and more maintainable job selection.

Grouping Individual Assets

@dg.asset(group_name="raw_files")
def taxi_zones_file() -> None:
    ...
@dg.asset(group_name="ingested")
def taxi_trips() -> None:
    ...

Grouping Entire Modules

In definitions.py:

metric_assets = dg.load_assets_from_modules(
    modules=[metrics],
    group_name="metrics"
)

This is clean, scalable, and ideal for large projects.


Mini Excursus: Practice Grouping

Try it like this:

# assets/trips.py
@dg.asset(group_name="raw_files")
def taxi_trips_file() -> None:
    ...

@dg.asset(group_name="ingested")
def taxi_trips() -> None:
    ...
# definitions.py
request_assets = dg.load_assets_from_modules(
    modules=[requests],
    group_name="requests"
)

Now, in the UI, assets are visually separated into raw_files, ingested, and requests.

No more asset soup. Just order.


Materialization Metadata: Add Context to Each Run

While definition metadata is about what the asset is, materialization metadata is about what happened during a run.

This might include:

  • Number of records processed
  • Size of output files
  • Execution timestamps
  • Even preview charts or markdown

Let’s implement a common one: row count.


Adding Row Count to taxi_trips_file

In assets/trips.py:

import pandas as pd
import dagster as dg

@dg.asset(
    partitions_def=monthly_partition,
    group_name="raw_files",
)
def taxi_trips_file(context) -> dg.MaterializeResult:
    """
    The raw parquet files for the taxi trips dataset. Sourced from the NYC Open Data portal.
    """
    partition_date_str = context.partition_key
    month_to_fetch = partition_date_str[:-3]

    raw_trips = requests.get(
        f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{month_to_fetch}.parquet"
    )

    file_path = constants.TAXI_TRIPS_TEMPLATE_FILE_PATH.format(month_to_fetch)

    with open(file_path, "wb") as output_file:
        output_file.write(raw_trips.content)

    num_rows = len(pd.read_parquet(file_path))

    return dg.MaterializeResult(
        metadata={
            "Number of records": dg.MetadataValue.int(num_rows)
        }
    )

Mini Excursus: Do It for taxi_zones_file

@dg.asset(group_name="raw_files")
def taxi_zones_file() -> dg.MaterializeResult:
    raw_taxi_zones = requests.get(
        "https://community-engineering-artifacts.s3.us-west-2.amazonaws.com/dagster-university/data/taxi_zones.csv"
    )

    with open(constants.TAXI_ZONES_FILE_PATH, "wb") as output_file:
        output_file.write(raw_taxi_zones.content)

    num_rows = len(pd.read_csv(constants.TAXI_ZONES_FILE_PATH))

    return dg.MaterializeResult(
        metadata={
            "Number of records": dg.MetadataValue.int(num_rows)
        }
    )

Viewing in the UI

Once materialized:

  • Open Global Asset Lineage
  • Select taxi_trips_file
  • View row count per partition as a graph
Plots in UI

Real-time dashboards. Zero setup.


Markdown Metadata: Inline Chart Previews

Some metadata is visual.

Let’s embed a generated chart (e.g. from adhoc_request) as a rendered Markdown image in the UI.

Here’s how:

import base64

with open(file_path, "rb") as file:
    image_data = file.read()

base64_data = base64.b64encode(image_data).decode('utf-8')
md_content = f"![Image](data:image/jpeg;base64,{base64_data})"

return dg.MaterializeResult(
    metadata={
        "preview": dg.MetadataValue.md(md_content)
    }
)

Dagster renders it directly in the UI:

Markdown preview

It’s like adding screenshots to your pipeline logs - but built in.


Knowledge Check

This post is for subscribers only

Subscribe to continue reading


More like this

Statistical Dreams: The Intimate History of AI

Statistical Dreams: The Intimate History of AI

From ancient automatons to the Transformer age, this is the story of how we taught machines to think. We'll dive into breakthroughs like the Perceptron, Backpropagation, and Attention, taking a critical look at AI's origins, its dangers, and where it's headed next.
26 Oct
More to Go: Clean Code & Core Concepts

More to Go: Clean Code & Core Concepts

A hands-on walkthrough of Go’s core building blocks. Arrays, slices, loops, functions, structs, and maps - explained with performance in mind. No fluff, just clarity.
05 Jun
Why Go? A Hacker’s First Dive into Golang

Why Go? A Hacker’s First Dive into Golang

Ever wondered why Go keeps popping up in modern toolchains and cloud-native stacks? Here’s a hands-on dive into the syntax, philosophy, and quirks of Golang; written from a hacker’s point of view.
03 Jun
Data Engineering With Dagster Part Seven – Event-Driven Pipelines with Sensors

Data Engineering With Dagster Part Seven – Event-Driven Pipelines with Sensors

From Passive Pipelines to Reactive Workflows So far, we’ve scheduled jobs based on time: “Run this every Monday” or
05 Apr
Data Engineering With Dagster Part Six – Partitioning & Backfills

Data Engineering With Dagster Part Six – Partitioning & Backfills

Learn how to make your pipelines smarter by slicing them into manageable, date-based partitions and handling backfills like a pro.
05 Apr
Data Engineering With Dagster Part Five – Automating With Schedules

Data Engineering With Dagster Part Five – Automating With Schedules

Dagster finally earns its “orchestrator” title - this part dives into jobs, asset selection, cron expressions, and how to wire everything into automated schedules.
03 Apr
Niklas Heringer © 2025. Published with Ghost & Braun
  • Sign up