Version: Next

ML System with DataHub

Why Integrate Your ML System with DataHub?

As a data practitioner, keeping track of your ML experiments, models, and their relationships can be challenging. DataHub makes this easier by providing a central place to organize and track your ML assets.

This guide will show you how to integrate your ML workflows with DataHub. With this integration, you can easily find and share ML models across your organization, track how models evolve over time, and understand how training data connects to each model. Most importantly, it enables seamless collaboration on ML projects by making everything discoverable and connected.

Goals Of This Guide

In this guide, you'll learn how to:

Create your basic ML components (models, experiments, runs)
Connect these components to build a complete ML system
Track relationships between models, data, and experiments

Core ML Concepts

Here's what you need to know about the key components, based on MLflow's terminology:

Experiments are collections of training runs for the same project, like all attempts to build a churn predictor
Training Runs are attempts to train a model within an experiment, capturing parameters and results
Models organize related model versions together, like all versions of your churn predictor
Model Versions are successful training runs registered for production use

The hierarchy works like this:

Every run belongs to an experiment
Successful runs can become model versions
Model versions belong to a model group
Not every run becomes a model version

Terminology

Here's how DataHub and MLflow terms map to each other. For more details, see the MLflow integration doc:

DataHub	MLflow	Description
ML Model Group	Model	Collection of related model versions
ML Model	Model Version	Specific version of a trained model
ML Training Run	Run	Single training attempt
ML Experiment	Experiment	Project workspace

Basic Setup

To follow this tutorial, you'll need DataHub Quickstart deployed locally. For detailed steps, see the Datahub Quickstart Guide.

Next, set up the Python client for DataHub. Create a token in DataHub UI and replace <your_token> with your token:

from mlflow_dh_client import MLflowDatahubClient

client = MLflowDatahubClient(token="<your_token>")

Verifying via GraphQL

Throughout this guide, we'll show how to verify changes using GraphQL queries. You can run these queries in the DataHub UI at https://localhost:9002/api/graphiql.

Create Simple ML Entities

Let's create the basic building blocks of your ML system. These components will help you organize your ML work and make it discoverable by your team.

Create Model Group

A model group contains different versions of a similar model. For example, all versions of your "Customer Churn Predictor" would go in one group.

Simple Version
Detailed Version

Create a basic model group with just an identifier:

client.create_model_group(
    group_id="airline_forecast_models_group",
)

Add rich metadata like descriptions, creation timestamps, and team information:

client.create_model_group(
    group_id="airline_forecast_models_group",
    properties=models.MLModelGroupPropertiesClass(
        name="Airline Forecast Models Group",
        description="Group of models for airline passenger forecasting",
        created=models.TimeStampClass(
            time=1628580000000, actor="urn:li:corpuser:datahub"
        ),
    ),
)

Let's verify that our model group was created:

UI
GraphQL

See your new model group in the DataHub UI:

Query your model group to check its properties:

query {
  mlModelGroup(
    urn:"urn:li:mlModelGroup:(urn:li:dataPlatform:mlflow,airline_forecast_models_group,PROD)"
  ) {
    name
    description
  }
}

The response will show your model group's details:

{
  "data": {
    "mlModelGroup": {
      "name": "airline_forecast_models_group",
      "description": "Group of models for airline passenger forecasting"
    }
  }
}

Create Model

Next, let's create a specific model version that represents a trained model ready for deployment.

Simple Version
Detailed Version

Create a model with just the required version:

client.create_model(
    model_id="arima_model",
    version="1.0",
)

Include metrics, parameters, and metadata for production use:

client.create_model(
    model_id="arima_model",
    properties=models.MLModelPropertiesClass(
        name="ARIMA Model",
        description="ARIMA model for airline passenger forecasting",
        customProperties={"team": "forecasting"},
        trainingMetrics=[
            models.MLMetricClass(name="accuracy", value="0.9"),
            models.MLMetricClass(name="precision", value="0.8"),
        ],
        hyperParams=[
            models.MLHyperParamClass(name="learning_rate", value="0.01"),
            models.MLHyperParamClass(name="batch_size", value="32"),
        ],
        externalUrl="https:localhost:5000",
        created=models.TimeStampClass(
            time=1628580000000, actor="urn:li:corpuser:datahub"
        ),
        lastModified=models.TimeStampClass(
            time=1628580000000, actor="urn:li:corpuser:datahub"
        ),
        tags=["forecasting", "arima"],
    ),
    version="1.0",
    alias="champion",
)

Let's verify our model:

UI
GraphQL

Check your model's details in the DataHub UI:

Query your model's information:

query {
  mlModel(
    urn:"urn:li:mlModel:(urn:li:dataPlatform:mlflow,arima_model,PROD)"
  ) {
    name
    description
    versionProperties {
      version {
        versionTag
      }
    }
  }
}

The response will show your model's details:

{
  "data": {
    "mlModel": {
      "name": "arima_model",
      "description": "ARIMA model for airline passenger forecasting",
      "versionProperties": {
        "version": {
          "versionTag": "1.0"
        }
      }
    }
  }
}

Create Experiment

An experiment helps organize multiple training runs for a specific project.

Simple Version
Detailed Version

Create a basic experiment:

client.create_experiment(
    experiment_id="airline_forecast_experiment",
)

Add context and metadata:

client.create_experiment(
    experiment_id="airline_forecast_experiment",
    properties=models.ContainerPropertiesClass(
        name="Airline Forecast Experiment",
        description="Experiment to forecast airline passenger numbers",
        customProperties={"team": "forecasting"},
        created=models.TimeStampClass(
            time=1628580000000, actor="urn:li:corpuser:datahub"
        ),
        lastModified=models.TimeStampClass(
            time=1628580000000, actor="urn:li:corpuser:datahub"
        ),
    ),
)

Verify your experiment:

UI
GraphQL

See your experiment's details in the UI:

Query your experiment's information:

query {
  container(
    urn:"urn:li:container:airline_forecast_experiment"
  ) {
    name
    description
    properties {
      customProperties
    }
  }
}

Check the response:

{
  "data": {
    "container": {
      "name": "Airline Forecast Experiment",
      "description": "Experiment to forecast airline passenger numbers",
      "properties": {
        "customProperties": {
          "team": "forecasting"
        }
      }
    }
  }
}

Create Training Run

A training run captures all details about a specific model training attempt.

Simple Version
Detailed Version

Create a basic training run:

client.create_training_run(
    run_id="simple_training_run_4",
)

Include metrics, parameters, and other important metadata:

client.create_training_run(
    run_id="simple_training_run_4",
    properties=models.DataProcessInstancePropertiesClass(
        name="Simple Training Run 4",
        created=models.AuditStampClass(
            time=1628580000000, actor="urn:li:corpuser:datahub"
        ),
        customProperties={"team": "forecasting"},
    ),
    training_run_properties=models.MLTrainingRunPropertiesClass(
        id="simple_training_run_4",
        outputUrls=["s3://my-bucket/output"],
        trainingMetrics=[models.MLMetricClass(name="accuracy", value="0.9")],
        hyperParams=[models.MLHyperParamClass(name="learning_rate", value="0.01")],
        externalUrl="https:localhost:5000",
    ),
    run_result=RunResultType.FAILURE,
    start_timestamp=1628580000000,
    end_timestamp=1628580001000,
)

Verify your training run:

UI
GraphQL

View the run details in the UI:

Query your training run:

query {
  dataProcessInstance(
    urn:"urn:li:dataProcessInstance:simple_training_run_4"
  ) {
    name
    created {
      time
    }
    properties {
      customProperties
    }
  }
}

Check the response:

{
  "data": {
    "dataProcessInstance": {
      "name": "Simple Training Run 4",
      "created": {
        "time": 1628580000000
      },
      "properties": {
        "customProperties": {
          "team": "forecasting"
        }
      }
    }
  }
}

Define Entity Relationships

Now let's connect these components to create a comprehensive ML system. These connections enable you to track model lineage, monitor model evolution, understand dependencies, and search effectively across your ML assets.

Add Model To Model Group

Connect your model to its group:

client.add_model_to_model_group(model_urn=model_urn, group_urn=model_group_urn)

UI
GraphQL

View model versions in the Model Group under the Models section:

Find group information in the Model page under the Group tab:

Query the model-group relationship:

query {
  mlModel(
    urn:"urn:li:mlModel:(urn:li:dataPlatform:mlflow,arima_model,PROD)"
  ) {
    name
    properties {
      groups {
        urn
        properties {
          name
        }
      }
    }
  }
}

Check the response:

{
  "data": {
    "mlModel": {
      "name": "arima_model",
      "properties": {
        "groups": [
          {
            "urn": "urn:li:mlModelGroup:(urn:li:dataPlatform:mlflow,airline_forecast_model_group,PROD)",
            "properties": {
              "name": "Airline Forecast Model Group"
            }
          }
        ]
      }
    }
  }
}

Add Run To Experiment

Connect a training run to its experiment:

client.add_run_to_experiment(run_urn=run_urn, experiment_urn=experiment_urn)

UI
GraphQL

Find your runs in the Experiment page under the Entities tab:

See the experiment details in the Run page:

Query the run-experiment relationship:

query {
  dataProcessInstance(
    urn:"urn:li:dataProcessInstance:simple_training_run"
  ) {
    name
    parentContainers {
      containers {
        urn
        properties {
          name
        }
      }
    }
  }
}

View the relationship details:

{
  "data": {
    "dataProcessInstance": {
      "name": "Simple Training Run",
      "parentContainers": {
        "containers": [
          {
            "urn": "urn:li:container:airline_forecast_experiment",
            "properties": {
              "name": "Airline Forecast Experiment"
            }
          }
        ]
      }
    }
  }
}

Add Run To Model

Connect a training run to its resulting model:

client.add_run_to_model(model_urn=model_urn, run_urn=run_urn)

This relationship enables you to:

Track which runs produced each model
Understand model provenance
Debug model issues
Monitor model evolution

UI
GraphQL

Find the source run in the Model page under the Summary tab:

See related models in the Run page under the Lineage tab:

Query the model's training jobs:

query {
  mlModel(
    urn:"urn:li:mlModel:(urn:li:dataPlatform:mlflow,arima_model,PROD)"
  ) {
    name
    properties {
      mlModelLineageInfo {
        trainingJobs
      }
    }
  }
}

View the relationship:

{
  "data": {
    "mlModel": {
      "name": "arima_model",
      "properties": {
        "mlModelLineageInfo": {
          "trainingJobs": [
            "urn:li:dataProcessInstance:simple_training_run_test"
          ]
        }
      }
    }
  }
}

Add Run To Model Group

Create a direct connection between a run and a model group:

client.add_run_to_model_group(model_group_urn=model_group_urn, run_urn=run_urn)

This connection lets you:

View model groups in the run's lineage
Query training jobs at the group level
Track training history for model families

UI
GraphQL

See model groups in the Run page under the Lineage tab:

Query the model group's training jobs:

query {
  mlModelGroup(
    urn:"urn:li:mlModelGroup:(urn:li:dataPlatform:mlflow,airline_forecast_model_group,PROD)"
  ) {
    name
    properties {
      mlModelLineageInfo {
        trainingJobs
      }
    }
  }
}

Check the relationship:

{
  "data": {
    "mlModelGroup": {
      "name": "airline_forecast_model_group",
      "properties": {
        "mlModelLineageInfo": {
          "trainingJobs": [
            "urn:li:dataProcessInstance:simple_training_run_test"
          ]
        }
      }
    }
  }
}

Add Dataset To Run

Track input and output datasets for your training runs:

client.add_input_datasets_to_run(
    run_urn=run_urn, 
    dataset_urns=[str(input_dataset_urn)]
)

client.add_output_datasets_to_run(
    run_urn=run_urn, 
    dataset_urns=[str(output_dataset_urn)]
)

These connections help you:

Track data lineage
Understand data dependencies
Ensure reproducibility
Monitor data quality impacts

Find dataset relationships in the Lineage tab of either the Dataset or Run page:

Full Overview

Here's your complete ML system with all components connected:

You now have a complete lineage view of your ML assets, from training data through runs to production models!

What's Next?

To see this integration in action and learn about real-world use cases:

Watch our Townhall demo on MLflow integration with DataHub
Join our Slack community for discussions
Read our MLflow integration doc for more details

Is this page helpful?

ML System with DataHub

Why Integrate Your ML System with DataHub?​

Goals Of This Guide​

Core ML Concepts​

Basic Setup​

Create Simple ML Entities​

Create Model Group​

Create Model​

Create Experiment​

Create Training Run​

Define Entity Relationships​

Add Model To Model Group​

Add Run To Experiment​

Add Run To Model​

Add Run To Model Group​

Add Dataset To Run​

Full Overview​

What's Next?​

Why Integrate Your ML System with DataHub?

Goals Of This Guide

Core ML Concepts

Basic Setup

Create Simple ML Entities

Create Model Group

Create Model

Create Experiment

Create Training Run

Define Entity Relationships

Add Model To Model Group

Add Run To Experiment

Add Run To Model

Add Run To Model Group

Add Dataset To Run

Full Overview

What's Next?