Sunday, November 30, 2025

Why XGBoost + MFLow?

XGBoost (eXtreme Gradient Boosting) is a popular gradient boosting library for structured data. MLflow provides native integration with XGBoost for experiment tracking, model management, and deployment.


This integration supports both the native XGBoost API and scikit-learn compatible interface, making it easy to track experiments and deploy models regardless of which API you prefer.



import mlflow

import xgboost as xgb

from sklearn.datasets import load_diabetes

from sklearn.model_selection import train_test_split


# Enable autologging - captures everything automatically

mlflow.xgboost.autolog()


# Load and prepare data

data = load_diabetes()

X_train, X_test, y_train, y_test = train_test_split(

    data.data, data.target, test_size=0.2, random_state=42

)


# Prepare data in XGBoost format

dtrain = xgb.DMatrix(X_train, label=y_train)

dtest = xgb.DMatrix(X_test, label=y_test)


# Train model - MLflow automatically logs everything!

with mlflow.start_run():

    model = xgb.train(

        params={

            "objective": "reg:squarederror",

            "max_depth": 6,

            "learning_rate": 0.1,

        },

        dtrain=dtrain,

        num_boost_round=100,

        evals=[(dtrain, "train"), (dtest, "test")],

    )




import mlflow

import xgboost as xgb

from sklearn.datasets import load_diabetes

from sklearn.model_selection import train_test_split


# Load data

data = load_diabetes()

X_train, X_test, y_train, y_test = train_test_split(

    data.data, data.target, test_size=0.2, random_state=42

)


# Enable autologging

mlflow.xgboost.autolog()


# Train with native API

with mlflow.start_run():

    dtrain = xgb.DMatrix(X_train, label=y_train)

    model = xgb.train(

        params={"objective": "reg:squarederror", "max_depth": 6},

        dtrain=dtrain,

        num_boost_round=100,

    )



What Gets Logged

When autologging is enabled, MLflow automatically captures:


Parameters: All booster parameters and training configuration

Metrics: Training and validation metrics for each boosting round

Feature Importance: Multiple importance types (weight, gain, cover) with visualizations

Model: The trained model with proper serialization format

Artifacts: Feature importance plots and JSON data


No comments:

Post a Comment