Train PyTorch models at scale with Azure Machine Learning

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In this article, you learn how to train, hyperparameter tune, and deploy a PyTorch model by using the Azure Machine Learning Python SDK v2.

You use example scripts to classify chicken and turkey images to build a deep learning neural network (DNN) based on PyTorch's transfer learning tutorial. Transfer learning is a technique that applies knowledge gained from solving one problem to a different but related problem. Transfer learning shortens the training process by requiring less data, time, and compute resources than training from scratch. To learn more about transfer learning, see Deep learning vs. machine learning.

Whether you're training a deep learning PyTorch model from the ground-up or you're bringing an existing model into the cloud, use Azure Machine Learning to scale out open-source training jobs by using elastic cloud compute resources. You can build, deploy, version, and monitor production-grade models with Azure Machine Learning.

Prerequisites

An Azure subscription. If you don't have one already, create a free account.
Run the code in this article by using either an Azure Machine Learning compute instance or your own Jupyter notebook.
- Azure Machine Learning compute instance—no downloads or installation necessary:
  - Complete the Quickstart: Get started with Azure Machine Learning to create a dedicated notebook server preloaded with the SDK and the sample repository.
  - Under the Samples tab in the Notebooks section of your workspace, find a completed and expanded notebook by navigating to this directory: SDK v2/sdk/python/jobs/single-step/pytorch/train-hyperparameter-tune-deploy-with-pytorch
- Your Jupyter notebook server:
  - Install the Azure Machine Learning SDK (v2).
  - Download the training script file pytorch_train.py.

You can also find a completed Jupyter notebook version of this guide on the GitHub samples page.

Set up the job

This section sets up the job for training by loading the required Python packages, connecting to a workspace, creating a compute resource to run a command job, and creating an environment to run the job.

Connect to the workspace

First, connect to your Azure Machine Learning workspace. The workspace is the top-level resource for the service. It provides a centralized place to work with all the artifacts you create when you use Azure Machine Learning.

Use DefaultAzureCredential to access the workspace. This credential can handle most Azure SDK authentication scenarios.

If DefaultAzureCredential doesn't work for you, see azure.identity package or Set up authentication for more available credentials.

# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

If you prefer to use a browser to sign in and authenticate, uncomment the following code and use it instead.

# Handle to the workspace
# from azure.ai.ml import MLClient

# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()

Next, get a handle to the workspace by providing your subscription ID, resource group name, and workspace name. To find these parameters:

Look for your workspace name in the upper-right corner of the Azure Machine Learning studio toolbar.
Select your workspace name to show your resource group and subscription ID.
Copy the values for your resource group and subscription ID into the code.

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="<SUBSCRIPTION_ID>",
    resource_group_name="<RESOURCE_GROUP>",
    workspace_name="<AML_WORKSPACE_NAME>",
)

The result of running this script is a workspace handle that you can use to manage other resources and jobs.

Note

Creating MLClient doesn't connect the client to the workspace. The client initialization is lazy and waits for the first time it needs to make a call. In this article, this call happens during compute creation.

Create a compute resource to run the job

Azure Machine Learning needs a compute resource to run a job. This resource can be single or multinode machines with Linux or Windows OS, or a specific compute fabric like Spark.

In the following example script, you provision a Linux compute cluster. You can see the Azure Machine Learning pricing page for the full list of VM sizes and prices. Since you need a GPU cluster for this example, pick a Standard_NC4as_T4_v3 model and create an Azure Machine Learning compute.

from azure.ai.ml.entities import AmlCompute

gpu_compute_target = "gpu-cluster"

try:
    # let's see if the compute target already exists
    gpu_cluster = ml_client.compute.get(gpu_compute_target)
    print(
        f"You already have a cluster named {gpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new gpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    gpu_cluster = AmlCompute(
        # Name assigned to the compute cluster
        name="gpu-cluster",
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_NC4AS_T4_V3",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=4,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )

    # Now, we pass the object to MLClient's create_or_update method
    gpu_cluster = ml_client.begin_create_or_update(gpu_cluster).result()

print(
    f"AMLCompute with name {gpu_cluster.name} is created, the compute size is {gpu_cluster.size}"
)

Create a job environment

To run an Azure Machine Learning job, you need an environment. An Azure Machine Learning environment encapsulates the dependencies, such as software runtime and libraries, needed to run your machine learning training script on your compute resource. This environment is similar to a Python environment on your local machine.

Azure Machine Learning allows you to either use a curated (or ready-made) environment or create a custom environment by using a Docker image or a Conda configuration. In this article, you reuse the curated Azure Machine Learning environment AzureML-acpt-pytorch-2.8-cuda12.6. Use the latest version of this environment by using the @latest directive.

curated_env_name = "AzureML-acpt-pytorch-2.8-cuda12.6@latest"

Configure and submit your training job

In this section, you introduce the data for training. You then cover how to run a training job by using a training script that you provide. You learn to build the training job by configuring the command for running the training script. Then, you submit the training job to run in Azure Machine Learning.

Obtain the training data

You can use the dataset in this zipped file. This dataset consists of about 120 training images each for two classes (turkeys and chickens), with 100 validation images for each class. The images are a subset of the Open Images v5 Dataset. The training script pytorch_train.py downloads and extracts the dataset.

Prepare the training script

In the prerequisites section, you provided the training script pytorch_train.py. In practice, you should be able to take any custom training script as is and run it with Azure Machine Learning without having to modify your code.

The provided training script downloads the data, trains a model, and registers the model.

Build the training job

Now that you have all the assets required to run your job, build it by using the Azure Machine Learning Python SDK v2. For this example, create a command.

An Azure Machine Learning command is a resource that specifies all the details needed to execute your training code in the cloud. These details include the inputs and outputs, type of hardware to use, software to install, and how to run your code. The command contains information to execute a single command.

Configure the command

Use the general purpose command to run the training script and perform your desired tasks. Create a command object to specify the configuration details of your training job.

from azure.ai.ml import command
from azure.ai.ml import Input

job = command(
    inputs=dict(
        num_epochs=30, learning_rate=0.001, momentum=0.9, output_dir="./outputs"
    ),
    compute=gpu_compute_target,
    environment=curated_env_name,
    code="./src/",  # location of source code
    command="python pytorch_train.py --num_epochs ${{inputs.num_epochs}} --output_dir ${{inputs.output_dir}}",
    experiment_name="pytorch-birds",
    display_name="pytorch-birds-image",
)

The inputs for this command include the number of epochs, learning rate, momentum, and output directory.
For the parameter values:
1. Provide the compute cluster gpu_compute_target = "gpu-cluster" that you created for running this command.
2. Provide the curated environment that you initialized earlier.
3. If you're not using the completed notebook in the Samples folder, specify the location of the pytorch_train.py file.
4. Configure the command line action itself. In this case, the command is python pytorch_train.py. You can access the inputs and outputs in the command via the ${{ ... }} notation.
5. Configure metadata such as the display name and experiment name, where an experiment is a container for all the iterations one does on a certain project. All the jobs submitted under the same experiment name appear next to each other in Azure Machine Learning studio.

Submit the job

Now, submit the job to run in Azure Machine Learning. This time, use create_or_update on ml_client.jobs.

ml_client.jobs.create_or_update(job)

When the job finishes, it registers a model in your workspace as a result of training. It also outputs a link for viewing the job in Azure Machine Learning studio.

Warning

Azure Machine Learning runs training scripts by copying the entire source directory. If you have sensitive data that you don't want to upload, use a .ignore file or don't include it in the source directory.

What happens during job execution

As the job executes, it goes through the following stages:

Preparing: A Docker image is created according to the environment you defined. The process uploads the image to the workspace's container registry and caches it for later runs. The process also streams logs to the job history, so you can view them to monitor progress. If you specify a curated environment, the process uses the cached image that backs that curated environment.
Scaling: The cluster attempts to scale up if it requires more nodes to execute the run than are currently available.
Running: The process uploads all scripts in the src script folder to the compute target. It mounts or copies data stores. It executes the script. The process streams outputs from stdout and the ./logs folder to the job history. You can use these outputs to monitor the job.

Tune model hyperparameters

You trained the model with one set of parameters. Now, see if you can further improve the accuracy of your model. Tune and optimize your model's hyperparameters by using Azure Machine Learning's sweep capabilities.

To tune the model's hyperparameters, define the parameter space to search during training. Replace some of the parameters passed to the training job with special inputs from the azure.ml.sweep package.

Since the training script uses a learning rate schedule to decay the learning rate every several epochs, you can tune the initial learning rate and the momentum parameters.

from azure.ai.ml.sweep import Uniform

# we will reuse the command_job created before. we call it as a function so that we can apply inputs
job_for_sweep = job(
    learning_rate=Uniform(min_value=0.0005, max_value=0.005),
    momentum=Uniform(min_value=0.9, max_value=0.99),
)

Then, configure sweep on the command job by using some sweep-specific parameters, such as the primary metric to watch and the sampling algorithm to use.

In the following code, random sampling tries different configuration sets of hyperparameters in an attempt to maximize the primary metric, best_val_acc.

You also define an early termination policy, the BanditPolicy, to terminate poorly performing runs early. The BanditPolicy terminates any run that doesn't fall within the slack factor of the primary evaluation metric. You apply this policy every epoch (since the best_val_acc metric is reported every epoch and evaluation_interval=1). The first policy evaluation is delayed until after the first 10 epochs (delay_evaluation=10).

from azure.ai.ml.sweep import BanditPolicy

sweep_job = job_for_sweep.sweep(
    compute="gpu-cluster",
    sampling_algorithm="random",
    primary_metric="best_val_acc",
    goal="Maximize",
    max_total_trials=8,
    max_concurrent_trials=4,
    early_termination_policy=BanditPolicy(
        slack_factor=0.15, evaluation_interval=1, delay_evaluation=10
    ),
)

Now, submit this job as before. This time, you're running a sweep job that sweeps over your train job.

returned_sweep_job = ml_client.create_or_update(sweep_job)

# stream the output and wait until the job is finished
ml_client.jobs.stream(returned_sweep_job.name)

# refresh the latest status of the job after streaming
returned_sweep_job = ml_client.jobs.get(name=returned_sweep_job.name)

You can monitor the job by using the studio user interface link that's presented during the job run.

Find the best model

When all the runs finish, find the run that produced the model with the highest accuracy.

from azure.ai.ml.entities import Model

if returned_sweep_job.status == "Completed":

    # First let us get the run which gave us the best result
    best_run = returned_sweep_job.properties["best_child_run_id"]

    # lets get the model from this run
    model = Model(
        # the script stores the model as "outputs"
        path="azureml://jobs/{}/outputs/artifacts/paths/outputs/".format(best_run),
        name="run-model-example",
        description="Model created from run.",
        type="custom_model",
    )

else:
    print(
        "Sweep job status: {}. Please wait until it completes".format(
            returned_sweep_job.status
        )
    )

Deploy the model as an online endpoint

Now you can deploy your model as an online endpoint—that is, as a web service in the Azure cloud.

To deploy a machine learning service, you typically need:

The model assets that you want to deploy. These assets include the model's file and metadata that you already registered in your training job.
Some code to run as a service. The code executes the model on a given input request (an entry script). This entry script receives data submitted to a deployed web service and passes it to the model. After the model processes the data, the script returns the model's response to the client. The script is specific to your model and must understand the data that the model expects and returns. When you use an MLFlow model, Azure Machine Learning automatically creates this script for you.

For more information about deployment, see Deploy and score a machine learning model with managed online endpoint using Python SDK v2.

Create a new online endpoint

As a first step to deploying your model, create your online endpoint. The endpoint name must be unique in the entire Azure region. For this article, create a unique name by using a universally unique identifier (UUID).

import uuid

# Creating a unique name for the endpoint
online_endpoint_name = "aci-birds-endpoint-" + str(uuid.uuid4())[:8]

from azure.ai.ml.entities import ManagedOnlineEndpoint

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Classify turkey/chickens using transfer learning with PyTorch",
    auth_mode="key",
    tags={"data": "birds", "method": "transfer learning", "framework": "pytorch"},
)

endpoint = ml_client.begin_create_or_update(endpoint).result()

print(f"Endpoint {endpoint.name} provisioning state: {endpoint.provisioning_state}")

After you create the endpoint, retrieve it as follows:

endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

print(
    f'Endpint "{endpoint.name}" with provisioning state "{endpoint.provisioning_state}" is retrieved'
)

Deploy the model to the endpoint

Deploy the model with the entry script. An endpoint can have multiple deployments. By using rules, the endpoint can direct traffic to these deployments.

In the following code, you create a single deployment that handles 100% of the incoming traffic. The code uses an arbitrary color name blue for the deployment. You can also use any other name such as green or red for the deployment.

The code to deploy the model to the endpoint:

Deploys the best version of the model that you registered earlier.
Scores the model by using the score.py file.
Uses the curated environment (that you specified earlier) to perform inferencing.

from azure.ai.ml.entities import (
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
)

online_deployment_name = "aci-blue"

# create an online deployment.
blue_deployment = ManagedOnlineDeployment(
    name=online_deployment_name,
    endpoint_name=online_endpoint_name,
    model=model,
    environment=curated_env_name,
    code_configuration=CodeConfiguration(code="./score/", scoring_script="score.py"),
    instance_type="STANDARD_NC4AS_T4_V3",
    instance_count=1,
)

blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()

Note

Expect this deployment to take some time to finish.

Test the deployed model

Now that you deployed the model to the endpoint, you can predict the output of the deployed model by using the invoke method on the endpoint.

To test the endpoint, use a sample image for prediction. First, display the image.

# install pillow if PIL cannot imported
%pip install pillow
import json
from PIL import Image
import matplotlib.pyplot as plt

%matplotlib inline
plt.imshow(Image.open("test_img.jpg"))

Create a function to format and resize the image.

# install torch and torchvision if needed
%pip install torch
%pip install torchvision

import torch
from torchvision import transforms


def preprocess(image_file):
    """Preprocess the input image."""
    data_transforms = transforms.Compose(
        [
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    )

    image = Image.open(image_file)
    image = data_transforms(image).float()
    image = torch.tensor(image)
    image = image.unsqueeze(0)
    return image.numpy()

Format the image and convert it to a JSON file.

image_data = preprocess("test_img.jpg")
input_data = json.dumps({"data": image_data.tolist()})
with open("request.json", "w") as outfile:
    outfile.write(input_data)

Invoke the endpoint with this JSON and print the result.

# test the blue deployment
result = ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    request_file="request.json",
    deployment_name=online_deployment_name,
)

print(result)

Clean up resources

If you don't need the endpoint anymore, delete it to stop using the resource. Make sure no other deployments are using the endpoint before you delete it.

ml_client.online_endpoints.begin_delete(name=online_endpoint_name)

Note

Expect this cleanup to take a bit of time to finish.

In this article, you trained and registered a deep learning neural network using PyTorch on Azure Machine Learning. You also deployed the model to an online endpoint. To learn more about Azure Machine Learning, see the following articles:

Feedback

Was this page helpful?

Last updated on 2026-02-18

Share via

Train PyTorch models at scale with Azure Machine Learning

Prerequisites

Set up the job

Connect to the workspace

Create a compute resource to run the job

Create a job environment

Configure and submit your training job

Obtain the training data

Prepare the training script

Build the training job

Configure the command

Submit the job

What happens during job execution

Tune model hyperparameters

Find the best model

Deploy the model as an online endpoint

Create a new online endpoint

Deploy the model to the endpoint

Test the deployed model

Clean up resources

Related content

Feedback

Additional resources