Prepare for Exam 70-774 - Performing Cloud Data Science with Azure Machine Learning

I recently passed Exam 70-774 - Performing Cloud Data Science with Azure Machine Learning and thought it may be helpful to provide some guidance on how best to prepare for the exam. It is worth noting that, at the time of writing, the exam focuses on Azure Machine Learning Studio for the most part. It does not cover the newer Azure Machine Learning Services, but does require some knowledge of Azure Cognitive Services and Microsoft CNTK.

The exam is broken into 4 areas from which questions will be focused. This lines up with the syllabus outlined on the exam page.

Prepare Data for Analysis in Azure Machine Learning and Export from Azure Machine Learning

You will want to become familiar with the various ways to import and export data within Azure Machine Learning.

Importing data

This is done in six ways:

  1. Uploading from a local file via ML Studio
  2. Using the Import module
  3. Using an Execute Script module
  4. Using data saved as a dataset from another experiment
  5. Using Sample data sets.
  6. Manually enter data. See the Enter Data Manually module.

Azure ML supports several file formats and data types - definitely important to be aware of the common ones. Check the list here

Import Data Module

It is key to become familar with the options for this module. It supports the following ways to bring data into an Azure ML Studio experiment:

  • A Web URL using HTTP
  • Hadoop using HiveQL
  • Azure blob storage
  • Azure table
  • Azure SQL database or SQL Server on Azure VM
  • On-premises SQL Server database
  • A data feed provider, OData currently
  • Azure Cosmos DB

Worth noting:

  • Web URL has to be public (authentication is not supported)
  • Accessing SQL Server on premises (or locked down in an Azure VM) can be done via the Data Management Gateway that provides access to your on-premises SQL Server.
  • Accessing data from Hadoop is done via Hive
  • For Azure SQL, Hadoop, SQL Server, Cosmos Db you have the ability to pass in a query as part of the import module to filter your data.
  • The Use cached results option caches the data in the Azure ML experiment after the first run to speed up performance. Use this if the data does not change often.

Exporting Data

To export data you use one of the following ways:

  1. Export Data Module.
  2. Execute Script module
  3. Convert to CSV and download dataset locally.

Export Data Module

This module supports exporting data to the following:

  • Hive Table in Hadoop Cluster
  • Azure SQL Database
  • Azure Table
  • Azure Blob Storage

Explore and summarize data

In Azure ML Studio experiments there are several ways to explore and summarize data.

Worth Noting:

Cleanse data for Azure Machine Learning

There are various ways to clean data right within Azure ML Studio. Some of the reasons for cleaning data would be:

  • Data is incomplete (missing data)
  • Data is noisy (contains incorrect values or outliers)
  • Data is inconsistent (contains discrepancies)
  • There is too much data.

Here is a great document on details of cleaning data.

Worth noting:

  • The Clean Missing Data Module supports cleaning missing data in various ways.

  • Understand the options you can use to clean missing data, including via MICE and Probablistic PCA.
  • You can have Azure ML generate a missing indicator value column to flag which columns had a value replaced.
  • There are two outputs from Clean Missing Data module. One is the cleaned data set, the other is the transform that you can reuse elsewhere.
  • You can identify and handle outliers using the Clip Values module.
  • If you want to remove a column you could use Clean Missing Data or you could use the Select Columns in Dataset module
  • You can use the Partition and Sample module to reduce the size of your dataset while maintain the same ratios.
  • There is a Remove Duplicate Rows module.

Perform feature engineering

There is a great document here around feature engineering.

Worth noting:

Develop Machine Learning Models

There are many algorithms supported in Azure ML studio. I would highly recommend using this cheat sheet as a starting point.

Worth noting:

  • The following problem types you can solve for in Azure ML:
    1. Classification (is this A or B?)
    2. Regression (How much/many?)
    3. Clustering (How are these related?)
    4. Anomaly Detection (Is this weird)
  • All are supervised learning except for K-means which is unsupervised. Supervised just means your training data has labels.
  • It is worth understanding the Train Matchbox Recommender for generating recommendations (either by user or user and item).
  • You can use the Tune Hyperparameters module to automatically choose the right parameters for your model.
  • Use the split data module to split your data into training and testing sets.
  • Use the Cross-Validate module to understand the quality of your dataset and if your model is affected by variations in it. Think of this as a module as an enhancement over simply splitting and testing.
  • You can compare two algorithm scores using the Evaluate module.
  • Understand score and evaluation metrics. This post does a great job breaking this down.

Operationalize and Manage Azure Machine Learning services

  • Understand how to create a predictive web service and deploy it. See here

  • To reduce or change the columnns used for input or output in the deployed web service use the Select Columns in Dataset Module.
  • A deployed web service has both a Request/Response mode and a batch mode and you can interact with both via a REST api.
  • Understand the outputs from the deployed predictive model. This will be the “Scored Label” and “Scored Probability”. Note that Scored Probability is the probability the result belongs to the positive class.

    Azure Machine Learning automatically decides which of the two classes in the dataset is the positive class. If the class labels are Boolean or integers, then the ‘true’ or ‘1’ labeled instances are assigned the positive class. If the labels are strings, as in the case of the income dataset, the labels are sorted alphabetically and the first level is chosen to be the negative class while the second level is the positive class.

  • You can setup your experiment for retraining.
  • You can connect to your published Azure ML experiment from Excel.

Leverage other services for Machine Learning

Finally, Microsoft just released a book to help with this exam - check it out here