I recently passed Exam 70-774 - Performing Cloud Data Science with Azure Machine Learning and thought it may be helpful to provide some guidance on how best to prepare for the exam. It is worth noting that, at the time of writing, the exam focuses on Azure Machine Learning Studio for the most part. It does not cover the newer Azure Machine Learning Services, but does require some knowledge of Azure Cognitive Services and Microsoft CNTK.
The exam is broken into 4 areas from which questions will be focused. This lines up with the syllabus outlined on the exam page.
Prepare Data for Analysis in Azure Machine Learning and Export from Azure Machine Learning
You will want to become familiar with the various ways to import and export data within Azure Machine Learning.
Importing data
This is done in six ways:
- Uploading from a local file via ML Studio
- Using the Import module
- Using an Execute Script module
- Using data saved as a dataset from another experiment
- Using Sample data sets.
- Manually enter data. See the Enter Data Manually module.
Azure ML supports several file formats and data types - definitely important to be aware of the common ones. Check the list here
Import Data Module
It is key to become familar with the options for this module. It supports the following ways to bring data into an Azure ML Studio experiment:
- A Web URL using HTTP
- Hadoop using HiveQL
- Azure blob storage
- Azure table
- Azure SQL database or SQL Server on Azure VM
- On-premises SQL Server database
- A data feed provider, OData currently
- Azure Cosmos DB
Worth noting:
- Web URL has to be public (authentication is not supported)
- Accessing SQL Server on premises (or locked down in an Azure VM) can be done via the Data Management Gateway that provides access to your on-premises SQL Server.
- Accessing data from Hadoop is done via Hive
- For Azure SQL, Hadoop, SQL Server, Cosmos Db you have the ability to pass in a query as part of the import module to filter your data.
- The Use cached results option caches the data in the Azure ML experiment after the first run to speed up performance. Use this if the data does not change often.
Exporting Data
To export data you use one of the following ways:
- Export Data Module.
- Execute Script module
- Convert to CSV and download dataset locally.
Export Data Module
This module supports exporting data to the following:
- Hive Table in Hadoop Cluster
- Azure SQL Database
- Azure Table
- Azure Blob Storage
Explore and summarize data
In Azure ML Studio experiments there are several ways to explore and summarize data.
Worth Noting:
-
There is a Summarize Module that is worth understanding.
- This video does a decent job highlighting how you can explore data in Azure ML Studio.
- There are two group by modules: Group Data into Bins and Group Categorical Values
- You can convert columns that contain categorical values into a series of binary indicator columns that can more easily be used as features in a machine learning model using the Convert To Indicator Values module.
- You can use Execute Python/R Scripts for the following:
- Custom Summaries
- Custom Visualizations
- You can import zipped packages for use with Execute R Script and Execute Python Script modules.
Cleanse data for Azure Machine Learning
There are various ways to clean data right within Azure ML Studio. Some of the reasons for cleaning data would be:
- Data is incomplete (missing data)
- Data is noisy (contains incorrect values or outliers)
- Data is inconsistent (contains discrepancies)
- There is too much data.
Here is a great document on details of cleaning data.
Worth noting:
-
The Clean Missing Data Module supports cleaning missing data in various ways.
- Understand the options you can use to clean missing data, including via MICE and Probablistic PCA.
- You can have Azure ML generate a missing indicator value column to flag which columns had a value replaced.
- There are two outputs from Clean Missing Data module. One is the cleaned data set, the other is the transform that you can reuse elsewhere.
- You can identify and handle outliers using the Clip Values module.
- If you want to remove a column you could use Clean Missing Data or you could use the Select Columns in Dataset module
- You can use the Partition and Sample module to reduce the size of your dataset while maintain the same ratios.
- There is a Remove Duplicate Rows module.
Perform feature engineering
There is a great document here around feature engineering.
Worth noting:
-
There are various modules for merging datasets in Azure ML Studio.
- There are three modules to help automatically select the right features:
- Principal Component Analysis module can help reduce data dimensionality. The module analyzes your data and creates a reduced feature set that captures all the information contained in the dataset, but in a smaller number of features.
- The Edit Metadata module allows you to alter metadata about columns in your dataset, e.g identifying which column is the label (or class), what the column data type is etc.
Develop Machine Learning Models
There are many algorithms supported in Azure ML studio. I would highly recommend using this cheat sheet as a starting point.
Worth noting:
- The following problem types you can solve for in Azure ML:
- Classification (is this A or B?)
- Regression (How much/many?)
- Clustering (How are these related?)
- Anomaly Detection (Is this weird)
- All are supervised learning except for K-means which is unsupervised. Supervised just means your training data has labels.
- It is worth understanding the Train Matchbox Recommender for generating recommendations (either by user or user and item).
- You can use the Tune Hyperparameters module to automatically choose the right parameters for your model.
- Use the split data module to split your data into training and testing sets.
- Use the Cross-Validate module to understand the quality of your dataset and if your model is affected by variations in it. Think of this as a module as an enhancement over simply splitting and testing.
- You can compare two algorithm scores using the Evaluate module.
- Understand score and evaluation metrics. This post does a great job breaking this down.
Operationalize and Manage Azure Machine Learning services
-
Understand how to create a predictive web service and deploy it. See here
- To reduce or change the columnns used for input or output in the deployed web service use the Select Columns in Dataset Module.
- A deployed web service has both a Request/Response mode and a batch mode and you can interact with both via a REST api.
- Understand the outputs from the deployed predictive model. This will be the “Scored Label” and “Scored Probability”. Note that Scored Probability is the probability the result belongs to the positive class.
Azure Machine Learning automatically decides which of the two classes in the dataset is the positive class. If the class labels are Boolean or integers, then the ‘true’ or ‘1’ labeled instances are assigned the positive class. If the labels are strings, as in the case of the income dataset, the labels are sorted alphabetically and the first level is chosen to be the negative class while the second level is the positive class.
- You can setup your experiment for retraining.
- You can connect to your published Azure ML experiment from Excel.
Leverage other services for Machine Learning
-
Get an understanding of Neural Networks - Microsoft CNTK is a framework for building neural networks. Think of it as similar to Tensorflow.
- In Azure the N-Series machines support NVidia GPUs and can be used for training a neural network.
- Check out the Azure AI Gallery - it was just recently renamed and so may also be called Cortana Intelligence Gallery.
- Gain an understanding of how you can use HDInsight for Machine Learning
- Understand how to enable support for R in SQL Server and how you would call an R script with data in SQL.
- Understand the following key Azure Cognitive Services:
Finally, Microsoft just released a book to help with this exam - check it out here