World of GCP by Ketan Patel: Machine Learning with Spark on Google Cloud Dataproc

Machine Learning with Spark on Google Cloud Dataproc

In this lab you implement logistic regression using a machine learning library for Apache Spark. Spark runs on a Dataproc cluster to develop a model for data from a multivariable dataset.

Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simple, cost-efficient way. Dataproc easily integrates with other Google Cloud services, giving you a powerful and complete platform for data processing, analytics and machine learning

Apache Spark is an analytics engine for large scale data processing. Logistic regression is available as a module in Apache Spark's machine learning library, MLlib. Spark MLlib, also called Spark ML, includes implementations for most standard machine learning algorithms such as k-means clustering, random forests, alternating least squares, k-means clustering, decision trees, support vector machines, etc. Spark can run on a Hadoop cluster, like Dataproc, in order to process very large datasets in parallel.

The base dataset this lab uses is retrieved from the US Bureau of Transport Statistics. The dataset provides historical information about internal flights in the United States and can be used to demonstrate a wide range of data science concepts and techniques. This lab provides the data as a set of CSV formatted text files.

Objectives

Create a training dataset for machine learning using Spark

Develop a logistic regression machine learning model using Spark

Evaluate the predictive behavior of a machine learning model using Spark on Dataproc

Evaluate the model

Task 1. Create a Dataproc cluster

Normally, the first step in writing Hadoop jobs is to get a Hadoop installation going. This involves setting up a cluster, installing Hadoop on it, and configuring the cluster so that the machines all know about one another and can communicate with one another in a secure manner.

Then, you’d start the YARN and MapReduce processes and finally be ready to write some Hadoop programs. On Google Cloud, Dataproc makes it convenient to spin up a Hadoop cluster that is capable of running MapReduce, Pig, Hive, Presto, and Spark.

If you are using Spark, Dataproc offers a fully managed, serverless Spark environment – you simply submit a Spark program and Dataproc executes it. In this way, Dataproc is to Apache Spark what Dataflow is to Apache Beam. In fact, Dataproc and Dataflow share backend services.

In this section you create a VM and then a Dataproc cluster on the VM

student-00-72f7c6ef0163@startup-vm:~/data-science-on-gcp/06_dataproc$ export PROJECT_ID=$(gcloud info --format='value(config.project)')

student-00-72f7c6ef0163@startup-vm:~/data-science-on-gcp/06_dataproc$ export BUCKET_NAME=$PROJECT_ID-dsongcp

student-00-72f7c6ef0163@startup-vm:~/data-science-on-gcp/06_dataproc$ cat create_cluster.sh

#!/bin/bash

if [ "$#" -ne 2 ]; then

echo "Usage: ./create_cluster.sh bucket-name region"

exit

PROJECT=$(gcloud config get-value project)

BUCKET=$1

REGION=$2

EMAIL=$3

INSTALL=gs://$BUCKET/flights/dataproc/install_on_cluster.sh

# upload install file

sed "s/CHANGE_TO_USER_NAME/dataproc/g" install_on_cluster.sh > /tmp/install_on_cluster.sh

gsutil cp /tmp/install_on_cluster.sh $INSTALL

# create cluster

gcloud dataproc clusters create ch6cluster \

--enable-component-gateway \

--region ${REGION} --zone ${REGION}-a \

--master-machine-type n1-standard-4 \

--master-boot-disk-size 500 --num-workers 2 \

--worker-machine-type n1-standard-4 \

--worker-boot-disk-size 500 \

--optional-components JUPYTER --project $PROJECT \

--initialization-actions=$INSTALL \

--scopes https://www.googleapis.com/auth/cloud-platform

student-00-72f7c6ef0163@startup-vm:~/data-science-on-gcp/06_dataproc$

student-00-72f7c6ef0163@startup-vm:~/data-science-on-gcp/06_dataproc$ ./create_cluster.sh $BUCKET_NAME us-central1

Copying file:///tmp/install_on_cluster.sh [Content-Type=text/x-sh]...

/ [1 files][ 487.0 B/ 487.0 B]

Operation completed over 1 objects/487.0 B.

Waiting on operation [projects/qwiklabs-gcp-00-67c3ead7c845/regions/us-central1/operations/fe620bc0-255b-309f-b9aa-c492d922af1d].

Waiting for cluster creation operation...

WARNING: No image specified. Using the default image version. It is recommended to select a specific image version in production, as the default image version may change at any time.

WARNING: Consider using Auto Zone rather than selecting a zone manually. See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/auto-zone

WARNING: For PD-Standard without local SSDs, we strongly recommend provisioning 1TB or larger to ensure consistently high I/O performance. See https://cloud.google.com/compute/docs/disks/performance for information on disk I/O performance.

WARNING: Permissions are missing for the default service account '702429846243-compute@developer.gserviceaccount.com', missing permissions: [storage.buckets.get, storage.objects.create, storage.objects.delete, storage.objects.get, storage.objects.list, storage.objects.update] on the project 'projects/qwiklabs-gcp-00-67c3ead7c845'. This usually happens when a custom resource (ex: custom staging bucket) or a user-managed VM Service account has been provided and the default/user-managed service account hasn't been granted enough permissions on the resource. See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts#VM_service_account.

WARNING: The firewall rules for specified network or subnetwork would allow ingress traffic from 0.0.0.0/0, which could be a security risk.

Waiting for cluster creation operation...done.

Created [https://dataproc.googleapis.com/v1/projects/qwiklabs-gcp-00-67c3ead7c845/regions/us-central1/clusters/ch6cluster] Cluster placed in zone [us-central1-a].

student-00-72f7c6ef0163@startup-vm:~/data-science-on-gcp/06_dataproc$

student-00-72f7c6ef0163@startup-vm:~/data-science-on-gcp/06_dataproc$ gcloud dataproc clusters list --region='us-central1'

NAME PLATFORM WORKER_COUNT PREEMPTIBLE_WORKER_COUNT STATUS ZONE SCHEDULED_DELETE

ch6cluster GCE 2 RUNNING us-central1-a

student-00-72f7c6ef0163@startup-vm:~/data-science-on-gcp/06_dataproc$

JupyterLab on Dataproc

In the Cloud Console, on the Navigation menu, click Dataproc. You may have to click More Products and scroll down.

In the Cluster list, click on the cluster name to view cluster details.

Click the Web Interfaces tab and then click JupyterLab towards the bottom of the right pane.

In the Notebook launcher section click Python 3 to open a new notebook.

To use a Notebook you enter commands into a cell. Be sure you run the commands in the cell by either pressing Shift + Enter, or clicking the triangle on the Notebook top menu to Run selected cells and advance.

Task 2. Set up bucket and start pyspark session

Set up a google cloud storage bucket where your raw files are hosted:

PROJECT=!gcloud config get-value project

PROJECT=PROJECT[0]

BUCKET = PROJECT + '-dsongcp'

import os

os.environ['BUCKET'] = PROJECT + '-dsongcp'

Copied!

Run the cell by either pressing Shift + Enter, or clicking the triangle on the Notebook top menu to Run selected cells and advance.

Note: After pasting commands into the Jupyter notebook cell, always run the cell to execute the command and advance to the next cell.

Create a spark session using the following code block:

from pyspark.sql import SparkSession

from pyspark import SparkContext

sc = SparkContext('local', 'logistic')

spark = SparkSession \

.builder \

.appName("Logistic regression w/ Spark ML") \

.getOrCreate()

Copied!

Once that code is added at the start of any Spark Python script, any code developed using the interactive Spark shell or Jupyter notebook will also work when launched as a standalone script.

Create a Spark Dataframe for training

Enter the following commands into new cell:

from pyspark.mllib.classification import LogisticRegressionWithLBFGS

from pyspark.mllib.regression import LabeledPoint

Copied!

Run the cell.