Analytics and visualization

Cloud Dataflow ===== Azure Databricks Managed platform for streaming batch data based on Open Source Apache products.

Data Studio & Looker === Power BI Business intelligence tools that build visualizations, perform ad hoc analysis, and develop business insights from data

BigQuery =============SQL Server Analysis Services Provides a serverless non-cloud interactive query service that uses standard SQL for analyzing databases.

Big Data & Analytics:

Managed Apache Spark-based analytics platform.

DataProc ==============Azure HDInsight/Azure Synapse Analytics/Azure Databricks.

Managed platform for streaming batch data based on Open Source Apache products.

DATAFLOW:

Processing Data with Google Cloud Dataflow

simulate a real-time, real world dataset from a historical dataset. You use Python and Dataflow to process a simulated dataset from a set of text files and then use BigQuery to store and analyze the resulting data.

The historical dataset this lab uses is from the US Bureau of Transport Statistics website, which provides historic information about internal flights in the United States.

Dataflow is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes via Java and Python APIs with the Apache Beam SDK. Dataflow provides a serverless architecture that can be used to shard and process very large batch datasets, or high volume live streams of data, in parallel.

BigQuery is a RESTful web service that enables interactive analysis of massive datasets working in conjunction with Cloud Storage.

Objectives

Configure a Python application to create a simulated real-time data stream from historical data.
Use Apache Beam locally to test Dataflow locally.
Use Apache Beam to process data using Dataflow to create a simulated real-time dataset.
Query the simulated real-time data stream using BigQuery.

student-01-2251dc43ebeb@lab-vm-ql:~/$ history

1 git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp/

2 cd data-science-on-gcp/04_streaming/transform/

7 ./install_packages.sh

9 gzip -d airports.csv.gz

11 head -n1000 airports.csv > temp.csv

13 mv airports.csv airports.csv.backup

14 mv temp.csv airports.csv

16 gzip airports.csv

18 vi df05.py

21 ./df05.py

23 head -3 all_events-00000-of-00001

24 head -3 all_flights-00000-of-00001

25 export PROJECT_ID=$(gcloud info --format='value(config.project)')

26 export BUCKET=${PROJECT_ID}-ml

27 ./stage_airports_file.sh $BUCKET

28 ./df06.py --project $PROJECT_ID --bucket $BUCKET

31 ./df07.py -p $PROJECT_ID -b $BUCKET -r us-west1

student-01-2251dc43ebeb@lab-vm-ql:~/$

Prepare your environment

Clone the Data Science sample

In the Cloud Console, on the Navigation menu (Navigation menu), click Compute Engine > VM instances.

Click the SSH button next to lab-vm-ql VM to launch a terminal and connect.

In the terminal, enter the following command to clone the repository:

git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp/

Copied!

Change to the repository source directory for this lab:

cd ~/data-science-on-gcp/04_streaming/transform

Copied!

Install required packages:

./install_packages.sh

Copied!

Note: When you run these commands, ignore errors related to Google utilities and incompatible packages.

Configure BigQuery and Dataflow for your project.

git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp/

The Cloud Storage bucket

The historical flights text data files are already in the bucket.

So far, you've been reading and writing files from one location.

Once you start to run your code in production, in a serverless environment, the concept of a “location” no longer makes sense. You have to read and write from Cloud Storage.

Also, because this is structured data, it is preferable to read and write to BigQuery.

Now copy the airport geolocation file to your Cloud Storage bucket.

This file identifies the physical location of each airport in order to convert the local time fields to universal time.

Task 3. Process the data using Dataflow

student-01-2251dc43ebeb@lab-vm-ql:~/data-science-on-gcp/04_streaming/transform$ export PROJECT_ID=$(gcloud info --format='value(config.project)')
export BUCKET=${PROJECT_ID}-ml
./stage_airports_file.sh $BUCKET
./df06.py --project $PROJECT_ID --bucket $BUCKET

Copying file://airports.csv.gz [Content-Type=text/csv]...

/ [1 files][ 43.7 KiB/ 43.7 KiB]

Operation completed over 1 objects/43.7 KiB.

/usr/lib/google-cloud-sdk/platform/bq/bq.py:42: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses

Monitor the Dataflow job and inspect the processed data

import imp

Waiting on bqjob_r5090e3b26112a6c5_00000189a05bf423_1 ... (1s) Current status: DONE

Correcting timestamps and writing to BigQuery dataset

/home/student-01-2251dc43ebeb/.local/lib/python3.9/site-packages/apache_beam/io/gcp/bigquery.py:2664: BeamDeprecationWarning: options is deprecated since First stable release. References to <pipeline>.options will not be supported

temp_location = pcoll.pipeline.options.view_as(

/home/student-01-2251dc43ebeb/.local/lib/python3.9/site-packages/apache_beam/io/gcp/bigquery_file_loads.py:1169: BeamDeprecationWarning: options is deprecated since First stable release. References to <pipeline>.options will not be supported

self.project = self.project or p.options.view_as(GoogleCloudOptions).project

World of GCP by Ketan Patel

Processing Data with Google Cloud Dataflow

Analytics and visualization

Big Data & Analytics:

DATAFLOW:

Processing Data with Google Cloud Dataflow

Prepare your environment

Configure BigQuery and Dataflow for your project.

student-01-2251dc43ebeb@lab-vm-ql:~/data-science-on-gcp/04_streaming/transform$ export PROJECT_ID=$(gcloud info --format='value(config.project)')
export BUCKET=${PROJECT_ID}-ml
./stage_airports_file.sh $BUCKET
./df06.py --project $PROJECT_ID --bucket $BUCKET

Monitor the Dataflow job and inspect the processed data

No comments:

Post a Comment

AppEngine - Python

Report Abuse

Labels

Processing Data with Google Cloud Dataflow

Analytics and visualization

Big Data & Analytics:

DATAFLOW:

Processing Data with Google Cloud Dataflow

Prepare your environment

Configure BigQuery and Dataflow for your project.

student-01-2251dc43ebeb@lab-vm-ql:~/data-science-on-gcp/04_streaming/transform$ export PROJECT_ID=$(gcloud info --format='value(config.project)')export BUCKET=${PROJECT_ID}-ml./stage_airports_file.sh $BUCKET./df06.py --project $PROJECT_ID --bucket $BUCKET

Monitor the Dataflow job and inspect the processed data

No comments:

Post a Comment

AppEngine - Python

student-01-2251dc43ebeb@lab-vm-ql:~/data-science-on-gcp/04_streaming/transform$ export PROJECT_ID=$(gcloud info --format='value(config.project)')
export BUCKET=${PROJECT_ID}-ml
./stage_airports_file.sh $BUCKET
./df06.py --project $PROJECT_ID --bucket $BUCKET