Processing Data with Google Cloud Dataflow

 

Analytics and visualization


Cloud Dataflow ===== Azure Databricks Managed platform for streaming batch data based on Open Source Apache products.

Data Studio & Looker  === Power BI   Business intelligence tools that build visualizations, perform ad hoc analysis, and develop business insights from data
.
BigQuery =============SQL Server Analysis Services   Provides a serverless non-cloud interactive query service that uses standard SQL for analyzing databases.

Big Data & Analytics:

Managed Apache Spark-based analytics platform.

DataProc ==============Azure HDInsight/Azure Synapse Analytics/Azure Databricks.

Managed platform for streaming batch data based on Open Source Apache products.



DATAFLOW: 

Processing Data with Google Cloud Dataflow


simulate a real-time, real world dataset from a historical dataset. You use Python and Dataflow to process a simulated dataset from a set of text files and then use BigQuery to store and analyze the resulting data.

The historical dataset this lab uses is from the US Bureau of Transport Statistics website, which provides historic information about internal flights in the United States.

Dataflow is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes via Java and Python APIs with the Apache Beam SDK. Dataflow provides a serverless architecture that can be used to shard and process very large batch datasets, or high volume live streams of data, in parallel.

BigQuery is a RESTful web service that enables interactive analysis of massive datasets working in conjunction with Cloud Storage.

Objectives
  • Configure a Python application to create a simulated real-time data stream from historical data.
  • Use Apache Beam locally to test Dataflow locally.
  • Use Apache Beam to process data using Dataflow to create a simulated real-time dataset.
  • Query the simulated real-time data stream using BigQuery.
student-01-2251dc43ebeb@lab-vm-ql:~/$ history
    1  git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp/
    2  cd data-science-on-gcp/04_streaming/transform/
    7  ./install_packages.sh 
    9  gzip -d airports.csv.gz
   11  head -n1000 airports.csv > temp.csv
   13  mv airports.csv airports.csv.backup
   14  mv temp.csv airports.csv
   16  gzip airports.csv
   18  vi df05.py 
   21  ./df05.py 
   23  head -3 all_events-00000-of-00001 
   24  head -3 all_flights-00000-of-00001 
   25  export PROJECT_ID=$(gcloud info --format='value(config.project)')
   26  export BUCKET=${PROJECT_ID}-ml
   27  ./stage_airports_file.sh $BUCKET
   28  ./df06.py --project $PROJECT_ID --bucket $BUCKET
   31  ./df07.py -p $PROJECT_ID -b $BUCKET -r us-west1
student-01-2251dc43ebeb@lab-vm-ql:~/$ 


Prepare your environment


Clone the Data Science sample
In the Cloud Console, on the Navigation menu (Navigation menu), click Compute Engine > VM instances.

Click the SSH button next to lab-vm-ql VM to launch a terminal and connect.

In the terminal, enter the following command to clone the repository:

git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp/
Copied!
Change to the repository source directory for this lab:

cd ~/data-science-on-gcp/04_streaming/transform
Copied!
Install required packages:

./install_packages.sh
Copied!

Note: When you run these commands, ignore errors related to Google utilities and incompatible packages.





Configure BigQuery and Dataflow for your project.



git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp/


The Cloud Storage bucket 

The historical flights text data files are already in the bucket.






So far, you've been reading and writing files from one location. 

Once you start to run your code in production, in a serverless environment, the concept of a “location” no longer makes sense. You have to read and write from Cloud Storage. 

Also, because this is structured data, it is preferable to read and write to BigQuery.

Now copy the airport geolocation file to your Cloud Storage bucket. 

This file identifies the physical location of each airport in order to convert the local time fields to universal time.


Task 3. Process the data using Dataflow

student-01-2251dc43ebeb@lab-vm-ql:~/data-science-on-gcp/04_streaming/transform$ export PROJECT_ID=$(gcloud info --format='value(config.project)')
export BUCKET=${PROJECT_ID}-ml
./stage_airports_file.sh $BUCKET
./df06.py --project $PROJECT_ID --bucket $BUCKET


Copying file://airports.csv.gz [Content-Type=text/csv]...
/ [1 files][ 43.7 KiB/ 43.7 KiB]                                                
Operation completed over 1 objects/43.7 KiB.                                     
/usr/lib/google-cloud-sdk/platform/bq/bq.py:42: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
 







Monitor the Dataflow job and inspect the processed data









import imp
Waiting on bqjob_r5090e3b26112a6c5_00000189a05bf423_1 ... (1s) Current status: DONE   
Correcting timestamps and writing to BigQuery dataset
/home/student-01-2251dc43ebeb/.local/lib/python3.9/site-packages/apache_beam/io/gcp/bigquery.py:2664: BeamDeprecationWarning: options is deprecated since First stable release. References to <pipeline>.options will not be supported
  temp_location = pcoll.pipeline.options.view_as(
/home/student-01-2251dc43ebeb/.local/lib/python3.9/site-packages/apache_beam/io/gcp/bigquery_file_loads.py:1169: BeamDeprecationWarning: options is deprecated since First stable release. References to <pipeline>.options will not be supported
  self.project = self.project or p.options.view_as(GoogleCloudOptions).project










No comments:

Post a Comment

AppEngine - Python

tudent_04_347b5286260a@cloudshell:~/python-docs-samples/appengine/standard_python3/hello_world (qwiklabs-gcp-00-88834e0beca1)$ sudo apt upda...