Stream Data from PubSub to BigQuery using DataFlow

 ketan_patel@cloudshell:~ (new-user-learning)$ gcloud config list project
[core]
project = new-user-learning

Your active configuration is: [cloudshell-1078]
ketan_patel@cloudshell:~ (new-user-learning)$ 


ketan_patel@cloudshell:~ (new-user-learning)$ gcloud services list --enabled | grep -i dataflow

ketan_patel@cloudshell:~ (new-user-learning)$ gcloud services list --available | grep -i dataflow
NAME: dataflow.googleapis.com
TITLE: Dataflow API

ketan_patel@cloudshell:~ (new-user-learning)$ gcloud services  enable dataflow.googleapis.com
Operation "operations/acf.p2-457904926486-ab0c6ab9-cb9f-4367-99a0-01052a03314a" finished successfully.

ketan_patel@cloudshell:~ (new-user-learning)$ gcloud services list --enabled | grep -i dataflow
NAME: dataflow.googleapis.com
TITLE: Dataflow API










Google Dataflow is a service for stream and batch processing at scale. When there is a need for processing lots of streamed data like click stream or data from IoT devices, Dataflow will be the starting point for receiving all the stream data. The data can then be sent to storage (BigQuery, Bigtable, GCS) for further processing (ML):










CREATE BUCKET(Storage):

kp@cloudshell:~ $ gsutil mb gs://ketanlearningbucket0818

Creating gs://ketanlearningbucket0818/...


CREATE PUBSUB:

kp@cloudshell:~ $ gcloud pubsub topics create ketantopic0818

Created topic [projects/new-user-learning/topics/ketantopic0818].





CREATE DATASET IN BIGQUERY:

kp@cloudshell:~$ bq mk ketandataset0818

Dataset 'new-user-learning:ketandataset0818' successfully created.

create a new table in the current dataset to get messages from PubSub to BigQuery in that table.

Here, let’s take a simple example of a message in JSON (JavaScript Object Notation) format, given below.

"name" : "Ketan",
"language" : "Cloud Eng" 
}

kp@cloudshell:~ $ bq mk ketandataset0818.table01 name:STRING,language:STRING

Table 'new-user-learning:ketandataset0818.table01' successfully created








CONNECT PUBSUB TO BIGQUERY USING DATAFLOW: 

CREATE DATAFLOW JOB:

USE GUI TO CREATE DATAFLOW:









INSTEAD OF GUI YOU CAN USE FOLLOWING CLI TO CREATE IT (BUT IT'S COMPLICATED)

$ gcloud dataflow jobs run Ketandataflowjob0818 --gcs-location gs://dataflow-templates-us-central1/latest/PubSub_to_BigQuery --region us-central1 --staging-location gs://ketanlearningbucket0818/temp --parameters inputTopic=projects/new-user-learning/topics/ketantopic0818,outputTableSpec=new-user-learning:ketandataset0818.table01



RUN JOB:






Now, go to PubSub and click on the “+ PUBLISH MESSAGE” button.









Copy and paste the above message in JSON or provide your message based on the fields you added in the table.

"name" : "Ketan",
"language" : "Cloud Eng" 
}

Click the “PUBLISH” button. It will publish the message from PubSub to BigQuery using Dataflow and Google Cloud Storage.









PUBLISH 100 MESSAGES:


"name" : "Patel",
"language" : "GCP Cloud Eng" 
}











You can check data in the table in Google BigQuery.


No comments:

Post a Comment

AppEngine - Python

tudent_04_347b5286260a@cloudshell:~/python-docs-samples/appengine/standard_python3/hello_world (qwiklabs-gcp-00-88834e0beca1)$ sudo apt upda...