World of GCP by Ketan Patel: Creating a Dataflow pipeline to store streaming data

Pub/Sub is a message broker, similar - but with subtle differences - to other systems like Kafka, Active MQ(apache), Azure Service Bus(Event Hub)

It allows you to decouple and at the same interconnect different services in the Google Cloud Platform (different services can publish information to a certain topic, while other can consume it).

But it will not provide a way for message persistent storage on its own: by default, the message retention period is at most 7 seven days.

The responsibility for storing those messages if required lies with the services consuming those messages.

You can use any service that can consume messages from those Pub/Sub topics, but typically you will use products like Dataflow, an advanced stream or batch data processing tool, that will allow you to process your device data and to distribute it to different storage services if necessary.

If your are interested in analytics, the way to go should be either Bigtable or better BigQuery. There are several Dataflow templates available out of the box for storing your Pub/Sub information in BigQuery. Please, consider review this excellent related article as well.

You can store your device information in other services as well, like Cloud SQL, managed relational database services versions of MySQL, PostgreSQL and SQL Server, Cloud Firestore (formerly Datastore), a flavor or NoSQL database, or even Cloud Storage, for storing your device information as plain, unstructured objects.

Although a bit dated now, please, see the section What Storage type? in the extraordinary grumpygrace.dev flowcharts page for more information about the different storage options available.

Google Dataflow is a service for stream and batch processing at scale.

When there is a need for processing lots of streamed data like click stream or data from IoT devices,

Dataflow will be the starting point for receiving all the stream data.

The data can then be sent to storage (BigQuery, Bigtable, GCS) for further processing (ML):

Creation of the Pub/Sub topic and subscription
Creating the Dataflow pipeline
Sending test temperature data to the Pub/Sub topic
Viewing the processed data and the stored data on Google Storage

ketan_patel@cloudshell:~ )$ gcloud pubsub topics create tempSensorTopic

Created topic [projects/ketan-mvp/topics/tempSensorTopic].

ketan_patel@cloudshell:~ $ gcloud pubsub subscriptions create readTempSubscptn --topic tempSensorTopic

Created subscription [projects/ketan-mvp/subscriptions/readTempSubscptn].

ketan_patel@cloudshell:$

2) CREATE DATAFLOW:

gcloud dataflow jobs run process-temperature --gcs-location gs://dataflow-templates-us-central1/latest/Cloud_PubSub_to_GCS_Text --region us-central1 --staging-location gs://tempsensor1/tmp/ --parameters inputTopic=projects/svo-mvp/topics/tempSensorTopic,outputDirectory=gs://tempsensor1/data/,outputFilenamePrefix=df

RUN THE JOB.

World of GCP by Ketan Patel

Creating a Dataflow pipeline to store streaming data

2) CREATE DATAFLOW:

No comments:

Post a Comment

AppEngine - Python

Report Abuse

Labels