Ingesting Data Into The Cloud

Use a bash script to download selected data from a large public data set available on the internet. This data, made available on the US Bureau of Transport Statistics (BTS) website, provides historic information about internal flights in the United States.

The techniques used to ingest this data from the website into the cloud can be applied to other data sets that provide comprehensive real world data but must be parsed and cleaned before to be usefull.

Objectives
  • Retrieve initial data from the BTS website
  • Store the data in Cloud Storage
  • Load data into Google BigQuery














$ history
    1  gcloud auth list
    2  gcloud config list project
    3  git clone    https://github.com/GoogleCloudPlatform/data-science-on-gcp/
    4  cd data-science-on-gcp/
    5  ls -l
    6  mkdir data
    7  cd data
    8  curl https://www.bts.dot.gov/sites/bts.dot.gov/files/docs/legacy/additional-attachment-files/ONTIME.TD.201501.REL02.04APR2015.zip --output data.zip
    9  ls -l
   10  pwd
   11  unzip data.zip 
   12  ls -l
   13  head ontime.td.201501.asc 
   14  cat ../02_ingest/ingest_from_crsbucket.sh
   15  ls -l ../
   16  ls -l ../02_ingest/
   17  export PROJECT_ID=$(gcloud info --format='value(config.project)')
   18  gsutil mb -l us-central1 gs://${PROJECT_ID}-ml
   19  bash ../02_ingest/ingest_from_crsbucket.sh ${PROJECT_ID}-ml
   20  pwd
   21  cat ../02_ingest/bqload.sh
   22  bash ../02_ingest/bqload.sh ${PROJECT_ID}-ml 2015
   23  pwd
   24  ls -l


Clone the Data Science on Google Cloud repository


student_04_5c9933322c08@cloudshell:~ (qwiklabs-gcp-03-7459aec741fa)$ git clone \
   https://github.com/GoogleCloudPlatform/data-science-on-gcp/

Cloning into 'data-science-on-gcp'...
remote: Enumerating objects: 3462, done.
remote: Counting objects: 100% (362/362), done.
remote: Compressing objects: 100% (128/128), done.
remote: Total 3462 (delta 251), reused 240 (delta 234), pack-reused 3100
Receiving objects: 100% (3462/3462), 6.68 MiB | 19.71 MiB/s, done.
Resolving deltas: 100% (1857/1857), done.
student_04_5c9933322c08@cloudshell:~ (qwiklabs-gcp-03-7459aec741fa)$ cd data-science-on-gcp/
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp (qwiklabs-gcp-03-7459aec741fa)$ ls -l
total 148
drwxr-xr-x 3 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 02_ingest
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 03_sqlstudio
drwxr-xr-x 6 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 04_streaming
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 05_bqnotebook
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 06_dataproc
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 07_sparkml
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 08_bqml
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 09_vertexai
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 10_mlops
drwxr-xr-x 3 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 11_realtime
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 12_fulldataset
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08   545 Jul 29 19:31 COPYRIGHT
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 81972 Jul 29 19:31 cover_edition2.jpg
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 11357 Jul 29 19:31 LICENSE
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08  2107 Jul 29 19:31 README.md
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp (qwiklabs-gcp-03-7459aec741fa)$ mkdir data
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp (qwiklabs-gcp-03-7459aec741fa)$ cd data


Task 2. Retrieve data from a website

Fetch a sample data file using curl
You will use curl to fetch the monthly CSV files that contain the raw data that will be used to build your complete data set. The data set is called the On-Time performance data. You can download a pre-configured data file for each month in any given year from the Bureau of Transportation Statistic.




student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ curl https://www.bts.dot.gov/sites/bts.dot.gov/files/docs/legacy/additional-attachment-files/ONTIME.TD.201501.REL02.04APR2015.zip --output data.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 14.5M  100 14.5M    0     0  23.1M      0 --:--:-- --:--:-- --:--:-- 23.1M


student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ 


student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ ls -l
total 14940
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 15297207 Jul 29 19:32 data.zip


student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ pwd
/home/student_04_5c9933322c08/data-science-on-gcp/data


student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ unzip data.zip 
Archive:  data.zip
  inflating: ontime.td.201501.asc    

student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ ls -l
total 93260
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 15297207 Jul 29 19:32 data.zip
-rw-rw-r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 80196338 Apr  3  2015 ontime.td.201501.asc

student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ head ontime.td.201501.asc 

AA|1|JFK|LAX|20150101|4|900|900|855|1230|1230|1237|0|0|390|402|-5|7|12|912|1230|N787AA|17|7|378||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150102|5|900|900|850|1230|1230|1211|0|0|390|381|-10|-19|-9|905|1202|N795AA|15|9|357||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150103|6|900|900|853|1230|1230|1151|0|0|390|358|-7|-39|-32|908|1138|N788AA|15|13|330||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150104|7|900|900|853|1230|1230|1218|0|0|390|385|-7|-12|-5|907|1159|N791AA|14|19|352||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150105|1|900|900|853|1230|1230|1222|0|0|390|389|-7|-8|-1|920|1158|N783AA|27|24|338||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150106|2|900|900|856|1235|1235|1300|0|0|395|424|-4|25|29|1021|1256|N799AA|85|4|335||0|0|25|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150107|3|900|900|859|1235|1235|1221|0|0|395|382|-1|-14|-13|928|1209|N784AA|29|12|341||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150108|4|900|900|856|1235|1235|1158|0|0|395|362|-4|-37|-33|922|1155|N787AA|26|3|333||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150109|5|900|900|901|1235|1235|1241|0|0|395|400|1|6|5|944|1237|N795AA|43|4|353||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150110|6|900|900|903|1235|1235|1235|0|0|395|392|3|0|-3|940|1225|N790AA|37|10|345||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ 





Download custom data from a storage bucket



Snapshots of custom BTS data have been organized and saved in a public storage bucket. Download it from the data-science-on-gcp public storage bucket. A script is provided in the repo to help achieve this.


student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ cat ../02_ingest/ingest_from_crsbucket.sh
#!/bin/bash

if [ "$#" -ne 1 ]; then
    echo "Usage: ./ingest_from_crsbucket.sh  destination-bucket-name"
    exit
fi

BUCKET=$1
FROM=gs://data-science-on-gcp/edition2/flights/raw
TO=gs://$BUCKET/flights/raw

CMD="gsutil -m cp "
for MONTH in `seq -w 1 12`; do
  CMD="$CMD ${FROM}/2015${MONTH}.csv"
done
CMD="$CMD ${FROM}/201601.csv $TO"

echo $CMD
$CMD


student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ ls -l ../
total 152
drwxr-xr-x 3 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 02_ingest
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 03_sqlstudio
drwxr-xr-x 6 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 04_streaming
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 05_bqnotebook
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 06_dataproc
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 07_sparkml
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 08_bqml
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 09_vertexai
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 10_mlops
drwxr-xr-x 3 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 11_realtime
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:31 12_fulldataset
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08   545 Jul 29 19:31 COPYRIGHT
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 81972 Jul 29 19:31 cover_edition2.jpg
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08  4096 Jul 29 19:32 data
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 11357 Jul 29 19:31 LICENSE
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08  2107 Jul 29 19:31 README.md
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ ls -l ../02_ingest/
total 32
-rwxr-xr-x 1 student_04_5c9933322c08 student_04_5c9933322c08 2932 Jul 29 19:31 bqload.sh
-rwxr-xr-x 1 student_04_5c9933322c08 student_04_5c9933322c08  782 Jul 29 19:31 download.sh
-rwxr-xr-x 1 student_04_5c9933322c08 student_04_5c9933322c08  354 Jul 29 19:31 ingest_from_crsbucket.sh
-rwxr-xr-x 1 student_04_5c9933322c08 student_04_5c9933322c08  796 Jul 29 19:31 ingest.sh
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 monthlyupdate
-rwxr-xr-x 1 student_04_5c9933322c08 student_04_5c9933322c08  319 Jul 29 19:31 raw_download.sh
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 1967 Jul 29 19:31 README.md
-rwxr-xr-x 1 student_04_5c9933322c08 student_04_5c9933322c08  321 Jul 29 19:31 upload.sh


To run the script, create a single-region bucket:

student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ export PROJECT_ID=$(gcloud info --format='value(config.project)')
gsutil mb -l us-central1 gs://${PROJECT_ID}-ml
Creating gs://qwiklabs-gcp-03-7459aec741fa-ml/...



student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ 


Run the download script using your bucket name as the argument



student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ bash ../02_ingest/ingest_from_crsbucket.sh ${PROJECT_ID}-ml

gsutil -m cp gs://data-science-on-gcp/edition2/flights/raw/201501.csv gs://data-science-on-gcp/edition2/flights/raw/201502.csv gs://data-science-on-gcp/edition2/flights/raw/201503.csv gs://data-science-on-gcp/edition2/flights/raw/201504.csv gs://data-science-on-gcp/edition2/flights/raw/201505.csv gs://data-science-on-gcp/edition2/flights/raw/201506.csv gs://data-science-on-gcp/edition2/flights/raw/201507.csv gs://data-science-on-gcp/edition2/flights/raw/201508.csv gs://data-science-on-gcp/edition2/flights/raw/201509.csv gs://data-science-on-gcp/edition2/flights/raw/201510.csv gs://data-science-on-gcp/edition2/flights/raw/201511.csv gs://data-science-on-gcp/edition2/flights/raw/201512.csv gs://data-science-on-gcp/edition2/flights/raw/201601.csv gs://qwiklabs-gcp-03-7459aec741fa-ml/flights/raw
Copying gs://data-science-on-gcp/edition2/flights/raw/201501.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201502.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201503.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201504.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201505.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201506.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201507.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201508.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201509.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201510.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201511.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201512.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201601.csv [Content-Type=text/csv]...
- [13/13 files][  2.5 GiB/  2.5 GiB] 100% Done                                  
Operation completed over 13 objects/2.5 GiB.                                     



Task 3. Load data into Google BigQuery

For larger files, it's better to use gsutil to ingest the files into Cloud Storage because gsutil takes advantage of multithreaded, resumable uploads and is better suited to the public internet.

This is what you did in the previous section when you used gsutil to copy the extracted flights CSV files to Cloud Storage.

Return to Cloud Shell to load the CSV files into BigQuery.

In Cloud Shell, examine the bqload.sh script:


This script loads the data from the Cloud Storage bucket to BigQuery.


student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ cat ../02_ingest/bqload.sh

#!/bin/bash

if [ "$#" -ne 2 ]; then
    echo "Usage: ./bqload.sh  csv-bucket-name YEAR"
    exit
fi

BUCKET=$1
YEAR=$2

SCHEMA=Year:STRING,Quarter:STRING,Month:STRING,DayofMonth:STRING,DayOfWeek:STRING,FlightDate:DATE,Reporting_Airline:STRING,DOT_ID_Reporting_Airline:STRING,IATA_CODE_Reporting_Airline:STRING,Tail_Number:STRING,Flight_Number_Reporting_Airline:STRING,OriginAirportID:STRING,OriginAirportSeqID:STRING,OriginCityMarketID:STRING,Origin:STRING,OriginCityName:STRING,OriginState:STRING,OriginStateFips:STRING,OriginStateName:STRING,OriginWac:STRING,DestAirportID:STRING,DestAirportSeqID:STRING,DestCityMarketID:STRING,Dest:STRING,DestCityName:STRING,DestState:STRING,DestStateFips:STRING,DestStateName:STRING,DestWac:STRING,CRSDepTime:STRING,DepTime:STRING,DepDelay:STRING,DepDelayMinutes:STRING,DepDel15:STRING,DepartureDelayGroups:STRING,DepTimeBlk:STRING,TaxiOut:STRING,WheelsOff:STRING,WheelsOn:STRING,TaxiIn:STRING,CRSArrTime:STRING,ArrTime:STRING,ArrDelay:STRING,ArrDelayMinutes:STRING,ArrDel15:STRING,ArrivalDelayGroups:STRING,ArrTimeBlk:STRING,Cancelled:STRING,CancellationCode:STRING,Diverted:STRING,CRSElapsedTime:STRING,ActualElapsedTime:STRING,AirTime:STRING,Flights:STRING,Distance:STRING,DistanceGroup:STRING,CarrierDelay:STRING,WeatherDelay:STRING,NASDelay:STRING,SecurityDelay:STRING,LateAircraftDelay:STRING,FirstDepTime:STRING,TotalAddGTime:STRING,LongestAddGTime:STRING,DivAirportLandings:STRING,DivReachedDest:STRING,DivActualElapsedTime:STRING,DivArrDelay:STRING,DivDistance:STRING,Div1Airport:STRING,Div1AirportID:STRING,Div1AirportSeqID:STRING,Div1WheelsOn:STRING,Div1TotalGTime:STRING,Div1LongestGTime:STRING,Div1WheelsOff:STRING,Div1TailNum:STRING,Div2Airport:STRING,Div2AirportID:STRING,Div2AirportSeqID:STRING,Div2WheelsOn:STRING,Div2TotalGTime:STRING,Div2LongestGTime:STRING,Div2WheelsOff:STRING,Div2TailNum:STRING,Div3Airport:STRING,Div3AirportID:STRING,Div3AirportSeqID:STRING,Div3WheelsOn:STRING,Div3TotalGTime:STRING,Div3LongestGTime:STRING,Div3WheelsOff:STRING,Div3TailNum:STRING,Div4Airport:STRING,Div4AirportID:STRING,Div4AirportSeqID:STRING,Div4WheelsOn:STRING,Div4TotalGTime:STRING,Div4LongestGTime:STRING,Div4WheelsOff:STRING,Div4TailNum:STRING,Div5Airport:STRING,Div5AirportID:STRING,Div5AirportSeqID:STRING,Div5WheelsOn:STRING,Div5TotalGTime:STRING,Div5LongestGTime:STRING,Div5WheelsOff:STRING,Div5TailNum:STRING

# create dataset if not exists
PROJECT=$(gcloud config get-value project)
#bq --project_id $PROJECT rm -f ${PROJECT}:dsongcp.flights_raw
bq --project_id $PROJECT show dsongcp || bq mk --sync dsongcp

for MONTH in `seq -w 1 12`; do

CSVFILE=gs://${BUCKET}/flights/raw/${YEAR}${MONTH}.csv
bq --project_id $PROJECT  --sync \
   load --time_partitioning_field=FlightDate --time_partitioning_type=MONTH \
   --source_format=CSV --ignore_unknown_values --skip_leading_rows=1 --schema=$SCHEMA \
   --replace ${PROJECT}:dsongcp.flights_raw\$${YEAR}${MONTH} $CSVFILE

done


Run the script using your bucket name as argument:

student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ bash ../02_ingest/bqload.sh ${PROJECT_ID}-ml 2015

Your active configuration is: [cloudshell-19651]
BigQuery error in show operation: Not found: Dataset qwiklabs-gcp-03-7459aec741fa:dsongcp
Dataset 'qwiklabs-gcp-03-7459aec741fa:dsongcp' successfully created.
Waiting on bqjob_r59cc60adf08d2a61_00000189a3290634_1 ... (15s) Current status: DONE   
Waiting on bqjob_r6d28d5e84ea17b26_00000189a3294f4f_1 ... (15s) Current status: DONE   
Waiting on bqjob_r78f3117d49aee045_00000189a329981f_1 ... (23s) Current status: DONE   
Waiting on bqjob_r5f0600e3d9db58b9_00000189a329fffb_1 ... (23s) Current status: DONE   
Waiting on bqjob_r5ce4ab23901948f3_00000189a32a689f_1 ... (15s) Current status: DONE   
Waiting on bqjob_r186860fc9484e7dc_00000189a32ab02c_1 ... (23s) Current status: DONE   
Waiting on bqjob_r5ac1e696e008660c_00000189a32b181d_1 ... (15s) Current status: DONE   
Waiting on bqjob_r1c7deb8ca210d16a_00000189a32b60d2_1 ... (23s) Current status: DONE   
Waiting on bqjob_r6801039bf7ecdb8_00000189a32bc7b1_1 ... (23s) Current status: DONE   
Waiting on bqjob_r10d59c89cf43ece1_00000189a32c2e9b_1 ... (23s) Current status: DONE   
Waiting on bqjob_r2e636e0fb59732f0_00000189a32c9579_1 ... (15s) Current status: DONE   
Waiting on bqjob_r27e364f2740b7991_00000189a32cdd77_1 ... (23s) Current status: DONE   
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ pwd
/home/student_04_5c9933322c08/data-science-on-gcp/data
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ ls -l
total 93260
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 15297207 Jul 29 19:32 data.zip
-rw-rw-r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 80196338 Apr  3  2015 ontime.td.201501.asc
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ 


At this point, the CSV files are in Cloud Storage and the raw data in BigQuery.




Summary
  • Fetch raw data from a website in CSV format and perform some basic text actions to tidy it up.

  • Copy the data to a Cloud Storage bucket.
  • Load the data from a Cloud Storage bucket to BiqQuery where it can be reused easily.

No comments:

Post a Comment

AppEngine - Python

tudent_04_347b5286260a@cloudshell:~/python-docs-samples/appengine/standard_python3/hello_world (qwiklabs-gcp-00-88834e0beca1)$ sudo apt upda...