Use a bash script to download selected data from a large public data set available on the internet. This data, made available on the US Bureau of Transport Statistics (BTS) website, provides historic information about internal flights in the United States.
The techniques used to ingest this data from the website into the cloud can be applied to other data sets that provide comprehensive real world data but must be parsed and cleaned before to be usefull.
Objectives
- Retrieve initial data from the BTS website
- Store the data in Cloud Storage
- Load data into Google BigQuery
$ history
1 gcloud auth list
2 gcloud config list project
3 git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp/
4 cd data-science-on-gcp/
5 ls -l
6 mkdir data
7 cd data
8 curl https://www.bts.dot.gov/sites/bts.dot.gov/files/docs/legacy/additional-attachment-files/ONTIME.TD.201501.REL02.04APR2015.zip --output data.zip
9 ls -l
10 pwd
11 unzip data.zip
12 ls -l
13 head ontime.td.201501.asc
14 cat ../02_ingest/ingest_from_crsbucket.sh
15 ls -l ../
16 ls -l ../02_ingest/
17 export PROJECT_ID=$(gcloud info --format='value(config.project)')
18 gsutil mb -l us-central1 gs://${PROJECT_ID}-ml
19 bash ../02_ingest/ingest_from_crsbucket.sh ${PROJECT_ID}-ml
20 pwd
21 cat ../02_ingest/bqload.sh
22 bash ../02_ingest/bqload.sh ${PROJECT_ID}-ml 2015
23 pwd
24 ls -l
$
Clone the Data Science on Google Cloud repository
student_04_5c9933322c08@cloudshell:~ (qwiklabs-gcp-03-7459aec741fa)$ git clone \
https://github.com/GoogleCloudPlatform/data-science-on-gcp/
Cloning into 'data-science-on-gcp'...
remote: Enumerating objects: 3462, done.
remote: Counting objects: 100% (362/362), done.
remote: Compressing objects: 100% (128/128), done.
remote: Total 3462 (delta 251), reused 240 (delta 234), pack-reused 3100
Receiving objects: 100% (3462/3462), 6.68 MiB | 19.71 MiB/s, done.
Resolving deltas: 100% (1857/1857), done.
student_04_5c9933322c08@cloudshell:~ (qwiklabs-gcp-03-7459aec741fa)$ cd data-science-on-gcp/
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp (qwiklabs-gcp-03-7459aec741fa)$ ls -l
total 148
drwxr-xr-x 3 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 02_ingest
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 03_sqlstudio
drwxr-xr-x 6 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 04_streaming
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 05_bqnotebook
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 06_dataproc
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 07_sparkml
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 08_bqml
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 09_vertexai
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 10_mlops
drwxr-xr-x 3 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 11_realtime
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 12_fulldataset
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 545 Jul 29 19:31 COPYRIGHT
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 81972 Jul 29 19:31 cover_edition2.jpg
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 11357 Jul 29 19:31 LICENSE
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 2107 Jul 29 19:31 README.md
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp (qwiklabs-gcp-03-7459aec741fa)$ mkdir data
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp (qwiklabs-gcp-03-7459aec741fa)$ cd data
Task 2. Retrieve data from a website
Fetch a sample data file using curl
You will use curl to fetch the monthly CSV files that contain the raw data that will be used to build your complete data set. The data set is called the On-Time performance data. You can download a pre-configured data file for each month in any given year from the Bureau of Transportation Statistic.
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ curl https://www.bts.dot.gov/sites/bts.dot.gov/files/docs/legacy/additional-attachment-files/ONTIME.TD.201501.REL02.04APR2015.zip --output data.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 14.5M 100 14.5M 0 0 23.1M 0 --:--:-- --:--:-- --:--:-- 23.1M
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ ls -l
total 14940
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 15297207 Jul 29 19:32 data.zip
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ pwd
/home/student_04_5c9933322c08/data-science-on-gcp/data
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ unzip data.zip
Archive: data.zip
inflating: ontime.td.201501.asc
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ ls -l
total 93260
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 15297207 Jul 29 19:32 data.zip
-rw-rw-r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 80196338 Apr 3 2015 ontime.td.201501.asc
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ head ontime.td.201501.asc
AA|1|JFK|LAX|20150101|4|900|900|855|1230|1230|1237|0|0|390|402|-5|7|12|912|1230|N787AA|17|7|378||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150102|5|900|900|850|1230|1230|1211|0|0|390|381|-10|-19|-9|905|1202|N795AA|15|9|357||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150103|6|900|900|853|1230|1230|1151|0|0|390|358|-7|-39|-32|908|1138|N788AA|15|13|330||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150104|7|900|900|853|1230|1230|1218|0|0|390|385|-7|-12|-5|907|1159|N791AA|14|19|352||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150105|1|900|900|853|1230|1230|1222|0|0|390|389|-7|-8|-1|920|1158|N783AA|27|24|338||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150106|2|900|900|856|1235|1235|1300|0|0|395|424|-4|25|29|1021|1256|N799AA|85|4|335||0|0|25|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150107|3|900|900|859|1235|1235|1221|0|0|395|382|-1|-14|-13|928|1209|N784AA|29|12|341||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150108|4|900|900|856|1235|1235|1158|0|0|395|362|-4|-37|-33|922|1155|N787AA|26|3|333||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150109|5|900|900|901|1235|1235|1241|0|0|395|400|1|6|5|944|1237|N795AA|43|4|353||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
AA|1|JFK|LAX|20150110|6|900|900|903|1235|1235|1235|0|0|395|392|3|0|-3|940|1225|N790AA|37|10|345||0|0|0|0|0|0|0|0|0||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|||0|0|0|0|
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$
Download custom data from a storage bucket
Snapshots of custom BTS data have been organized and saved in a public storage bucket. Download it from the data-science-on-gcp public storage bucket. A script is provided in the repo to help achieve this.
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ cat ../02_ingest/ingest_from_crsbucket.sh
#!/bin/bash
if [ "$#" -ne 1 ]; then
echo "Usage: ./ingest_from_crsbucket.sh destination-bucket-name"
exit
fi
BUCKET=$1
FROM=gs://data-science-on-gcp/edition2/flights/raw
TO=gs://$BUCKET/flights/raw
CMD="gsutil -m cp "
for MONTH in `seq -w 1 12`; do
CMD="$CMD ${FROM}/2015${MONTH}.csv"
done
CMD="$CMD ${FROM}/201601.csv $TO"
echo $CMD
$CMD
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ ls -l ../
total 152
drwxr-xr-x 3 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 02_ingest
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 03_sqlstudio
drwxr-xr-x 6 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 04_streaming
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 05_bqnotebook
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 06_dataproc
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 07_sparkml
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 08_bqml
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 09_vertexai
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 10_mlops
drwxr-xr-x 3 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 11_realtime
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 12_fulldataset
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 545 Jul 29 19:31 COPYRIGHT
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 81972 Jul 29 19:31 cover_edition2.jpg
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:32 data
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 11357 Jul 29 19:31 LICENSE
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 2107 Jul 29 19:31 README.md
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ ls -l ../02_ingest/
total 32
-rwxr-xr-x 1 student_04_5c9933322c08 student_04_5c9933322c08 2932 Jul 29 19:31 bqload.sh
-rwxr-xr-x 1 student_04_5c9933322c08 student_04_5c9933322c08 782 Jul 29 19:31 download.sh
-rwxr-xr-x 1 student_04_5c9933322c08 student_04_5c9933322c08 354 Jul 29 19:31 ingest_from_crsbucket.sh
-rwxr-xr-x 1 student_04_5c9933322c08 student_04_5c9933322c08 796 Jul 29 19:31 ingest.sh
drwxr-xr-x 2 student_04_5c9933322c08 student_04_5c9933322c08 4096 Jul 29 19:31 monthlyupdate
-rwxr-xr-x 1 student_04_5c9933322c08 student_04_5c9933322c08 319 Jul 29 19:31 raw_download.sh
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 1967 Jul 29 19:31 README.md
-rwxr-xr-x 1 student_04_5c9933322c08 student_04_5c9933322c08 321 Jul 29 19:31 upload.sh
To run the script, create a single-region bucket:
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ export PROJECT_ID=$(gcloud info --format='value(config.project)')
gsutil mb -l us-central1 gs://${PROJECT_ID}-ml
Creating gs://qwiklabs-gcp-03-7459aec741fa-ml/...
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$
Run the download script using your bucket name as the argument
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ bash ../02_ingest/ingest_from_crsbucket.sh ${PROJECT_ID}-ml
gsutil -m cp gs://data-science-on-gcp/edition2/flights/raw/201501.csv gs://data-science-on-gcp/edition2/flights/raw/201502.csv gs://data-science-on-gcp/edition2/flights/raw/201503.csv gs://data-science-on-gcp/edition2/flights/raw/201504.csv gs://data-science-on-gcp/edition2/flights/raw/201505.csv gs://data-science-on-gcp/edition2/flights/raw/201506.csv gs://data-science-on-gcp/edition2/flights/raw/201507.csv gs://data-science-on-gcp/edition2/flights/raw/201508.csv gs://data-science-on-gcp/edition2/flights/raw/201509.csv gs://data-science-on-gcp/edition2/flights/raw/201510.csv gs://data-science-on-gcp/edition2/flights/raw/201511.csv gs://data-science-on-gcp/edition2/flights/raw/201512.csv gs://data-science-on-gcp/edition2/flights/raw/201601.csv gs://qwiklabs-gcp-03-7459aec741fa-ml/flights/raw
Copying gs://data-science-on-gcp/edition2/flights/raw/201501.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201502.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201503.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201504.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201505.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201506.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201507.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201508.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201509.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201510.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201511.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201512.csv [Content-Type=text/csv]...
Copying gs://data-science-on-gcp/edition2/flights/raw/201601.csv [Content-Type=text/csv]...
- [13/13 files][ 2.5 GiB/ 2.5 GiB] 100% Done
Operation completed over 13 objects/2.5 GiB.
Task 3. Load data into Google BigQuery
For larger files, it's better to use gsutil to ingest the files into Cloud Storage because gsutil takes advantage of multithreaded, resumable uploads and is better suited to the public internet.
This is what you did in the previous section when you used gsutil to copy the extracted flights CSV files to Cloud Storage.
Return to Cloud Shell to load the CSV files into BigQuery.
In Cloud Shell, examine the bqload.sh script:
This script loads the data from the Cloud Storage bucket to BigQuery.
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ cat ../02_ingest/bqload.sh
#!/bin/bash
if [ "$#" -ne 2 ]; then
echo "Usage: ./bqload.sh csv-bucket-name YEAR"
exit
fi
BUCKET=$1
YEAR=$2
SCHEMA=Year:STRING,Quarter:STRING,Month:STRING,DayofMonth:STRING,DayOfWeek:STRING,FlightDate:DATE,Reporting_Airline:STRING,DOT_ID_Reporting_Airline:STRING,IATA_CODE_Reporting_Airline:STRING,Tail_Number:STRING,Flight_Number_Reporting_Airline:STRING,OriginAirportID:STRING,OriginAirportSeqID:STRING,OriginCityMarketID:STRING,Origin:STRING,OriginCityName:STRING,OriginState:STRING,OriginStateFips:STRING,OriginStateName:STRING,OriginWac:STRING,DestAirportID:STRING,DestAirportSeqID:STRING,DestCityMarketID:STRING,Dest:STRING,DestCityName:STRING,DestState:STRING,DestStateFips:STRING,DestStateName:STRING,DestWac:STRING,CRSDepTime:STRING,DepTime:STRING,DepDelay:STRING,DepDelayMinutes:STRING,DepDel15:STRING,DepartureDelayGroups:STRING,DepTimeBlk:STRING,TaxiOut:STRING,WheelsOff:STRING,WheelsOn:STRING,TaxiIn:STRING,CRSArrTime:STRING,ArrTime:STRING,ArrDelay:STRING,ArrDelayMinutes:STRING,ArrDel15:STRING,ArrivalDelayGroups:STRING,ArrTimeBlk:STRING,Cancelled:STRING,CancellationCode:STRING,Diverted:STRING,CRSElapsedTime:STRING,ActualElapsedTime:STRING,AirTime:STRING,Flights:STRING,Distance:STRING,DistanceGroup:STRING,CarrierDelay:STRING,WeatherDelay:STRING,NASDelay:STRING,SecurityDelay:STRING,LateAircraftDelay:STRING,FirstDepTime:STRING,TotalAddGTime:STRING,LongestAddGTime:STRING,DivAirportLandings:STRING,DivReachedDest:STRING,DivActualElapsedTime:STRING,DivArrDelay:STRING,DivDistance:STRING,Div1Airport:STRING,Div1AirportID:STRING,Div1AirportSeqID:STRING,Div1WheelsOn:STRING,Div1TotalGTime:STRING,Div1LongestGTime:STRING,Div1WheelsOff:STRING,Div1TailNum:STRING,Div2Airport:STRING,Div2AirportID:STRING,Div2AirportSeqID:STRING,Div2WheelsOn:STRING,Div2TotalGTime:STRING,Div2LongestGTime:STRING,Div2WheelsOff:STRING,Div2TailNum:STRING,Div3Airport:STRING,Div3AirportID:STRING,Div3AirportSeqID:STRING,Div3WheelsOn:STRING,Div3TotalGTime:STRING,Div3LongestGTime:STRING,Div3WheelsOff:STRING,Div3TailNum:STRING,Div4Airport:STRING,Div4AirportID:STRING,Div4AirportSeqID:STRING,Div4WheelsOn:STRING,Div4TotalGTime:STRING,Div4LongestGTime:STRING,Div4WheelsOff:STRING,Div4TailNum:STRING,Div5Airport:STRING,Div5AirportID:STRING,Div5AirportSeqID:STRING,Div5WheelsOn:STRING,Div5TotalGTime:STRING,Div5LongestGTime:STRING,Div5WheelsOff:STRING,Div5TailNum:STRING
# create dataset if not exists
PROJECT=$(gcloud config get-value project)
#bq --project_id $PROJECT rm -f ${PROJECT}:dsongcp.flights_raw
bq --project_id $PROJECT show dsongcp || bq mk --sync dsongcp
for MONTH in `seq -w 1 12`; do
CSVFILE=gs://${BUCKET}/flights/raw/${YEAR}${MONTH}.csv
bq --project_id $PROJECT --sync \
load --time_partitioning_field=FlightDate --time_partitioning_type=MONTH \
--source_format=CSV --ignore_unknown_values --skip_leading_rows=1 --schema=$SCHEMA \
--replace ${PROJECT}:dsongcp.flights_raw\$${YEAR}${MONTH} $CSVFILE
done
Run the script using your bucket name as argument:
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ bash ../02_ingest/bqload.sh ${PROJECT_ID}-ml 2015
Your active configuration is: [cloudshell-19651]
BigQuery error in show operation: Not found: Dataset qwiklabs-gcp-03-7459aec741fa:dsongcp
Dataset 'qwiklabs-gcp-03-7459aec741fa:dsongcp' successfully created.
Waiting on bqjob_r59cc60adf08d2a61_00000189a3290634_1 ... (15s) Current status: DONE
Waiting on bqjob_r6d28d5e84ea17b26_00000189a3294f4f_1 ... (15s) Current status: DONE
Waiting on bqjob_r78f3117d49aee045_00000189a329981f_1 ... (23s) Current status: DONE
Waiting on bqjob_r5f0600e3d9db58b9_00000189a329fffb_1 ... (23s) Current status: DONE
Waiting on bqjob_r5ce4ab23901948f3_00000189a32a689f_1 ... (15s) Current status: DONE
Waiting on bqjob_r186860fc9484e7dc_00000189a32ab02c_1 ... (23s) Current status: DONE
Waiting on bqjob_r5ac1e696e008660c_00000189a32b181d_1 ... (15s) Current status: DONE
Waiting on bqjob_r1c7deb8ca210d16a_00000189a32b60d2_1 ... (23s) Current status: DONE
Waiting on bqjob_r6801039bf7ecdb8_00000189a32bc7b1_1 ... (23s) Current status: DONE
Waiting on bqjob_r10d59c89cf43ece1_00000189a32c2e9b_1 ... (23s) Current status: DONE
Waiting on bqjob_r2e636e0fb59732f0_00000189a32c9579_1 ... (15s) Current status: DONE
Waiting on bqjob_r27e364f2740b7991_00000189a32cdd77_1 ... (23s) Current status: DONE
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ pwd
/home/student_04_5c9933322c08/data-science-on-gcp/data
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$ ls -l
total 93260
-rw-r--r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 15297207 Jul 29 19:32 data.zip
-rw-rw-r-- 1 student_04_5c9933322c08 student_04_5c9933322c08 80196338 Apr 3 2015 ontime.td.201501.asc
student_04_5c9933322c08@cloudshell:~/data-science-on-gcp/data (qwiklabs-gcp-03-7459aec741fa)$
At this point, the CSV files are in Cloud Storage and the raw data in BigQuery.
Summary- Fetch raw data from a website in CSV format and perform some basic text actions to tidy it up.
- Copy the data to a Cloud Storage bucket.
- Load the data from a Cloud Storage bucket to BiqQuery where it can be reused easily.
No comments:
Post a Comment