This library proposes a straightforward data workflow between Jupyter notebook, Google Drive, and Google Cloud Platform. The library contains three modules:
google_authentification
: Gives authorization to access Google Drive and Google Cloud Platformgoogle_drive
: Provides the necessary operations on Google Drive such as creating files, moving files, add images to a Google Docs, add/load data to/from Google Spreadsheetgoogle_platform
: Add/Load/deletes files in Google Cloud Storage but also BigQuery
Install library
!pip install git+git://github.com/thomaspernet/GoogleDrive-python
Update library
!pip install --upgrade git+git://github.com/thomaspernet/GoogleDrive-python
The motivation behinds this library is to automatize the data workflow as follow:
One particular objective is to archive the summary statistics or output of explanatory data analysis in Google drive.
The module connect_service
authorizes Google to perform operations on Google Drive. To connect to Google module, you need to download a credential and a service account files with the appropriate authorization.
The credential file gives the authorization to view, read, or write in files from Google Drive. Service account permits to perform the same operation in Google Cloud Storage and BigQuery. Note that, you can change that authorization in GCP
There are two different files to create to give access to Google Cloud:
- credential and token
- service_account
To connect to Google product, you need a credential. The only way to get this credential is to accept data consent.
Here are the steps:
- Go to this page to create your credentials
- At the top of the page, select the OAuth consent screen tab. Select an Email address, enter an App name if not already set, and click the Save button.
- Select the Credentials tab, click the Create credentials button and select OAuth client ID.
- Select the application type Other, enter the name "connect_py," and click the Create button.
- Click OK to dismiss the resulting dialog.
- Click the file_download (Download JSON) button to the right of the client ID.
- Move this file to your working directory and rename it
credential.json.
The first time you use the library, you will use credential.json
to generate a unique token through the Google authentification windows. During the authentification, Google prompts warning information the app is not verified. Click advanced and proceed. The app's name is the one you defined previously. If the authentification is a success, you should see The authentication flow has completed, you may close this window.
After Google authenticates you, you'll get a new file named token.pickle
in the current directory. You need this token every time to connect to the library. Store it in a safe but accessible place.
To create a service account, go to the IAM tab in the iam-admin panel.
- Click Add
- Add three roles: 3. BigQuery Data Viewer 4. BigQuery Job User 5. Viewer
- Go to service account
- Select the user you've just created
- Look the three dots at the right side of the windows and click Create
- Select JSON and click Create
The service account JSON file is downloaded. You need this file each time you want to connect to GCP. The filename looks like valid-pagoda-XXXXXX.json
if you haven't renamed your project.
It will give enough flexibility to read data in GCS and BigQuery. You can tailor-made the role for each user. For instance, allow only certain users to access some data.
Now that you have downloaded the authorization, you can create the connection. The module connect_service_local
provides a quick way to connect with Google Drive and Google Cloud platform.
- To connect to Google Drive, use the function
get_service
: - To connect to GCP, use the function
get_storage_client
from GoogleDrivePy.google_authentification import connect_service_local
You need to initialize the connection with connect_service_local
. There are three arguments:
path_credential:
Path to the credential and the token. Required to connect to Google Drivepath_service_account
: Path to the service account file. Needed to connect to GCPscope
: Required to set up the first connection, i.e., download the token
During your first first connection, define the path to the credential and the scope.
#pathcredential = '/content/gdrive/My Drive/PATH TO CREDENTIAL/'
pathcredential = '/PATH TO CREDENTIAL/'
scopes = ['https://www.googleapis.com/auth/documents.readonly',
'https://www.googleapis.com/auth/drive',
'https://www.googleapis.com/auth/spreadsheets.readonly']
To initialize the connection to Google Drive only. Use scope only if the token is not available.
cs = connect_service_local.connect_service_local(path_credential =pathcredential,
scope = scopes)
The function get_service()
returns the Service Google Drive, Service Google Doc., and Service Google spreadsheet.
service_drive = cs.get_service()
To initialize the connection to GCS only. Note that the path should include the filename. The filename looks like valid-pagoda-XXXXXX.json
path_serviceaccount = '/PATH TO CREDENTIAL/FILENAME.json'
cs = connect_service_local.connect_service_local(path_service_account =path_serviceaccount)
The function get_storage_client()
returns the Service Google Cloud Storage and Google BigQuery.
service_gcp = cs.get_storage_client()
You can create a connection for both service:
### Scope is not required since the token is already created
cs = connect_service_local.connect_service_local(path_credential =pathcredential,
path_service_account =path_serviceaccount)
Then use get_service
and get_storage_client
to connect to the different service.
There is a module connect_service_colab
to connect to Google Drive or GCP from Google Colab.
After the connection with Google Drive is done, you can use the module connect_drive
to perform the following operation:
- Google Drive:
- Upload file:
upload_file_root
- Find folder ID:
find_folder_id
- Find file ID:
find_file_id
- Move file:
move_file
- Upload file:
- Google Doc
- Find/create doc:
access_google_doc
- Add image to doc:
add_image_to_doc
- Add bullet point:
add_bullet_to_doc
- Find/create doc:
- Google Spreadsheet
- Add data:
add_data_to_spreadsheet
- Upload data:
upload_data_from_spreadsheet
- Find latest row:
getLatestRow
- Find number columns:
getColumnNumber
- Both columns and rows:
getRowAndColumns
- Add data:
All functions are in the connect_drive
module
from GoogleDrivePy.google_drive import connect_drive
To use one of the functions above, you need to use the authorization defined with get_service
gdr = connect_drive.connect_drive(service_drive)
- Upload file
f = open("test.txt","w+")
for i in range(10):
f.write("This is line %d\r\n" % (i+1))
f.close()
Check if the file is created locally.
from __future__ import print_function
import os
path = '.'
files = os.listdir(path)
for name in files:
print(name)
To upload the file in the root of Google Drive, we can use the function upload_file_root
. The function has two arguments.
mime_type
: You can use MIME types to filter query results or have your app listed in the Chrome Web Store list of apps that can open specific file types. list mime-typesfile_name
: Name of the file
It returns the ID of the file newly created.
mime_type = "text/plain"
file_name = "test.txt"
gdr.upload_file_root(mime_type, file_name)
- Find Folder
Folder
gdr.find_folder_id(folder_name = "FOLDER_NAME")
- Find file
gdr.find_file_id(file_name = "FILE_NAME")
- Move the file to a folder
gdr.move_file(file_name = 'FILE_NAME, folder_name = 'FOLDER_NAME')
- Find doc
gdr.access_google_doc(doc_name = 'FILE_NAME')
- Add image to doc
This function adds an image to google docs.
gdr.add_image_to_doc(image_name = 'FILE_NAME', doc_name = 'DOC_NAME')
- Add bullet point
gdr.add_bullet_to_doc(doc_name = 'document_test',
name_bullet = 'This is a long test')
- Add data
We updated the function so that there is no need anymore to add the range of the data. The function add_data_to_spreadsheet
has the following arguments:
data
: A pandas dataframesheetID
: ID of the spreadsheet to add the datasheetName
: Sheet name to add the data. If not exist in the spreadsheet, a new sheet is addeddetectRange
: Boolean. By default True. Automatically detect where to paste the data. If detect an existing table in the sheet, it will append the data. Otherwise, a new table is created with the pandas dataframe as headerrangeData
: By default, set toNone
. The user can use custom range. It is useful to paste table column wise. Note that, the user needs to include the header in the range`
The function checks if data exists starting from cell A1
gdr.add_data_to_spreadsheet(data,
sheetID,
sheetName,
detectRange,
rangeData)
- Upload data
If to_dataframe = False
, it returns a JSON file else a Pandas dataframe
gdr.upload_data_from_spreadsheet(sheetID, sheetName,
to_dataframe = False)
- Find latest row
gdr.getLatestRow(sheetID, sheetName)
- Find number columns
gdr.getColumnNumber(sheetID, sheetName)
- Find both latest row and number columns
gdr.getRowAndColumns(sheetID, sheetName)
Google Cloud Platform functions are available in the module connect_cloud_platform
and accessible from get_storage_client
- Google Cloud Storage
- Upload file to a bucket:
upload_blob
- Delete file from bucket:
delete_blob
- Download file from bucket:
download_blob
- List buckets:
list_bucket
- List all files in a bucket:
list_blob
- Upload file to a bucket:
- Big Query
- Add data to table with automatic format detection:
move_to_bq_autodetect
- Add data to table with predefined SQL:
upload_bq_predefined_sql
- Load data:
upload_data_from_bigquery
- Delete table:
delete_table
- List dataset:
list_dataset
- List table in dataset:
list_tables
- Add data to table with automatic format detection:
from GoogleDrivePy.google_platform import connect_cloud_platform
To access the GCP, you need explicitly tells which to use and add the authorization
project = 'PROJECT NAME'
gcp = connect_cloud_platform.connect_console(project = project,
service_account = service_gcp)
Note, this service is also accessible from Colab. If you use Colab, add colab = True
gcp = connect_cloud_platform.connect_console(project = project,
service_account = service_gcp,
colab = True)
To try the function, create a pandas dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)),
columns=list('ABCD'))
df.to_csv("test.csv", index = False)
Go to GCS, create a bucket name machine_learning_teaching
and a subfolder library_test
- Uploads a file to a bucket.
To upload files to GCS, you need to add more privilege to the user. Go to iam-admin, select the user and add new role: Storage Admin
Note that, we didn't add the error yet if the user does not have the privilege to write to a bucket. If the user gets this message, Not found: bucket name BUCKET_NAME
it's mostly because of privilege restriction.
gcp.upload_blob(bucket_name, destination_blob_name, source_file_name)
bucket_name
: Name of the bucketdestination_blob_name
: Name of the subfolder in the bucket; The function save with source file namesource_file_name
: Path source file locally. If blob not found, then it is created automatically with blob name.
bucket_name = 'machine_learning_teaching'
destination_blob_name = 'test_library'
source_file_name = 'test.csv'
gcp.upload_blob(bucket_name, destination_blob_name, source_file_name)
- List bucket
gcp.list_bucket()
- List files in bucket
gcp.list_blob(bucket = 'machine_learning_teaching')
- Download file
gcp.download_blob(bucket_name = 'machine_learning_teaching',
destination_blob_name = 'test_library',
source_file_name = 'test.csv')
- Delete file
Only the user with full control of GCS storage can delete files.
gcp.delete_blob(bucket_name, destination_blob_name)
- Add file to a dataset
You need to create a dataset in BigQuery. You can add a table to the dataset.
There is two way to transfer data from GCS to Bigquery.
move_to_bq_autodetect
: Auto detect format of the variablesupload_bq_predefined_sql
: User predefined format of the variables using SQL
Once again, make sure the user has the right to create a table in the dataset. Go to iam-admin and change the role BigQuery Data Viewer to BigQuery Admin
gcp.move_to_bq_autodetect(dataset_name, name_table, bucket_gcs)
The function upload a CSV file from Google Cloud Storage to Google BigQuery
- dataset_name: Name of the dataset
-
- name_table: Name of the table created in the dataset
- bucket_gcs: Folder and subfolder from GCS
dataset_name = 'tuto'
name_table = 'test'
bucket_gcs = 'machine_learning_teaching/test_library/test.csv'
gcp.move_to_bq_autodetect(dataset_name, name_table, bucket_gcs)
We saved the data frame with
SQL_schema = [
['A', 'INTEGER'],
['B', 'INTEGER'],
['C', 'INTEGER'],
['D', 'INTEGER']
]
gcp.upload_bq_predefined_sql(dataset_name='library',
name_table='test_1',
bucket_gcs='machine_learning_teaching/test_library/test.csv',
sql_schema=SQL_schema)
Other formats are available:
STRING
FLOAT
Make sure to choose the right format, and the data does not have an issue. Otherwise, the uploading will fail.
- Load data from BigQuery
- Each SQL line should be a wrap by quotes and with whitespace before the last quote
- The location must match that of the dataset(s) referenced in the query.
query = (
"SELECT * "
"FROM library.test_1 "
)
gcp.upload_data_from_bigquery(query = query, location = 'US')
- List dataset
gcp.list_dataset()
- List tables
gcp.list_tables(dataset = 'library')
- Delete table
gcp.delete_table(dataset_name = 'library', name_table = 'test')
If you have any question, you can contact me to my email address [email protected]