Giter Club home page Giter Club logo

googledrive-python's Introduction

This library proposes a straightforward data workflow between Jupyter notebook, Google Drive, and Google Cloud Platform. The library contains three modules:

  • google_authentification: Gives authorization to access Google Drive and Google Cloud Platform
  • google_drive: Provides the necessary operations on Google Drive such as creating files, moving files, add images to a Google Docs, add/load data to/from Google Spreadsheet
  • google_platform: Add/Load/deletes files in Google Cloud Storage but also BigQuery

Install library

!pip install git+git://github.com/thomaspernet/GoogleDrive-python

Update library

!pip install --upgrade git+git://github.com/thomaspernet/GoogleDrive-python

The motivation behinds this library is to automatize the data workflow as follow:

One particular objective is to archive the summary statistics or output of explanatory data analysis in Google drive.

Connect service module

The module connect_service authorizes Google to perform operations on Google Drive. To connect to Google module, you need to download a credential and a service account files with the appropriate authorization.

The credential file gives the authorization to view, read, or write in files from Google Drive. Service account permits to perform the same operation in Google Cloud Storage and BigQuery. Note that, you can change that authorization in GCP

Configurate authorization

There are two different files to create to give access to Google Cloud:

  • credential and token
  • service_account

Credential and token

To connect to Google product, you need a credential. The only way to get this credential is to accept data consent.

Here are the steps:

  1. Go to this page to create your credentials
  2. At the top of the page, select the OAuth consent screen tab. Select an Email address, enter an App name if not already set, and click the Save button.
  3. Select the Credentials tab, click the Create credentials button and select OAuth client ID.
  4. Select the application type Other, enter the name "connect_py," and click the Create button.
  5. Click OK to dismiss the resulting dialog.
  6. Click the file_download (Download JSON) button to the right of the client ID.
  7. Move this file to your working directory and rename it credential.json.

The first time you use the library, you will use credential.json to generate a unique token through the Google authentification windows. During the authentification, Google prompts warning information the app is not verified. Click advanced and proceed. The app's name is the one you defined previously. If the authentification is a success, you should see The authentication flow has completed, you may close this window.

After Google authenticates you, you'll get a new file named token.pickle in the current directory. You need this token every time to connect to the library. Store it in a safe but accessible place.

Service account

To create a service account, go to the IAM tab in the iam-admin panel.

  1. Click Add
  2. Add three roles: 3. BigQuery Data Viewer 4. BigQuery Job User 5. Viewer
  3. Go to service account
  4. Select the user you've just created
  5. Look the three dots at the right side of the windows and click Create
  6. Select JSON and click Create

The service account JSON file is downloaded. You need this file each time you want to connect to GCP. The filename looks like valid-pagoda-XXXXXX.json if you haven't renamed your project.

It will give enough flexibility to read data in GCS and BigQuery. You can tailor-made the role for each user. For instance, allow only certain users to access some data.

Connect to Google drive/GCP

Now that you have downloaded the authorization, you can create the connection. The module connect_service_local provides a quick way to connect with Google Drive and Google Cloud platform.

  • To connect to Google Drive, use the function get_service:
  • To connect to GCP, use the function get_storage_client
from GoogleDrivePy.google_authentification import connect_service_local

You need to initialize the connection with connect_service_local. There are three arguments:

  • path_credential: Path to the credential and the token. Required to connect to Google Drive
  • path_service_account: Path to the service account file. Needed to connect to GCP
  • scope: Required to set up the first connection, i.e., download the token

Google Drive

During your first first connection, define the path to the credential and the scope.

#pathcredential = '/content/gdrive/My Drive/PATH TO CREDENTIAL/'
pathcredential = '/PATH TO CREDENTIAL/'
scopes = ['https://www.googleapis.com/auth/documents.readonly',
            'https://www.googleapis.com/auth/drive', 
         'https://www.googleapis.com/auth/spreadsheets.readonly']

To initialize the connection to Google Drive only. Use scope only if the token is not available.

cs = connect_service_local.connect_service_local(path_credential =pathcredential,
                                                 scope = scopes)

The function get_service() returns the Service Google Drive, Service Google Doc., and Service Google spreadsheet.

service_drive = cs.get_service()

GCS

To initialize the connection to GCS only. Note that the path should include the filename. The filename looks like valid-pagoda-XXXXXX.json

path_serviceaccount = '/PATH TO CREDENTIAL/FILENAME.json'
cs = connect_service_local.connect_service_local(path_service_account =path_serviceaccount)

The function get_storage_client() returns the Service Google Cloud Storage and Google BigQuery.

service_gcp = cs.get_storage_client()

You can create a connection for both service:

### Scope is not required since the token is already created
cs = connect_service_local.connect_service_local(path_credential =pathcredential,
                                                 path_service_account =path_serviceaccount)

Then use get_service and get_storage_client to connect to the different service.

There is a module connect_service_colab to connect to Google Drive or GCP from Google Colab.

Quickstart

Google Drive Service

After the connection with Google Drive is done, you can use the module connect_drive to perform the following operation:

  • Google Drive:
    • Upload file: upload_file_root
    • Find folder ID: find_folder_id
    • Find file ID: find_file_id
    • Move file: move_file
  • Google Doc
    • Find/create doc: access_google_doc
    • Add image to doc: add_image_to_doc
    • Add bullet point: add_bullet_to_doc
  • Google Spreadsheet
    • Add data: add_data_to_spreadsheet
    • Upload data: upload_data_from_spreadsheet
    • Find latest row: getLatestRow
    • Find number columns: getColumnNumber
    • Both columns and rows: getRowAndColumns

All functions are in the connect_drive module

from GoogleDrivePy.google_drive import connect_drive

To use one of the functions above, you need to use the authorization defined with get_service

gdr = connect_drive.connect_drive(service_drive)

Google Drive

  1. Upload file
f = open("test.txt","w+")
for i in range(10):
     f.write("This is line %d\r\n" % (i+1))
f.close() 

Check if the file is created locally.

from __future__ import print_function
import os
 
path = '.'
 
files = os.listdir(path)
for name in files:
    print(name)

To upload the file in the root of Google Drive, we can use the function upload_file_root. The function has two arguments.

  • mime_type: You can use MIME types to filter query results or have your app listed in the Chrome Web Store list of apps that can open specific file types. list mime-types
  • file_name: Name of the file

It returns the ID of the file newly created.

mime_type = "text/plain"
file_name = "test.txt"
gdr.upload_file_root(mime_type, file_name)
  1. Find Folder

Folder

gdr.find_folder_id(folder_name = "FOLDER_NAME")
  1. Find file
gdr.find_file_id(file_name = "FILE_NAME")
  1. Move the file to a folder
gdr.move_file(file_name = 'FILE_NAME, folder_name = 'FOLDER_NAME')

Google doc

  1. Find doc
gdr.access_google_doc(doc_name = 'FILE_NAME')
  1. Add image to doc

This function adds an image to google docs.

gdr.add_image_to_doc(image_name = 'FILE_NAME', doc_name = 'DOC_NAME')
  1. Add bullet point
gdr.add_bullet_to_doc(doc_name = 'document_test',
 name_bullet = 'This is a long test')

Google Spreadsheet

  1. Add data

We updated the function so that there is no need anymore to add the range of the data. The function add_data_to_spreadsheet has the following arguments:

  • data: A pandas dataframe
  • sheetID: ID of the spreadsheet to add the data
  • sheetName: Sheet name to add the data. If not exist in the spreadsheet, a new sheet is added
  • detectRange: Boolean. By default True. Automatically detect where to paste the data. If detect an existing table in the sheet, it will append the data. Otherwise, a new table is created with the pandas dataframe as header
  • rangeData: By default, set to None. The user can use custom range. It is useful to paste table column wise. Note that, the user needs to include the header in the range`

The function checks if data exists starting from cell A1

gdr.add_data_to_spreadsheet(data,
                        sheetID,
                        sheetName,
                        detectRange,
                        rangeData)
  1. Upload data

If to_dataframe = False, it returns a JSON file else a Pandas dataframe

gdr.upload_data_from_spreadsheet(sheetID, sheetName,
	 to_dataframe = False)
  1. Find latest row
gdr.getLatestRow(sheetID, sheetName)
  1. Find number columns
gdr.getColumnNumber(sheetID, sheetName)
  1. Find both latest row and number columns
gdr.getRowAndColumns(sheetID, sheetName)

Google Cloud Platform

Google Cloud Platform functions are available in the module connect_cloud_platform and accessible from get_storage_client

  • Google Cloud Storage
    • Upload file to a bucket: upload_blob
    • Delete file from bucket: delete_blob
    • Download file from bucket: download_blob
    • List buckets: list_bucket
    • List all files in a bucket: list_blob
  • Big Query
    • Add data to table with automatic format detection: move_to_bq_autodetect
    • Add data to table with predefined SQL: upload_bq_predefined_sql
    • Load data: upload_data_from_bigquery
    • Delete table: delete_table
    • List dataset: list_dataset
    • List table in dataset: list_tables
from GoogleDrivePy.google_platform import connect_cloud_platform

To access the GCP, you need explicitly tells which to use and add the authorization

project = 'PROJECT NAME'
gcp = connect_cloud_platform.connect_console(project = project, 
                                             service_account = service_gcp)

Note, this service is also accessible from Colab. If you use Colab, add colab = True

gcp = connect_cloud_platform.connect_console(project = project, 
                                             service_account = service_gcp,
                                             colab = True)

Google Storage

To try the function, create a pandas dataframe

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)),
                  columns=list('ABCD'))
df.to_csv("test.csv", index = False)

Go to GCS, create a bucket name machine_learning_teaching and a subfolder library_test

  1. Uploads a file to a bucket.

To upload files to GCS, you need to add more privilege to the user. Go to iam-admin, select the user and add new role: Storage Admin

Note that, we didn't add the error yet if the user does not have the privilege to write to a bucket. If the user gets this message, Not found: bucket name BUCKET_NAME it's mostly because of privilege restriction.

gcp.upload_blob(bucket_name, destination_blob_name, source_file_name)
  • bucket_name: Name of the bucket
  • destination_blob_name: Name of the subfolder in the bucket; The function save with source file name
  • source_file_name: Path source file locally. If blob not found, then it is created automatically with blob name.
bucket_name = 'machine_learning_teaching'
destination_blob_name = 'test_library'
source_file_name = 'test.csv'
gcp.upload_blob(bucket_name, destination_blob_name,  source_file_name)
  1. List bucket
gcp.list_bucket()
  1. List files in bucket
gcp.list_blob(bucket = 'machine_learning_teaching')
  1. Download file
gcp.download_blob(bucket_name = 'machine_learning_teaching',
                  destination_blob_name = 'test_library',
                  source_file_name = 'test.csv')
  1. Delete file

Only the user with full control of GCS storage can delete files.

gcp.delete_blob(bucket_name, destination_blob_name)

Big Query

  1. Add file to a dataset

You need to create a dataset in BigQuery. You can add a table to the dataset.

There is two way to transfer data from GCS to Bigquery.

  • move_to_bq_autodetect: Auto detect format of the variables
  • upload_bq_predefined_sql: User predefined format of the variables using SQL

Once again, make sure the user has the right to create a table in the dataset. Go to iam-admin and change the role BigQuery Data Viewer to BigQuery Admin

Auto detect

gcp.move_to_bq_autodetect(dataset_name, name_table, bucket_gcs)

The function upload a CSV file from Google Cloud Storage to Google BigQuery

  • dataset_name: Name of the dataset
    • name_table: Name of the table created in the dataset
  • bucket_gcs: Folder and subfolder from GCS
dataset_name = 'tuto'
name_table = 'test'
bucket_gcs = 'machine_learning_teaching/test_library/test.csv'
gcp.move_to_bq_autodetect(dataset_name, name_table, bucket_gcs)

Predefined SQL

We saved the data frame with

SQL_schema = [
    ['A', 'INTEGER'],
    ['B', 'INTEGER'],
    ['C', 'INTEGER'],
    ['D', 'INTEGER']
]
gcp.upload_bq_predefined_sql(dataset_name='library',
                             name_table='test_1',
                             bucket_gcs='machine_learning_teaching/test_library/test.csv',
                             sql_schema=SQL_schema)

Other formats are available:

  • STRING
  • FLOAT

Make sure to choose the right format, and the data does not have an issue. Otherwise, the uploading will fail.

  1. Load data from BigQuery
  • Each SQL line should be a wrap by quotes and with whitespace before the last quote
  • The location must match that of the dataset(s) referenced in the query.
query = (
  "SELECT * "
    "FROM library.test_1 "

)
gcp.upload_data_from_bigquery(query = query, location = 'US')
  1. List dataset
gcp.list_dataset()
  1. List tables
gcp.list_tables(dataset = 'library')
  1. Delete table
gcp.delete_table(dataset_name = 'library', name_table = 'test')

If you have any question, you can contact me to my email address [email protected]

googledrive-python's People

Contributors

thomaspernet avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.