Giter Club home page Giter Club logo

gaapi4py's Introduction

gaapi4py

Google Analytics Reporting API v4 for Python 3

Prerequisites

To use this library, you need to have a project in Google Cloud Platform and a service account key that has access to Google Analytics account you want to get data from.

Quick Start

from gaapi4py import GAClient
# if GOOGLE_APPLICATION_CREDENTIALS is set:
c = GAClient() 
# or you may specify keyfile path:
c = GAClient(json_keyfile="path/to/keyfile.json")


request_body = {
    'view_id': '123456789',
    'start_date': '2019-01-01',
    'end_date': '2019-01-31',
    'dimensions': {
        'ga:sourceMedium',
        'ga:date'
    },
    'metrics': {
        'ga:sessions'
    },
    'filter': 'ga:sourceMedium==google / organic' # optional filter clause
}

response = c.get_all_data(request_body)

response['info'] # sampling and "golden" metadata

response['data'] # Pandas dataframe that contains data from GA

If you want to make many requests to a speficic view or with specific dateranges, you can set date ranges for all future requests:

# Pass arguments to class init
c = GAClient(view_id="123456789", start_date="2019-09-01", end_date="2019-09-07") 
# or use methods to overwrite viewID or dateranges
c.set_view_id('123456789')
c.set_dateranges('2019-01-01', '2019-01-31')

request_body_1 = {
    'dimensions': {
        'ga:sourceMedium',
        'ga:date'
    },
    'metrics': {
        'ga:sessions'
    }
}

request_body_2 = {
    'dimensions': {
        'ga:deviceCategory',
        'ga:date'
    },
    'metrics': {
        'ga:sessions'
    }
}

response_1 = c.get_all_data(request_body_1)
response_2 = c.get_all_data(request_body_2)

Avoid sampling by taking data day-by-day

Important! Google Analytics reporting API has a limit of maximum 100 requests per 100 seconds. If you want to iterate over large period of days, you might consider adding time.sleep(1) at the end of the loop to avoid reaching this limit.

from datetime import date, timedelta
from time import sleep

import pandas as pd
from gaapi4py import GAClient

c = GAClient(view_id='123456789')

start_date = date(2019,7,1)
end_date = date(2019,7,14)

df_list = []
iter_date = start_date
while iter_date <= end_date:
    c.set_dateranges(iter_date, iter_date)
    response = c.get_all_data({
        'dimensions': {
            'ga:sourceMedium',
            'ga:deviceCategory'
        },
        'metrics': {
            'ga:sessions'
        }
    })
    df = response['data']
    df['date'] = iter_date
    df_list.append(response['data'])
    iter_date = iter_date + timedelta(days=1)
    time.sleep(1)

all_data = pd.concat(df_list, ignore_index=True)

Avoid "maximum 7 dimensions" restriction

If you store sessionId and/or hitId as custom dimensions (Example implementation on Simo Ahava's blog), you can circumvent restriction on maximum number of dimensions and metrics in one report. Example below:

If sampling starts to appear, try to break the set of dimensions into smaller parts and run queries on them.

one_day = date(2019,7,1)
c.set_dateranges(one_day, one_day)

SESSION_ID_CD_INDEX = '2'
HIT_ID_CD_INDEX = '5'

session_id = 'dimension' + SESSION_ID_CD_INDEX
hit_id = 'dimension' + HIT_ID_CD_INDEX


#Get session scope data
response_1 = c.get_all_data({
    'dimensions': {
        'ga:' + session_id,
        'ga:sourceMedium',
        'ga:campaign',
        'ga:keyword',
        'ga:adContent',
        'ga:userType',
        'ga:deviceCategory'
    },
    'metrics': {
        'ga:sessions'
    }
})

response2 = c.get_all_data({
    'dimensions': {
        'ga:' + session_id,
        'ga:landingPagePath',
        'ga:secondPagePath',
        'ga:exitPagePath',
        'ga:pageDepth',
        'ga:daysSinceLastSession',
        'ga:sessionCount'
    },
    'metrics': {
        'ga:hits',
        'ga:totalEvents',
        'ga:bounces',
        'ga:sessionDuration'
    }
})

all_data = response_1['data'].merge(response2['data'], on=session_id, how='left')

all_data.rename(index=str, columns={
    session_id: 'session_id'
}, inplace=True)

all_data.head()

# Get hit scope data
hits_response_1 = c.get_all_data({
    'dimensions': {
        'ga:' + session_id,
        'ga:' + hit_id,
        'ga:pagePath',
        'ga:previousPagePath',
        'ga:dateHourMinute'
    },
    'metrics': {
        'ga:hits',
        'ga:totalEvents',
        'ga:pageviews'
    }
})

hits_response_2 = c.get_all_data({
    'dimensions': {
        'ga:' + session_id,
        'ga:' + hit_id,
        'ga:eventCategory',
        'ga:eventAction',
        'ga:eventLabel'
    },
    'metrics': {
        'ga:totalEvents'
    }
})

all_hits_data = hits_response_1['data'].merge(hits_response_2['data'],
                                              on=[session_id, hit_id],
                                              how='left')
all_hits_data.rename(index=str, columns={
    session_id: 'session_id',
    hit_id: 'hit_id'
}, inplace=True)

all_hits_data.head()

gaapi4py's People

Contributors

oleg-dt avatar ptrvtch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

gaapi4py's Issues

DEPRECATION NOTICE: Add info to the README saying that this library is for the version of Universal Analytics which is deprecated on 2023-07-01

As far as I can tell, this library is for Universal Analytics (UA), which is deprecated on 2023-07-01 (evidence: it uses View ID, example contains 'UA' in the property ID). UA's final API version was the Analytics Reporting API v4 (which this library appears to utilize).

The current version of Google Analytics is Google Analytics 4, API v1. This library is incompatible with the new GA4, which replaced UA in 2022-2023.

oauth2client error

I get the error:

ImportError: cannot import name 'ServiceAccountCredentials' from 'oauth2client.service_account'

which may be because oauth2client is depreciated

Incorrect Records

Hi,

Firstly, many thanks for developing and sharing this library!

Below is the query returned from my search:

  1. What does it mean by Data is not golden?
  2. The dataframe size is different from the sampled size. 936 rows as opposed to 96272 rows. Is it a bug?

Data is not golden
Data is sampled! Sampling size: 72.84%, 96272 rows were taken out of 132174

response['data']
[936 rows x 6 columns]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.