Giter Club home page Giter Club logo

bigquery-plugin's Introduction

Warning

This plugin has been deprecated in favor of our BigQuery destination for batch exports.

Google BigQuery Plugin

Send events to a BigQuery database on ingestion.

Installation

  1. Visit 'Plugins' in PostHog
  2. Find this plugin from the repository or install https://github.com/PostHog/bigquery-plugin
  3. Configure the plugin
    1. Upload your Google Cloud key .json file. (See below for permissions and how to retrieve this.)
    2. Enter your Dataset ID
    3. Enter your Table ID
  4. Watch events roll into BigQuery

BigQuery permissions and service account setup

To set the right permissions up for the BigQuery plugin, you'll need:

  1. A service account.
  2. A dataset which has permissions allowing the service account to access it.

Here's how to set these up so that the PostHog plugin has access only to the table it needs:

  1. Create a service account. Keep hold of the JSON file at the end of these steps for setting up the plugin, and remember the name too.

  2. Create a role which has only the specific permissions the PostHog BigQuery plugin requires (listed below), or use the built in BigQuery DataOwner permission. If you create a custom role, you will need:

    • bigquery.datasets.get
    • bigquery.tables.create
    • bigquery.tables.get
    • bigquery.tables.list
    • bigquery.tables.updateData
  3. Create a dataset within a BigQuery project (ours is called posthog, but any name will do).

  4. Follow the instructions on granting access to a dataset in BigQuery to ensure your new service account has been granted either the role you created or the "BigQuery Data Owner" permission.

    SQL_workspace_–_BigQuery_–_Data_Warehouse_Exp_–_Google_Cloud_Platform

    Use the Share Dataset button to share your dataset with your new service account and either the BigQuery DataOwner role, or your custom role created above. In the below, we've used a custom role PostHog Ingest.

    SQL_workspace_–_BigQuery_–_Data_Warehouse_Exp_–_Google_Cloud_Platform

That's it! Once you've done the steps above, your data should start flowing from PostHog to BigQuery.

Troubleshooting

Duplicate events

There's a very rare case when duplicate events appear in BigQuery. This happens due to network errors, where the export seems to have failed, yet it actually reaches BigQuery.

While this shouldn't happen, if you find duplicate events in BigQuery, follow these Google Cloud docs to manually remove the them.

Here is an example query based on the Google Cloud docs that would remove duplicates:

WITH

-- first add a row number, one for each uuid
raw_data AS
(
SELECT 
  *,
  ROW_NUMBER() OVER (PARTITION BY uuid) as ROW_NUMBER
from
  `<project_id>.<dataset>.<table>`
WHERE
  DATE(timestamp) = '<YYYY-MM-DD>'
),

-- now just filter for one row per uuid
raw_data_deduplicated AS 
(
SELECT
  * EXCEPT(ROW_NUMBER)
FROM
  raw_data   
WHERE
  ROW_NUMBER = 1
)

SELECT 
  * 
FROM
  raw_data_deduplicated
;

bigquery-plugin's People

Contributors

andrewm4894 avatar edhgoose avatar hazzadous avatar macobo avatar marconlp avatar mariusandra avatar neilkakkar avatar tiina303 avatar timgl avatar twixes avatar xvello avatar yakkomajuri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

bigquery-plugin's Issues

What permissions are required?

I'd love to extend the README to include examples of what permissions a service account should have. I hope to develop this myself, but if anyone can help to describe them we'd appreciate it.

We'd love to know:
a) Which specific permissions are required
b) How to constrain them so that the Service Account only has access to the posthog dataset (and not other datasets in the same account).

BigQuery is very flakey on local / self-hosted

Bigquery client takes > 30s for insert requests when run locally. Need to investigate why this is happening, since if we're facing this, other self-hosted clients would be too.

What's mystifying is that it works perfectly fine on cloud - there's never* a timeout.

While we don't lose data, thanks to the retry job queues, it would be nice to figure out why this is happening in the first place.


More context: See #9

Partition table based on timestamp

I just started using PostHog's BigQuery integration and i notices the created table has no partition whatsoever.
Given than in most cases this data can grow a lot i think it would be ideal to create the table with a timestamp partition by date or month so that query's can be optimized and save some money on bigquery side.

TypeError [ERR_INVALID_ARG_TYPE]

There was a type error in sentry: https://sentry.io/organizations/posthog/issues/2203245089/?project=5592816&query=is%3Aunassigned+is%3Aunresolved

Message:

[___4] 🔔 TypeError [ERR_INVALID_ARG_TYPE]: The "string" argument must be of type string or an instance of Buffer or ArrayBuffer. Received type boolean (true)
    at Function.byteLength (buffer.js:728:11)
    at BufferWriter.write_string_buffer [as string] (/code/plugins/node_modules/protobufjs/src/writer_buffer.js:70:27)
    at Function.encode (/code/plugins/node_modules/@posthog/plugin-server/dist/idl/protos.js:48:31)
    at Function.encodeDelimited (/code/plugins/node_modules/@posthog/plugin-server/dist/idl/protos.js:65:21)
    at EventsProcessor.createEvent (/code/plugins/node_modules/@posthog/plugin-server/dist/ingestion/process-event.js:346:47)
    at EventsProcessor.captureEE (/code/plugins/node_modules/@posthog/plugin-server/dist/ingestion/process-event.js:251:27)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async EventsProcessor.processEvent (/code/plugins/node_modules/@posthog/plugin-server/dist/ingestion/process-event.js:81:22)
    at async Object.ingestEvent (/code/plugins/node_modules/@posthog/plugin-server/dist/ingestion/ingest-event.js:33:9)
    at async /code/plugins/node_modules/@posthog/plugin-server/dist/worker/worker.js:40:40
    at async /code/plugins/node_modules/@posthog/piscina/dist/src/worker.js:141:26 {
  code: 'ERR_INVALID_ARG_TYPE'
}

worth adding a bq_timestamp?

I'm just wondering if it would be worth adding something like a bq_ingested_timestamp to the events being sent to BQ such that users could use it to monitor and see differences between time of event creation and time it landed in BigQuery.

To me that would be something i would monitor in BQ and raise errors or alerts if i was to see the BQ data getting a bit stale or slow in some way.

Not sure how relevant that is in terms of how events flow through the system but was just thinking.

I would love to anonymise IP address (or exclude it)

We'd prefer if we weren't storing PII in our BigQuery dataset. I think most of it is excluded by virtue of only using user_id and other properties, but I note that the IP address is included in the events.

It'd be awesome to have a feature where the IP address is either anonymised (as a hash perhaps?) or removed.

I think a hash would be backwards compatible (for storing), as the schema is just a string?

Route events to different BigQuery tables

Feature request from a customer:

Right now all the events just go to one table. But that can become a bit of a bottle neck - it would be cool to be able to define a filter of some sort and define multiple output tables in BQ then.

We are starting to have many different types of events and they all landing on one biggish table in BQ and is becoming a little messy on the ETL side - would be great if i could just define that some events go to table A in BQ and some go to table B etc.

Expected Behaviour for Existing Tables without defined schema

Problem
=> We added BQ to our posthog and tried to load it into an existing table called user_events.
=> First, I did not add any schema at all as it isnt required by the docs. It throws an error then however and the plugin fails:
setupPlugin failed with Error: Can not get metadata for table. Please check if the table schema is defined

However, looking at the code I do not think this is expected behaviour:

if (!metadata.schema || !metadata.schema.fields) {
           throw new Error('Can not get metadata for table. Please check if the table schema is defined.')
       }

Assumption
Why wouldnt it just proceed to the matching fields step, here, opposed to throwing an error? I would expect the plugin to treat this just like a schema difference in general (since there is no schema so its different to the desired posthog schema), hence, it would create all required columns instead. Is there an apparent reason the plugin raises in that case?

Right now, I have to either add a dummy schema beforehand or let the script create the table. Having the script handle a table without a schema does not seem to work. Am I missing something?

Thanks for your explanation 🙏

Missing user properties in exported events

We're testing out this BigQuery Export app to export data to it but noticed any custom user properties (https://posthog.com/docs/getting-started/user-properties) are missing from the data export to BigQuery. This is important information for us to be able to export as it allows us to better map our data for our internal users.

The default $set, and set_once keys and values are present but not any custom user properties; example of such custom user properties as seen in "Events" of PostHog page:
Screenshot 2023-06-30 at 11 52 58 AM

Could it be an issue in this code area where we're it might be missing to pull in some other "event" property? Or something else going on?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.