posthog / bigquery-plugin Goto Github PK

View Code? Open in Web Editor NEW

8.0 6.0 7.0 196 KB

TypeScript 97.88% JavaScript 2.12%

bigquery-plugin's Introduction

Warning

This plugin has been deprecated in favor of our BigQuery destination for batch exports.

Google BigQuery Plugin

Send events to a BigQuery database on ingestion.

Installation

Visit 'Plugins' in PostHog
Find this plugin from the repository or install https://github.com/PostHog/bigquery-plugin
Configure the plugin
1. Upload your Google Cloud key .json file. (See below for permissions and how to retrieve this.)
2. Enter your Dataset ID
3. Enter your Table ID
Watch events roll into BigQuery

BigQuery permissions and service account setup

To set the right permissions up for the BigQuery plugin, you'll need:

A service account.
A dataset which has permissions allowing the service account to access it.

Here's how to set these up so that the PostHog plugin has access only to the table it needs:

Create a service account. Keep hold of the JSON file at the end of these steps for setting up the plugin, and remember the name too.
Create a role which has only the specific permissions the PostHog BigQuery plugin requires (listed below), or use the built in BigQuery DataOwner permission. If you create a custom role, you will need:
- bigquery.datasets.get
- bigquery.tables.create
- bigquery.tables.get
- bigquery.tables.list
- bigquery.tables.updateData
Create a dataset within a BigQuery project (ours is called posthog, but any name will do).
Follow the instructions on granting access to a dataset in BigQuery to ensure your new service account has been granted either the role you created or the "BigQuery Data Owner" permission.

Use the Share Dataset button to share your dataset with your new service account and either the BigQuery DataOwner role, or your custom role created above. In the below, we've used a custom role PostHog Ingest.

That's it! Once you've done the steps above, your data should start flowing from PostHog to BigQuery.

Troubleshooting

Duplicate events

There's a very rare case when duplicate events appear in BigQuery. This happens due to network errors, where the export seems to have failed, yet it actually reaches BigQuery.

While this shouldn't happen, if you find duplicate events in BigQuery, follow these Google Cloud docs to manually remove the them.

Here is an example query based on the Google Cloud docs that would remove duplicates:

WITH

-- first add a row number, one for each uuid
raw_data AS
(
SELECT 
  *,
  ROW_NUMBER() OVER (PARTITION BY uuid) as ROW_NUMBER
from
  `<project_id>.<dataset>.<table>`
WHERE
  DATE(timestamp) = '<YYYY-MM-DD>'
),

-- now just filter for one row per uuid
raw_data_deduplicated AS 
(
SELECT
  * EXCEPT(ROW_NUMBER)
FROM
  raw_data   
WHERE
  ROW_NUMBER = 1
)

SELECT 
  * 
FROM
  raw_data_deduplicated
;

bigquery-plugin's People

Contributors

Stargazers

Watchers

Forkers

weyert yakkomajuri andrewm4894 mention-me isabella232 choxx

bigquery-plugin's Issues

What permissions are required?

I'd love to extend the README to include examples of what permissions a service account should have. I hope to develop this myself, but if anyone can help to describe them we'd appreciate it.

We'd love to know:
a) Which specific permissions are required
b) How to constrain them so that the Service Account only has access to the posthog dataset (and not other datasets in the same account).

Add retries via Job Queues (BigQuery)

We should add this.

BigQuery is very flakey on local / self-hosted

Bigquery client takes > 30s for insert requests when run locally. Need to investigate why this is happening, since if we're facing this, other self-hosted clients would be too.

What's mystifying is that it works perfectly fine on cloud - there's never* a timeout.

While we don't lose data, thanks to the retry job queues, it would be nice to figure out why this is happening in the first place.

More context: See #9

Partition table based on timestamp

I just started using PostHog's BigQuery integration and i notices the created table has no partition whatsoever.
Given than in most cases this data can grow a lot i think it would be ideal to create the table with a timestamp partition by date or month so that query's can be optimized and save some money on bigquery side.

TypeError [ERR_INVALID_ARG_TYPE]

There was a type error in sentry: https://sentry.io/organizations/posthog/issues/2203245089/?project=5592816&query=is%3Aunassigned+is%3Aunresolved

Message:

[___4] 🔔 TypeError [ERR_INVALID_ARG_TYPE]: The "string" argument must be of type string or an instance of Buffer or ArrayBuffer. Received type boolean (true)
    at Function.byteLength (buffer.js:728:11)
    at BufferWriter.write_string_buffer [as string] (/code/plugins/node_modules/protobufjs/src/writer_buffer.js:70:27)
    at Function.encode (/code/plugins/node_modules/@posthog/plugin-server/dist/idl/protos.js:48:31)
    at Function.encodeDelimited (/code/plugins/node_modules/@posthog/plugin-server/dist/idl/protos.js:65:21)
    at EventsProcessor.createEvent (/code/plugins/node_modules/@posthog/plugin-server/dist/ingestion/process-event.js:346:47)
    at EventsProcessor.captureEE (/code/plugins/node_modules/@posthog/plugin-server/dist/ingestion/process-event.js:251:27)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async EventsProcessor.processEvent (/code/plugins/node_modules/@posthog/plugin-server/dist/ingestion/process-event.js:81:22)
    at async Object.ingestEvent (/code/plugins/node_modules/@posthog/plugin-server/dist/ingestion/ingest-event.js:33:9)
    at async /code/plugins/node_modules/@posthog/plugin-server/dist/worker/worker.js:40:40
    at async /code/plugins/node_modules/@posthog/piscina/dist/src/worker.js:141:26 {
  code: 'ERR_INVALID_ARG_TYPE'
}

worth adding a bq_timestamp?

I'm just wondering if it would be worth adding something like a bq_ingested_timestamp to the events being sent to BQ such that users could use it to monitor and see differences between time of event creation and time it landed in BigQuery.

To me that would be something i would monitor in BQ and raise errors or alerts if i was to see the BQ data getting a bit stale or slow in some way.

Not sure how relevant that is in terms of how events flow through the system but was just thinking.

I would love to anonymise IP address (or exclude it)

We'd prefer if we weren't storing PII in our BigQuery dataset. I think most of it is excluded by virtue of only using user_id and other properties, but I note that the IP address is included in the events.

It'd be awesome to have a feature where the IP address is either anonymised (as a hash perhaps?) or removed.

I think a hash would be backwards compatible (for storing), as the schema is just a string?

Route events to different BigQuery tables

Feature request from a customer:

Right now all the events just go to one table. But that can become a bit of a bottle neck - it would be cool to be able to define a filter of some sort and define multiple output tables in BQ then.

We are starting to have many different types of events and they all landing on one biggish table in BQ and is becoming a little messy on the ETL side - would be great if i could just define that some events go to table A in BQ and some go to table B etc.

Expected Behaviour for Existing Tables without defined schema

Problem
=> We added BQ to our posthog and tried to load it into an existing table called user_events.
=> First, I did not add any schema at all as it isnt required by the docs. It throws an error then however and the plugin fails:
setupPlugin failed with Error: Can not get metadata for table. Please check if the table schema is defined

However, looking at the code I do not think this is expected behaviour:

if (!metadata.schema || !metadata.schema.fields) {
           throw new Error('Can not get metadata for table. Please check if the table schema is defined.')
       }

Assumption
Why wouldnt it just proceed to the matching fields step, here, opposed to throwing an error? I would expect the plugin to treat this just like a schema difference in general (since there is no schema so its different to the desired posthog schema), hence, it would create all required columns instead. Is there an apparent reason the plugin raises in that case?

Right now, I have to either add a dummy schema beforehand or let the script create the table. Having the script handle a table without a schema does not seem to work. Am I missing something?

Thanks for your explanation 🙏

Missing user properties in exported events

We're testing out this BigQuery Export app to export data to it but noticed any custom user properties (https://posthog.com/docs/getting-started/user-properties) are missing from the data export to BigQuery. This is important information for us to be able to export as it allows us to better map our data for our internal users.

The default $set, and set_once keys and values are present but not any custom user properties; example of such custom user properties as seen in "Events" of PostHog page:

Could it be an issue in this code area where we're it might be missing to pull in some other "event" property? Or something else going on?

Thanks!