airgrid / edgekit Goto Github PK

View Code? Open in Web Editor NEW

53.0 3.0 8.0 4.94 MB

Open source, privacy focused client side library for the creation and monetisation of online audiences.

Home Page: https://edgekit.org

License: MIT License

JavaScript 1.36% TypeScript 98.03% Makefile 0.60%

privacy edge programmatic prebid prebidjs advertising rtb iab online-audiences audience-definitions

edgekit's People

Contributors

Stargazers

Watchers

Forkers

bausk antondanilenko dragosromaniuc naripok codywl manigandham josepowera jretza

edgekit's Issues

Bootstrap the IAB Tier 2 Interest audience taxonomy.

Summary

Prior to inviting the industry for some well needed debate on audience definition, we would like to bootstrap a simple taxonomy of audience definitions:

we will begin with purely interest audiences
granularity will be tier 1 + tier 2
this equates to 30 unique audiences

The initial version of the definitions will consist of:

ttl; a time to live, how long will the user remain in the audience after being matched.
a look back window; how far back we should check the users local page view history.
an occurrence number; how many times would a user have to viewed a page matching the interest definition.
keywords; a set of words which describe this.

A simple example

The below is an audience definition, which is saying a user must have viewed at least 3 pages, within the past 7 days with would contain at least one of the keywords, to be matched into that interest based for the next 7 days.

const audience = {
  id: 719,
  name: 'Interest | Travel',
  ttl: 7,
  lookback: 7,
  occurrence: 3,
  keywords: ['beach', 'mojito', 'palms', 'spain', '...']
}

IAB Taxonomy `Interest`

Unique ID	Parent ID	Name	Tier 1	Tier 2
243	206	Interest \| Automotive	Interest	Automotive
258	206	Interest \| Books and Literature	Interest	Books and Literature
268	206	Interest \| Business and Finance	Interest	Business and Finance
338	206	Interest \| Careers	Interest	Careers
344	258	Interest \| Education	Interest	Education
348	206	Interest \| Family and Relationships	Interest	Family and Relationships
359	206	Interest \| Fine Art	Interest	Fine Art
368	206	Interest \| Food & Drink	Interest	Food & Drink
381	206	Interest \| Health and Medical Services	Interest	Health and Medical Services
406	206	Interest \| Healthy Living	Interest	Healthy Living
422	206	Interest \| Hobbies & Interests	Interest	Hobbies & Interests
457	206	Interest \| Home & Garden	Interest	Home & Garden
466	206	Interest \| Medical Health	Interest	Medical Health
467	206	Interest \| Movies	Interest	Movies
481	206	Interest \| Music and Audio	Interest	Music and Audio
522	206	Interest \| News and Politics	Interest	News and Politics
534	206	Interest \| Personal Finance	Interest	Personal Finance
541	206	Interest \| Pets	Interest	Pets
550	206	Interest \| Pharmaceuticals, Conditions, and Symptoms	Interest	Pharmaceuticals, Conditions, and Symptoms
581	206	Interest \| Pop Culture	Interest	Pop Culture
583	206	Interest \| Real Estate	Interest	Real Estate
595	206	Interest \| Religion & Spirituality	Interest	Religion & Spirituality
606	206	Interest \| Shopping	Interest	Shopping
607	206	Interest \| Sports	Interest	Sports
676	206	Interest \| Style & Fashion	Interest	Style & Fashion
687	206	Interest \| Technology & Computing	Interest	Technology & Computing
706	206	Interest \| Television	Interest	Television
719	206	Interest \| Travel	Interest	Travel
733	206	Interest \| Video Gaming	Interest	Video Gaming

References

IAB Data Transparency Framework
Full IAB taxonomy: IABTL-Audience-Taxonomy-1.0-5.16.19-Final.xlsx

Consider audience definition version.

Summary

The current logic for audience versions is not complete, we do not "un-match" a user if they do not conform to a bumped audience version.

Detail

if a user has matched version 1, and the most recent audience version available is 1, we should not run the checking.
if a user has matched version 1, and the most recent audience version available is 2, we should run the checking.
- if they not do not match the new version (2), but had matched 1, we must remove this audience from matched.
- if they matched v1 and now match on v2, they remain in the audience but with the updated version

There is code to cache audiences in local storage, but our internal implementation does not utilise this mechanism, as we use the browser cache for GET requests, although I could see this feature being useful in future or for other users of the library, I suggest while it is unused we do remove it - as it will be trivial to re-implement.

RFC: Remove pageFeatureGetters API, in favour of direct setting of features.

Summary

EdgeKit right now allows the user to pass an Array of pageFeatureGetters which are objects containing a name and func key:

const getHtmlKeywords = {
  name: 'keywords',
  func: () => ( return ['kw1', 'kw2'])
};

This seems to add a layer of complexity and boilerplate as EdgeKit, then simply executes these functions. For someone wanting to use the project if their function failed it would mean having to try and debug the internals.

The current API adds boiler
Adds a layer without any real utility or abstraction
No real error checking / handling

Proposal

Expose a new method which creates a more familiar set() based API.

Rather then providing a list of functions, which resolve to features, we would allow users to set features directly:

edkt.setPageFeatures('keywords', ['kw1', 'kw2']);

We could also return a pointer to the feature if for example it needed to be updated:

pageFeaturesPointer = edkt.setPageFeatures('dwell', 20);
// some time passes
pageFeaturesPointer.update('dwell', 30)

Add internal engine `AND` matching condition capability

The EngineCondition interface carries an any attribute which is not being currently used.
It can be used to add internal engine matching AND logic by swapping the evaluation condition on internal filterPageViews method.

Uniques are refreshed set for every page view.

Summary

This commit, has introduced a bug which means that we set a user as "unique" each page view as we simply overwrite the store rather than use the previous merging strategy.

The expected behaviour is that a user should be a unique only first time they are matched. Following this they are not unique, until the ttl expires at which point if they are matched again - they can once more become a unique.

Remove backward compat code from matchedAudienceStore

Summary

There has been a change introduced, which converts matchedAudiences from being stored as an array to an object, this means some extra code has been added to allow for that to not break clients.

This issue removes that code after most clients would have been updated by accessing the new version of the code.

The code has been marked with a TODO and a link to this issue.

Add cosine similarity as a vector distance metric.

Summary

This issue extends the query filters by adding a cosine similarity metric which can be calculated between two vectors of equal length.

Detail

this can re-use the internal dot product calculation.
adds another query filter type
updates the audience definitions types / docs with this new value.

Improving jest e2e testing.

Summary

Our current testing goal is to attack the exposed APIs of the library.

The power of edgekit comes into play where a user builds a page history over time, representing their own local state, we then want to run edkt against this state in different forms and test for the output.

A design question here is how best to simulate the growth of a user's data points in their local state (storage).

Idea 1:

create hardcoded states, initialise the test by inserting a new state each time
run models at each state and check for the matched audiences
note: each check of matched audiences adds a new page view to the state, this could be by passed by passing an error into the pageFeatureGetter.

Add support for local audience caching.

Summary

EdgeKit audience definitions can also arrive from the server, this means to lessen the burden on servers we can cache definitions locally, and just let the server know which audience definitions we have stored locally.

Requirements

audience definitions should contain a flag, if they are to be cached locally
edgekit would first check local cache, and send along some meta to the server
server can reply with any new audience definitions or return an empty response
the rest of the evaluation would run as normal.

Add support for vector comparisons in engine.

Summary

The current implementation of engine is designed for direct string comparisons: if('sport' === 'sport'). To expand the ways in which we are able to define audiences, we would like to extend the functionality to compare two vectors, using this as a filtering criteria within the users local storage.

Requirements

ToDo

Proposed API

ToDo

Uncaught (in promise) Error: TCF API is missing

Summary

index.js:7 Uncaught (in promise) Error: TCF API is missing

We should handle this error properly.

Add support for lookback from audience definition.

Summary

Audience definitions have a lookback field, which stipulates how far back in a user's local data we can check to find matching page views. Currently engine does not use this field when doing its evaluation and consumes the user's full data.

Requirements

Each audience definition can have its own lookback window which is a values in seconds, we must consider only the values which fit within this window.
We should support a value which tells engine to use the full local data (0, -1, Infinity, ... ? )
An entry in the /docs folder, we should document these features as part of the issue itself.

Add support for audience versions.

Summary

Related to: #22

This issue adds supports for audience definitions versions, which is primarily aimed at dynamic audience definitions which will evolve rapidly. We would send cached audience definition versions to the server, where a call would be made if they should be updated.

Requirements:

add a version field to the audience definitions
store which audience definition version a user was matched against
if audience definition is incremented, the user should be un-matched and checked again against the new version

Refactor & organise tests, create a testing framework.

Summary

The tests need to be extended, organised and kept consistent.

Details

ongoing POC for property based tests
testing modules.

Double-counting page views on multi-query audience definitions

edgekit/src/engine/evaluate.ts

Lines 16 to 21 in aab2d7a

 const filteredPageViews = filter.queries.flatMap((query) => 

 pageViews.filter((pageView) => { 

 const pageFeatures = pageView.features[query.queryProperty]; 

 return queryMatches(query, pageFeatures); 

 }) 

 );

This operation double counts page views on multiple query audience definitions.
It should return a single pageView instance if matched by any of the defined queries.
It should have, ideally, its own testing suite also.

Fix GH action for automated building, test & publish.

Summary

This issues adds a GH action for automated building, test & publish.

Details

on PR, build and test
on merge to master, build test and publish to NPM

Update documentation for page feature setting & metadata

Summary

There are various parts of refactoring ongoing in edgekit, this issue tracks the major changes to ensure we update all the documentation needed once it is complete:

#89
GDPR module

Support audience definitions with multiple target vectors.

Summary

Currently we support audience definitions composed of a single vector, this issue extends the engine module to allow for audience definitions which may contain multiple vectors, each with their own threshold value.

Details

A (shortened) audience definitions currently looks like the following:

{
  vector: [0.1, 0.1, 0.2],
  threshold: 0.8,
  occurrences: 3
}

Which is then compared to some page views stored locally:

[
  {vector: [0.1, 0.1, 0.2]},
  {vector: [0.2, 0.3, 0.2]},
  {vector: [0.4, 0.2, 0.2]},
]

We compare by calculating a distance metric between the definition vector and the page view vector.

A user matches an audience, when more page views have been above threshold then the occurrences value.

This issue adds support for an audience definition containing multiple target vectors, each with its own threshold:

{
  targets: [
    {vector, threshold}, 
    {vector, threshold}, 
  ],
  occurrences: 3
}

The occurrences value is now global across the array of targets.

Additional

this needs thorough testing
there are a few areas which can be refactored as part of this issue;
- simplification of engine condition creator.
- translation layer from audience definition to engine condition.
- both of these parts are doing a similar job and have duplicated code which is getting a little painful to follow.

References

a good place to start is to check out the audience definitions in the tests.

Add support for logistic regression audience matching.

Summary

This issue adds support for logistic regression audience matching, by extending engine to compare vectors using a set of coefficients and a bias.

On the creation of a comprehensive testing framework.

Summary

This issue will outline a proposal for the creation of a comprehensive testing framework.

Here will discuss the ideas behind the testing suite, its structure, coverage goals and technologies.

The main goal is to create a healthy environment, favouring development efficiency and velocity while maintaining robustness, simplicity and reduced onboarding effort.

Ideological basis

Ideally, a test suite should outline the desired behaviour of a program. Think of documentation that can be executed.
It should be legible, significative, and it should cover the broader input/output space as possible.

Some guidelines can be used to assure such qualities:

Write tests for behaviours, not for implementation details;
- it should read as documentation and be clear about what is the expected behaviour it tests for;
- a test suite should be useful while developing new features, as well as while refactoring and optimizing existing ones;
- it should not break nor require changes while the behaviours it describes remain unchanged;
Use tests as an aid for thinking;
- writing tests along with implementation helps to outline the desired behaviour of a program while assuring its subjects are well divided and easily testable;
- writing tests before implementation and seeing it fails helps to keep false negative counts as low as possible;
Generate your tests when you can;
- generative testing helps to give you confidence that the implementation follows the expected behaviour for a broad input/output space;
- it helps the developers, finding unthought edge-cases;
Test as many paths as possible;
- the unit test suite should be complete, with a unit test for each exposed API at least;
- the end-to-end test suite should cover at least the 'happy paths' of the application, using mocks to simulate the error scenarios while maintaining execution time low;
Maintain test execution time low;
- if the test suite becomes a burden on development, if it breaks the state of flow, developers will not use it;
- ideally, a test unit test suite should take no more than 4-5 seconds to run, and it should run every time a file is saved;
- for end-to-end tests, the time requirements are less restricted as it will run less often;

Structure

Ideally, the test suite structure should be unambiguous and consistent.

In my view, things should be aggregated on a module basis.
Each module would be its own test suite, while fixtures, generators and helper methods would be extracted for reuse.
Something like the following (this is just an example):

src
|-main
|  |-index.ts
|  |-moduleFuncs.ts
|  |-moreFuncs.ts
|-someModule
|  |-index.ts
|-utils
   |-index.ts

test
|-fixtures
|  |-someMock.json
|  |-someOther.json
|-helpers
|  |-generators
|  |  |-domain.ts
|  |  |-application.ts
|  |-utils
|  |  |-setupNCleaning.ts
|  |  |-mockFactories.ts
|-unit
|  |-main
|  |  |-index.test.ts
|  |  |-moduleFuncs.test.ts
|  |  |-moreFuncs.test.ts
|  |-someModule
|  |  |-index.test.ts
|  |-utils
|    |-index.test.ts
|-e2e
   |-main
   |  |-index.test.ts
   |-someModule
   |  |-index.test.ts 
   |-snapshots
   |  |-snap.json
.
.
.

Inside each testing module, suites should be organized at the function/class level.
They should be written in a descriptive/declarative manner, specifying the desired behaviour for testing.
Ideally, the code should explain itself. If there is a need for too many comments, the test item probably can be divided into more than one, more descriptive items.

Something like:

describe('methodOne surface behaviour', () => {
  beforeAll(() => {
    setup()
  })
  afterAll(() => {
    teardown()
  })
  it('should not throw errors for valid inputs', () => {
    method.call(validInput)
  })
  it('should throw on invalid inputs', () => {
    expect(() => {
      method.call(invalidInput)
    }).toThrow(/invalid input/)
  })
  it('should return valid objects', () => {
    const rep = method.call(validInput)
    expect(rep).toEqual(expectedRep)
  })
})

describe('methodTwo surface behaviour', () => {
  ...
})

Coverage

As uncle Bob Martin would say (demand):

Every single line of code that you write should be tested

While not every target can always be achieved, we should still have a target to aim for. Mainly the implementation should be done (when possible) in a way that each method is easily testable, meaning: deterministic and pure. There also must be minimal branching as possible.

Roadmap

I suggest we proceed as follows (in order of importance):

organize the suite structure
clean current unit tests, extracting shared helpers, mocks and fixtures
write/improve tests targeting missing execution environments
substitute end-to-end testing framework
create a strategy for improving test coverage

Progress

Organize the suite structure

POCs and related stuff:

Issues with the current testing setup

Remove string array matching code paths.

Summary

We are no longer doing any matching on strings, we should remove that code path, and the tests.

Automate build, test & release.

Summary

Automate the process of build, test and release as a github action when pushing to master, with the additional constraint that the final git commit message must be of the format: release: X.X.X.

Additionally run the workflow on any new PRs to ensure that tests are passing.

Move IAB audience creation process into a separate repo.

Summary

This issue decouples the audiences from the edgekit project, to create better separation of the two concepts are they are not direct dependants.

This also allows the new repository to focus more on the ML process needed to create audience definitions in a transparent manner.

Detail

`WIP~

Add support for consent checking via the TCF 2.0 CMP.

Summary

We should have native support / integration for checking of consent from a TCF registered CMP.

The CMP, basically provides a way for vendors to check if they have consent for the processing of data under GDPR, i.e a user has given access or not.

Details

There is some setup and checking that needs to be done:

is the API available?
does GDPR apply?

After this we must provide a way for edgekit to check if consent has been granted, this is done via the following function:

__tcfapi('getTCData', 2, (tcData, success) => {
  if(success) {
    // do something with tcData
  } else {
    // do something else
  }
}, [1,2,3]);

The final array passed in would be vendors for whom we are checking consent. Since this library is available to be used by multiple vendors this part should be configurable on the init stage of edgekit.

Once edgekit has checked if it has consent we need to either execute our logic, or not based on the response.

Nothing should happen before we are certain that we have consent.

Refs

Full API doc.
Vendor list
Guardian, running TCF 2.0

Skip computation on previously matched audiences

As now, every run of edgekit's matching engine will check for matching pageView/audienceDefinitions for every passed item.

// edkt.run :: (PageView[], AudienceDefinition[]) -> MatchedAudiences[]
const pageViews = viewStore.getCopyOfPageViews();
// this iterates over all the pageView/audienceDefinition instances
const matchedAudiences = engine.getMatchingAudiences(
  audienceDefinitions,
  pageViews
);
matchedAudienceStore.setMatchedAudiences(matchedAudiences);

We would want to skip audienceDefinitions that were previously matched and not changed to save computation.
Something in the ways of:

const pageViews = viewStore.getCopyOfPageViews();
const matchedAudiences = matchedAudienceStore.getCopyOfMatchedAudiences();

// this will remove prev matched / not changed audience definitions
const audienceDefinitionCandidates = getAudienceDefinitionCandidates(audienceDefinitions, matchedAudiences)

// this will only iterate over new audienceDefinitions / update old ones with bumped version
const matchedAudiences = engine.getMatchingAudiences(
  audienceDefinitionsCandidates,
  pageViews
);

matchedAudienceStore.appendMatchedAudiences(matchedAudiences);  // append, do not overwrite
matchedAudienceStore.removeOldMatchedAudienceDefinitions();  // we have to clean the old items in some way

Decouple edgekit from the audience definitions.

Summary

Right now the audience definitions form an internal component of the library, there is no way to select/de-select audiences, or to completely avoid the pre-created definitions and insert your own custom definitions.

Requirements

decouple audience definitions
support tree shaking for consumer builds & bundles (unused audiences should not be in bundle)
allow library user to:
- pull in individual audience definitions
- pass none of the predefined audiences

Proposed API

Consumer wishes to use the full taxonomy packaged with EdgeKit:

import { edkt, audienceDefinitions } from '@airgrid/edgekit';

const edktConfig = {
  audienceDefinitions,
  pageFeatureGetters,
};

edkt.run(edktConfig);

Proof-of-concept for implementation of property-based testing using fast-check

Summary

This issue proposes the implementation of property-based testing and specification using fast-check.
Property-based testing is a type of generative testing (fuzzing) used to assure properties about the behaviour of the implementation's API.
It differs from example-based testing in the sense that, while example-based testing assures that the implementation can handle certain hand-tailored examples without failing, property-based testing assures that the implementation behaviours as desired for (supposedly) any input it could take.

From fast-check documentation:

A property can be summarized by: for any (a, b, ...) such as precondition(a, b, ...) holds, property(a, b, ...) is true.
Property-based testing frameworks try to discover inputs (a, b, ...) causing property(a, b, ...) to be false. If they find such, they have to reduce the counterexample to a minimal counterexample.

Advantages

Generative testing has several advantages over manual test writing, e.g.:

It reduces the runtime errors occurrences drastically, as it provides a wider path coverage by testing against possible inputs not thought by the developers;
It favours and stimulates code reusability (where the generators are usually composed by smaller ones and shared);
It increases development velocity, as thinking about properties is faster than trying to think about enough relevant test inputs;
It obliges the maintenance of the test suite up-to-date (as it has much better case coverage and will show errors easily on implementations out of spec);
It serves as a good base for implementation optimization and refactoring of code. If you know you have some code functioning as it should, you can compare implementations throughout the input space assuring the outputs matches;
It shifts the mentality of the developer from 'trying to come up with cases where the implementation breaks' to 'specifying the desired behaviour of the API';

Here is what It looks like testing one of our methods:

import * as fc from 'fast-check';
import { evaluateCondition } from '../../src/engine/evaluate';
import {
  EngineCondition,
  QueryFilterComparisonType,
  CosineSimilarityFilter,
  PageView,
} from '../../types';

// Implementation of the generators mimicking
// the type definitions.
// This generators would be shared throughout
// the test suite. Write once; use everywhere.

// NumberArray
const vectorGen = fc.array(fc.float(), 128, 128)

// VectorQueryValue
const queryValueGen = fc.record({
  vector: vectorGen,
  threshold: fc.float()
})

// AudienceQueryDefinition
const queryArrayGen = fc.array(
  fc.record({
    featureVersion: fc.integer(),
    queryProperty: fc.constant('topicDist'),
    queryFilterComparisonType:
      fc.constant(QueryFilterComparisonType.COSINE_SIMILARITY),
    queryValue: queryValueGen
  })
)

// EngineCondition<AudienceDefinitionFilter>
const cosineSimilarityConditionGen  =
  fc.record({
  filter: fc.record({
    any: fc.boolean(),
    queries: queryArrayGen,
  }),
  rules: fc.array(
    fc.record({
      reducer: fc.record({
        name: fc.constant('count')
      }),
      matcher: fc.record({
        name: fc.constant('ge'),
        args: fc.constant(1),
      })
    })
  )
})

// PageView[]
const pageViewsGen = fc.array(
  fc.record({
    ts: fc.integer(),
    features: fc.record({
      topicDist: fc.record({
        version: fc.integer(),
        value: vectorGen,
      })
    })
  })
)

describe('Engine evaluateCondition method behaviour', () => {
  fc.assert(
    fc.property(
      cosineSimilarityConditionGen,
      pageViewsGen,
      (
        condition,
        pageViews,
      ) => {

        const result = evaluateCondition(
          condition as EngineCondition<CosineSimilarityFilter>,
          pageViews as PageView[]
        )

        // should always returns a boolean value and not throw errors
        expect(typeof result).toBe('boolean')

        // the below should pass, but in the actual implementation
        // using rules.every, whenever the rule
        // array is empty, the evaluateCondition
        // will return true
        //
        // this is a good case for demonstrating the
        // runner's shrinking capabilities anyways.

        // if there is no pageViews to match, don't match at all
        if (pageViews.length === 0) {
          expect(result).toBe(false)
        }

        // if there is no queries, don't match at all
        if (condition.filter.queries.length === 0) {
          expect(result).toBe(false)
        }

        // TODO specify remaining desired behaviours
      })
  );
});

The implementation of the specs should not interfere with the current testing suite, and it can be gradually adopted.
However, it would, at some point, renders the manual case checking nearly useless, at which point it could be discarded.

If the presented generator syntax does not please, JSON syntax can also be used with the help of json-schema-fast-check.

Disadvantages

Incurs a shift in thinking pattern and can be problematic for new developers to grok, at first.

References

Remove multi-package setup, moving to a mono-package.

Summary

The current project is setup using lerna to manage multiple packages which are pulled into a core package designed for consumers to utilise, the idea here was to:

create clear boundaries between various modules
allow code re-use in other repositories

However it seems this does add a unnecessary development overhead in the following areas:

publishing
testing
tooling setup
refactor

We have therefore decided to switch back to a mono-package to increase development speed in the early stages of the project and make it more accessible for other JS/TS devs to contribute.

	const filteredPageViews = filter.queries.flatMap((query) =>
	pageViews.filter((pageView) => {
	const pageFeatures = pageView.features[query.queryProperty];
	return queryMatches(query, pageFeatures);
	})
	);

airgrid / edgekit Goto Github PK

edgekit's People

Contributors

Stargazers

Watchers

Forkers

edgekit's Issues

Summary

A simple example

IAB Taxonomy Interest

References

Summary

Detail

Related

Summary

Proposal

Summary

Summary

Summary

Detail

Summary

Idea 1:

Summary

Requirements

Summary

Requirements

Proposed API

Summary

Summary

Requirements

Summary

Requirements:

Summary

Details

Summary

Details

Summary

Summary

Details

Additional

References

Summary

Summary

Ideological basis

Structure

Coverage

Roadmap

Progress

Related

Summary

Summary

Summary

Detail

Summary

Details

Refs

Summary

Requirements

Proposed API

Summary

Advantages

Disadvantages

References

Summary

Recommend Projects

Recommend Topics

Recommend Org

IAB Taxonomy `Interest`