airgrid / edgekit Goto Github PK
View Code? Open in Web Editor NEWOpen source, privacy focused client side library for the creation and monetisation of online audiences.
Home Page: https://edgekit.org
License: MIT License
Open source, privacy focused client side library for the creation and monetisation of online audiences.
Home Page: https://edgekit.org
License: MIT License
Prior to inviting the industry for some well needed debate on audience definition, we would like to bootstrap a simple taxonomy of audience definitions:
interest
audiencestier 1
+ tier 2
The initial version of the definitions will consist of:
The below is an audience definition, which is saying a user must have viewed at least 3 pages, within the past 7 days with would contain at least one of the keywords, to be matched into that interest based for the next 7 days.
const audience = {
id: 719,
name: 'Interest | Travel',
ttl: 7,
lookback: 7,
occurrence: 3,
keywords: ['beach', 'mojito', 'palms', 'spain', '...']
}
Interest
Unique ID | Parent ID | Name | Tier 1 | Tier 2 |
---|---|---|---|---|
243 | 206 | Interest | Automotive | Interest | Automotive |
258 | 206 | Interest | Books and Literature | Interest | Books and Literature |
268 | 206 | Interest | Business and Finance | Interest | Business and Finance |
338 | 206 | Interest | Careers | Interest | Careers |
344 | 258 | Interest | Education | Interest | Education |
348 | 206 | Interest | Family and Relationships | Interest | Family and Relationships |
359 | 206 | Interest | Fine Art | Interest | Fine Art |
368 | 206 | Interest | Food & Drink | Interest | Food & Drink |
381 | 206 | Interest | Health and Medical Services | Interest | Health and Medical Services |
406 | 206 | Interest | Healthy Living | Interest | Healthy Living |
422 | 206 | Interest | Hobbies & Interests | Interest | Hobbies & Interests |
457 | 206 | Interest | Home & Garden | Interest | Home & Garden |
466 | 206 | Interest | Medical Health | Interest | Medical Health |
467 | 206 | Interest | Movies | Interest | Movies |
481 | 206 | Interest | Music and Audio | Interest | Music and Audio |
522 | 206 | Interest | News and Politics | Interest | News and Politics |
534 | 206 | Interest | Personal Finance | Interest | Personal Finance |
541 | 206 | Interest | Pets | Interest | Pets |
550 | 206 | Interest | Pharmaceuticals, Conditions, and Symptoms | Interest | Pharmaceuticals, Conditions, and Symptoms |
581 | 206 | Interest | Pop Culture | Interest | Pop Culture |
583 | 206 | Interest | Real Estate | Interest | Real Estate |
595 | 206 | Interest | Religion & Spirituality | Interest | Religion & Spirituality |
606 | 206 | Interest | Shopping | Interest | Shopping |
607 | 206 | Interest | Sports | Interest | Sports |
676 | 206 | Interest | Style & Fashion | Interest | Style & Fashion |
687 | 206 | Interest | Technology & Computing | Interest | Technology & Computing |
706 | 206 | Interest | Television | Interest | Television |
719 | 206 | Interest | Travel | Interest | Travel |
733 | 206 | Interest | Video Gaming | Interest | Video Gaming |
The current logic for audience versions is not complete, we do not "un-match" a user if they do not conform to a bumped audience version.
There is code to cache audiences in local storage, but our internal implementation does not utilise this mechanism, as we use the browser cache for GET requests, although I could see this feature being useful in future or for other users of the library, I suggest while it is unused we do remove it - as it will be trivial to re-implement.
EdgeKit right now allows the user to pass an Array
of pageFeatureGetters
which are objects containing a name
and func
key:
const getHtmlKeywords = {
name: 'keywords',
func: () => ( return ['kw1', 'kw2'])
};
This seems to add a layer of complexity and boilerplate as EdgeKit, then simply executes these functions. For someone wanting to use the project if their function failed it would mean having to try and debug the internals.
Expose a new method which creates a more familiar set()
based API.
Rather then providing a list of functions, which resolve to features, we would allow users to set features directly:
edkt.setPageFeatures('keywords', ['kw1', 'kw2']);
We could also return a pointer to the feature if for example it needed to be updated:
pageFeaturesPointer = edkt.setPageFeatures('dwell', 20);
// some time passes
pageFeaturesPointer.update('dwell', 30)
The EngineCondition interface carries an any
attribute which is not being currently used.
It can be used to add internal engine matching AND
logic by swapping the evaluation condition on internal filterPageViews method.
This commit, has introduced a bug which means that we set a user as "unique" each page view as we simply overwrite the store rather than use the previous merging strategy.
The expected behaviour is that a user should be a unique only first time they are matched. Following this they are not unique, until the ttl
expires at which point if they are matched again - they can once more become a unique
.
There has been a change introduced, which converts matchedAudiences
from being stored as an array to an object, this means some extra code has been added to allow for that to not break clients.
This issue removes that code after most clients would have been updated by accessing the new version of the code.
The code has been marked with a TODO
and a link to this issue.
This issue extends the query filters by adding a cosine similarity metric which can be calculated between two vectors of equal length.
Our current testing goal is to attack the exposed APIs of the library.
The power of edgekit comes into play where a user builds a page history over time, representing their own local state, we then want to run edkt
against this state in different forms and test for the output.
A design question here is how best to simulate the growth of a user's data points in their local state (storage).
error
into the pageFeatureGetter
.EdgeKit audience definitions can also arrive from the server, this means to lessen the burden on servers we can cache definitions locally, and just let the server know which audience definitions we have stored locally.
The current implementation of engine
is designed for direct string comparisons: if('sport' === 'sport')
. To expand the ways in which we are able to define audiences, we would like to extend the functionality to compare two vectors, using this as a filtering criteria within the users local storage.
ToDo
ToDo
index.js:7 Uncaught (in promise) Error: TCF API is missing
We should handle this error properly.
Audience definitions have a lookback
field, which stipulates how far back in a user's local data we can check to find matching page views. Currently engine
does not use this field when doing its evaluation and consumes the user's full data.
lookback
window which is a values in seconds, we must consider only the values which fit within this window.0
, -1
, Infinity
, ... ? )/docs
folder, we should document these features as part of the issue itself.Related to: #22
This issue adds supports for audience definitions versions, which is primarily aimed at dynamic audience definitions which will evolve rapidly. We would send cached audience definition versions to the server, where a call would be made if they should be updated.
The tests need to be extended, organised and kept consistent.
edgekit/src/engine/evaluate.ts
Lines 16 to 21 in aab2d7a
This operation double counts page views on multiple query audience definitions.
It should return a single pageView instance if matched by any of the defined queries.
It should have, ideally, its own testing suite also.
This issues adds a GH action for automated building, test & publish.
There are various parts of refactoring ongoing in edgekit, this issue tracks the major changes to ensure we update all the documentation needed once it is complete:
Currently we support audience definitions composed of a single vector, this issue extends the engine
module to allow for audience definitions which may contain multiple vectors, each with their own threshold value.
A (shortened) audience definitions currently looks like the following:
{
vector: [0.1, 0.1, 0.2],
threshold: 0.8,
occurrences: 3
}
Which is then compared to some page views stored locally:
[
{vector: [0.1, 0.1, 0.2]},
{vector: [0.2, 0.3, 0.2]},
{vector: [0.4, 0.2, 0.2]},
]
We compare by calculating a distance metric between the definition vector
and the page view vector
.
A user matches an audience, when more page views have been above threshold
then the occurrences
value.
This issue adds support for an audience definition containing multiple target vectors, each with its own threshold:
{
targets: [
{vector, threshold},
{vector, threshold},
],
occurrences: 3
}
The occurrences
value is now global across the array of targets.
This issue adds support for logistic regression audience matching, by extending engine
to compare vectors using a set of coefficients and a bias.
This issue will outline a proposal for the creation of a comprehensive testing framework.
Here will discuss the ideas behind the testing suite, its structure, coverage goals and technologies.
The main goal is to create a healthy environment, favouring development efficiency and velocity while maintaining robustness, simplicity and reduced onboarding effort.
Ideally, a test suite should outline the desired behaviour of a program. Think of documentation that can be executed.
It should be legible, significative, and it should cover the broader input/output space as possible.
Some guidelines can be used to assure such qualities:
Ideally, the test suite structure should be unambiguous and consistent.
In my view, things should be aggregated on a module basis.
Each module would be its own test suite, while fixtures, generators and helper methods would be extracted for reuse.
Something like the following (this is just an example):
src
|-main
| |-index.ts
| |-moduleFuncs.ts
| |-moreFuncs.ts
|-someModule
| |-index.ts
|-utils
|-index.ts
test
|-fixtures
| |-someMock.json
| |-someOther.json
|-helpers
| |-generators
| | |-domain.ts
| | |-application.ts
| |-utils
| | |-setupNCleaning.ts
| | |-mockFactories.ts
|-unit
| |-main
| | |-index.test.ts
| | |-moduleFuncs.test.ts
| | |-moreFuncs.test.ts
| |-someModule
| | |-index.test.ts
| |-utils
| |-index.test.ts
|-e2e
|-main
| |-index.test.ts
|-someModule
| |-index.test.ts
|-snapshots
| |-snap.json
.
.
.
Inside each testing module, suites should be organized at the function/class level.
They should be written in a descriptive/declarative manner, specifying the desired behaviour for testing.
Ideally, the code should explain itself. If there is a need for too many comments, the test item probably can be divided into more than one, more descriptive items.
Something like:
describe('methodOne surface behaviour', () => {
beforeAll(() => {
setup()
})
afterAll(() => {
teardown()
})
it('should not throw errors for valid inputs', () => {
method.call(validInput)
})
it('should throw on invalid inputs', () => {
expect(() => {
method.call(invalidInput)
}).toThrow(/invalid input/)
})
it('should return valid objects', () => {
const rep = method.call(validInput)
expect(rep).toEqual(expectedRep)
})
})
describe('methodTwo surface behaviour', () => {
...
})
As uncle Bob Martin would say (demand):
Every single line of code that you write should be tested
While not every target can always be achieved, we should still have a target to aim for. Mainly the implementation should be done (when possible) in a way that each method is easily testable, meaning: deterministic and pure. There also must be minimal branching as possible.
I suggest we proceed as follows (in order of importance):
POCs and related stuff:
Issues with the current testing setup
We are no longer doing any matching on strings, we should remove that code path, and the tests.
Automate the process of build, test and release as a github action when pushing to master, with the additional constraint that the final git commit message must be of the format: release: X.X.X
.
Additionally run the workflow on any new PRs to ensure that tests are passing.
This issue decouples the audiences from the edgekit project, to create better separation of the two concepts are they are not direct dependants.
This also allows the new repository to focus more on the ML process needed to create audience definitions in a transparent manner.
`WIP~
We should have native support / integration for checking of consent from a TCF registered CMP.
The CMP, basically provides a way for vendors to check if they have consent for the processing of data under GDPR, i.e a user has given access or not.
There is some setup and checking that needs to be done:
After this we must provide a way for edgekit to check if consent has been granted, this is done via the following function:
__tcfapi('getTCData', 2, (tcData, success) => {
if(success) {
// do something with tcData
} else {
// do something else
}
}, [1,2,3]);
The final array passed in would be vendors for whom we are checking consent. Since this library is available to be used by multiple vendors this part should be configurable on the init stage of edgekit.
Once edgekit has checked if it has consent we need to either execute our logic, or not based on the response.
Nothing should happen before we are certain that we have consent.
As now, every run of edgekit's matching engine will check for matching pageView
/audienceDefinition
s for every passed item.
// edkt.run :: (PageView[], AudienceDefinition[]) -> MatchedAudiences[]
const pageViews = viewStore.getCopyOfPageViews();
// this iterates over all the pageView/audienceDefinition instances
const matchedAudiences = engine.getMatchingAudiences(
audienceDefinitions,
pageViews
);
matchedAudienceStore.setMatchedAudiences(matchedAudiences);
We would want to skip audienceDefinitions
that were previously matched and not changed to save computation.
Something in the ways of:
const pageViews = viewStore.getCopyOfPageViews();
const matchedAudiences = matchedAudienceStore.getCopyOfMatchedAudiences();
// this will remove prev matched / not changed audience definitions
const audienceDefinitionCandidates = getAudienceDefinitionCandidates(audienceDefinitions, matchedAudiences)
// this will only iterate over new audienceDefinitions / update old ones with bumped version
const matchedAudiences = engine.getMatchingAudiences(
audienceDefinitionsCandidates,
pageViews
);
matchedAudienceStore.appendMatchedAudiences(matchedAudiences); // append, do not overwrite
matchedAudienceStore.removeOldMatchedAudienceDefinitions(); // we have to clean the old items in some way
Right now the audience definitions form an internal component of the library, there is no way to select/de-select audiences, or to completely avoid the pre-created definitions and insert your own custom definitions.
Consumer wishes to use the full taxonomy packaged with EdgeKit:
import { edkt, audienceDefinitions } from '@airgrid/edgekit';
const edktConfig = {
audienceDefinitions,
pageFeatureGetters,
};
edkt.run(edktConfig);
This issue proposes the implementation of property-based testing and specification using fast-check.
Property-based testing is a type of generative testing (fuzzing) used to assure properties about the behaviour of the implementation's API.
It differs from example-based testing in the sense that, while example-based testing assures that the implementation can handle certain hand-tailored examples without failing, property-based testing assures that the implementation behaviours as desired for (supposedly) any input it could take.
From fast-check documentation:
A property can be summarized by: for any (a, b, ...) such as precondition(a, b, ...) holds, property(a, b, ...) is true.
Property-based testing frameworks try to discover inputs (a, b, ...) causing property(a, b, ...) to be false. If they find such, they have to reduce the counterexample to a minimal counterexample.
Generative testing has several advantages over manual test writing, e.g.:
Here is what It looks like testing one of our methods:
import * as fc from 'fast-check';
import { evaluateCondition } from '../../src/engine/evaluate';
import {
EngineCondition,
QueryFilterComparisonType,
CosineSimilarityFilter,
PageView,
} from '../../types';
// Implementation of the generators mimicking
// the type definitions.
// This generators would be shared throughout
// the test suite. Write once; use everywhere.
// NumberArray
const vectorGen = fc.array(fc.float(), 128, 128)
// VectorQueryValue
const queryValueGen = fc.record({
vector: vectorGen,
threshold: fc.float()
})
// AudienceQueryDefinition
const queryArrayGen = fc.array(
fc.record({
featureVersion: fc.integer(),
queryProperty: fc.constant('topicDist'),
queryFilterComparisonType:
fc.constant(QueryFilterComparisonType.COSINE_SIMILARITY),
queryValue: queryValueGen
})
)
// EngineCondition<AudienceDefinitionFilter>
const cosineSimilarityConditionGen =
fc.record({
filter: fc.record({
any: fc.boolean(),
queries: queryArrayGen,
}),
rules: fc.array(
fc.record({
reducer: fc.record({
name: fc.constant('count')
}),
matcher: fc.record({
name: fc.constant('ge'),
args: fc.constant(1),
})
})
)
})
// PageView[]
const pageViewsGen = fc.array(
fc.record({
ts: fc.integer(),
features: fc.record({
topicDist: fc.record({
version: fc.integer(),
value: vectorGen,
})
})
})
)
describe('Engine evaluateCondition method behaviour', () => {
fc.assert(
fc.property(
cosineSimilarityConditionGen,
pageViewsGen,
(
condition,
pageViews,
) => {
const result = evaluateCondition(
condition as EngineCondition<CosineSimilarityFilter>,
pageViews as PageView[]
)
// should always returns a boolean value and not throw errors
expect(typeof result).toBe('boolean')
// the below should pass, but in the actual implementation
// using rules.every, whenever the rule
// array is empty, the evaluateCondition
// will return true
//
// this is a good case for demonstrating the
// runner's shrinking capabilities anyways.
// if there is no pageViews to match, don't match at all
if (pageViews.length === 0) {
expect(result).toBe(false)
}
// if there is no queries, don't match at all
if (condition.filter.queries.length === 0) {
expect(result).toBe(false)
}
// TODO specify remaining desired behaviours
})
);
});
The implementation of the specs should not interfere with the current testing suite, and it can be gradually adopted.
However, it would, at some point, renders the manual case checking nearly useless, at which point it could be discarded.
If the presented generator syntax does not please, JSON syntax can also be used with the help of json-schema-fast-check.
The current project is setup using lerna to manage multiple packages which are pulled into a core
package designed for consumers to utilise, the idea here was to:
However it seems this does add a unnecessary development overhead in the following areas:
We have therefore decided to switch back to a mono-package to increase development speed in the early stages of the project and make it more accessible for other JS/TS devs to contribute.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.