Giter Club home page Giter Club logo

odl's Introduction

oDL

Open Downloads (oDL) is built on the simple idea that podcast download numbers should be consistent and transparent across hosting companies, networks, and analytics providers.

oDL is an open source package that contains a simple spec, blacklists, and code to prepare log files and count podcast downloads in a scalable way.

oDL's goal is to move the podcast industry forward collectively by introducing a layer of trust between podcasters, providers, and advertisers by removing the black box approach and replacing it with an open system to count and verify download numbers.

Quickstart

oDL is in the comment phase of development. For now you'll need to clone from source.

$ git clone [email protected]:open-downloads/odl.git && cd odl
$ virtualenv .
$ source bin/activate
$ python setup.py install
$ ipython

> from odl import prepare, pipeline
> prepare.run('path/to/events-input.csv', 'path/to/events.odl.avro')
> pipeline.run('path/to/events.odl.avro', 'path/to/events-output')
Running the oDL pipeline from path/to/events.odl.avro to path/to/events-output

oDL run complete.

Downloads: 13751

path/to/events-output/count.txt
path/to/events-output/hourly.csv
path/to/events-output/episodes.csv
path/to/events-output/apps.csv

Overview

oDL is a fundamentally new paradigm in podcasting, so we wanted to explain a little more about where oDL fits in the ecosystem.

Spec

An oDL Download meets the following criteria:

  • It's an HTTP GET request
  • The IP Address is not on the blacklist
  • The User Agent is not on the blacklist
  • It can't be a streaming request for the first 2 bytes (i.e. Range: bytes=0-1)
  • The User Agent/IP combination counts once per day (i.e. fixed 24 hour period starting at midnight UTC)

That's it. It's intentionally simple to reduce confusion for all involved.

Dealing with bots

Podcasts get many downloads from bots, servers, and things that just aren't human. We need to avoid counting these, but if everyone maintains their own blacklist, they can't count the same way.

oDL uses common, publicly available lists to decide which User Agents and IP Addresses should be blacklisted:

Out of Scope.

oDL focuses on accuracy in cross provider download numbers. Analytics Prefixes and hosting providers have access to different sets of data, so we take the intersection of this information. All counts are based on these seven data points:

  • IP Address
  • User Agent
  • HTTP Method
  • Timestamp
  • Episode Identifier (ID/Enclosure URL)
  • Byte Range Start
  • Byte Range End

oDL does not take into account bytes coming off the server or even percentage of episode streamed. These numbers tend to conflate listening behavior with downloads and only podcast players can reliably report listening behavior.

Our goal is to simply weed out requests that did not make a good faith effort to put an audio file on a podcast player.

Isn't this just IAB v2.0?

No, similar goals, but different tactics.

The IAB v2.0 spec is great, but it relies on wording around "Best Practices." We believe that a spec shouldn't have best practices. Two hosting providers, both who are IAB v2.0 certified could have up to a 10% difference in download counts. This creates confusion for publishers, podcasters and advertisers alike as who's number is "correct".

IAB v2.0 is also expensive. It's up to $45k for a hosting provider to become certified. Competition is important, and this hurdle creates an undue burden on smaller companies.

oDL takes a transparent, open source approach. By using common blacklists of IPs and User Agents we make the industry as a whole more accurate.

oDL and IAB v2.0 are not mutually exclusive. A provider may decide to do both.

Advanced Analytics

Many hosting providers offer advanced analytics and we are all about it. oDL is not meant to reduce innovation in the podcast analytics space. Simplecast is fingerprinting downloads, backtracks has its thing going on and many others have interesting ideas of how to use download data.

Advanced analytics methodologies are not consistent across hosting providers. Some providers will use a whitelist to allow more downloads from IPs with many users or use shorter attribution windows. These methods, while taken in good faith, inflate download numbers against providers that take a stricter approach.

Ad-supported podcasters can make more or less money depending on which hosting provider they choose.

Our hope is that an oDL number sits beside the advanced analytics number and may be higher or lower, but the podcaster knows that it is consistent with their peers.

I'm a Hosting Provider and I want to support oDL.

oDL is a self-certifying spec, meaning that there is no formal process to become certified. The only requirement is that you let podcasters download their raw logs in the odl.avro format.

Users can then verify the numbers reported in your dashboard are the same as oDL's. Our goal in the future is to add a hosted verification service to https://odl.dev.

Code

A spec isn't much without code to run it all. oDl ships with a full implementation for counting downloads against server logs. It's built using Apache Beam to scale from counting thousands of downloads on your laptop to millions using a data processing engine like Apache Spark or Google Cloud Dataflow.

Prepare

The first step in running oDL is to prepare the data for the download counting job. oDL uses the following avro schema for raw events:

Note: timestamps should be strings in RFC 3339 format

[{
    'name': 'encoded_ip',
    'type': 'string'
}, {
    'name': 'user_agent',
    'type': 'string'
}, {
    'name': 'http_method',
    'type': 'string'
}, {
    'name': 'timestamp',
    'type': 'string'
}, {
    'name': 'episode_id',
    'type': 'string'
}, {
    'name': 'byte_range_start',
    'type': ["int", "null"]
}, {
    'name': 'byte_range_end',
    'type': ["int", "null"]
}]

We ask that you encode the IP before creating the odl.avro file. We provide a helper to salt and hash the IP.

> from odl.prepare import get_ip_encoder
> encode = get_ip_encoder()
> encode('1.2.3.4')
'58ecdbd64d2fa9844d29557a35955a58'
> encode('1.2.3.5')
'1bcdcb404b16e046f3a13fc5563853d3'
> encode('1.2.3.5')
 '1bcdcb404b16e046f3a13fc5563853d3'

By default get_ip_encoder uses a random salt, so if you need to run oDL over multiple files, include your own salt.

> encode = get_ip_encoder(salt="this is a super secret key")
> encode('1.2.3.4')

To actually write a file, you can use the following.

> from odl import prepare
> prepare.run('./path/to/log-events.json', './path/to/events.odl.avro', format="json")

Write will throw errors at you if it can't create a file.

Pipeline

The odl package uses Apache Beam under the hood to work on logfiles at any scale. On your local machine it looks like this:

pipeline.run('path/to/events.odl.avro', 'path/to/events-output')

> from odl import pipeline
> pipeline.run('path/to/events.odl.avro', 'path/to/events-output')

If you would like to run this through Google Cloud DataFlow on a large scale dataset, you would use the following:

Note: This costs money, since it's using Google Cloud Platform.

> pipeline.run('gs://events-bucket-name/events*',  
  'gs://output-bucket-namep/events-output',
  {"runner" : "DataflowRunner", 'project': 'gc-org-name'})

Closing

It's still early in the ball game here, but we hope open, transparent counting will improve podcasting for all.

odl's People

Contributors

anrope avatar screeley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

odl's Issues

Install error on Linux - file does not exist

On a clean install, following the documentation, the command
"python setup.py install"
returns the following error:
installing scripts to build/bdist.linux-x86_64/egg/EGG-INFO/scripts
running install_scripts
running build_scripts
creating build/scripts-2.7
file '/home/username/wodl27/odl/odl/bin/odl.py' does not exist.
Although is trivial to fix, this is still an error which requires manual fix(mv ./odl/bin/odl ./odl/bin/odl.py)

Producing the AVRO schema with AWS Athena

This looks like a great project - thanks so much for it.

If you host on Cloudfront + Amazon S3, the simplest way to get logs out is to use AWS Athena to query your log files. Here's where I've got to so far with producing AVRO schema output using an AWS Athena SQL query...

SELECT lower(to_hex(md5(to_utf8(requestip)))) AS encoded_ip, useragent AS user_agent, method AS http_method, 'timestamp-todo' AS timestamp, uri AS episode_id, * FROM cloudfront_logs WHERE uri LIKE '/audio/%' AND method='GET' LIMIT 10

  1. It would be helpful in your spec to identify the format of your timestamp, please (AWS Athena gives you a date and a time, which will need some work).

  2. byte_start and byte_end are going to be interesting to extract, since these aren't in the logfile data. I can see whether it's a 200 or 206 response, and the total amount of bytes transferred.

Would you be able to clarify what this means:

It can't be a streaming request for the first 2 bytes (i.e. Range: bytes=0-1)

...it would be good to see if I can do that filtering this end.

Once per 24 hours or once per UTC Day

I wanted to check and make sure I understand this wording:

The User Agent/IP combination counts once per day (i.e. fixed 24 hour period starting at midnight UTC)

So if someone starts streaming an episode at 23:59 UTC and completes streaming at 00:05 UTC the next day, this would count as 2 downloads?

Excluding bytes=0-1 requests

The readme says...

It can't be a streaming request for the first 2 bytes (i.e. Range: bytes=0-1)

Doesn't this discount a lot of genuine Apple Podcast plays, given a certain server setup?

Say your podcast host receives a request for example.com/foo.mp3, then responds with a 302 redirect to a second server that actually hosts the file (eg a CDN).

If you try to stream an mp3 in Apple Podcasts (at least on iOS 12.4), the podcasts app sends a single Range: bytes=0-1 to example.com, then follows the 302 redirect to the CDN. Any subsequent requests are sent directly to the CDN.
In order to count such plays, you'd need access to the CDN logs. Is that expected? It would also seem to prevent tracking-prefix services (Chartable et al) from being able to provide accurate oDL numbers.

No IPv6 support - crash when query blacklist for an IPv6

Currently, querying blacklist for an IPv6 causes a failed assert in patricia trie(and crash). For example, if we add this line to blacklist unit test:
assert not blacklist.is_blacklisted("2600::84"), we have the following error:
(odl) username@dev-XPS-15-9550:~/wodl27/odl/tests$ python -m unittest blacklist
python: patricia.c:605: patricia_search_best2: Assertion `prefix->bitlen <= patricia->maxbits' failed.
Aborted (core dumped).
In odl/blacklist.py, for IPv6, PyTricia should be initialized with 128 bits length: pytricia.Pytricia(128) instead of pytricia.PyTricia(). Or it can be opted to use 2 databases: db and dbv6.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.