featurebasedb / featurebase-examples Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 1.0 9.83 MB

Examples for FeatureBase Community

License: MIT License

Python 40.11% Shell 3.02% Jupyter Notebook 21.52% Dockerfile 3.97% HTML 28.80% CSS 2.58%

featurebase-examples's People

Contributors

Stargazers

Watchers

Forkers

graska

featurebase-examples's Issues

Cloud example for ingesting data with bulk inserts

Now we have capabilities to do bulk inserts with the product, and cloud offering, we need an example that shows how to use Python to pull from something like Kafka and insert into FB using the new endpoints.

some simple graphing example using a standalone Flask app
examples of inserts using Python, with or without Kafka

Meroxa + FeatureBase Example

Meroxa is a code-first data application platform that enables developers to build and deploy data products quickly and easily. Meroxa's platform is designed to maximize a developer's time spent building data products, and minimizes the time spent on maintaining fragile data systems.

FeatureBase is a real-time analytical database built on bitmaps. It is open source, in-memory, and provides SQL support, real-time updates, and analytical processing for your growing data.

With a Meroxa example, FeatureBase can index a wide variety of information for use in doing analytical processing on the data.

This example should illustrate how to signup for both Meroxa and FeatureBase's cloud offerings and deploy the code provided to illustrate using both systems for ingestion of a moderate amount of data, likely stored in Kafka and S3.

Written in Python. Shows some tables and graphs, using the Snow example's dashboard.

Docker Container Example for running FeatureBase + Kafka

A Docker container is required for running the FeatureBase binary. This container should also run Kafka, Zookeeper, and the Kafka consumer, as seen here: https://github.com/FeatureBaseDB/featurebase-examples/tree/main/kafka-starter

create a Dockerfile that downloads and runs the current version of Featurebase
Dockerfile also downloads and installs a version of Kafka
Dockerfile starts Zookeeper, Kafka, and exposes port for sending data to Kafka as well as exposing the 10101 port for the FB UI.
Dockerfile starts the kafka consumer using a file which may be uploaded

To enable a configurable schema file, create a simple Flask endpoint that takes a JSON file POST, saves it to the container and then restarts the consumer. The consumer should start with a sample schema JSON (as seen in the kafka-starter example) and display the current schema. Here are the endpoints and sample outputs:

/schema [GET]
[
{
"name": "user_id",
"path": ["user_id"],
"type": "string"
},
{
"name": "name",
"path": ["name"],
"type": "string"
},
{
"name": "age",
"path": ["age"],
"type": "id"
}
]

/schema [POST] (validates JSON)
{"response": "OK"}

Invalid JSON:
{"response": "FAILED"}

The port to expose for the Flask endpoint is to be 20202.

A sample Python file that sends its schema to the port and then submits data via the Kafka libs is also needed. A README.md is to be written and posted to the /docker-example directory in this repo.

Add a docker compose service for the csv consumer

The docker examples, docker-simple and docker-cluster both need a way to spin up a container, copy a CSV into it, then run the consumer, per the instructions:

idk/molecula-consumer-csv \
--auto-generate \
--index=allyourbase \
--files=sample.csv

Once the container starts, in that docker compose context, the container should have access to the docker naming scheme and docker network.

Implement a means to do this to the two repos and ensure it can consume the sample.csv file, which should be included in the example. All code should go under the respective example directories.

Weaviate + FeatureBase Mashup

This example would show how to integrate FeatureBase into a Weaviate project.

determine how to accelerate weaviate searches using a lookup in FB
determine a data set that may be vectorized (embeddings on GPT-3 for example) as well as indexed by FB for speed increase
determine which types of data sets are likely to be larger for this project
establish other integrations, such as with Solr or other indexing technologies

Gaming integration with FeatureBase

Implement a prototype that logs data out of a game server.

big data preferred
statistics visualized

Multiplayer Matchmaking Example

Overview

Matchmaking in games is primarily a matter of segmenting a large population down to finding a small well fit set. Sometimes as small a set as finding a single player for 1v1 competitive games.

Given the scope of potentially billions of players, needing to be broken down to a set of less than 100, this should be a solid example for FeatureBase.

Scope

Generate a large dataset of players with traditional matchmaking data points
Build a simple API server that will take a player's id and return a set of potential matches

AI Analytics Exploration

Implement an example that shows off using a model ON the analytics data that is pulled out of FeatureBase.

needs a suitable dataset for ingestion
needs a model that can provide or query data (this will be done in the vision example)

Analytics Example

FeatureBase is an analytics engine, but we have yet to store our own metrics in our software. The intent of this issue is to create a project that provides generalized tracking of metrics from a variety of sources. These include, but are not limited to:

joins to the Discord server
repo stars
repo clones/other traffice
website visits
conversions to cloud signup
docker spinups (of the binary)
ingestion of data into cloud
other funnel activities (twitter mentions, etc.)

We'll need graphs, over time, so a few requirements include:

timeline charts
funnel view
permanent deployment to a server we run in cloud
use of cloud or standalone binary for storage of metrics
push endpoints /api/metrics/ POST support
query UI for arbitrary queries
timeline selector or time range limiting
ability to quickly add new metrics for those not managing server

Please add comments or requirements to this ticket as needed.

Initially this work will be done in the featurebase-examples repo, but will be spun out once the bulk of the work is complete. At that time, the new repo can carry the requirements and issues for continued work on the project.

Scheme synthesis using GPT-3

This project aims to implement a basic scheme synthesis for FeatureBase from sample data.

Use of the example would provide a run-once activity that:

Collects data from Kafka
Creates a scheme to insert into FeatureBase
Ingests data from Kafka from that scheme
Visualizes the data

Some notes about scheme updates or changes would be needed.

Snow: A dashboard for querying and getting graphs of data.

This example would illustrate the use of querying FeatureBase for data and then displaying the data in a wide variety of different visualized graphs, using chart.js or other JS based graphing libraries.

Uses SQL synthesis from a call to GPT-3 to write SQL.

Uses a simple Python Flask file to display the dashboard and handle calls to FB and GPT-3.

CRON-like trigger and ML pipeline

Evaluate existing technologies and choose a Python pipeline for triggering searches to FB and then running the resulting data against a model.

Additionally, take inferencing from a model and put the results in a pipeline.

High volume data synthesis + querying demo

This prototype (and others like it that are available) would be built from existing demos authored by engineering.

Jaffee's demo of high volume data and querying
Garrett's demo which graphs data ingested by FB

Large public data set and modeling inference ingestion

Various large public data sets exist on Kaggle, etc. which may be used with ML pipelines.

determine which data sets would produce large numbers of entries
add logging to pipelines to log the data to FeatureBase
use a visualizer example with the data and include queries that can be run to produce graphs

ML Vision Logging and Analysis Example

This project would illustrate utilizing FeatureBase as a store of logged inferences with a given vision model, such as: https://huggingface.co/microsoft/resnet-50

Historical analysis would be done, with an intent to identify outliers in the logged data.

A visualization component would also be interesting.

Add next step link in docker compose outputs

Add next step link in docker compose outputs for user to see what to do next, for example to click on a link to view the UI locally.

Run prometheus metrics into FeatureBase

Write an example that takes the prometheus metrics and run them into FeatureBase. Present a small graph example that can be used to write more examples.

Update Docker example to use just FeatureBase, without Kafka

Move the Kafka based example to a new directory. Update the link in README.md.

Create a new standalone dockerfile that uses just FeatureBase, plus some Python example to insert via SQL3.

LLM conversational query to SQL3 framework

Build a framework that takes a conversational query and sends it to GPT-3's APIs. Use that query against an existing FB data set.

provide a means to ingest a moderate amount of data into FB
use the schema of the data in FB to synthesize a query from a conversational query entered by the user.
output the resulting query and the results from FB

This issue is prerequisite for #2.

Base project for most examples

Using Snow and query synthesis from GPT-3, assemble a basic framework for implementing query->graph functions.

likely written in Python using Flask
basic layout and graphics for other prototypes
search box
display area (for graphs/viz and tables of data)

Create README guide for Examples

Guide should contain:

Installing and running FeatureBase (locally and assuming macOS and Linux)
Links to various examples (which live in directories)
Screenshots of some of the examples

Comprehensive smoke tests for the various Docker deployments.

We need more comprehensive smoke tests for the releases. Internal tests are continuing to improve, but with the complications brought by Docker deployments, we don't want to miss anything for the users.

build an example that deploys FB in different ways and runs heavy ingestion and querying tests
automate most of this so that the tests can be run by anyone at any time
consider using selenium to test some of the UI queries
explore this topic further by creating the directory example and then creating tickets for future work on it

ML Pipeline Logging and Inferencing Example

This issue replaces #14, #10, #4 and #8.

A large amount of data is desired for ingestion. Evaluate datasets which may lead to:

Choosing a dataset on Kaggle that has a large number amount of training data.
Choosing a set of models from HuggingFace that can be run on the data set, for initial labeling.
Building a Jupyter notebook that uses the labeled data + reports from FB to train a new model.
Logging the training of the new model.
Add a cron triggered process that does rollups.

Query exploration is desired for reporting by the user in a simple UI. This feature requires the ability of #2 for graphing the results.

5B games of Set (the game)

This example inserts 5 billion game draws (initial draw) of Set, the game.

Generate data and insert into FB (how?)
Visualize the data and provide a simple UI for querying using PQL/SQL

A post has been started discussing how to model the data for insertion and querying.

Analyzing 5 Billion Games of Set with FeatureBase

FeatureBase is a binary-tree database built on Roaring Bitmaps. This makes it suitable for running analytics on massive data sets in real time. If you've never used FeatureBase before, you can get it running locally in about 5 minutes.

Today, we're going to take a look at using FeatureBase to simulate and analyze a very large number of Set games in real-time.

Set (the game)

Set is a card game designed by Marsha Falco in 1974 and published by Set Enterprises in 1991. The deck consists of 81 unique cards that vary in four features across three possibilities for each kind of feature: number of shapes (one, two, or three), shape (diamond, squiggle, oval), shading (solid, striped, or open), and color (red, green, or purple).

In a game of Set, the cards are shuffled and then 12 cards are drawn from the top of the deck and placed on the table. Game play then commences by players beginning to identify sets in the initial deal.

In this example, we are going to focus on the initial draw only. We won't be pulling cards and dealing new ones from the remainder of the deck, in other words. We'll simulate one billion draws of twelve cards (from a full deck) and then proceed to do one billion draws of fifteen cards (adding three cards each time) and so on until we have a total of five billion game draws.

Thinking in Binary

There are 1,080 unique sets possible in the game. Let's think about this for a minute by creating a couple of large binary numbers to look at for visualization. The first number will be 81 digits and represent a single set of of the 1,080 possible sets. We'll also put the different attribute headers at the top of this number to help us figure out what a given binary place represents which card.

We'll use green, purpl, and red for colors. S will respresent squiggles. ♦️ are diamonds and O (ohs) are pills.

set_0

<         solid           ><         open            ><         shaded          >
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
< green >< purpl ><  red  >< green >< purpl ><  red  >< green >< purpl ><  red  >
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O>
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
123123123123123123123123123123123123123123123123123123123123123123123123123123123
---------------------------------------------------------------------------------
100000000100000000100000000000000000000000000000000000000000000000000000000000000

In this representation, we are saying we have a solid green squggle of count one, a solid purple squiggle of count one and a solid red squiggle of count one. This is a set because all the attributes are either different (colors in this example) or all the same (shading, count and shape).

Now let's do one where the three cards have different shading, color, count and shape:

set_1

<         solid           ><         open            ><         shaded          >
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
< green >< purpl ><  red  >< green >< purpl ><  red  >< green >< purpl ><  red  >
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O>
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
123123123123123123123123123123123123123123123123123123123123123123123123123123123
---------------------------------------------------------------------------------
001000000000000000000000000000000000000000000000000010000000000000100000000000000

Now we'll do a sample draw of 15 cards.

draw_0

<         solid           ><         open            ><         shaded          >
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
< green >< purpl ><  red  >< green >< purpl ><  red  >< green >< purpl ><  red  >
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O>
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
123123123123123123123123123123123123123123123123123123123123123123123123123123123
---------------------------------------------------------------------------------
100011000100000000100000010100000000000010000010000000101100000100000010010000000
---------------------------------------------------------------------------------

Here's what that draw looks like, in a real game:

Now we AND the two numbers:

---------------------------------------------------------------------------------
100000000100000000100000000000000000000000000000000000000000000000000000000000000
100011000100000000100000010100000000000010000010000000101100000100000010010000000
---------------------------------------------------------------------------------
100000000100000000100000000000000000000000000000000000000000000000000000000000000

Given that result is equivalent to the the set we mention above, we have a match. There may be other sets present on the board, but we're going to switch to using decimal numbers to respresent the different cards.

Modeling

As there are 81 total cards, we're going to use 0 through 80 to represent those cards. So, a sample draw of those same 15 cards above now becomes:

[0,4,5,9,18,25,27,40,46,54,56,57,63,70,73]

As for our sample set we chose, that becomes:

[0,9,18]

We need a list of valid sets.

Sample + Guide for implementing bulk SQL3 inserts from Kafka topic

show a means to replace IDK ingesters with a python example that uses bulk inserts with SQL3