Giter Club home page Giter Club logo

featurebase-examples's People

Contributors

ch7ck avatar graska avatar gthrone25 avatar kordless avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

graska

featurebase-examples's Issues

Cloud example for ingesting data with bulk inserts

Now we have capabilities to do bulk inserts with the product, and cloud offering, we need an example that shows how to use Python to pull from something like Kafka and insert into FB using the new endpoints.

  • some simple graphing example using a standalone Flask app
  • examples of inserts using Python, with or without Kafka

Meroxa + FeatureBase Example

Meroxa is a code-first data application platform that enables developers to build and deploy data products quickly and easily. Meroxa's platform is designed to maximize a developer's time spent building data products, and minimizes the time spent on maintaining fragile data systems.

FeatureBase is a real-time analytical database built on bitmaps. It is open source, in-memory, and provides SQL support, real-time updates, and analytical processing for your growing data.

With a Meroxa example, FeatureBase can index a wide variety of information for use in doing analytical processing on the data.

This example should illustrate how to signup for both Meroxa and FeatureBase's cloud offerings and deploy the code provided to illustrate using both systems for ingestion of a moderate amount of data, likely stored in Kafka and S3.

Written in Python. Shows some tables and graphs, using the Snow example's dashboard.

Docker Container Example for running FeatureBase + Kafka

A Docker container is required for running the FeatureBase binary. This container should also run Kafka, Zookeeper, and the Kafka consumer, as seen here: https://github.com/FeatureBaseDB/featurebase-examples/tree/main/kafka-starter

  • create a Dockerfile that downloads and runs the current version of Featurebase
  • Dockerfile also downloads and installs a version of Kafka
  • Dockerfile starts Zookeeper, Kafka, and exposes port for sending data to Kafka as well as exposing the 10101 port for the FB UI.
  • Dockerfile starts the kafka consumer using a file which may be uploaded

To enable a configurable schema file, create a simple Flask endpoint that takes a JSON file POST, saves it to the container and then restarts the consumer. The consumer should start with a sample schema JSON (as seen in the kafka-starter example) and display the current schema. Here are the endpoints and sample outputs:

/schema [GET]
[
{
"name": "user_id",
"path": ["user_id"],
"type": "string"
},
{
"name": "name",
"path": ["name"],
"type": "string"
},
{
"name": "age",
"path": ["age"],
"type": "id"
}
]

/schema [POST] (validates JSON)
{"response": "OK"}

Invalid JSON:
{"response": "FAILED"}

The port to expose for the Flask endpoint is to be 20202.

A sample Python file that sends its schema to the port and then submits data via the Kafka libs is also needed. A README.md is to be written and posted to the /docker-example directory in this repo.

Add a docker compose service for the csv consumer

The docker examples, docker-simple and docker-cluster both need a way to spin up a container, copy a CSV into it, then run the consumer, per the instructions:

idk/molecula-consumer-csv \
--auto-generate \
--index=allyourbase \
--files=sample.csv

Once the container starts, in that docker compose context, the container should have access to the docker naming scheme and docker network.

Implement a means to do this to the two repos and ensure it can consume the sample.csv file, which should be included in the example. All code should go under the respective example directories.

Weaviate + FeatureBase Mashup

This example would show how to integrate FeatureBase into a Weaviate project.

  • determine how to accelerate weaviate searches using a lookup in FB
  • determine a data set that may be vectorized (embeddings on GPT-3 for example) as well as indexed by FB for speed increase
  • determine which types of data sets are likely to be larger for this project
  • establish other integrations, such as with Solr or other indexing technologies

Multiplayer Matchmaking Example

Overview

Matchmaking in games is primarily a matter of segmenting a large population down to finding a small well fit set. Sometimes as small a set as finding a single player for 1v1 competitive games.

Given the scope of potentially billions of players, needing to be broken down to a set of less than 100, this should be a solid example for FeatureBase.

Scope

  1. Generate a large dataset of players with traditional matchmaking data points
  2. Build a simple API server that will take a player's id and return a set of potential matches

AI Analytics Exploration

Implement an example that shows off using a model ON the analytics data that is pulled out of FeatureBase.

  • needs a suitable dataset for ingestion
  • needs a model that can provide or query data (this will be done in the vision example)

Analytics Example

FeatureBase is an analytics engine, but we have yet to store our own metrics in our software. The intent of this issue is to create a project that provides generalized tracking of metrics from a variety of sources. These include, but are not limited to:

  • joins to the Discord server
  • repo stars
  • repo clones/other traffice
  • website visits
  • conversions to cloud signup
  • docker spinups (of the binary)
  • ingestion of data into cloud
  • other funnel activities (twitter mentions, etc.)

We'll need graphs, over time, so a few requirements include:

  • timeline charts
  • funnel view
  • permanent deployment to a server we run in cloud
  • use of cloud or standalone binary for storage of metrics
  • push endpoints /api/metrics/ POST support
  • query UI for arbitrary queries
  • timeline selector or time range limiting
  • ability to quickly add new metrics for those not managing server

Please add comments or requirements to this ticket as needed.

Initially this work will be done in the featurebase-examples repo, but will be spun out once the bulk of the work is complete. At that time, the new repo can carry the requirements and issues for continued work on the project.

Scheme synthesis using GPT-3

This project aims to implement a basic scheme synthesis for FeatureBase from sample data.

Use of the example would provide a run-once activity that:

  • Collects data from Kafka
  • Creates a scheme to insert into FeatureBase
  • Ingests data from Kafka from that scheme
  • Visualizes the data

Some notes about scheme updates or changes would be needed.

Snow: A dashboard for querying and getting graphs of data.

This example would illustrate the use of querying FeatureBase for data and then displaying the data in a wide variety of different visualized graphs, using chart.js or other JS based graphing libraries.

image

Uses SQL synthesis from a call to GPT-3 to write SQL.

Uses a simple Python Flask file to display the dashboard and handle calls to FB and GPT-3.

CRON-like trigger and ML pipeline

Evaluate existing technologies and choose a Python pipeline for triggering searches to FB and then running the resulting data against a model.

Additionally, take inferencing from a model and put the results in a pipeline.

High volume data synthesis + querying demo

This prototype (and others like it that are available) would be built from existing demos authored by engineering.

  • Jaffee's demo of high volume data and querying
  • Garrett's demo which graphs data ingested by FB

Large public data set and modeling inference ingestion

Various large public data sets exist on Kaggle, etc. which may be used with ML pipelines.

  • determine which data sets would produce large numbers of entries
  • add logging to pipelines to log the data to FeatureBase
  • use a visualizer example with the data and include queries that can be run to produce graphs

LLM conversational query to SQL3 framework

Build a framework that takes a conversational query and sends it to GPT-3's APIs. Use that query against an existing FB data set.

  1. provide a means to ingest a moderate amount of data into FB
  2. use the schema of the data in FB to synthesize a query from a conversational query entered by the user.
  3. output the resulting query and the results from FB

This issue is prerequisite for #2.

Base project for most examples

Using Snow and query synthesis from GPT-3, assemble a basic framework for implementing query->graph functions.

  • likely written in Python using Flask
  • basic layout and graphics for other prototypes
  • search box
  • display area (for graphs/viz and tables of data)

Create README guide for Examples

Guide should contain:

  • Installing and running FeatureBase (locally and assuming macOS and Linux)
  • Links to various examples (which live in directories)
  • Screenshots of some of the examples

Comprehensive smoke tests for the various Docker deployments.

We need more comprehensive smoke tests for the releases. Internal tests are continuing to improve, but with the complications brought by Docker deployments, we don't want to miss anything for the users.

  • build an example that deploys FB in different ways and runs heavy ingestion and querying tests
  • automate most of this so that the tests can be run by anyone at any time
  • consider using selenium to test some of the UI queries
  • explore this topic further by creating the directory example and then creating tickets for future work on it

ML Pipeline Logging and Inferencing Example

This issue replaces #14, #10, #4 and #8.

A large amount of data is desired for ingestion. Evaluate datasets which may lead to:

  1. Choosing a dataset on Kaggle that has a large number amount of training data.
  2. Choosing a set of models from HuggingFace that can be run on the data set, for initial labeling.
  3. Building a Jupyter notebook that uses the labeled data + reports from FB to train a new model.
  4. Logging the training of the new model.
  5. Add a cron triggered process that does rollups.

Query exploration is desired for reporting by the user in a simple UI. This feature requires the ability of #2 for graphing the results.

5B games of Set (the game)

This example inserts 5 billion game draws (initial draw) of Set, the game.

  • Generate data and insert into FB (how?)
  • Visualize the data and provide a simple UI for querying using PQL/SQL

A post has been started discussing how to model the data for insertion and querying.

Analyzing 5 Billion Games of Set with FeatureBase

FeatureBase is a binary-tree database built on Roaring Bitmaps. This makes it suitable for running analytics on massive data sets in real time. If you've never used FeatureBase before, you can get it running locally in about 5 minutes.

Today, we're going to take a look at using FeatureBase to simulate and analyze a very large number of Set games in real-time.

Set (the game)

Set is a card game designed by Marsha Falco in 1974 and published by Set Enterprises in 1991. The deck consists of 81 unique cards that vary in four features across three possibilities for each kind of feature: number of shapes (one, two, or three), shape (diamond, squiggle, oval), shading (solid, striped, or open), and color (red, green, or purple).

In a game of Set, the cards are shuffled and then 12 cards are drawn from the top of the deck and placed on the table. Game play then commences by players beginning to identify sets in the initial deal.

In this example, we are going to focus on the initial draw only. We won't be pulling cards and dealing new ones from the remainder of the deck, in other words. We'll simulate one billion draws of twelve cards (from a full deck) and then proceed to do one billion draws of fifteen cards (adding three cards each time) and so on until we have a total of five billion game draws.

Thinking in Binary

There are 1,080 unique sets possible in the game. Let's think about this for a minute by creating a couple of large binary numbers to look at for visualization. The first number will be 81 digits and represent a single set of of the 1,080 possible sets. We'll also put the different attribute headers at the top of this number to help us figure out what a given binary place represents which card.

We'll use green, purpl, and red for colors. S will respresent squiggles. ♦️ are diamonds and O (ohs) are pills.

set_0

<         solid           ><         open            ><         shaded          >
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
< green >< purpl ><  red  >< green >< purpl ><  red  >< green >< purpl ><  red  >
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O>
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
123123123123123123123123123123123123123123123123123123123123123123123123123123123
---------------------------------------------------------------------------------
100000000100000000100000000000000000000000000000000000000000000000000000000000000

In this representation, we are saying we have a solid green squggle of count one, a solid purple squiggle of count one and a solid red squiggle of count one. This is a set because all the attributes are either different (colors in this example) or all the same (shading, count and shape).

Now let's do one where the three cards have different shading, color, count and shape:

set_1

<         solid           ><         open            ><         shaded          >
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
< green >< purpl ><  red  >< green >< purpl ><  red  >< green >< purpl ><  red  >
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O>
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
123123123123123123123123123123123123123123123123123123123123123123123123123123123
---------------------------------------------------------------------------------
001000000000000000000000000000000000000000000000000010000000000000100000000000000

Now we'll do a sample draw of 15 cards.

draw_0

<         solid           ><         open            ><         shaded          >
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
< green >< purpl ><  red  >< green >< purpl ><  red  >< green >< purpl ><  red  >
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O><S><♦️><O>
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
123123123123123123123123123123123123123123123123123123123123123123123123123123123
---------------------------------------------------------------------------------
100011000100000000100000010100000000000010000010000000101100000100000010010000000
---------------------------------------------------------------------------------

Here's what that draw looks like, in a real game:

open

Now we AND the two numbers:

---------------------------------------------------------------------------------
100000000100000000100000000000000000000000000000000000000000000000000000000000000
100011000100000000100000010100000000000010000010000000101100000100000010010000000
---------------------------------------------------------------------------------
100000000100000000100000000000000000000000000000000000000000000000000000000000000

Given that result is equivalent to the the set we mention above, we have a match. There may be other sets present on the board, but we're going to switch to using decimal numbers to respresent the different cards.

Modeling

As there are 81 total cards, we're going to use 0 through 80 to represent those cards. So, a sample draw of those same 15 cards above now becomes:

[0,4,5,9,18,25,27,40,46,54,56,57,63,70,73]

As for our sample set we chose, that becomes:

[0,9,18]

We need a list of valid sets.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.