georgia-tech-db / evadb Goto Github PK

View Code? Open in Web Editor NEW

2.6K 27.0 260.0 158.89 MB

Database system for AI-powered apps

Home Page: https://evadb.ai/docs

License: Apache License 2.0

Python 99.28% Shell 0.68% Dockerfile 0.05%

eva video-analytics serving database labeling object-detection data-analysis ai chatgpt langchain

evadb's People

Contributors

Stargazers

Watchers

Forkers

jaehobang sanjanag rishabhbhardwaj priyamraut kcaras asrayousuf saiprashanth173 jacksonf1 pgluss gaurav274 jhanavi vaishnavik22 imvinod snd96 alilakda manu-tej shubham07iiit priyankmadria auag92 akhileshsiddhanti jeremyhua18 swati21 albert-hen fzc2nothing ashwin1934 xzdandy sudev warhammer0 karan-sarkar vellichorlk blackburdai anirudh58 gitter-badger suryatejreddy kratosst yxyang847 hongyuchen1030 peytonhowell jkomskis 7kevin49 eloyekunle yemmyfolayan devshreebharatia akshayrdeodhar brandodecu quanchao yt0828 mrethanhw jarulraj kaushikravichandran aezexa ishsiva aavhad1910 aryan-rajoria bashhike tracli rajveerb zeokav borunsong sashiko-345 erickkbentz wgoodall01 yichenzhang21 ashmitaraju leungyukshing americast eric-shang luo12826 shreymodi13 blenature tushar-97 dungnmaster fblgit dingguangyao yanxiang-zhou aditmohan96 marhelia tangyuan233 vivek-mandal sameer-s hedaanirudh snigdha04 geraintcjy vivinperis techno-byte hhh21u luoj1 jaytoday patrickcurl alexxx-db dakouan18 evdcush goulash1971 jolks sekmet ohmygaugh-crypto bigfathead nightincode devdoshi aidasdir

evadb's Issues

UDFs - Object Detection

Need to report ROI, mAP for Faster RCNN.
We can utilized specialized neural networks since we only have 4 classes to detect.

Add support for concurrent queries

In the current setup, we cannot run multiple queries even though the execute_async returns control to the user immediately. We need the execute_async to return a future or an id which the user can use to specify the query they want o spin on.
Adding support for this workflow using the following API changes.

.execute_async() - returns a future
.fetch_all(future) - returns all the rows

Potential bug in FastRCNNObjectDetector

In our current code in FastRCNNObjectDetector, we have in line 101:

pred_t = [pred_score.index(x) for x in pred_score if x > self.threshold][-1]

But it is possible that, for some video, none of the pred_score is actually greater than self.threshold leading to an empty pred_t. This breaks the code.

Suggested fix:
while pred_t is empty:
reduce threshold by 0.05 and repeat the step

UDF - Textual Data

We can utilize multi-layered perceptron for textual data.
We can use a convolutional neural network to determine the color of the bounding box.

FutureWarning: pyarrow.localfs is deprecated as of 2.0.0

Pip automatically installs latest pyarrow library, in which some functions are deprecated.

To suppress those warnings, users can migrate to older pyarrow version.

Query Template for Filters

The purpose of this issue is to discuss how filters (i.e. specialized NNs used as stand-ins for more expensive object detection models, as described in papers like NoScope and BlazeIt) should fit into EVA's queries.

Naming Convention

template -> abstract*
ex) loader_template -> abstract_loader
uadetrac_loader would be correct

Expression Tree Enhancement

Make ExpressionType Enum auto() and add DELIMITER between different expression types.
Add few basic util functionalities:
Convert the expression tree to a list representation which can be used by the optimizer to reorder predicates.
Extract predicate constant from a Comparison predicate
Simplify predicate expression using sympy
Build an expression tree from a list of predicates.

Docker needs to export Port, so the external client can connect to the running EVA server.

Docker needs to start EVA server and export the port. @jiashenC

Uploading and using a custom model in Eva

Right now, Eva does not support using custom-trained PyTorch models for inference.
To add this feature, we will need a few enhancements:

Add support for uploading a custom model to the server.
Add functionality for uploading the class labels supported by the model, and the mapping between the model output and the labels.
Add functionality for loading a custom model in PyTorch.

Query Optimizer - Extensibility

Need to be able to interpret complex queries such as ones with parenthesis.
Maybe I need to use an external logic library at this point?

Connection response on EVA server

Currently, the EVA server and client do not provide explicit messages to indicate that the connection is successfully established. Add these messages to the server and the client code.

Clean up old (deprecated) code

Extraordinary memory footprint by resnet50

10 frames uses GPU memory around 9G.

Different UDF output format

The format of current master branch UDF output, eva-reuse and my SSD object detector is different. Need a sync format.

System metrics collecting support

One metrics we should support first is the latency numbers. Collecting the end-to-end execution time and detailed latency analysis for each component (e..g, optimizer, data access, data transformation) when executing the query.

Replace panda dataframe with pyspark dataframe

Motivation:

Reduce dependency and code complexity.
For some operations (e.g., join), the current implementation based on the Panda data frame relies on the assumption that the data can fit in the memory.

Filters - Curve instead of static values

The confidence level needs to be adjustable by users. In order to do so, filters need to report statistics that are along a curve instead of one static value.

Loader - Keypoint Detection

Need the manually labelled keypoint dict from Siddharth

download.sh link for downloading xml file is broken

In /data/ua_detrac/download.sh,

wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=12xJc8S0Z7lYaAadsi2CoSK3WqH2OkUBu' -O DETRAC-Train_Annotations-XML.zip

fails for me. It is not able to download XML file.

Loader - Color Detection

color detection needs more work; possibly use some other library?

Enable caching in circle ci to reduce build time

Cache before-install and install steps.
https://circleci.com/docs/2.0/language-python/#cache-dependencies

Client does not use EVA parser to parse UPLOAD

Currently, the client does not use the EVA parser for reading the UPLOAD statement. As a result, we need to manually handle syntax and semantic errors in the statement. Currently, any errors in the syntax lead to the client shutting down. We need to either allow the client to use the EVA parser or add syntax conditions within the client as a special case for the UPLOAD statement.

To reproduce:

Run the server and connect the client.
Provide an invalid command beginning with the keyword UPLOAD. Eg:

UPLOAD IMFILE 'data/ua_detrac/ua_detrac.mp4' PATH 'test_video.mp4';

Disallow multiple queries on the same cursor to ensure cursor correctness

We shouldn't allow concurrent select queries on the same cursor to ensure correctness.

cursor.execute('query1')
cursor.execute('query2')

The above code shouldn't be allowed. The user has to clear the cursor buffer before running the second query. Throw the following error.
InternalError: Unread result found

Loader - database

Using database like postgres to easily read and load the necessary data. Maybe use pickle to load the data. Currently, loading takes way too long. We shouldn't be continuously doing this.

UDF - Speed Detector

We can have a object tracker to determine speed

PYSPARK does not use the same version of python as the conda environment.

NDARRAY doesn't support floats

https://github.com/georgia-tech-db/eva/blob/85977ecd2d3c483273a8069124fd5bc11f243f0c/src/catalog/schema_utils.py#L62

Incompatibility of eva environment with Jupyter notebook

In the current conda environment of Eva, the db_abi connect method fails (with an asyncio error) when run from a jupyter notebook. This issue only arises on the Eva environment and not in a fresh conda environment. Possible fixes include upgrading the python version, checking for package conflicts etc.

Filter - Online pp making

Yao's paper mentions a way of making PPs on the fly. Need to look into that and implement.

Documentation Bug: src.storage.init.py executes code

The file src/storage/__init__.py executes code when creating a storage object. This causes problems with the documentation engine as it builds itself by importing each package (from the __init__.py files). This also causes some problems when building the documentation for some of the src/executor/ files such as plan_executor.py and storage_executor as those import the storage package as well.

Currently, the documentation is skipping over this sub-directory but we discussed in the weekly meeting a potential quick fix to move the code out of this init file to abstract_storage_engine.py and petastorm_storage_engine.py

Populate GPU from torch cuda API

We already check whether GPU is available through the torch API. https://github.com/georgia-tech-db/eva/blob/09e8a98ca0d80a03d6563a268a2281d26f714819/src/utils/generic_utils.py#L79

Can we simply just populate GPU from the torch API as well? Instead of asking manual config from user. https://github.com/georgia-tech-db/eva/blob/09e8a98ca0d80a03d6563a268a2281d26f714819/src/executor/execution_context.py#L70

It initially gave me some troubles because the required config is hidden in the code.

Docker support for GPU is pending due to unclear dependency

Reduce video loading time

The LOAD DATA command currently takes a significant amount of time to load the video into the database. Integrate optimizations from the eva-reuse project into eva to improve loading time.

h5py support for dataset

Bigger datasets might not work on ordinary machines. We need to support reading / writing to h5py framework.

Filters

Filters need to output a curve not scalar value

High priority Queries involving Object detection to be supported

UNNEST: #143
JOIN: TBD
ARRAY_FUNCTIONs: TBD (https://www.postgresql.org/docs/8.4/functions-array.html)

 
-- GET frames with pedestrians
SELECT id, frame
FROM DETRAC
WHERE ['pedestrain'] <@ ObjDet(frame).labels;

-- GET frames with a pedestrian and a car
SELECT id, frame
FROM DETRAC
WHERE ['pedestrain', 'car'] <@ ObjDet(frame).labels;

-- GET frames with more than 5 cars
SELECT id, frame
FROM DETRAC
WHERE array_count(ObjDet(frame).labels, 'car') > 5;

-- GET frames with 2 pedestrians and 5 car
SELECT id, frame
FROM DETRAC
WHERE array_count(ObjDet(frame).labels, 'car') = 5 
       and array_count(ObjDet(frame).labels, 'pedestrian') = 2;

-- GET frames with red cars
SELECT id, frame
FROM DETRAC, UNNEST(ObjDet(frame)) as T(label, bbox) 
WHERE label = 'car' and COLOR(frame, bbox) = 'red';

-- GET frames with cars masking 50% frame area
SELECT id, frame
FROM DETRAC, UNNEST(ObjDet(frame)) as T(label, bbox) 
WHERE label = 'car' and AREA(frame, bbox) > 0.5;

-- GET bboxes of all red cars
SELECT id, frame, bboxes
FROM DETRAC, UNNEST(ObjDet(frame)) as T(label, bbox)  
WHERE label = 'car' and AREA(frame, bbox) = 'red' 
GROUPBY id;

-- GET first 100 frames with red car
SELECT id, frame
FROM DETRAC, UNNEST(ObjDet(frame)) as T(label, bbox) 
WHERE label = 'car' and COLOR(frame, bbox) = 'red'
LIMIT 100;

Command line client does not support input one query across multiple lines

Human-readable messages from server

Right now, when the client issues a command, the server does not provide a human-readable response to the request. This affects the useability of the system. We need to add request-dependent messages to the response.

Connection established - Messages on server and client when a connection is established.
Upload - Message on the server with size and name of video being uploaded. Message on the client with the size of the video uploaded.

pytest failed with ModuleNotFoundError: No module named 'test.util'; 'test' is not a package

The pytest importing is broken.

Error I got on latest master branch.

ModuleNotFoundError: No module named 'test.util'; 'test' is not a package

The only solution I find now is to delete the root directory __init__.py file. The pytest official documentation claims it is not appropriate to put a __init__.py at project root directory.

Check return condition in the CreateExecutor

In the create_executor.py, we have

if (self.node.if_not_exists):
            # check catalog if we already have this table
            return

Due to this condition, we are not able to actually create tables for metadata. Temporarily disabling this for my tasks, but do we actually need this? if yes, how do we handle creation of new tables.

Upload fails when using with python notebook

Steps to reproduce:

Checkout the tutorial branch or PR #161
Start EVA server and jupyter notebook on ada1.
Run the object_detection notebook (specifically the upload cell)
Cmd fails with the following error.
[Errno 13] Permission denied: '/tmp/test_video.mp4'

Optimizing data access overhead

Replace Batch.append (underlying pandas.append) into Batch.concat (underlying pandas.concat) if possible
Reduce the data transformation between panda.dataframe and spark.dataframe?

Parenthesis Support
PP matching for negatives

Merge query optimizer

Faster RCNN crash due to no object detected

Not really an urgent bug. Just keep a note here.

Index out of range error when the faster rcnn model does not detect any object here.