ddkang / aidb Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 6.0 3.67 MB

License: Apache License 2.0

Python 100.00%

aidb's People

Contributors

Stargazers

Watchers

Forkers

sky-2002 shg8 phantominh shubhankarunhale

aidb's Issues

Support multi aggregation queries

Support multi-aggregation queries

Change the print into logging

          Can we turn the prints into logging and make the tests print info? Can do this in another PR but raise an issue in that case

Originally posted by @ddkang in #107 (comment)

Not waiting for the insertion to complete in case of mysql

https://github.com/ddkang/aidb-new/blob/1f7021a67b507b4308f3f31c26b60989cb7ec93c/tests/db_utils/db_setup.py#L142

I have spent so much time on this but unable to figure out. The execution doesn't stop at this line for the insertion to complete in case of mysql.

Issue in retrieving results from output tables

https://github.com/ddkang/aidb-new/blob/eb1394191d44b3b84b7658abc1ff83e872cbc54d/aidb/inference/bound_inference_service.py#L161

I am not able to understand this logic or maybe I am missing some design choice.

self._result_query_stub = sqlalchemy.sql.select(output_cols)

result_query_stub selects output columns where as the filtering condition is on the cache columns (cache columns are all input binding columns).

query = self._result_query_stub.where(
            sqlalchemy.sql.and_(
              *[getattr(self._cache_table.c, self.convert_column_name(col)) == getattr(inp_row, col) for col in
                self.binding.input_columns]
            )

I don't know why should this work.

Bump sqlglot-aidb version, use % for confidence everywhere

Allow for clearing all ML model results

The use case is the ML model is updated.

Configuration checks: schema only (no engines)

Blob table check: logically one blob table (join keys allowed) for each unstructured data resource. Blob tables refer to base tables.
Reference check: If Table A references Table B, then Table A must refer to all primary key columns in Table B. L135-L145
Table graph check: self loop, one edge loop, cyclic; use directed graph L124-L132 L153-L155

When add new clause in query, it returns expression type, it's better to return a new query.

For example,

aidb/aidb/query/query.py

Lines 523 to 526 in 9ae2dbe

 def add_select(self, expression, selects): 

 re = Rewriter(expression) 

 e = re.add_selects(selects) 

 return e.expression

redirect demanding aqp queries to full scan engine

Is there a test that takes more samples than available blobs? (e.g., an error target of 0.1% should just do a full scan and return the exact answer) #64

Remove unneeded check

https://github.com/ddkang/aidb/blob/main/aidb/config/config.py#L241

Check to see if this is needed

Pass valid non-select queries / statements to base database

If the user does something like CREATE TABLE, it will currently fail since we only accept select queries. We need to pass all valid SQL to the underlying database.

Refactor nested functions in full scan engine

get_where_str

get_join_str

Think about good design. Where to put these functions.

Support the batch HuggingFace API

We need to decide whether or not to be able to serialize the configuration or if it should be only stored in the database. If it's only stored in the database, we need a way to inspect the configuration.

Write testcase for input / output conversion in http inference service

Loosen requirements.txt

And load the packages on demand only

Refactor integration tests code

There is redundancy in the integration tests code. Need to do refactoring.

columns with same name in 2 different tables

https://github.com/ddkang/aidb-new/blob/f8e1521958e07381dd182a3a67174489191dafea/aidb/config/config.py#L176

Just making sure:

If 2 different tables (not related by input-output of inference service) have a column with same name, we are assuming that they both refer to the same thing?

There may be some small errors in full_scan_engine.py

https://github.com/ddkang/aidb-new/blob/88807f3cda41cc5ee37d7adf61681c5699bbabf6/aidb/engine/full_scan_engine.py#L28
https://github.com/ddkang/aidb-new/blob/88807f3cda41cc5ee37d7adf61681c5699bbabf6/aidb/config/config_types.py#L59-L64
table.foreign_keys has structure {col.name:fk.target_fullname}.
I think line28 should be changed to for fkey in table1.foreign_keys.values():

https://github.com/ddkang/aidb-new/blob/88807f3cda41cc5ee37d7adf61681c5699bbabf6/aidb/engine/full_scan_engine.py#L42-L45
if we have two tables objects, color, it will output
'''
SELECT
FROM objects, color
INNER JOIN color ON
'''
I think line44 should be changed to
FROM {inp_tables[0]}

Case for 'num_samples' equal to 0 in approximate aggregation query

In the extreme case where all sampled blobs have same aggregation value, this will result in 'num_samples' being equal to 0. For example, there's a predicate with a 1% positivity rate. After 1k samples, you'll only get 10 positive samples. If we're unlucky and they're all the same, our estimate won't be valid.

Ordering of tables

https://github.com/ddkang/aidb-new/blame/4c57b30d6ff8346c4eddd40c5f1212c3c032cfe4/aidb/engine/full_scan_engine.py#L23

Are we assuming an ordering on these tables? I think this may give wrong results if the tables are not ordered correctly.

Batch selection of cache table

currently, we check the elements in cache one by one.

Everything from cache can be loaded in memory.

Hugging Face API- HTTPError:429 Client Error: Too many Requests for url

When call Hugging Face API many times like 1000, in a short time. There will raise a HTTPError:429 Client Error: Too many Requests for url
https://github.com/ddkang/aidb-new/blob/3049bd4f52f6cc1ae56770f1158c174b373052ce/aidb/engine/full_scan_engine.py#L22-L26

Originally posted by @ttt-77 in #83 (comment)

Test proxy score correctness

Just to be sure that the proxy score implementation and all is correct. I would also suggest that you compare the number of inference service calls for limit queries with these proxy scores and some adversarial proxy score (maybe you can test with (1 - proxy_score)). If the implementation is correct, number of inference service calls will be fewer in case of perfect proxy scores.

#50

Support sqlalchemy 1.4 and 2.0

Cache table insertion

inference service calls are not inserted in cache tables

Fix legal use case with PDFs

Link: https://drive.google.com/file/d/19Zdb15MOun7juMwy-ia6TlbIqvsdBZkn/view?usp=drive_link

Project-wide formatter / linter

Allow for versioning of ML models

Do not fetch ALL records from cache table

#109

Currently, all the records from the cache table are fetched together and kept in memory. In case of big tables, it can be a problem.

Composite pk-fk relation

https://github.com/ddkang/aidb-new/blob/c6303c6bf6b5576f1bdc664b00d99958b9e4ca3d/aidb/inference/bound_inference_service.py#L55

As I was testing, I encountered an error in case of composite pk-fk relation. Have you tested this for composite pk-fk relation? Can I push a bug-fix?

Numba version doesn't support python 3.12

Need add python version requirement in setup.py

Current fake http inference can't deal the case that some columns are NULL

When testing the law dataset, an exception occurred during the HTTP inference process. Upon reviewing the dataset, I discovered that the presence of NULL values in some columns was the cause of this exception.

Test aggregation engine on mysql, postgresql

Add aggregation test to github action

Add aggregation test to GitHub action, test doesn't start when included in GitHub actions together with other tests.

Configuration needs to do schema checks

Need to make a list of checks @continue-revolution

Multiple blob key filtering issue

In filtering keys, the condition is 'keyA in (1, 2, 3) AND keyB in (2, 4, 6)' rather than '(keyA, key B) in ((1,2), (2,4), (3,6))'. This will be a problem when the dataset has multiple blob keys.

Allow the AIDB engine to set up the configuration and metadata tables

Deal with the case when number becomes the key to a JSON object in HTTP request/response

At this time, I've not noticed any JSON request/response that use a single number as key. I think for batch request, people normally request via a list.

However, if this really happens, we should modify flatten_json to support branch on the type

issues in naming of cache table

https://github.com/ddkang/aidb-new/blob/c6303c6bf6b5576f1bdc664b00d99958b9e4ca3d/aidb/utils/constants.py#L20

I think including column name in the table is not good as it exceeds the limit.

We should include service name in the cache table so that we can distinguish different cache tables.

Refactor Proxy Score Computation

          For counts, what's wrong with setting it to the estimated count? I think for derived rows, we want to estimate the count and do proportional importance sampling or something actually (or control variates)

Originally posted by @ddkang in #107 (comment)

Unclosed event loop for large csv

https://github.com/ddkang/aidb-new/blob/main/aidb_utilities/db_setup/blob_table.py#L54

This line throws this warning when the csv file is large (279249 rows). @akash17mittal is this warning expected?

Too many ML calls when run SELECT AVG(x_min) FROM objects00 ERROR_TARGET 5% CONFIDENCE 95;

count query not working inside full scan engine's execute as expected..

Use a count query on one of the columns inside execute function of full scan engine
eg:

f'''
SELECT COUNT({inp_col})
FROM {', '.join(inp_tables)}
{join_str};
'''

it was giving me 1000 as answer instead of 605 on objects00 table, inp_col being objects00.frame

Actually compare results in full scan test

Sync sqlglot with upstream

sqlglot has been updated quite a bit. We should sync sqlglot-aidb to upstream

Configuration checks: inference engine

1. The inference service must be defined in config.
2. The input columns and output columns must exist in the database schema.
3. The output table must not be a blob table.
4. The input table must include the minimal set of primary key columns from the output table.
   And to ensure that no primary key column in the output table is null, any column in the output table.
5. The output column must be bound to only one inference service.
6. The table relations of the input tables and output tables must form a DAG.

	def add_select(self, expression, selects):
	re = Rewriter(expression)
	e = re.add_selects(selects)
	return e.expression