ddkang / aidb Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Support multi-aggregation queries
Can we turn the prints into logging and make the tests print info? Can do this in another PR but raise an issue in that case
Originally posted by @ddkang in #107 (comment)
I have spent so much time on this but unable to figure out. The execution doesn't stop at this line for the insertion to complete in case of mysql.
I am not able to understand this logic or maybe I am missing some design choice.
self._result_query_stub = sqlalchemy.sql.select(output_cols)
result_query_stub selects output columns where as the filtering condition is on the cache columns (cache columns are all input binding columns).
query = self._result_query_stub.where(
sqlalchemy.sql.and_(
*[getattr(self._cache_table.c, self.convert_column_name(col)) == getattr(inp_row, col) for col in
self.binding.input_columns]
)
I don't know why should this work.
The use case is the ML model is updated.
For example,
Lines 523 to 526 in 9ae2dbe
Is there a test that takes more samples than available blobs? (e.g., an error target of 0.1% should just do a full scan and return the exact answer) #64
https://github.com/ddkang/aidb/blob/main/aidb/config/config.py#L241
Check to see if this is needed
If the user does something like CREATE TABLE
, it will currently fail since we only accept select queries. We need to pass all valid SQL to the underlying database.
get_where_str
get_join_str
Think about good design. Where to put these functions.
We need to decide whether or not to be able to serialize the configuration or if it should be only stored in the database. If it's only stored in the database, we need a way to inspect the configuration.
And load the packages on demand only
There is redundancy in the integration tests code. Need to do refactoring.
Just making sure:
If 2 different tables (not related by input-output of inference service) have a column with same name, we are assuming that they both refer to the same thing?
https://github.com/ddkang/aidb-new/blob/88807f3cda41cc5ee37d7adf61681c5699bbabf6/aidb/engine/full_scan_engine.py#L28
https://github.com/ddkang/aidb-new/blob/88807f3cda41cc5ee37d7adf61681c5699bbabf6/aidb/config/config_types.py#L59-L64
table.foreign_keys has structure {col.name:fk.target_fullname}.
I think line28 should be changed to for fkey in table1.foreign_keys.values():
https://github.com/ddkang/aidb-new/blob/88807f3cda41cc5ee37d7adf61681c5699bbabf6/aidb/engine/full_scan_engine.py#L42-L45
if we have two tables objects, color, it will output
'''
SELECT
FROM objects, color
INNER JOIN color ON
'''
I think line44 should be changed to
FROM {inp_tables[0]}
In the extreme case where all sampled blobs have same aggregation value, this will result in 'num_samples' being equal to 0. For example, there's a predicate with a 1% positivity rate. After 1k samples, you'll only get 10 positive samples. If we're unlucky and they're all the same, our estimate won't be valid.
Are we assuming an ordering on these tables? I think this may give wrong results if the tables are not ordered correctly.
currently, we check the elements in cache one by one.
Everything from cache can be loaded in memory.
When call Hugging Face API many times like 1000, in a short time. There will raise a HTTPError:429 Client Error: Too many Requests for url
https://github.com/ddkang/aidb-new/blob/3049bd4f52f6cc1ae56770f1158c174b373052ce/aidb/engine/full_scan_engine.py#L22-L26
Originally posted by @ttt-77 in #83 (comment)
Just to be sure that the proxy score implementation and all is correct. I would also suggest that you compare the number of inference service calls for limit queries with these proxy scores and some adversarial proxy score (maybe you can test with (1 - proxy_score)). If the implementation is correct, number of inference service calls will be fewer in case of perfect proxy scores.
inference service calls are not inserted in cache tables
Currently, all the records from the cache table are fetched together and kept in memory. In case of big tables, it can be a problem.
As I was testing, I encountered an error in case of composite pk-fk relation. Have you tested this for composite pk-fk relation? Can I push a bug-fix?
Test aggregation engine on mysql, postgresql
Add aggregation test to GitHub action, test doesn't start when included in GitHub actions together with other tests.
Need to make a list of checks @continue-revolution
In filtering keys, the condition is 'keyA in (1, 2, 3) AND keyB in (2, 4, 6)' rather than '(keyA, key B) in ((1,2), (2,4), (3,6))'. This will be a problem when the dataset has multiple blob keys.
At this time, I've not noticed any JSON request/response that use a single number as key. I think for batch request, people normally request via a list.
However, if this really happens, we should modify flatten_json to support branch on the type
For counts, what's wrong with setting it to the estimated count? I think for derived rows, we want to estimate the count and do proportional importance sampling or something actually (or control variates)
Originally posted by @ddkang in #107 (comment)
https://github.com/ddkang/aidb-new/blob/main/aidb_utilities/db_setup/blob_table.py#L54
This line throws this warning when the csv file is large (279249 rows). @akash17mittal is this warning expected?
Use a count query on one of the columns inside execute function of full scan engine
eg:
f'''
SELECT COUNT({inp_col})
FROM {', '.join(inp_tables)}
{join_str};
'''
it was giving me 1000 as answer instead of 605 on objects00 table, inp_col being objects00.frame
sqlglot has been updated quite a bit. We should sync sqlglot-aidb to upstream
1. The inference service must be defined in config.
2. The input columns and output columns must exist in the database schema.
3. The output table must not be a blob table.
4. The input table must include the minimal set of primary key columns from the output table.
And to ensure that no primary key column in the output table is null, any column in the output table.
5. The output column must be bound to only one inference service.
6. The table relations of the input tables and output tables must form a DAG.
Currently, full scan engine runs all inference engines. Running only the inference engines that are required by the query.
Check the limit result is a subset of the full result, current check may not work for floating point numbers due to weird serialization issues.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.