Giter Club home page Giter Club logo

aidb's People

Contributors

akash17mittal avatar continue-revolution avatar ddkang avatar ttt-77 avatar zencodess avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

aidb's Issues

Issue in retrieving results from output tables

https://github.com/ddkang/aidb-new/blob/eb1394191d44b3b84b7658abc1ff83e872cbc54d/aidb/inference/bound_inference_service.py#L161

I am not able to understand this logic or maybe I am missing some design choice.

self._result_query_stub = sqlalchemy.sql.select(output_cols)

result_query_stub selects output columns where as the filtering condition is on the cache columns (cache columns are all input binding columns).

query = self._result_query_stub.where(
            sqlalchemy.sql.and_(
              *[getattr(self._cache_table.c, self.convert_column_name(col)) == getattr(inp_row, col) for col in
                self.binding.input_columns]
            )

I don't know why should this work.

Configuration checks: schema only (no engines)

  • Blob table check: logically one blob table (join keys allowed) for each unstructured data resource. Blob tables refer to base tables.
  • Reference check: If Table A references Table B, then Table A must refer to all primary key columns in Table B. L135-L145
  • Table graph check: self loop, one edge loop, cyclic; use directed graph L124-L132 L153-L155

Serialize config

We need to decide whether or not to be able to serialize the configuration or if it should be only stored in the database. If it's only stored in the database, we need a way to inspect the configuration.

There may be some small errors in full_scan_engine.py

https://github.com/ddkang/aidb-new/blob/88807f3cda41cc5ee37d7adf61681c5699bbabf6/aidb/engine/full_scan_engine.py#L28
https://github.com/ddkang/aidb-new/blob/88807f3cda41cc5ee37d7adf61681c5699bbabf6/aidb/config/config_types.py#L59-L64
table.foreign_keys has structure {col.name:fk.target_fullname}.
I think line28 should be changed to for fkey in table1.foreign_keys.values():

https://github.com/ddkang/aidb-new/blob/88807f3cda41cc5ee37d7adf61681c5699bbabf6/aidb/engine/full_scan_engine.py#L42-L45
if we have two tables objects, color, it will output
'''
SELECT
FROM objects, color
INNER JOIN color ON
'''
I think line44 should be changed to
FROM {inp_tables[0]}

Case for 'num_samples' equal to 0 in approximate aggregation query

In the extreme case where all sampled blobs have same aggregation value, this will result in 'num_samples' being equal to 0. For example, there's a predicate with a 1% positivity rate. After 1k samples, you'll only get 10 positive samples. If we're unlucky and they're all the same, our estimate won't be valid.

Test proxy score correctness

Just to be sure that the proxy score implementation and all is correct. I would also suggest that you compare the number of inference service calls for limit queries with these proxy scores and some adversarial proxy score (maybe you can test with (1 - proxy_score)). If the implementation is correct, number of inference service calls will be fewer in case of perfect proxy scores.

#50

Multiple blob key filtering issue

In filtering keys, the condition is 'keyA in (1, 2, 3) AND keyB in (2, 4, 6)' rather than '(keyA, key B) in ((1,2), (2,4), (3,6))'. This will be a problem when the dataset has multiple blob keys.

Refactor Proxy Score Computation

          For counts, what's wrong with setting it to the estimated count? I think for derived rows, we want to estimate the count and do proportional importance sampling or something actually (or control variates)

Originally posted by @ddkang in #107 (comment)

Configuration checks: inference engine

1. The inference service must be defined in config.
2. The input columns and output columns must exist in the database schema.
3. The output table must not be a blob table.
4. The input table must include the minimal set of primary key columns from the output table.
   And to ensure that no primary key column in the output table is null, any column in the output table.
5. The output column must be bound to only one inference service.
6. The table relations of the input tables and output tables must form a DAG.

Limit Engine Correctness Check

Check the limit result is a subset of the full result, current check may not work for floating point numbers due to weird serialization issues.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.