Giter Club home page Giter Club logo

spyql's People

Contributors

areski avatar benji-york avatar dcmoura avatar dependabot[bot] avatar hayashi-yudai avatar pyup-bot avatar recharte avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spyql's Issues

Code improvements

Some cases where the code could be more pythonic:

  • some for cycles where range(len(l)) might be replaced by enumerate
  • some functions where yeld might be used to return a generator

Changes should be accessed for performance.

TO pandas

allow using pandas writing functions to cover more formats.
don't focus on performance, focus on bringing new formats to small datasets

EXPLODE OUTER

Currently explode has the following behaviour:

SELECT row.name, row.departments
FROM [
    {"name": "Alice", "departments": [1,4]},
    {"name": "Bob", "departments": [2]},
    {"name": "Charles", "departments": []}
]
EXPLODE row.departments
TO json

Results in:

{"name": "Alice", "departments": 1}
{"name": "Alice", "departments": 4}
{"name": "Bob", "departments": 2}

Omitting the Charles row because of the empty array.

TO DO

  • Implement the OUTER modifier that would include cases where departments is empty or NULL.

Result would be:

{"name": "Alice", "departments": 1}
{"name": "Alice", "departments": 4}
{"name": "Bob", "departments": 2}
{"name": "Charles", "departments": null}

Also:

  • Make sure EXPLODE without the OUTER modifier works with NULLs and create test

Additional output formats / writers

  • JIRA
    ||heading 1||heading 2||heading 3||
    |col A1|col A2|col A3|
    |col B1|col B2|col B3|

  • HTML

                       
MonthSavings
January$100
  • Latex

\begin{table}[]
\begin{tabular}{l|l|l}
id & name & add \ \hline
1 & hello & www \
2 & planet & biniam \
3 & earth & com
\end{tabular}
\end{table}

  • mark-down
A B
1 2
3 4

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

Aggregations

Basic implementation of the group by clause using in-memory data structures and aggregation functions.

Column names with TEXT format

When using TEXT as an import format, I couldn't find a way of configuring the column name. The column name seems to be evaluated as long as we don't specify * in the SELECT statement which we have to do for TEXT files. Am I missing something?
Here's an example:

cat /tmp/test.txt | spyql -Otable=domains 'Select * as colname from text To SQL'
INSERT INTO "domains"("col1") VALUES ('aaaaaaaaaa'),('bbbbbbbbb'),('cccccccccc'),('dddddddddd'),('eeeeeeeeee');
cat /tmp/test.txt | spyql -Otable=domains 'Select 1 as colname from text To SQL' 
INSERT INTO "domains"("colname") VALUES (1),(1),(1),(1),(1);

Expected result would be:

cat /tmp/test.txt | spyql -Otable=domains 'Select * as colname from text To SQL'
INSERT INTO "domains"("colname") VALUES ('aaaaaaaaaa'),('bbbbbbbbb'),('cccccccccc'),('dddddddddd'),('eeeeeeeeee');

SELECT DISTINCT

Basic implementation of the DISTINCT modifier of the SELECT clause using in-memory data structures.

Recursive column selection in JSON input

Allow json->* and json->afield->* for horizontal explosion.

  • Gather set of fields based on sample from the input (keys not present in the sample will be excluded)
  • Replace star * column by set of columns
  • Optional: warning if key not present in the sample is detected along the way

Extend Dictionaries

Add methods to NullSafeDict to help working with dicts:

  • dict.with(k, v): returns dict with new/updated pair
  • dict.drop(k1, k2, k3): return dicts without keys (does not fail if keys not found)
  • dict.withKeyRename(old_key, new_key): return dict with key renamed

Add method to allow using dicts on CSVs, etc:

  • row / row(): constructs a dict over _values (and caches row to allow multiple calls). e.g. row->acolumn

Interactive Spyql

Hi,

I am curious to see if we can implement interactive spyql that can be called from within any script. Querying python objects is something pretty powerful that should be brought directly into the python language like how C# has LINQ. I have started to work a bit on it and here's some pseudo code I had in mind:

class Q:
  def __init__(self, query):
    self._prs, self._string_regex = parse(clean_query(query))
  
  def __call__(self, **kwargs):
    pass

And usage can be as follows:

>>> q = Q("IMPORT numpy SELECT numpy.mean(data->salary) FROM data WHERE data->name == 'akash'")
>>> q(data = data)

Though I can hack it as you have done in the main_test.py by getting a CliRunner but that seems unintuitive when using interactively. I'd like to know your thoughts on this and some pointers to start working on it!

Missing values handling

  • reading NULL/None (e.g. on CSV and JSON)
  • handling NULL/None on expression evaluations: catch None exceptions and return None (e.g. arithmetics): also covered on #6
  • writing NULL/None
  • coalesce function (quick way to return a value when NULL/None)
  • tests (NULL in Numeric data types, NULL in Strings, arithmetics, concatenations, etc)

FROM pandas / numpy

Try iterating over pandas and numpy arrays and check if any adjustment is needed.
Load everything into memory and don't focus on performance... only to be used with small datasets. e.g.

python3 -m spyql.spyql "SELECT col1->IGNITION FROM [[row.to_dict()] for _, row in pd.read_csv('~/Downloads/gps_fb-ma-06-13.csv', nrows=10).iterrows()] TO json"

SELECT * Modifiers

Modifiers:

  • SELECT * EXCEPT
  • SELECT * REPLACE

SELECT * EXCEPT

A SELECT * EXCEPT statement specifies the names of one or more columns to exclude from the result. All matching column names are omitted from the output.

WITH orders AS
  (SELECT 5 as order_id,
  "sprocket" as item_name,
  200 as quantity)
SELECT * EXCEPT (order_id)
FROM orders;

+-----------+----------+
| item_name | quantity |
+-----------+----------+
| sprocket  | 200      |
+-----------+----------+

SELECT * REPLACE

A SELECT * REPLACE statement specifies one or more expression AS identifier clauses. Each identifier must match a column name from the SELECT * statement. In the output column list, the column that matches the identifier in a REPLACE clause is replaced by the expression in that REPLACE clause.

A SELECT * REPLACE statement does not change the names or order of columns. However, it can change the value and the value type.

WITH orders AS
  (SELECT 5 as order_id,
  "sprocket" as item_name,
  200 as quantity)
SELECT * REPLACE ("widget" AS item_name)
FROM orders;

+----------+-----------+----------+
| order_id | item_name | quantity |
+----------+-----------+----------+
| 5        | widget    | 200      |
+----------+-----------+----------+

WITH orders AS
  (SELECT 5 as order_id,
  "sprocket" as item_name,
  200 as quantity)
SELECT * REPLACE (quantity/2 AS quantity)
FROM orders;

+----------+-----------+----------+
| order_id | item_name | quantity |
+----------+-----------+----------+
| 5        | sprocket  | 100      |
+----------+-----------+----------+

Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_replace

DESCRIBE dataset

Utility or keyword that automatically summaries a dataset, e.g. by calculating # of non-null values, # distinct values, average, min, max ...

Create easy syntax for JSON fields

e.g. SELECT json->id, json->data->ts FROM json

Transform to json['id'], json['data']['ts'] and handle exceptions to return None in case of key error.

Exception handling for data issues

Exception handling is of uttermost importance. There are several situations where exceptions are raised due to the data and not to programming errors. These should be dealt with so that execution is not halted. Instead, NULL/None values should result from the evaluation (e.g. NULL/None values, JSON field missing, casting empty strings to numeric). Optionally, warnings should be sent to stderr. This issue should cover, at least the following exceptions, always returning None:

  • Operations with None values (e.g. None + 1)
  • Accessing methods/properties of None objects (e.g. None.get(...))
  • Retrieving inexistent keys from dictionaries (e.g. json['missing key'])

Note: in the future, a coalesce function should be available that is able to handle these exceptions and replacing None by a value/expression. (look into #3)

Key value json output and key value loader to emulate left joins

Allowing to load dictionaries (e.g. key/value from json files), and in the query do dictionaries look-ups.
This would cover a kind of LEFT JOIN with an equality condition. e.g.

SELECT adict[id] ... 

If id does not exist in adict , a NULL value should be returned.

Initialization Code

Command line option to allow running Python initialization code.
Most typical cases should be importing libraries, initializing variables and defining functions.

Real-world example

Could you please give an example on how to convert this real world json to sqlite:

{
    "tradeHistory":{
       "orders":[
          {
             "id":"aaaa",
             "name":"a",
             "subOrders":[
                
             ],
             "cash":null
          },
          {
             "id":"bbbb",
             "name":"b",
             "subOrders":[
                
             ],
             "cash":null
          }
       ],
       "pageCount":2,
       "isSuperWikifolio":false
    },
    "tooltips":{
       "download-price":"<div>aaaa</div>",
       "download-account-statement":"<div>bbbb</div>"
    }
}

The table should hold the orders only. I suppose, I'd have to convert it to single line json with jq, but that already fails.

Multiple EXPLODEs

Allow multiple explosions in a single EXPLODE statement.

Some possible syntaxes:

  • EXPLODE json->i_am_an_array[]->i_am_an_object->i_am_another_array[]
  • EXPLODE json->i_am_an_array, json->i_am_an_array->i_am_an_object->i_am_another_array

The 2nd is easier to parse and allows for explosions in different braches, but it is more verbose.
The 2nd is also explicit about the order of explosions.

Other issues

This is a backlog to be converted into issues as we go 😅 🔨

Backlog (after 1st release)

Tentative (but unlikely)

Next

  • prev object/dict/list/function (choose) with data from the previous line, e.g. prev(col1) or prev->col1
  • EXECUTE clause to execute python statements for each row. e.g. create a variable to simplify query: EXECUTE cust = json->order->customer SELECT cust->id, cust->name, cust->age FROM JSON.

Later / ideas

  • Other file formats: ARFF, XML, YAML, ...
  • When writing to SQL, CREATE TABLE IF NOT EXISTS - data types vary w/ the DB - (look at csvsql: https://github.com/wireservice/csvkit))
  • aggregates options (print every record, print every x sec, by window)
  • windows + watermarks (streaming) ?
  • read from files (with wildcards)
  • window functions: percentile, median (only on sorted input)
  • when reading from files, get file size info to make planning decisions

Support reading and writing of JSON objects (not JSON lines)

Currently SPyQL only allows to read and write JSON lines. Writing JSON arrays can be done using the dict_agg, aggregating everything into an array and writing an JSON with a single line.

The idea is to add an argument lines=True to json and orjson writers and processors. The processor should be able to handle single object files as well as arrays of objects or arrays of scalars. When lines is False the processor should load the full input into memory and then parse it. While this is not ideal, it is the most straightforward implementation. In addition, arrays of JSON shouldn't be used for large data, in that cases JSON lines should be used instead.

The writer should write an array of objects when lines is False.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.