dcmoura / spyql Goto Github PK

View Code? Open in Web Editor NEW

911.0 12.0 25.0 1.57 MB

Query data on the command line with SQL-like SELECTs powered by Python expressions

Home Page: https://spyql.readthedocs.io

License: MIT License

Python 12.46% Makefile 0.21% Jupyter Notebook 87.33%

sql python csv json data text command-line

spyql's People

Contributors

Stargazers

Watchers

spyql's Issues

Code improvements

Some cases where the code could be more pythonic:

some for cycles where range(len(l)) might be replaced by enumerate
some functions where yeld might be used to return a generator

Changes should be accessed for performance.

TO pandas

allow using pandas writing functions to cover more formats.
don't focus on performance, focus on bringing new formats to small datasets

EXPLODE OUTER

Currently explode has the following behaviour:

SELECT row.name, row.departments
FROM [
    {"name": "Alice", "departments": [1,4]},
    {"name": "Bob", "departments": [2]},
    {"name": "Charles", "departments": []}
]
EXPLODE row.departments
TO json

Results in:

{"name": "Alice", "departments": 1}
{"name": "Alice", "departments": 4}
{"name": "Bob", "departments": 2}

Omitting the Charles row because of the empty array.

TO DO

Implement the OUTER modifier that would include cases where departments is empty or NULL.

Result would be:

{"name": "Alice", "departments": 1}
{"name": "Alice", "departments": 4}
{"name": "Bob", "departments": 2}
{"name": "Charles", "departments": null}

Also:

Make sure EXPLODE without the OUTER modifier works with NULLs and create test

Exception handling on bad JSONs

Ignore / error / warning on lines where the JSON is not valid

Make documentation

We started by putting some recipes at the README, but at some point we will have to write documentation.

Proposed tools:

Example:

https://github.com/pallets/click/blob/main/docs/options.rst

Additional output formats / writers

JIRA
||heading 1||heading 2||heading 3||
|col A1|col A2|col A3|
|col B1|col B2|col B3|
HTML

Month	Savings
January	$100

Latex

\begin{table}[]
\begin{tabular}{l|l|l}
id & name & add \ \hline
1 & hello & www \
2 & planet & biniam \
3 & earth & com
\end{tabular}
\end{table}

mark-down

A	B
1	2
3	4

INIT statement for running initialisation python code

e.g.

INIT a = 10; b=2
SELECT col1 * a + b
FROM ...

Infer output col name from expression

e.g.

when json->a->b automatic column alias would be b or a_b
when abs(x) -> abs_x

Integration with Jupyter notebooks

Look at https://fugue-tutorials.readthedocs.io/tutorials/fugue_sql/python.html

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

package release

clean-up repo: https://docs.python-guide.org/writing/structure/
installation package, e.g.: https://github.com/glamp/bashplotlib/blob/master/setup.py
90% code coverage
Good README with lots of recipes

read for good examples:

Aggregations

Basic implementation of the group by clause using in-memory data structures and aggregation functions.

When using TEXT as an import format, I couldn't find a way of configuring the column name. The column name seems to be evaluated as long as we don't specify * in the SELECT statement which we have to do for TEXT files. Am I missing something?
Here's an example:

cat /tmp/test.txt | spyql -Otable=domains 'Select * as colname from text To SQL'
INSERT INTO "domains"("col1") VALUES ('aaaaaaaaaa'),('bbbbbbbbb'),('cccccccccc'),('dddddddddd'),('eeeeeeeeee');

cat /tmp/test.txt | spyql -Otable=domains 'Select 1 as colname from text To SQL' 
INSERT INTO "domains"("colname") VALUES (1),(1),(1),(1),(1);

Expected result would be:

cat /tmp/test.txt | spyql -Otable=domains 'Select * as colname from text To SQL'
INSERT INTO "domains"("colname") VALUES ('aaaaaaaaaa'),('bbbbbbbbb'),('cccccccccc'),('dddddddddd'),('eeeeeeeeee');

CREATE / DROP / TRUNCATE when exporting to SQL

SELECT DISTINCT

Basic implementation of the DISTINCT modifier of the SELECT clause using in-memory data structures.

404 status on downloading sample.csv file

wget https://raw.githubusercontent.com/dcmoura/blogposts/spyql_cell_towers/spyql_cell_towers/sample.csv gives 404 status

Friendly exception reporting and errors messages

Don't through unnecessary stack traces for user errors.
Show contextual, meaningful information that helps users locate and fix errors.

write/load to/from py

write data as py str representations
load data with eval (make processor)

Recursive column selection in JSON input

Allow json->* and json->afield->* for horizontal explosion.

Gather set of fields based on sample from the input (keys not present in the sample will be excluded)
Replace star * column by set of columns
Optional: warning if key not present in the sample is detected along the way

ORDER BY - in-memory

Basic implementation of the order by clause using in-memory data structures.

Infer and cast column type when reading CSV

Infer column type (Int, Float, String) based on CSV sample
Cast while reading row data when not String

Take a look at csvstat: https://github.com/wireservice/csvkit

Extend Dictionaries

Add methods to NullSafeDict to help working with dicts:

dict.with(k, v): returns dict with new/updated pair
dict.drop(k1, k2, k3): return dicts without keys (does not fail if keys not found)
dict.withKeyRename(old_key, new_key): return dict with key renamed

Add method to allow using dicts on CSVs, etc:

row / row(): constructs a dict over _values (and caches row to allow multiple calls). e.g. row->acolumn

Interactive Spyql

Hi,

I am curious to see if we can implement interactive spyql that can be called from within any script. Querying python objects is something pretty powerful that should be brought directly into the python language like how C# has LINQ. I have started to work a bit on it and here's some pseudo code I had in mind:

class Q:
  def __init__(self, query):
    self._prs, self._string_regex = parse(clean_query(query))
  
  def __call__(self, **kwargs):
    pass

And usage can be as follows:

>>> q = Q("IMPORT numpy SELECT numpy.mean(data->salary) FROM data WHERE data->name == 'akash'")
>>> q(data = data)

Though I can hack it as you have done in the main_test.py by getting a CliRunner but that seems unintuitive when using interactively. I'd like to know your thoughts on this and some pointers to start working on it!

header from CSV is not reaching the writer

Missing values handling

reading NULL/None (e.g. on CSV and JSON)
handling NULL/None on expression evaluations: catch None exceptions and return None (e.g. arithmetics): also covered on #6
writing NULL/None
coalesce function (quick way to return a value when NULL/None)
tests (NULL in Numeric data types, NULL in Strings, arithmetics, concatenations, etc)

FROM pandas / numpy

Try iterating over pandas and numpy arrays and check if any adjustment is needed.
Load everything into memory and don't focus on performance... only to be used with small datasets. e.g.

python3 -m spyql.spyql "SELECT col1->IGNITION FROM [[row.to_dict()] for _, row in pd.read_csv('~/Downloads/gps_fb-ma-06-13.csv', nrows=10).iterrows()] TO json"

SELECT * Modifiers

Modifiers:

SELECT * EXCEPT
SELECT * REPLACE

SELECT * EXCEPT

A SELECT * EXCEPT statement specifies the names of one or more columns to exclude from the result. All matching column names are omitted from the output.

WITH orders AS
  (SELECT 5 as order_id,
  "sprocket" as item_name,
  200 as quantity)
SELECT * EXCEPT (order_id)
FROM orders;

+-----------+----------+
| item_name | quantity |
+-----------+----------+
| sprocket  | 200      |
+-----------+----------+

SELECT * REPLACE

A SELECT * REPLACE statement specifies one or more expression AS identifier clauses. Each identifier must match a column name from the SELECT * statement. In the output column list, the column that matches the identifier in a REPLACE clause is replaced by the expression in that REPLACE clause.

A SELECT * REPLACE statement does not change the names or order of columns. However, it can change the value and the value type.

WITH orders AS
  (SELECT 5 as order_id,
  "sprocket" as item_name,
  200 as quantity)
SELECT * REPLACE ("widget" AS item_name)
FROM orders;

+----------+-----------+----------+
| order_id | item_name | quantity |
+----------+-----------+----------+
| 5        | widget    | 200      |
+----------+-----------+----------+

WITH orders AS
  (SELECT 5 as order_id,
  "sprocket" as item_name,
  200 as quantity)
SELECT * REPLACE (quantity/2 AS quantity)
FROM orders;

+----------+-----------+----------+
| order_id | item_name | quantity |
+----------+-----------+----------+
| 5        | sprocket  | 100      |
+----------+-----------+----------+

Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_replace

DESCRIBE dataset

Utility or keyword that automatically summaries a dataset, e.g. by calculating # of non-null values, # distinct values, average, min, max ...

Processor for getting the top/bottom n rows

Memory-based processor that only stores the top/bottom rows in queries of type:

ORDER BY ...
LIMIT x

where x is not too large.

Option to specify table Schema when exporting to SQL

Implement coding style

Go black!

Create easy syntax for JSON fields

e.g. SELECT json->id, json->data->ts FROM json

Transform to json['id'], json['data']['ts'] ~~and handle exceptions to return None in case of key error~~.

Exception handling for data issues

Exception handling is of uttermost importance. There are several situations where exceptions are raised due to the data and not to programming errors. These should be dealt with so that execution is not halted. Instead, NULL/None values should result from the evaluation (e.g. NULL/None values, JSON field missing, casting empty strings to numeric). Optionally, warnings should be sent to stderr. This issue should cover, at least the following exceptions, always returning None:

Operations with None values (e.g. None + 1)
Accessing methods/properties of None objects (e.g. None.get(...))
Retrieving inexistent keys from dictionaries (e.g. json['missing key'])

Note: in the future, a coalesce function should be available that is able to handle these exceptions and replacing None by a value/expression. (look into #3)

Add Python 3.10 to tests

CSV reader does not work well when input is a single row

Check SQL output against PostgreSQL/MySQL/SQLite

Statement for executing python code on each row

e.g.

DO name = json->customer->name
SELECT name
FROM json
WHERE name.startswith('A')

DO might not be the best keyword, e.g. it is used in postgres to execute a single block of code

Key value json output and key value loader to emulate left joins

Allowing to load dictionaries (e.g. key/value from json files), and in the query do dictionaries look-ups.
This would cover a kind of LEFT JOIN with an equality condition. e.g.

SELECT adict[id] ...

If id does not exist in adict , a NULL value should be returned.

Initialization Code

Command line option to allow running Python initialization code.
Most typical cases should be importing libraries, initializing variables and defining functions.

limit the scope of variables when evaluating

https://stackoverflow.com/questions/2220699/whats-the-difference-between-eval-exec-and-compile

Real-world example

Could you please give an example on how to convert this real world json to sqlite:

{
    "tradeHistory":{
       "orders":[
          {
             "id":"aaaa",
             "name":"a",
             "subOrders":[
                
             ],
             "cash":null
          },
          {
             "id":"bbbb",
             "name":"b",
             "subOrders":[
                
             ],
             "cash":null
          }
       ],
       "pageCount":2,
       "isSuperWikifolio":false
    },
    "tooltips":{
       "download-price":"<div>aaaa</div>",
       "download-account-statement":"<div>bbbb</div>"
    }
}

The table should hold the orders only. I suppose, I'd have to convert it to single line json with jq, but that already fails.

Multiple EXPLODEs

Allow multiple explosions in a single EXPLODE statement.

Some possible syntaxes:

EXPLODE json->i_am_an_array[]->i_am_an_object->i_am_another_array[]
EXPLODE json->i_am_an_array, json->i_am_an_array->i_am_an_object->i_am_another_array

The 2nd is easier to parse and allows for explosions in different braches, but it is more verbose.
The 2nd is also explicit about the order of explosions.

argument parsing and options

https://stackabuse.com/command-line-arguments-in-python/
--unbuffered (-u) (e.g. calling flush every write)
--sample-size (-s? CSV, JSON, ..),
--batch-size (SQL)
--quiet -q, --verbose -v / or log level
-w/-W control warnings: show/ignore/error

add command line help

Other issues

This is a backlog to be converted into issues as we go 😅 🔨

Backlog (after 1st release)

Tentative (but unlikely)

add badges to Readme.md: https://dev.to/ananto30/how-to-add-some-badges-in-your-git-readme-github-gitlab-etc-3ne9
Better handling of dates constants. e.g. instead of datetime.datetime(2020,11,10,19,55,03) use $2020-11-10 19:55:04$ or to_datetime('2020-11-10 19:55:04')

prev object/dict/list/function (choose) with data from the previous line, e.g. prev(col1) or prev->col1
EXECUTE clause to execute python statements for each row. e.g. create a variable to simplify query: EXECUTE cust = json->order->customer SELECT cust->id, cust->name, cust->age FROM JSON.

Later / ideas

Other file formats: ARFF, XML, YAML, ...
When writing to SQL, CREATE TABLE IF NOT EXISTS - data types vary w/ the DB - (look at csvsql: https://github.com/wireservice/csvkit))
aggregates options (print every record, print every x sec, by window)
windows + watermarks (streaming) ?
read from files (with wildcards)
window functions: percentile, median (only on sorted input)
when reading from files, get file size info to make planning decisions

`like` function and regex function shortcut

for handy search on strings

JSON benchmark

Leave your comments for the JSON benchmark here.

Find a better way of profiling the code

#8 (comment)

Support reading and writing of JSON objects (not JSON lines)

Currently SPyQL only allows to read and write JSON lines. Writing JSON arrays can be done using the dict_agg, aggregating everything into an array and writing an JSON with a single line.

The idea is to add an argument lines=True to json and orjson writers and processors. The processor should be able to handle single object files as well as arrays of objects or arrays of scalars. When lines is False the processor should load the full input into memory and then parse it. While this is not ideal, it is the most straightforward implementation. In addition, arrays of JSON shouldn't be used for large data, in that cases JSON lines should be used instead.

The writer should write an array of objects when lines is False.

options for input and output

e.g. delimiter, auto on/off, header, etc

Disk-based ORDER BY

Using temporary files when needed

dcmoura / spyql Goto Github PK

spyql's People

Contributors

Stargazers

Watchers

Forkers

spyql's Issues

TO DO

SELECT * EXCEPT

SELECT * REPLACE

Backlog (after 1st release)

Tentative (but unlikely)

Next

Later / ideas

Recommend Projects

Recommend Topics

Recommend Org