dcmoura / spyql Goto Github PK
View Code? Open in Web Editor NEWQuery data on the command line with SQL-like SELECTs powered by Python expressions
Home Page: https://spyql.readthedocs.io
License: MIT License
Query data on the command line with SQL-like SELECTs powered by Python expressions
Home Page: https://spyql.readthedocs.io
License: MIT License
Some cases where the code could be more pythonic:
for
cycles where range(len(l))
might be replaced by enumerate
yeld
might be used to return a generatorChanges should be accessed for performance.
allow using pandas writing functions to cover more formats.
don't focus on performance, focus on bringing new formats to small datasets
Currently explode has the following behaviour:
SELECT row.name, row.departments
FROM [
{"name": "Alice", "departments": [1,4]},
{"name": "Bob", "departments": [2]},
{"name": "Charles", "departments": []}
]
EXPLODE row.departments
TO json
Results in:
{"name": "Alice", "departments": 1}
{"name": "Alice", "departments": 4}
{"name": "Bob", "departments": 2}
Omitting the Charles
row because of the empty array.
departments
is empty or NULL.Result would be:
{"name": "Alice", "departments": 1}
{"name": "Alice", "departments": 4}
{"name": "Bob", "departments": 2}
{"name": "Charles", "departments": null}
Also:
Ignore / error / warning on lines where the JSON is not valid
We started by putting some recipes at the README, but at some point we will have to write documentation.
Proposed tools:
Example:
JIRA
||heading 1||heading 2||heading 3||
|col A1|col A2|col A3|
|col B1|col B2|col B3|
HTML
Month | Savings |
---|---|
January | $100 |
\begin{table}[]
\begin{tabular}{l|l|l}
id & name & add \ \hline
1 & hello & www \
2 & planet & biniam \
3 & earth & com
\end{tabular}
\end{table}
A | B |
---|---|
1 | 2 |
3 | 4 |
e.g.
INIT a = 10; b=2
SELECT col1 * a + b
FROM ...
e.g.
json->a->b
automatic column alias would be b
or a_b
abs(x)
-> abs_xThe bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.
read for good examples:
Basic implementation of the group by clause using in-memory data structures and aggregation functions.
When using TEXT as an import format, I couldn't find a way of configuring the column name. The column name seems to be evaluated as long as we don't specify *
in the SELECT statement which we have to do for TEXT files. Am I missing something?
Here's an example:
cat /tmp/test.txt | spyql -Otable=domains 'Select * as colname from text To SQL'
INSERT INTO "domains"("col1") VALUES ('aaaaaaaaaa'),('bbbbbbbbb'),('cccccccccc'),('dddddddddd'),('eeeeeeeeee');
cat /tmp/test.txt | spyql -Otable=domains 'Select 1 as colname from text To SQL'
INSERT INTO "domains"("colname") VALUES (1),(1),(1),(1),(1);
Expected result would be:
cat /tmp/test.txt | spyql -Otable=domains 'Select * as colname from text To SQL'
INSERT INTO "domains"("colname") VALUES ('aaaaaaaaaa'),('bbbbbbbbb'),('cccccccccc'),('dddddddddd'),('eeeeeeeeee');
Basic implementation of the DISTINCT modifier of the SELECT clause using in-memory data structures.
Don't through unnecessary stack traces for user errors.
Show contextual, meaningful information that helps users locate and fix errors.
Allow json->*
and json->afield->*
for horizontal explosion.
*
column by set of columnsBasic implementation of the order by clause using in-memory data structures.
Take a look at csvstat: https://github.com/wireservice/csvkit
Add methods to NullSafeDict
to help working with dicts:
dict.with(k, v)
: returns dict with new/updated pairdict.drop(k1, k2, k3)
: return dicts without keys (does not fail if keys not found)dict.withKeyRename(old_key, new_key)
: return dict with key renamedAdd method to allow using dicts on CSVs, etc:
row
/ row()
: constructs a dict over _values
(and caches row to allow multiple calls). e.g. row->acolumn
Hi,
I am curious to see if we can implement interactive spyql that can be called from within any script. Querying python objects is something pretty powerful that should be brought directly into the python language like how C# has LINQ. I have started to work a bit on it and here's some pseudo code I had in mind:
class Q:
def __init__(self, query):
self._prs, self._string_regex = parse(clean_query(query))
def __call__(self, **kwargs):
pass
And usage can be as follows:
>>> q = Q("IMPORT numpy SELECT numpy.mean(data->salary) FROM data WHERE data->name == 'akash'")
>>> q(data = data)
Though I can hack it as you have done in the main_test.py
by getting a CliRunner
but that seems unintuitive when using interactively. I'd like to know your thoughts on this and some pointers to start working on it!
Try iterating over pandas and numpy arrays and check if any adjustment is needed.
Load everything into memory and don't focus on performance... only to be used with small datasets. e.g.
python3 -m spyql.spyql "SELECT col1->IGNITION FROM [[row.to_dict()] for _, row in pd.read_csv('~/Downloads/gps_fb-ma-06-13.csv', nrows=10).iterrows()] TO json"
Modifiers:
A SELECT * EXCEPT statement specifies the names of one or more columns to exclude from the result. All matching column names are omitted from the output.
WITH orders AS
(SELECT 5 as order_id,
"sprocket" as item_name,
200 as quantity)
SELECT * EXCEPT (order_id)
FROM orders;
+-----------+----------+
| item_name | quantity |
+-----------+----------+
| sprocket | 200 |
+-----------+----------+
A SELECT * REPLACE statement specifies one or more expression AS identifier clauses. Each identifier must match a column name from the SELECT * statement. In the output column list, the column that matches the identifier in a REPLACE clause is replaced by the expression in that REPLACE clause.
A SELECT * REPLACE statement does not change the names or order of columns. However, it can change the value and the value type.
WITH orders AS
(SELECT 5 as order_id,
"sprocket" as item_name,
200 as quantity)
SELECT * REPLACE ("widget" AS item_name)
FROM orders;
+----------+-----------+----------+
| order_id | item_name | quantity |
+----------+-----------+----------+
| 5 | widget | 200 |
+----------+-----------+----------+
WITH orders AS
(SELECT 5 as order_id,
"sprocket" as item_name,
200 as quantity)
SELECT * REPLACE (quantity/2 AS quantity)
FROM orders;
+----------+-----------+----------+
| order_id | item_name | quantity |
+----------+-----------+----------+
| 5 | sprocket | 100 |
+----------+-----------+----------+
Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_replace
Utility or keyword that automatically summaries a dataset, e.g. by calculating # of non-null values, # distinct values, average, min, max ...
Memory-based processor that only stores the top/bottom rows in queries of type:
ORDER BY ...
LIMIT x
where x
is not too large.
Go black!
e.g. SELECT json->id, json->data->ts FROM json
Transform to json['id'], json['data']['ts']
and handle exceptions to return None in case of key error.
Exception handling is of uttermost importance. There are several situations where exceptions are raised due to the data and not to programming errors. These should be dealt with so that execution is not halted. Instead, NULL/None values should result from the evaluation (e.g. NULL/None values, JSON field missing, casting empty strings to numeric). Optionally, warnings should be sent to stderr. This issue should cover, at least the following exceptions, always returning None
:
None + 1
)None.get(...)
)json['missing key']
)Note: in the future, a coalesce
function should be available that is able to handle these exceptions and replacing None
by a value/expression. (look into #3)
e.g.
DO name = json->customer->name
SELECT name
FROM json
WHERE name.startswith('A')
DO
might not be the best keyword, e.g. it is used in postgres to execute a single block of code
Allowing to load dictionaries (e.g. key/value from json files), and in the query do dictionaries look-ups.
This would cover a kind of LEFT JOIN with an equality condition. e.g.
SELECT adict[id] ...
If id
does not exist in adict , a NULL
value should be returned.
Command line option to allow running Python initialization code.
Most typical cases should be importing libraries, initializing variables and defining functions.
Could you please give an example on how to convert this real world json to sqlite:
{
"tradeHistory":{
"orders":[
{
"id":"aaaa",
"name":"a",
"subOrders":[
],
"cash":null
},
{
"id":"bbbb",
"name":"b",
"subOrders":[
],
"cash":null
}
],
"pageCount":2,
"isSuperWikifolio":false
},
"tooltips":{
"download-price":"<div>aaaa</div>",
"download-account-statement":"<div>bbbb</div>"
}
}
The table should hold the orders only. I suppose, I'd have to convert it to single line json with jq, but that already fails.
Allow multiple explosions in a single EXPLODE statement.
Some possible syntaxes:
EXPLODE json->i_am_an_array[]->i_am_an_object->i_am_another_array[]
EXPLODE json->i_am_an_array, json->i_am_an_array->i_am_an_object->i_am_another_array
The 2nd is easier to parse and allows for explosions in different braches, but it is more verbose.
The 2nd is also explicit about the order of explosions.
https://stackabuse.com/command-line-arguments-in-python/
--unbuffered (-u) (e.g. calling flush every write)
--sample-size (-s? CSV, JSON, ..),
--batch-size (SQL)
--quiet -q, --verbose -v / or log level
-w/-W control warnings: show/ignore/error
add command line help
This is a backlog to be converted into issues as we go 😅 🔨
datetime.datetime(2020,11,10,19,55,03)
use $2020-11-10 19:55:04$
or to_datetime('2020-11-10 19:55:04')
prev(col1)
or prev->col1
EXECUTE cust = json->order->customer SELECT cust->id, cust->name, cust->age FROM JSON
.for handy search on strings
Leave your comments for the JSON benchmark here.
Currently SPyQL only allows to read and write JSON lines. Writing JSON arrays can be done using the dict_agg
, aggregating everything into an array and writing an JSON with a single line.
The idea is to add an argument lines=True
to json and orjson writers and processors. The processor should be able to handle single object files as well as arrays of objects or arrays of scalars. When lines
is False
the processor should load the full input into memory and then parse it. While this is not ideal, it is the most straightforward implementation. In addition, arrays of JSON shouldn't be used for large data, in that cases JSON lines should be used instead.
The writer should write an array of objects when lines
is False
.
e.g. delimiter, auto on/off, header, etc
Using temporary files when needed
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.