Comments (9)
That is good to know, thank you for your feedback! Let me think on it.
from spyql.
Right now we only support joining with a JSON file. The main data source can be in any data format, but the secondary data source must be JSON. You can translate a CSV to JSON, though. Here is an example: https://danielcmoura.com/blog/2022/spyql-cell-towers/
from spyql.
Proper JOINs are in the roadmap but it might take months until getting there, unless someone steps in.
from spyql.
Thank you, @dcmoura !
Proper JOINs would not be easy from scratch, but I would suggest using pandas like this:
import pandas as pd
left_df = pd.read_csv(left_file, dtype=str, keep_default_na=False)
right_df = pd.read_csv(right_file, dtype=str, keep_default_na=False)
df = left_df.merge(right_df, **kwargs)
df.to_csv(out_file, index=False)
pandas does not need to be a requirement of spyql.
An error can be returned if join feature is used but pandas is not installed.
Pipe feature (with "spy" output) for pandas dataframes can be omitted if difficult (in the first release).
from spyql.
Currently, I use pandas to join CSV files, which means I need to prepare and manage Python scripts.
If spyql supports joining CSV files, I can save more time.
from spyql.
Thank you, @dcmoura ! Proper JOINs would not be easy from scratch, but I would suggest using pandas like this:
Using pandas goes against the principles of SPyQL. Pandas adds a very large overhead, and loads everything into memory. Still, thank you for your suggestion.
In the meanwhile, I suggest you use something like the following. I am showing how can we JOIN two csv files using spyql.
$ cat example1.csv
id, name, age
1, Ana, 26
2, Jane, 31
3, Richard, 42
4, Samuel, 23
$ cat example2.csv
date, ammount, user_id
2022-02-01, 100.0, 3
2022-03-05, 25.1, 1
2022-03-15, 93.2, 1
2022-04-01, 50.0, 2
example1.csv
is a list of users. We want to add user info to example2.csv
. So, we will convert example1.csv
to JSON and then use it to JOIN with example2.csv
.
$ spyql "SELECT dict_agg(id, .) AS json FROM csv('example1.csv') TO json" > example1.json
$ spyql -Jusers=example1.json "SELECT *, users[user_id].name AS user_name, users[user_id].age AS user_age FROM csv('example2.csv') TO pretty"
date ammount user_id user_name user_age
---------- --------- --------- ----------- ----------
2022-02-01 100 3 Richard 42
2022-03-05 25.1 1 Ana 26
2022-03-15 93.2 1 Ana 26
2022-04-01 50 2 Jane 31
Hope this helps.
from spyql.
Thank you for your suggestion, @dcmoura !
- Hmmm, unfortunately, that seems a bit complicated to me. Hope syntax similar to SQL can be used, as supported by
clickhouse local
as follows:
./clickhouse local -q "SELECT u.full_name, h.text FROM file('hackernews.csv', CSVWithNames) h \
JOIN file('users.tsv', TSVWithNames) u ON (u.username = h.by) WHERE NOT empty(text) AND length(text) < 50"
References:
https://clickhouse.com/blog/extracting-converting-querying-local-files-with-sql-clickhouse-local
https://clickhouse.com/docs/en/operations/utilities/clickhouse-local/
- How about
modin
as an optional pandas alternative?modin
should be able to run pandas code (without modifying).modin
claims: "Modin is a drop-in replacement for pandas. ... Modin works especially well on larger datasets, where pandas becomes painfully slow or runs out of memory."
Install
pip install modin
Python code:
try:
import modin.pandas as pd
except Exception:
import pandas as pd
from spyql.
- Hmmm, unfortunately, that seems a bit complicated to me. Hope syntax similar to SQL can be used, as supported by
clickhouse local
Right now supporting JOINs (as in the SQL syntax) is not our top priority, but I hope we get there soon enough. Out of curiosity @Minyus, is there any particular advantage of SPyQL over clickhouse local for your use case?
- How about
modin
as an optional pandas alternative?modin
should be able to run pandas code (without modifying).
Thank you again for your suggestion, but we try to keep our list of dependencies as short as possible. And I would not put a core feature (such as a JOIN) depending on an optional package. The JOIN most probably will have to be implemented from scratch.
from spyql.
Advantages of SPyQL over clickhouse local for my use case are:
- SpyQL can be installed easily by pip
- SpyQL supports Python objects/functions (e.g. pathlib.Path)
from spyql.
Related Issues (20)
- Column names with TEXT format HOT 3
- Interactive Spyql HOT 2
- EXPLODE OUTER
- JSON benchmark HOT 17
- 404 status on downloading sample.csv file HOT 2
- Support reading and writing of JSON objects (not JSON lines) HOT 1
- Support for line comments
- `parse_structure` performance can be improved by less regular expressions HOT 2
- UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2854: character maps to <undefined> HOT 4
- [FR] COUNT(DISTINCT col1) ... GROUP BY HOT 3
- Cannot run CI/CD pipe on Python 3.11
- general text file support HOT 6
- Cannot dump JSON object with Null values to string HOT 2
- Function to dump columns that are dictionaries as json
- how to get distinct top level json keys HOT 4
- improve error messaging when I'm holding it wrong HOT 1
- [Feature request] Support multiple tables in FROM clause HOT 1
- [Feature request] Define data access interface HOT 1
- Double quotes HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spyql.