uwdata / arquero Goto Github PK
View Code? Open in Web Editor NEWQuery processing and transformation of array-backed data tables.
Home Page: https://uwdata.github.io/arquero
License: BSD 3-Clause "New" or "Revised" License
Query processing and transformation of array-backed data tables.
Home Page: https://uwdata.github.io/arquero
License: BSD 3-Clause "New" or "Revised" License
Right now, if I do op.sum(colName)
and all the values of colName
are null
within a group, the aggregated value for that group is 0.0
. IMO it would be more sensible to return null
in this case.
Example:
dt = aq.from([{ foo: 1, bar: 1 }, { foo: 1, bar: 3 }, { foo: 2, bar: null }])
dt.groupby('foo').derive({bar: op.sum('bar')})
First congratulations for that impressive work, which i consider, being a R user and a D3 fan, as a huge step forward for live and sexy datavisualisation and dataflows!
It looks like Arquero, from CSV for instance, is able to infer column types (Date, Numeric, String...), as we can see with the view() display (columns right or left-alignement), or by testing values type with typeof
. Could that information be exposed in the table object, allowing for instance to test for numeric columns only?
A cool feature, starting from this, could be for instance:
select(1, aq.isNumeric())
or groupby(v1).rollup(*...here sum all numeric variables keeping same name...*)
I am used to this convenient R/dplyr syntax : summarise_if(is_numeric, sum)
or summarise( across(where(is_numeric), sum) )
First, this looks amazing and thank you for creating it.
I am could not find a built in method in the api docs to understand the dataframe/table structure that is created. In R:dplyr
the underlying dataframe/tibble provides a str
function to see type of each column vector. In Python:Pandas
, there is dtypes
method. In JS:arquero
, the underlying array is 'read as-is' and there is no-requirement/enforcement of a uniform type
for each column vector. So there is no built in-method to see what is the current column type being worked on?
I can extend it using the built-in extension. Just wanted to confirm that that is the current case.
Hi,
I'm trying to derive
many keys with different params
:
table
.params({date: 'Jan'})
.derive({Jan: (e,$) => concat($.date, e) })
.params({date: 'Feb'})
.derive({Feb: (e,$) => concat($.date, e) })
However the months can be 3, 6, or 12. So I need to calculate the derive function using a object that looks like
params = [{date: 'Jan'}];
derive = {
Jan : f()...,
Feb: f()...,
};
Without a fluent api I can imagine doing something like
table.
notfluent([
{params: params},
{derive: derive}
])
Is there a way to achieve this in Arquero? Alternatively I'll have to append the array to the existing Arquero table, do a spread, and rename the columns? Is that the solution that is advisable? This means I would do the calculations outside arquero, which might defeat the point?
As suggested by @bmschmidt in a comment on Observable.
Maybe it could be interesting to add an operation to recode the values of a categorical variable, similar to the fct_recode
function of the forcats
R package. I don't think this already exists in Arquero, but I may have missed something.
Such a function could be something like this :
recode = function(value, recoding, other) {
const keys = [...recoding.keys()];
if (keys.includes(value)) return recoding.get(value);
if (other !== undefined) return other;
return value;
}
The idea would be to pass a value, a Map
as recoding
argument, and an optional other
argument which, if defined, would be returned instead of value if value is not found in the Map
. So for example :
let recoding = new Map([["foo", "bar"],[1,2]])
recode("foo", recoding) // returns "bar"
recode(1, recoding) // returns 2
recode("hello", recoding) // returns "hello"
recode("hello", recoding, "other") // returns "other"
I'm not a good JavaScript developer so there may be a much better implementation, but I think such an operation could be very useful for dealing with categorical variabels with Arquero.
Thanks !
I am troubled with the fact that get() ignores current filter, making it (a little bit) difficult to access data in a filtered/ordered table.
And i am also having that strange result here (undefined when i expect 4 or ideally 5):
const dt = aq.table({ a: [1, 2, 3], b: ["4", "5", "6"] })
dt.filter(d => d.a > 1)
.derive({b: d => +d.b})
.column('b').get(0)
As an extension to the join functions,
i find these tidyverse functions really useful : https://dplyr.tidyverse.org/reference/rows.html
Especially 2 of them:
rows_update : like SQL UPDATE
rows_upsert : UPDATE and also INSERT missing rows
Hi,
I was trying to use the library in the context of the Ember.js application, however, the module won't load with the default setup. Here is the error message. Do you have any advice on how to avoid it?
ERROR in ./node_modules/arquero/src/arrow/encode/profiler.js 99:40
Module parse failed: Identifier directly after number (99:40)
You may need an appropriate loader to handle this file type.
| return Type.Float64;
| } else if (p.bigints === valid) {
> var _v = -p.min > p.max ? -p.min - 1n : p.max;
|
| return p.min < 0 ? _v < Math.pow(2, 63) ? Type.Int64 : error("BigInt exceeds 64 bits: ".concat(_v)) : p.max < Math.pow(2, 64) ? Type.Uint64 : error("BigInt exceeds 64 bits: ".concat(p.max));
@ ./node_modules/arquero/src/arrow/encode/data-from-objects.js 2:0-37 9:12-19
@ ./node_modules/arquero/src/arrow/encode/index.js
@ ./node_modules/arquero/src/index.js
Hello!
I just tripped over a small inaccuracy: in the documentation for fromCSV, the option object's property for specifying the delimiter is referenced as delimiter, whereas in the respective code it says delim:
arquero/src/format/from-csv.js
Line 36 in f4e03eb
Therefore code like aq.fromCSV(text, { delimiter: ';' })
does not work as intended.
I tried this:
import {table} from "arquero";
const myTable = table({
x: [1610109901000, 1610109902000, 1610109903000, 1610109904000],
y: [600, 610, 590, 602]
});
I got a warning from IntelliJ (both in javascript and in typescript): Invalid number of arguments, expected 2
This is because the names
parameter of the table
function is not marked as optional.
Lines 84 to 96 in 3083ad3
Add the following code to your jsdoc: @param {string[]?} names
I would also suggest to review the typings of all functions in your code π
Extremely minor documentation issue. All the [Source]
links on the page https://github.com/uwdata/arquero/blob/master/docs/api/verbs.md link to https://github.com/uwdata/arquero/blob/master/src/table/table.js, but the functions aren't actually listed there. The relevant code is actually at (eg) https://github.com/uwdata/arquero/blob/master/src/engine/select.js.
Thanks for the thorough docs, BTW.
@jheer and others, this is already looking like a great library for us at OmniSci, where we just added the ability to return SQL query results over the wire as Arrow buffers π π
Wondering if it is possible to serialize the results of an Arquero query to Arrow - it would make for a composable pipeline e.g. for use cases where you retrieve a dataset from OmniSci, wrangle it client-side with Arquero, and then either visualize it with Vega-based charts or things like deck.gl for 3d charts, for example.
We're happy to help in any way we can!
I love pivot
and fold
fonctions in Arquero, they are so powerful !
I just ran into few small difficulties.
Here is a use case:
t = aq.table({x:['01','01','01','02','02','02'], y:['1','2','3','1','2','3'], z:[123,256,854, 652,734,222]})
x | y | z |
---|---|---|
01 | 1 | 123 |
01 | 2 | 256 |
01 | 3 | 854 |
02 | 1 | 652 |
02 | 2 | 734 |
02 | 3 | 222 |
I pivot that table:
t.groupby(0).pivot(1,2)
The problem is that x
is now located on the right, i'd like it to stay in the first position.
1 | 2 | 3 | x |
---|---|---|---|
123 | 256 | 854 | 01 |
652 | 734 | 222 | 02 |
I am trying all these variants, but in vain, x
doesn't want to move ;)
t.groupby(0).pivot(1,2).relocate('x', {before:0}).view() ;
t.groupby(0).pivot(1,2).relocate('x', {before:1}).view() ;
t.groupby(0).pivot(1,2).relocate('x', {before:"1"}).view() ;
Maybe i am missing something there. Of course i can write a select.
The fact that new column names are not by default prefixed probably does not help.
My humble suggestions would be:
pivot
, make the groupby columns appear at firstprefix
option in pivot
(and fold
as well)For instance:
p = t.groupby(0).pivot(1,2, {prefix:'age_'})
could produce:
x | age_1 | age_2 | age_3 |
---|---|---|---|
01 | 123 | 256 | 854 |
02 | 652 | 734 | 222 |
and conversely:
p.fold(aq.range(1,3),{as:['y','z'], prefix:'age_'})
would take me back to the original table.
For the moment i use for pivot
something like:
p = t.groupby(0).pivot({ d: d => 'age_' + d.y },2, {prefix:'age_'})
But after a fold
i have to write something like this this, which i find not very elegant:
.derive({y: d => op.substring(d.y,4)})
cf. for inspiration (and old habits):
https://tidyr.tidyverse.org/reference/pivot_longer.html (aka fold) and names_prefix option
https://tidyr.tidyverse.org/reference/pivot_wider.html (aka pivot) and names_prefix option
I think I found a problem with table.antijoin()
, which is evidenced if you try to place this file on the arquero's test
directory and run it with node:
const aq = require('../')
const assert = require('assert')
const tb1 = aq.table({
id: [1, 2, 3]
})
const tb2 = aq.table({
id: [1, 2]
})
const tb3 = tb1.antijoin(tb2)
tb3.print()
assert.equal(tb3.numRows(), 1)
assert.equal(tb3.get('id', 0), 3)
This fails with the following error:
node:assert:119
throw new AssertionError(obj);
^
AssertionError [ERR_ASSERTION]: 1 == 3
at Object.<anonymous> (/Users/pedroteixeira/projects/deci/arquero/test/antijoin.js:17:8)
at Module._compile (node:internal/modules/cjs/loader:1108:14)
at Object.Module._extensions..js (node:internal/modules/cjs/loader:1137:10)
at Module.load (node:internal/modules/cjs/loader:973:32)
at Function.Module._load (node:internal/modules/cjs/loader:813:14)
at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:76:12)
at node:internal/main/run_main_module:17:47 {
generatedMessage: true,
code: 'ERR_ASSERTION',
actual: 1,
expected: 3,
operator: '=='
}
The output table is:
βββββββββββ¬βββββ
β (index) β id β
βββββββββββΌβββββ€
β 0 β 3 β
βββββββββββ΄βββββ
But the assertion fails.
The problem goes away if I reify the result of the antijoin call:
const tb3 = tb1.antijoin(tb2).reify()
If you point me to where the problem lays in the code I can try to fix it.
With R, it is very convenient to create a column and specify where to insert it:
mutate( newvar = ..., .after = var1 )
hence avoiding to respecify the desired column order with a full select.
Could derive
encompass a new parameter similar to .after or .before?
Arquero does not seem to take into account a filter applied to an arrow table.
Here is my use case:
I have quite a big arrow table:
arrow_tb.count()
=> 6159296
I filter it with arrow functions, which is very fast, especially with dictionary-encoded columns:
arrow_filtered_tb = arrow_tb.filter(arrow.predicate.col('NIVGEO').eq('COM')
.and(arrow.predicate.col('CS1_8').eq('2'))
.and(arrow.predicate.col('AGEQ65').eq('60'))
.and(arrow.predicate.col('SEXE').eq('2') )
)
arrow_filtered_tb.count()
=> 34951
I load it as an Arquero table and compute its number of rows:
aq.fromArrow(arrow_filtered_tb).numRows()
=> 6159296, which is not what i expect
Apparently, the arrow method column.toArray() exports the whole unfiltered column.
The arrow scan method looks more effective, i'd like to avoid here the conversion to objects before converting to arquero table :
nbRowsFilteredAfterScan = {
let codgeo, nb, result = [] ;
arrow_filtered_tb.scan((idx) => {
result.push({'codgeo': codgeo(idx), 'nb': nb(idx)});
}, (batch) => {
codgeo = arrow.predicate.col('CODGEO').bind(batch);
nb = arrow.predicate.col('NB').bind(batch);
});
return aq.from(result).numRows() ;
}
=> 34951
Hi,
First of all, many thanks for this awesome library that makes working with tabular data much more natural. As an Observable newbie coming from R and dplyr, I really appreciate it.
A small question : when joining two tables, the key column(s) are duplicated in the resulting joined table. For example, if both tables are merged on the same key
column, the resulting table will have both a key_1
and a key_2
columns. In other languages or libraries I'm accustomed to the fact that the result only has a key
column, as the original one.
Please excuse me if it is the intended behavior or if I am misunderstanding something.
Thanks.
These are suggestions from a R practice, with handy functionalities i frequently use:
select
current possibilities such as selecting by name, indices, not
, range
are excellent. I'd like also: starts_with
, contains
, and the possibility to test on column types (for instance select all numeric columns)
rename
of course select in Arquero allows to rename, i have noticed that.
Within R, janitor::clean_names is very useful for tidying and fixing complex column_names that you can find, sometimes, in CSV files (with blanks, accented characters, caps) => DB compliant column names (one single world, lowercase, without accented or special chars...)
While trying to figure out how to set a lag programatically, I hit some unexpected behavior.
I wanted to do:
aq.table({"n": [1,2,3,4]})
.params({lag: 1})
.derive({prod: (d, $) => d.n - op.lag(d.n, $.lag)})
.view()
But it raises: TypeError: t.param is not a function
Probably I'm just not finding the right place to set params to an op. (Or need to write a new one?).
But even the following raises an error, although the op.lag
function does not ask for access to .params,
which seems less than ideal.
aq.table({"n": [1,2,3,4]})
.params({lag: 1})
.derive({prod: (d, $) => d.n * $.lag - op.lag(d.n, 1)})
.view()
Thanks again for this great library.
If you load a table using aq.fromArrow, date type information is preserved;
but after some operations, the dates revert to being represented as some kind of integers, not as datetime.
Apologies for not providing an example inline, because I'm not confident about generating
an Arrow date field in Javascript. In the notebook at https://observablehq.com/@bmschmidt/a-files,
you can see that date information is preserved after a filter
table.filter(d => d.country=="Mexico").select("birth date").view()
gives
birth date
--
1915-06-21T00:00:00.000Z
1900-04-14T00:00:00.000Z
1899-04-15T00:00:00.000Z
1886-04-12T00:00:00.000Z
But after a sample or slice, you get ints:
table.sample(10).select("birth date").view()
gives
birth date
--
-1841614848
1234978816
-275112960
689860608
Recently I'm doing some COVID-19 stats trying to utilize Arquero lib.
Most of my operations are perfectly ported to Arquero APIs and worked well.
One of my key complaint is the lack of easy moving average operations.
My current version is like:
.derive({ moving_death_7: d => (op.lag(d.death, 1) + op.lag(d.death, 2) + op.lag(d.death, 3) + op.lag(d.death, 4) + op.lag(d.death, 5) + op.lag(d.death, 6) + op.lag(d.death, 0)) / 7 })
It's currently unable to use for loops, and it looks awkward a bit. If I want to change to a 14d moving average, it will look even worse. Am I doing it wrong?
It would be awesome if one could convert a query to SQL so that it can run in a database. Would that be feasible or are there some obvious operations that are not easily translated to SQL?
It's only OK to use 'op.max' for aggregations, like in d=>op.max(d.foo)
,
But if I try to use it for common math calculations, like d=>op.max(1,2,3)
or d=>op.max([1,2,3])
, none of them will output expected values. It's still common usage for processing data like correcting negative values using like op.max(d.foo, 0)
.
I think Arquero should better separate the 'max' function with different keywords, like 'op.math_max' or 'op.agg_max'
pandas has df.sample(frac=1)
and df.sample(n=1)
; dplyr has t %>% sample_frac(1)
and t %>% sample_n(1)
.
Currently the recommended way to do this is either
t.sample(t.numRows())
(only on ungrouped tables).t.sample(() => op.count())
The first only works on ungrouped tables, and the second is a little wonky--I only found it in the docs by searching if there was an example of bootstrap sampling.
For example, in the case when dealing with U.S. Census FIPS values that are stored as numbers in a dataset, how would I convert the numeric values back to string values that are left padded with the correct number of zeros?
When I try doing something like:
table.derive({
"fips_state_str": d => `${d.fips_state}`.padStart(2, "0")
})
... Arquero throws an error. Looking at the API documentation and docs on Observable I don't see an example of how to accomplish this, but apologies if I missed or overlooked something. Thanks!
I hope that my question is not too silly, but I wondered what would be a reliable way to test if a given object is an Arquero Table (for example in the case of a function that would apply both to a table and to a d3 array).
I tried to use obj.constructor.name
but I believe this is not reliable because the result can change from a version to another due to code minimization.
Is there a "good" way to test for this ?
Many thanks !
It is useful to be able to select individual rows by index, including respect for sorting and grouping.
dplyr
uses head(n)
, tail(n)
, and slice(indices)
.
R does not respect negative indices in slice, but it would be nice to also be to say
table.slice([-1, 0]) to get the last and first elements, in that order, for each group.
The pandas function nth
has a better name than slice, but only selects a single row per group.
table view() method is convenient for showing/checking a slice of rows. I wish it could implement sortable columns and lazy browsing.
As a matter of fact, Observable released yesterday such a component: https://observablehq.com/@observablehq/input-table
I'd love he could operate on arquero/arrow tables (especially combined with a filtering select dropdown).
Would a rapprochement be conceivable?
Really loving Arquero so far! One thing I constantly run into is the need for a flatMap
or equivalently SQL array_agg
operation - Often grouping by a key, I want to collect the values into an array-valued output. I can use Javascript flatMap, but it would be great to have Arquero support this in its native ops
Apologies if this shouldn't be an issue, but I have been searching everywhere and haven't been able to understand what can I do with table expressions. Is this documented?
I want to derive columns programmatically. I have a dataset with a bunch of dates per each row, and I want to compute their relative position compared to a reference one. I know I can do:
dt.derive({
date1_rel: d => d.date1 - d.date_ref,
date2_rel: d => d.date2 - d.date_ref,
...
})
But I don't want to write that by hand. I tried then these cells on Observable:
columnFechas = [
"fis",
"fecha_de_muerte",
"fecha_reporte_web"
"fecha_recuperado",
"fecha_de_notificaci_n",
"fecha_diagnostico"
]
deriveFechas = Object.fromEntries(
columnFechas.map(f => [
`${f}_res`,
d => d[f] - d.fecha_reporte_web
])
)
dt.derive(deriveFechas)
But It throws an error saying Invalid column "f": "d[f]"
. I tried debugging the code and noticed that the function is a table expression, so for a reason I don't understand I cannot even console.log the attribute d. How do I get the column that is in the variable f
in a table expression. Documenting this could be important whenever you have column names with spaces
Here is a notebook replicating the issue
https://observablehq.com/d/2783d075d8079564#fechas
Seems related to #5, is there a way of doing op.get(column, d)
?
I have a table t1
with a risk
nominal column
and i want all distinct values in an array this is quite a common need i am having.
I really appreciate the new columnArray()
Table method.
I am actually hesitating between different paths, having difficulties to select and remind which would be the best:
[...new Set(t1.columnArray('risk'))] // short but not ideal for JS debutants...
t1.groupby('risk').count().columnArray('risk') // tradeoff, but i don't really need the count
t1.rollup({risk: op.array_agg_distinct('risk')}).get('risk', 0) // complex... could we avoid get here or use a get(0,0) ?
Maybe i am missing here something simplier.
Wish i could use something very simple such as t1.distinct('risk')
Error: Illegal function call: "Date.UTC(d.year, 0, 1)"
aq.table({ year: [2010, 2011, 2012, 2013, 2019, ] })
.derive({ date: d => Date.UTC(d.year, 0, 1) })
.view()
aq.fromCSV, with option autotype = true, removes leading 0 (as does d3.autotype, unfortunately).
"06075" => 6075
"01" => 1
Could a test on leading 0 be added to avoid string to number conversion here?
I currently fix this with a specific parse option:
parse: {zipcode: String}
but it would be simplier not having to care about this, it would allow for writing more generic function to parse CSV files.
First of all: Thanks for this great project. Itβs so great to finally see a good data wrangling framework in JavaScript land!
I have only worked with Arquero a couple of hours and stumbled upon an unexpected behavior that you might explain somewhere in the docs. Or it might be a bug.
Accessing individual cell values using .get(..., index)
does not seem to respect re-ordered tables. Hereβs an example:
const dt = aq.table({a: [1, 2, 3], b: [4, 5, 6]});
a b
1 4
2 5
3 6
dt.get("a", 0); // 1
const dto = dt.orderby(aq.desc("a"));
a b
3 6
2 5
1 4
dto.get("a", 0); // still 1 but expected 3
Any idea why that may be the case?
The following code (within an observable notebook cell) runs just fine:
{
let df = aq.from([{ a: 1 }, { a: 2 }]);
return df.derive({ b: row => row['a'] + 1 }).view(10);
}
But if I try and make the column name a variable, it fails:
{
let df = aq.from([{ a: 1 }, { a: 2 }]);
let col = 'a';
return df.derive({ b: row => row[col] + 1 }).view(10);
}
I suppose I'm just not understanding the scope in which the lambda gets evaluated, or something. What is the right way to use variable names within the lambdas that get passed to derive
?
Notebook: https://observablehq.com/@pwills/arquero-confusions
Is there a way to bind tables by columns, in the same way as concat
bind them by rows ?
For the moment the only way I found to do it is with something like :
let tmp1 = dt1.derive({ tmp: op.row_number() });
let tmp2 = dt2.derive({ tmp: op.row_number() });
tmp1
.join(tmp2)
.select(aq.not("tmp"));
But maybe there is a simpler way to do it ?
Thanks !
Hi,
This is quite a corner case, but if I try to join the following two tables :
dt1 = aq
.from([
{ key: 1, value1: 1 },
{ key: 2, value1: 2 },
{ key: 3, value1: 3 },
{ key: 4, value1: 4 }
])
.filter(d => d.key < 3)
.derive({ key: d => d.key })
dt2 = aq.from([
{ key: 1, value2: 1 },
{ key: 2, value2: 2 },
{ key: 5, value2: 5 }
])
The result has 4 rows, with undefined
values in the key
column for the rows filtered from dt1
:
dt1.join_left(dt2).view()
Here is a small Observable notebook reproducing the issue : https://observablehq.com/d/d85031b66c92460b
The problem is solved if I put the derive
before the filter
, or if I call reify
on dt1
before the join.
Maybe the good practice is just to systematically call reify
between a filter
and a derive
?
Thanks !
The API doc exposes as an example:
// create table from an input CSV loaded from 'url'
aq.fromCSV(await fetch(url).text())
But it does not seem to work.
Here are few variants:
let url ='https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/countries-etc-datapoints/ddf--datapoints--children_per_woman_total_fertility--by--geo--time.csv' ;
return aq.fromCSV( await fetch(url).text() ) // KO
return (await fetch(url)).text() // OK
return aq.fromCSV( (await fetch(url)).text() ) // KO
return aq.fromCSV( await ( (await fetch(url)).text() ) ) // OK
return fetch(url).then(d => d.text()).then(d => aq.fromCSV(d)) // OK
Actually i wish i could use simply:
aq.fromCSV(url)
like
d3.csv(url)
An extension to op.ntile() could prove useful to encode numeric values to categories from manual breaks.
Something similar to the R cut
function:
dens_code = cut( pop_density, breaks = c(0, 1000, 5000,20000,100000, Inf)...)
or d3.scaleThreshold()
jenks() and kmeans() are also useful clustering methods, we can borrow them from the simple statistics library
, but of course if they were in Arquero it would be convenient.
I noticed what i think are few minor typos in the doc (API reference):
https://uwdata.github.io/arquero/api/verbs
orderby
table.orderby('a', desc('b'))
=> table.orderby('a', aq.desc('b'))
table.fold(range(5, 8))
=> table.fold(aq.range(5, 8))
https://uwdata.github.io/arquero/api/op
op.includes
index: The integer index to start searcing from (default 0).
=> index: The integer index to start searching from (default 0).
Here are some assorted problems I encountered when using arquero:
ColumnTable
(see dist/types/table/column-table.d.ts
)
toCSV
, toHTML
, toJSON
, toMarkdown
options
has been generated as any
and not as specified in the jsdoc.arquero/src/table/column-table.js
Lines 204 to 211 in 55c7c13
options
parameter should be optional, now it is requiredTable.create
: the definition of config parameter should be refactored as a separate type or interface
Lines 51 to 57 in 55c7c13
Table.comparator
: should be more specific than just a Function
, e.g. type Comparator<T> = (a:T,b:T) => number
or something...
Lines 113 to 119 in 55c7c13
I want to check what should be a numeric column in my table (loaded from a CSV file), but it happens that it contains several non-numeric values : 'NA', 'ND', 'XXX'...
So i am trying a test such as:
isNaN(x)
With Arquero, i will use:
dt.filter(d => op.is_nan(d.mycol))
but it actually does not retain these 'NA', 'ND', 'XXX'...
beacause op.is_nan()
follows Number.isNaN()
, which behaves differently than isNaN()
.
Is this by design?
As an alternative, i am currently using:
dt.filter(d => +d.mycol != d.mycol)
How about adding those (sorted by descending interest for me):
table.sample(n)
)op.density(field, bandwidth)
),I have actually spotted them in vega-lite transforms collection.
This is an alternate solution related to the question at #33 about selecting certain subsets of columns based on types.
Arquero could have a metadata
slot on each column that borrows from Apache Arrow's support of table and column-level metadata. If my arrow column is of type arrow.utf8
with a metadata field saying "language": "English", if would be useful to have an arquero table derived from it at some point declare table.metadata = {"language": "English", "arrow_type": "utf8"}
. The pyarrow feather export functions do something similar with pandas frames: the feather metadata includes a description of the pandas dtypes.
If a function as argument to aq.matches
worked with reference to the full column (not just the name as for strings and regexes) @ericemc3's case in #33 could be expressed something like this:
table.select(aq.matches(col => col.metadata.arrow_type.match(/int|float/)))
col.metadata
would also conceivably be a useful place to expose information about what autotype inference in fromCSV and fromJSON decided to do.
Could try to find time for a pull request, but obviously this hinges on questions about how/whether you want to use slots on the column
object other than data
.
Thanks so much for releasing this project--I've been hoping for precisely this kind of engine for a while. Look forward to using it.
There's some relatively low-hanging fruit for a performance bump on dictionary-encoded columns. (Probably string columns, too, although it's a little trickier.) Currently, groupby
on a dictionary(utf-8 entries with int32 keys) column is quite slow because--if I'm tracing the code correctly--each individual row is being decoded from utf-8.
Similar optimizations should be possible for 'filter' and 'join' on dictionary columns, although there's probably more overhead involved.
FIrst off, thanks for building this - it's making my life a lot easier when I work with data in Observable π
I'm having some trouble working with dates. I build a data table via
df = aq.from([
{ date: '2020-01-01 00:00:00.000' },
{ date: '2020-01-02 00:00:00.000' }
])
I then add a custom function to shift the date forward 30 days, using moment
:
let shiftDateFunc = d => moment(d).add(30, 'days').format()
aq.addFunction("shiftDate", shiftDateFunc)
However, when I try to do
df.derive({shiftedDate: op.shiftDate('date')})
I get an error:
Expression parse error: Invalid date
Which is surprising, since I can map this function over the objects just fine:
df.objects().map(d => shiftDateFunc(d.date)) // succeeds, returns an array of strings
Why can I map this function over the object arrays successfully, but it fails when I register it via addFunction
? In my mental model the success of the former should imply that the latter will also work, but obviously I'm confused about something.
In previous versions, by using someTable.groupby('colA','colB').rollup(), it is possible to return a result table with all unique values with valid groups without actually rollup a value. However in 1.3.0, this is broken since the 'Object.entries are not allowed for undefined or null' error(depends on the browser). I have to manually add anything like op.count() to get it work, which I actually don't need.
I don't know if it is by purpose or by accident. But I do think the previous usage is no big deal for rollup() function.
This issue captures a missing feature that's
bothered me a little too; a quick and dirty way to join tables on shared keys as in tidyverse
's inner_join
or MySQL's NATURAL JOIN
. Especially without named arguments in JS, it's hard to
remember how to do this with existing joins.
In observable I've started adding a join_natural
function that mimics their behavior:
I wonder if you'd consider adding something similar as a method.
aq.internal.Table.prototype.join_natural = function(right) {
const names = this.columnNames()
return this.join(right, undefined, [names,right.columnNames().filter(d => names.indexOf(d) == -1)])
}
Would probably be best if it threw an error if there are no shared keys. (?)
Thanks again for your work--still greatly enjoying this library's contribution.
Arrow JS's compute library has the ability to optimize filter expressions that operate on a dictionary-encoded column. For example if foo
is a dictionary encoded string column and you're evaluating foo == "bar"
we first lookup the dictionary index for "bar" and then search for instances of that index. This can make a huge performance improvement, particularly for string columns where UTF-8 decoding is very expensive.
In Arrow I believe we only do it for equality expressions but you can imagine there are many other expressions that could benefit from such an optimization (foo in ("bar", "baz")
, lt, gt, ...). I never got a chance to build that out in Arrow, but now that arquero is here it seems like a much better place for it :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.