uwdata / arquero Goto Github PK

View Code? Open in Web Editor NEW

1.2K 30.0 63.0 1.06 MB

Query processing and transformation of array-backed data tables.

Home Page: https://uwdata.github.io/arquero

License: BSD 3-Clause "New" or "Revised" License

JavaScript 100.00%

data query database table dataframe transform arrays

arquero's People

Contributors

Stargazers

Watchers

arquero's Issues

Aggregates should return null when all values are null

Right now, if I do op.sum(colName) and all the values of colName are null within a group, the aggregated value for that group is 0.0. IMO it would be more sensible to return null in this case.

Example:

dt = aq.from([{ foo: 1, bar: 1 }, { foo: 1, bar: 3 }, { foo: 2, bar: null }])
dt.groupby('foo').derive({bar: op.sum('bar')})

Support explicit column types.

First congratulations for that impressive work, which i consider, being a R user and a D3 fan, as a huge step forward for live and sexy datavisualisation and dataflows!

It looks like Arquero, from CSV for instance, is able to infer column types (Date, Numeric, String...), as we can see with the view() display (columns right or left-alignement), or by testing values type with typeof. Could that information be exposed in the table object, allowing for instance to test for numeric columns only?

A cool feature, starting from this, could be for instance:
select(1, aq.isNumeric())
or groupby(v1).rollup(*...here sum all numeric variables keeping same name...*)

I am used to this convenient R/dplyr syntax : summarise_if(is_numeric, sum) or summarise( across(where(is_numeric), sum) )

DataFrame Structure

First, this looks amazing and thank you for creating it.

I am could not find a built in method in the api docs to understand the dataframe/table structure that is created. In R:dplyr the underlying dataframe/tibble provides a str function to see type of each column vector. In Python:Pandas, there is dtypes method. In JS:arquero, the underlying array is 'read as-is' and there is no-requirement/enforcement of a uniform type for each column vector. So there is no built in-method to see what is the current column type being worked on?

I can extend it using the built-in extension. Just wanted to confirm that that is the current case.

Alternative to a fluent API

Hi,

I'm trying to derive many keys with different params:

table
  .params({date: 'Jan'})
  .derive({Jan: (e,$) => concat($.date, e) })
  .params({date: 'Feb'})
  .derive({Feb: (e,$) => concat($.date, e) })

However the months can be 3, 6, or 12. So I need to calculate the derive function using a object that looks like

params = [{date: 'Jan'}];
derive = { 
  Jan : f()...,
  Feb: f()...,
};

Without a fluent api I can imagine doing something like

table.
  notfluent([
    {params: params},
    {derive: derive}
  ])

Is there a way to achieve this in Arquero? Alternatively I'll have to append the array to the existing Arquero table, do a spread, and rename the columns? Is that the solution that is advisable? This means I would do the calculations outside arquero, which might defeat the point?

Add table iterator for row objects.

As suggested by @bmschmidt in a comment on Observable.

Add an op.recode function.

Maybe it could be interesting to add an operation to recode the values of a categorical variable, similar to the fct_recode function of the forcats R package. I don't think this already exists in Arquero, but I may have missed something.

Such a function could be something like this :

recode = function(value, recoding, other) {
  const keys = [...recoding.keys()];
  if (keys.includes(value)) return recoding.get(value);
  if (other !== undefined) return other;
  return value;
}

The idea would be to pass a value, a Map as recoding argument, and an optional other argument which, if defined, would be returned instead of value if value is not found in the Map. So for example :

let recoding = new Map([["foo", "bar"],[1,2]])
recode("foo", recoding)  // returns "bar"
recode(1, recoding)  // returns 2
recode("hello", recoding)  // returns "hello"
recode("hello", recoding, "other")  // returns "other"

I'm not a good JavaScript developer so there may be a much better implementation, but I think such an operation could be very useful for dealing with categorical variabels with Arquero.

Thanks !

Update table getters to respect filter and order criteria?

I am troubled with the fact that get() ignores current filter, making it (a little bit) difficult to access data in a filtered/ordered table.

And i am also having that strange result here (undefined when i expect 4 or ideally 5):

const dt = aq.table({ a: [1, 2, 3], b: ["4", "5", "6"] })

dt.filter(d => d.a > 1)
  .derive({b: d => +d.b})
  .column('b').get(0)

How to update table rows based on another table?

As an extension to the join functions,
i find these tidyverse functions really useful : https://dplyr.tidyverse.org/reference/rows.html

Especially 2 of them:
rows_update : like SQL UPDATE
rows_upsert : UPDATE and also INSERT missing rows

Build failures with Ember.js

Hi,
I was trying to use the library in the context of the Ember.js application, however, the module won't load with the default setup. Here is the error message. Do you have any advice on how to avoid it?

ERROR in ./node_modules/arquero/src/arrow/encode/profiler.js 99:40
Module parse failed: Identifier directly after number (99:40)
You may need an appropriate loader to handle this file type.
|     return Type.Float64;
|   } else if (p.bigints === valid) {
>     var _v = -p.min > p.max ? -p.min - 1n : p.max;
| 
|     return p.min < 0 ? _v < Math.pow(2, 63) ? Type.Int64 : error("BigInt exceeds 64 bits: ".concat(_v)) : p.max < Math.pow(2, 64) ? Type.Uint64 : error("BigInt exceeds 64 bits: ".concat(p.max));
 @ ./node_modules/arquero/src/arrow/encode/data-from-objects.js 2:0-37 9:12-19
 @ ./node_modules/arquero/src/arrow/encode/index.js
 @ ./node_modules/arquero/src/index.js

aq.fromCSV's "delimiter"-option is incorrectly documented

Hello!

I just tripped over a small inaccuracy: in the documentation for fromCSV, the option object's property for specifying the delimiter is referenced as delimiter, whereas in the respective code it says delim:

arquero/src/format/from-csv.js

Line 36 in f4e03eb

const delim = (options.delim == null ? ',' : options.delim + '').charCodeAt(0);

Therefore code like aq.fromCSV(text, { delimiter: ';' }) does not work as intended.

Typings should be fixed

Problem

I tried this:

import {table} from "arquero";

const myTable = table({
  x: [1610109901000, 1610109902000, 1610109903000, 1610109904000],
  y: [600, 610, 590, 602]
});

I got a warning from IntelliJ (both in javascript and in typescript): Invalid number of arguments, expected 2

This is because the names parameter of the table function is not marked as optional.

arquero/src/verbs/index.js

Lines 84 to 96 in 3083ad3

 /** 

  * Create a new table for a set of named columns. 

  * @param {object} columns 

  * The set of named column arrays. 

  * Object keys are the column names. 

  * The enumeration order of the keys determines the column indices. 

  * Object values must be arrays (or array-like values) of identical length. 

  * @return {ColumnTable} the instantiated table 

  * @example table({ colA: ['a', 'b', 'c'], colB: [3, 4, 5] }) 

  */ 

 export function table(columns, names) { 

 return new ColumnTable(mapObject(columns, x => x), names); 

 }

How to fix

Add the following code to your jsdoc: @param {string[]?} names

I would also suggest to review the typings of all functions in your code 😄

Wrong links in documentation

Extremely minor documentation issue. All the [Source] links on the page https://github.com/uwdata/arquero/blob/master/docs/api/verbs.md link to https://github.com/uwdata/arquero/blob/master/src/table/table.js, but the functions aren't actually listed there. The relevant code is actually at (eg) https://github.com/uwdata/arquero/blob/master/src/engine/select.js.

Thanks for the thorough docs, BTW.

Add Arrow output serialization.

@jheer and others, this is already looking like a great library for us at OmniSci, where we just added the ability to return SQL query results over the wire as Arrow buffers 👍 👍

Wondering if it is possible to serialize the results of an Arquero query to Arrow - it would make for a composable pipeline e.g. for use cases where you retrieve a dataset from OmniSci, wrangle it client-side with Arquero, and then either visualize it with Vega-based charts or things like deck.gl for 3d charts, for example.

We're happy to help in any way we can!

Fix ordering of integer-named columns.

I love pivot and fold fonctions in Arquero, they are so powerful !

I just ran into few small difficulties.
Here is a use case:

t = aq.table({x:['01','01','01','02','02','02'], y:['1','2','3','1','2','3'], z:[123,256,854, 652,734,222]})

x	y	z
01	1	123
01	2	256
01	3	854
02	1	652
02	2	734
02	3	222

I pivot that table:

t.groupby(0).pivot(1,2)

The problem is that x is now located on the right, i'd like it to stay in the first position.

1	2	3	x
123	256	854	01
652	734	222	02

I am trying all these variants, but in vain, x doesn't want to move ;)

t.groupby(0).pivot(1,2).relocate('x', {before:0}).view() ;
t.groupby(0).pivot(1,2).relocate('x', {before:1}).view() ;
t.groupby(0).pivot(1,2).relocate('x', {before:"1"}).view() ;

Maybe i am missing something there. Of course i can write a select.
The fact that new column names are not by default prefixed probably does not help.

My humble suggestions would be:

after a pivot, make the groupby columns appear at first
add a prefix option in pivot (and fold as well)

For instance:
p = t.groupby(0).pivot(1,2, {prefix:'age_'})

could produce:

x	age_1	age_2	age_3
01	123	256	854
02	652	734	222

and conversely:
p.fold(aq.range(1,3),{as:['y','z'], prefix:'age_'})

would take me back to the original table.

For the moment i use for pivot something like:
p = t.groupby(0).pivot({ d: d => 'age_' + d.y },2, {prefix:'age_'})

But after a fold i have to write something like this this, which i find not very elegant:
.derive({y: d => op.substring(d.y,4)})

cf. for inspiration (and old habits):
https://tidyr.tidyverse.org/reference/pivot_longer.html (aka fold) and names_prefix option
https://tidyr.tidyverse.org/reference/pivot_wider.html (aka pivot) and names_prefix option

result of table.antijoin is incoherent

I think I found a problem with table.antijoin(), which is evidenced if you try to place this file on the arquero's test directory and run it with node:

const aq = require('../')
const assert = require('assert')

const tb1 = aq.table({
  id: [1, 2, 3]
})

const tb2 = aq.table({
  id: [1, 2]
})

const tb3 = tb1.antijoin(tb2)

tb3.print()

assert.equal(tb3.numRows(), 1)
assert.equal(tb3.get('id', 0), 3)

This fails with the following error:

node:assert:119
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: 1 == 3
    at Object.<anonymous> (/Users/pedroteixeira/projects/deci/arquero/test/antijoin.js:17:8)
    at Module._compile (node:internal/modules/cjs/loader:1108:14)
    at Object.Module._extensions..js (node:internal/modules/cjs/loader:1137:10)
    at Module.load (node:internal/modules/cjs/loader:973:32)
    at Function.Module._load (node:internal/modules/cjs/loader:813:14)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:76:12)
    at node:internal/main/run_main_module:17:47 {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: 1,
  expected: 3,
  operator: '=='
}

The output table is:

┌─────────┬────┐
│ (index) │ id │
├─────────┼────┤
│    0    │ 3  │
└─────────┴────┘

But the assertion fails.

The problem goes away if I reify the result of the antijoin call:

const tb3 = tb1.antijoin(tb2).reify()

If you point me to where the problem lays in the code I can try to fix it.

Specify insertion point for derived columns.

With R, it is very convenient to create a column and specify where to insert it:
mutate( newvar = ..., .after = var1 )
hence avoiding to respecify the desired column order with a full select.

Could derive encompass a new parameter similar to .after or .before?

filtered arrow tables and conversion to arquero table: filter is lost

Arquero does not seem to take into account a filter applied to an arrow table.

Here is my use case:

I have quite a big arrow table:

arrow_tb.count()

=> 6159296

I filter it with arrow functions, which is very fast, especially with dictionary-encoded columns:

arrow_filtered_tb = arrow_tb.filter(arrow.predicate.col('NIVGEO').eq('COM') 
                            .and(arrow.predicate.col('CS1_8').eq('2'))
                            .and(arrow.predicate.col('AGEQ65').eq('60'))
                            .and(arrow.predicate.col('SEXE').eq('2') )
               )

arrow_filtered_tb.count()

=> 34951

I load it as an Arquero table and compute its number of rows:
aq.fromArrow(arrow_filtered_tb).numRows()

=> 6159296, which is not what i expect

Apparently, the arrow method column.toArray() exports the whole unfiltered column.

The arrow scan method looks more effective, i'd like to avoid here the conversion to objects before converting to arquero table :

nbRowsFilteredAfterScan = {
  let codgeo, nb, result = [] ;

  arrow_filtered_tb.scan((idx) => {
        result.push({'codgeo': codgeo(idx), 'nb': nb(idx)});
    }, (batch) => {
        codgeo = arrow.predicate.col('CODGEO').bind(batch);
        nb     = arrow.predicate.col('NB').bind(batch);
    });
  
  return aq.from(result).numRows() ;
}

=> 34951

Duplicated keys column after join

Hi,

First of all, many thanks for this awesome library that makes working with tabular data much more natural. As an Observable newbie coming from R and dplyr, I really appreciate it.

A small question : when joining two tables, the key column(s) are duplicated in the resulting joined table. For example, if both tables are merged on the same key column, the resulting table will have both a key_1 and a key_2 columns. In other languages or libraries I'm accustomed to the fact that the result only has a key column, as the original one.

Please excuse me if it is the intended behavior or if I am misunderstanding something.

Thanks.

Add additional selection helpers.

These are suggestions from a R practice, with handy functionalities i frequently use:

select
current possibilities such as selecting by name, indices, not, range are excellent. I'd like also: starts_with, contains, and the possibility to test on column types (for instance select all numeric columns)
rename
of course select in Arquero allows to rename, i have noticed that.
Within R, janitor::clean_names is very useful for tidying and fixing complex column_names that you can find, sometimes, in CSV files (with blanks, accented characters, caps) => DB compliant column names (one single world, lowercase, without accented or special chars...)

params in expr containing op.lag raises error

While trying to figure out how to set a lag programatically, I hit some unexpected behavior.

I wanted to do:

aq.table({"n": [1,2,3,4]})
  .params({lag: 1})
  .derive({prod: (d, $) => d.n - op.lag(d.n, $.lag)})
  .view()

But it raises: TypeError: t.param is not a function

Probably I'm just not finding the right place to set params to an op. (Or need to write a new one?).

But even the following raises an error, although the op.lag function does not ask for access to .params,
which seems less than ideal.

aq.table({"n": [1,2,3,4]})
  .params({lag: 1})
  .derive({prod: (d, $) => d.n * $.lag - op.lag(d.n, 1)})
  .view()

Date typing information from arrow lost after sample.

Thanks again for this great library.

If you load a table using aq.fromArrow, date type information is preserved;
but after some operations, the dates revert to being represented as some kind of integers, not as datetime.

Apologies for not providing an example inline, because I'm not confident about generating
an Arrow date field in Javascript. In the notebook at https://observablehq.com/@bmschmidt/a-files,
you can see that date information is preserved after a filter

table.filter(d => d.country=="Mexico").select("birth date").view()

gives

birth date
--
1915-06-21T00:00:00.000Z
1900-04-14T00:00:00.000Z
1899-04-15T00:00:00.000Z
1886-04-12T00:00:00.000Z

But after a sample or slice, you get ints:

table.sample(10).select("birth date").view()

gives

birth date
--
-1841614848
1234978816
-275112960
689860608

Easier moving average in windowed operations?

Recently I'm doing some COVID-19 stats trying to utilize Arquero lib.
Most of my operations are perfectly ported to Arquero APIs and worked well.
One of my key complaint is the lack of easy moving average operations.

My current version is like:

.derive({ moving_death_7: d => (op.lag(d.death, 1) + op.lag(d.death, 2) + op.lag(d.death, 3) + op.lag(d.death, 4) + op.lag(d.death, 5) + op.lag(d.death, 6) + op.lag(d.death, 0)) / 7 })

It's currently unable to use for loops, and it looks awkward a bit. If I want to change to a 14d moving average, it will look even worse. Am I doing it wrong?

Support Arquero to SQL query translation.

It would be awesome if one could convert a query to SQL so that it can run in a database. Would that be feasible or are there some obvious operations that are not easily translated to SQL?

Name collisions for op min and max functions.

It's only OK to use 'op.max' for aggregations, like in d=>op.max(d.foo),

But if I try to use it for common math calculations, like d=>op.max(1,2,3) or d=>op.max([1,2,3]), none of them will output expected values. It's still common usage for processing data like correcting negative values using like op.max(d.foo, 0).

I think Arquero should better separate the 'max' function with different keywords, like 'op.math_max' or 'op.agg_max'

Add sample_frac verb

pandas has df.sample(frac=1) and df.sample(n=1); dplyr has t %>% sample_frac(1) and t %>% sample_n(1).

Currently the recommended way to do this is either

t.sample(t.numRows()) (only on ungrouped tables).
t.sample(() => op.count())

The first only works on ungrouped tables, and the second is a little wonky--I only found it in the docs by searching if there was an example of bootstrap sampling.

How to left pad a string for a new column?

For example, in the case when dealing with U.S. Census FIPS values that are stored as numbers in a dataset, how would I convert the numeric values back to string values that are left padded with the correct number of zeros?

When I try doing something like:

table.derive({
  "fips_state_str": d => `${d.fips_state}`.padStart(2, "0")
})

... Arquero throws an error. Looking at the API documentation and docs on Observable I don't see an example of how to accomplish this, but apologies if I missed or overlooked something. Thanks!

Reliable way to test if an object is an Arquero table

I hope that my question is not too silly, but I wondered what would be a reliable way to test if a given object is an Arquero Table (for example in the case of a function that would apply both to a table and to a d3 array).

I tried to use obj.constructor.name but I believe this is not reliable because the result can change from a version to another due to code minimization.

Is there a "good" way to test for this ?

Many thanks !

Add head, tail, and/or slice verbs

It is useful to be able to select individual rows by index, including respect for sorting and grouping.

dplyr uses head(n), tail(n), and slice(indices).

R does not respect negative indices in slice, but it would be nice to also be to say
table.slice([-1, 0]) to get the last and first elements, in that order, for each group.

The pandas function nth has a better name than slice, but only selects a single row per group.

Table view() with sortable columns and lazy browsing

table view() method is convenient for showing/checking a slice of rows. I wish it could implement sortable columns and lazy browsing.

As a matter of fact, Observable released yesterday such a component: https://observablehq.com/@observablehq/input-table

I'd love he could operate on arquero/arrow tables (especially combined with a filtering select dropdown).
Would a rapprochement be conceivable?

Support for flatMap or arrayAgg

Really loving Arquero so far! One thing I constantly run into is the need for a flatMap or equivalently SQL array_agg operation - Often grouping by a key, I want to collect the values into an array-valued output. I can use Javascript flatMap, but it would be great to have Arquero support this in its native ops

Access a column programmatically on a table expression

Apologies if this shouldn't be an issue, but I have been searching everywhere and haven't been able to understand what can I do with table expressions. Is this documented?

I want to derive columns programmatically. I have a dataset with a bunch of dates per each row, and I want to compute their relative position compared to a reference one. I know I can do:

dt.derive({
  date1_rel: d => d.date1 - d.date_ref,
  date2_rel: d => d.date2 - d.date_ref,
 ...
})

But I don't want to write that by hand. I tried then these cells on Observable:

columnFechas = [
  "fis",
  "fecha_de_muerte",
  "fecha_reporte_web"
  "fecha_recuperado",
  "fecha_de_notificaci_n",
  "fecha_diagnostico"
]

deriveFechas = Object.fromEntries(
  columnFechas.map(f => [
    `${f}_res`,
    d =>  d[f] - d.fecha_reporte_web 
  ])
)

dt.derive(deriveFechas)

But It throws an error saying Invalid column "f": "d[f]". I tried debugging the code and noticed that the function is a table expression, so for a reason I don't understand I cannot even console.log the attribute d. How do I get the column that is in the variable f in a table expression. Documenting this could be important whenever you have column names with spaces

Here is a notebook replicating the issue

https://observablehq.com/d/2783d075d8079564#fechas

Seems related to #5, is there a way of doing op.get(column, d)?

Best (simple) way to get distinct values from a column ?

I have a table t1 with a risk nominal column
and i want all distinct values in an array this is quite a common need i am having.

I really appreciate the new columnArray() Table method.

I am actually hesitating between different paths, having difficulties to select and remind which would be the best:

[...new Set(t1.columnArray('risk'))]          // short but not ideal for JS debutants...
t1.groupby('risk').count().columnArray('risk')  // tradeoff, but i don't really need the count
t1.rollup({risk: op.array_agg_distinct('risk')}).get('risk', 0)  // complex... could we avoid get here or use a get(0,0) ?

Maybe i am missing here something simplier.

Wish i could use something very simple such as t1.distinct('risk')

Why Error: Illegal function call? and this library seems very limited unable to call general JS code?

Error: Illegal function call: "Date.UTC(d.year, 0, 1)"

aq.table({ year: [2010, 2011, 2012, 2013, 2019, ] })
  .derive({ date: d => Date.UTC(d.year, 0, 1) })
  .view()

fromCSV autotype: keep leading 0

aq.fromCSV, with option autotype = true, removes leading 0 (as does d3.autotype, unfortunately).

"06075" => 6075
"01" => 1

Could a test on leading 0 be added to avoid string to number conversion here?

I currently fix this with a specific parse option:
parse: {zipcode: String}
but it would be simplier not having to care about this, it would allow for writing more generic function to parse CSV files.

Get indices do not seem to respect table order

First of all: Thanks for this great project. It’s so great to finally see a good data wrangling framework in JavaScript land!

I have only worked with Arquero a couple of hours and stumbled upon an unexpected behavior that you might explain somewhere in the docs. Or it might be a bug.

Accessing individual cell values using .get(..., index) does not seem to respect re-ordered tables. Here’s an example:

const dt = aq.table({a: [1, 2, 3], b: [4, 5, 6]});

a	b
1	4
2	5
3	6

dt.get("a", 0); // 1

const dto = dt.orderby(aq.desc("a"));

a	b
3	6
2	5
1	4

dto.get("a", 0); // still 1 but expected 3

Any idea why that may be the case?

Update expression parser to support indirect column name references?

The following code (within an observable notebook cell) runs just fine:

{
  let df = aq.from([{ a: 1 }, { a: 2 }]);
  return df.derive({ b: row => row['a'] + 1 }).view(10);
}

But if I try and make the column name a variable, it fails:

{
  let df = aq.from([{ a: 1 }, { a: 2 }]);
  let col = 'a';
  return df.derive({ b: row => row[col] + 1 }).view(10);
}

I suppose I'm just not understanding the scope in which the lambda gets evaluated, or something. What is the right way to use variable names within the lambdas that get passed to derive?

Notebook: https://observablehq.com/@pwills/arquero-confusions

Bind tables by columns?

Is there a way to bind tables by columns, in the same way as concat bind them by rows ?

For the moment the only way I found to do it is with something like :

let tmp1 = dt1.derive({ tmp: op.row_number() });
let tmp2 = dt2.derive({ tmp: op.row_number() });
tmp1
    .join(tmp2)
    .select(aq.not("tmp"));

But maybe there is a simpler way to do it ?

Thanks !

undefined values in joined table after a filter and a derive

Hi,

This is quite a corner case, but if I try to join the following two tables :

dt1 = aq
  .from([
    { key: 1, value1: 1 },
    { key: 2, value1: 2 },
    { key: 3, value1: 3 },
    { key: 4, value1: 4 }
  ])
  .filter(d => d.key < 3)
  .derive({ key: d => d.key })

dt2 = aq.from([
  { key: 1, value2: 1 },
  { key: 2, value2: 2 },
  { key: 5, value2: 5 }
])

The result has 4 rows, with undefined values in the key column for the rows filtered from dt1 :

dt1.join_left(dt2).view()

Here is a small Observable notebook reproducing the issue : https://observablehq.com/d/d85031b66c92460b

The problem is solved if I put the derive before the filter, or if I call reify on dt1 before the join.

Maybe the good practice is just to systematically call reify between a filter and a derive ?

Thanks !

Support file loading from URL directly?

The API doc exposes as an example:

// create table from an input CSV loaded from 'url'
aq.fromCSV(await fetch(url).text())

But it does not seem to work.
Here are few variants:

  let url ='https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/countries-etc-datapoints/ddf--datapoints--children_per_woman_total_fertility--by--geo--time.csv' ;
  
  return aq.fromCSV( await fetch(url).text() )  // KO
  return (await fetch(url)).text() // OK
  return aq.fromCSV( (await fetch(url)).text() )  // KO
  return aq.fromCSV( await ( (await fetch(url)).text() ) ) // OK
  return fetch(url).then(d => d.text()).then(d => aq.fromCSV(d)) // OK

Actually i wish i could use simply:
aq.fromCSV(url)

like
d3.csv(url)

Thematic mapping utilities

An extension to op.ntile() could prove useful to encode numeric values to categories from manual breaks.
Something similar to the R cut function:
dens_code = cut( pop_density, breaks = c(0, 1000, 5000,20000,100000, Inf)...)
or d3.scaleThreshold()

jenks() and kmeans() are also useful clustering methods, we can borrow them from the simple statistics library, but of course if they were in Arquero it would be convenient.

doc typos

I noticed what i think are few minor typos in the doc (API reference):

https://uwdata.github.io/arquero/api/verbs
orderby
table.orderby('a', desc('b'))
=> table.orderby('a', aq.desc('b'))

table.fold(range(5, 8))
=> table.fold(aq.range(5, 8))

https://uwdata.github.io/arquero/api/op
op.includes
index: The integer index to start searcing from (default 0).
=> index: The integer index to start searching from (default 0).

op.is_nan() returns false for 'ND', etc.

I want to check what should be a numeric column in my table (loaded from a CSV file), but it happens that it contains several non-numeric values : 'NA', 'ND', 'XXX'...

So i am trying a test such as:

isNaN(x)

With Arquero, i will use:

dt.filter(d => op.is_nan(d.mycol))

but it actually does not retain these 'NA', 'ND', 'XXX'...

beacause op.is_nan() follows Number.isNaN(), which behaves differently than isNaN().

Is this by design?

As an alternative, i am currently using:
dt.filter(d => +d.mycol != d.mycol)

New statistical transforms

How about adding those (sorted by descending interest for me):

sample (table.sample(n))
density (op.density(field, bandwidth)),
regression,
loess...

I have actually spotted them in vega-lite transforms collection.

Column-level metadata/functional arguments to aq.matches

This is an alternate solution related to the question at #33 about selecting certain subsets of columns based on types.

Arquero could have a metadata slot on each column that borrows from Apache Arrow's support of table and column-level metadata. If my arrow column is of type arrow.utf8 with a metadata field saying "language": "English", if would be useful to have an arquero table derived from it at some point declare table.metadata = {"language": "English", "arrow_type": "utf8"}. The pyarrow feather export functions do something similar with pandas frames: the feather metadata includes a description of the pandas dtypes.

If a function as argument to aq.matches worked with reference to the full column (not just the name as for strings and regexes) @ericemc3's case in #33 could be expressed something like this:

table.select(aq.matches(col => col.metadata.arrow_type.match(/int|float/)))

col.metadata would also conceivably be a useful place to expose information about what autotype inference in fromCSV and fromJSON decided to do.

Could try to find time for a pull request, but obviously this hinges on questions about how/whether you want to use slots on the column object other than data.

Optimize groupby to use dictionary indices

Thanks so much for releasing this project--I've been hoping for precisely this kind of engine for a while. Look forward to using it.

There's some relatively low-hanging fruit for a performance bump on dictionary-encoded columns. (Probably string columns, too, although it's a little trickier.) Currently, groupby on a dictionary(utf-8 entries with int32 keys) column is quite slow because--if I'm tracing the code correctly--each individual row is being decoded from utf-8.

Here's a benchmark counting 2.5 million rows; in this case, it's about 4.7 seconds to count 1,000 distinct fields on character keys and 0.2 seconds on their integer equivalent.

Similar optimizations should be possible for 'filter' and 'join' on dictionary columns, although there's probably more overhead involved.

Working with date columns

FIrst off, thanks for building this - it's making my life a lot easier when I work with data in Observable 😄

I'm having some trouble working with dates. I build a data table via

df = aq.from([
  { date: '2020-01-01 00:00:00.000' },
  { date: '2020-01-02 00:00:00.000' }
])

I then add a custom function to shift the date forward 30 days, using moment:

let shiftDateFunc = d => moment(d).add(30, 'days').format()
aq.addFunction("shiftDate", shiftDateFunc)

However, when I try to do

df.derive({shiftedDate: op.shiftDate('date')})

I get an error:

Expression parse error: Invalid date

Which is surprising, since I can map this function over the objects just fine:

df.objects().map(d => shiftDateFunc(d.date)) // succeeds, returns an array of strings

Why can I map this function over the object arrays successfully, but it fails when I register it via addFunction? In my mental model the success of the former should imply that the latter will also work, but obviously I'm confused about something.

rollup() no longer allows empty inputs since 1.3.0

In previous versions, by using someTable.groupby('colA','colB').rollup(), it is possible to return a result table with all unique values with valid groups without actually rollup a value. However in 1.3.0, this is broken since the 'Object.entries are not allowed for undefined or null' error(depends on the browser). I have to manually add anything like op.count() to get it work, which I actually don't need.

I don't know if it is by purpose or by accident. But I do think the previous usage is no big deal for rollup() function.

Add join_natural method

This issue captures a missing feature that's
bothered me a little too; a quick and dirty way to join tables on shared keys as in tidyverse's inner_join
or MySQL's NATURAL JOIN. Especially without named arguments in JS, it's hard to
remember how to do this with existing joins.

In observable I've started adding a join_natural function that mimics their behavior:
I wonder if you'd consider adding something similar as a method.

  aq.internal.Table.prototype.join_natural = function(right) {
    const names = this.columnNames()
    return this.join(right, undefined, [names,right.columnNames().filter(d => names.indexOf(d) == -1)])
  }

Would probably be best if it threw an error if there are no shared keys. (?)

Thanks again for your work--still greatly enjoying this library's contribution.

Optimize for dictionary-encoded Arrow columns.

Arrow JS's compute library has the ability to optimize filter expressions that operate on a dictionary-encoded column. For example if foo is a dictionary encoded string column and you're evaluating foo == "bar" we first lookup the dictionary index for "bar" and then search for instances of that index. This can make a huge performance improvement, particularly for string columns where UTF-8 decoding is very expensive.

In Arrow I believe we only do it for equality expressions but you can imagine there are many other expressions that could benefit from such an optimization (foo in ("bar", "baz"), lt, gt, ...). I never got a chance to build that out in Arrow, but now that arquero is here it seems like a much better place for it :)

	/**
	* Format this table as a JavaScript Object Notation (JSON) string.
	* @param {JSONFormatOptions} options The formatting options.
	* @return {string} A JSON string.
	*/
	toJSON(options) {
	return toJSON(this, options);
	}

	* @param {object} [config] Configuration settings for the new table:
	* - data: The data payload to use.
	* - names: An ordered list of column names.
	* - filter: An additional filter bitset to apply.
	* - groups: The groupby specification to use (null for no groups).
	* - order: The orderby comparator to use (null for no order).
	* - params: Table expression parameters.

	/**
	* Returns the row order comparator function, if specified.
	* @return {Function} The row order comparator function.
	*/
	comparator() {
	return this._order;
	}

	/**
	* Create a new table for a set of named columns.
	* @param {object} columns
	* The set of named column arrays.
	* Object keys are the column names.
	* The enumeration order of the keys determines the column indices.
	* Object values must be arrays (or array-like values) of identical length.
	* @return {ColumnTable} the instantiated table
	* @example table({ colA: ['a', 'b', 'c'], colB: [3, 4, 5] })
	*/
	export function table(columns, names) {
	return new ColumnTable(mapObject(columns, x => x), names);
	}

uwdata / arquero Goto Github PK

arquero's People

Contributors

Stargazers

Watchers

Forkers

arquero's Issues

Problem

How to fix

Recommend Projects

Recommend Topics

Recommend Org