rtosholdings / riptable Goto Github PK
View Code? Open in Web Editor NEW64bit multithreaded python data analytics tools for numpy arrays and datasets
Home Page: https://riptable.readthedocs.io/en/stable/
License: Other
64bit multithreaded python data analytics tools for numpy arrays and datasets
Home Page: https://riptable.readthedocs.io/en/stable/
License: Other
rt.version
'1.0.19'
ds = rt.rt_dataset.Dataset({'f': [0, 1, 10, 100, 5]})
ds['g'] = ds['f'].astype(float)
ds
# f g
0 0 0.00
1 1 1.00
2 10 10.00
3 100 100.00
4 5 5.00
ds.dtypes
{'f': dtype('int64'), 'g': dtype('float64')}
rt.cut(ds['f'], [-100, 2, 11, 50, 1000])
Categorical([-100.0->2.0, -100.0->2.0, 2.0->11.0, 50.0->1000.0, 2.0->11.0]) Length: 5
FastArray([1, 1, 2, 4, 2], dtype=int8) Base Index: 1
FastArray([b'-100.0->2.0', b'2.0->11.0', b'11.0->50.0', b'50.0->1000.0'], dtype='|S12') Unique count: 4
rt.cut(ds['f'], [-100.0, 2.0, 11.0, 50.0, 1000.0]) == rt.cut(ds['f'], [-100, 2, 11, 50, 1000])
FastArray([ True, True, True, True, True])
rt.cut(ds['f'], [-np.inf, 2, 11, 50, np.inf])
Categorical([Filtered, Filtered, Filtered, Filtered, Filtered]) Length: 5
FastArray([0, 0, 0, 0, 0], dtype=int8) Base Index: 1
FastArray([b'-inf->2.0', b'2.0->11.0', b'11.0->50.0', b'50.0->inf'], dtype='|S10') Unique count: 4
rt.cut(ds['g'], [-np.inf, 2, 11, 50, np.inf])
Categorical([-inf->2.0, -inf->2.0, 2.0->11.0, 50.0->inf, 2.0->11.0]) Length: 5
FastArray([1, 1, 2, 4, 2], dtype=int8) Base Index: 1
FastArray([b'-inf->2.0', b'2.0->11.0', b'11.0->50.0', b'50.0->inf'], dtype='|S10') Unique count: 4
In [11]: rt.TimeSpan(23,'h')
Out[11]: TimeSpan(['23:00:00.000000000'])
In [12]: rt.TimeSpan(24,'h')
Out[12]: TimeSpan(['00:00:00.000000000'])
In [13]: rt.TimeSpan(25,'h')
Out[13]: TimeSpan(['01:00:00.000000000'])
Calling rolling_diff on a misaligned array causes a seg fault that crashes the kernel.
In [1]: import riptable as rt
In [2]: import numpy as np
In [3]: c = rt.Cat(np.random.randint(0,10,1_000_000))
In [4]: x = np.random.randn(5)
In [5]: y = np.random.randn(1_000_000)
In [6]: c.rolling_diff(y)
# col_0
------ -----
0 nan
1 nan
2 nan
3 nan
4 nan
5 0.06
6 -0.23
7 nan
8 nan
9 0.06
10 0.36
11 -1.37
12 nan
13 2.49
14 -0.77
... ...
999985 1.81
999986 -0.58
999987 -0.44
999988 1.09
999989 -1.42
999990 1.26
999991 -0.90
999992 -1.03
999993 -0.50
999994 -0.32
999995 0.23
999996 -1.03
999997 1.57
999998 -2.12
999999 -2.43
[1000000 rows x 1 columns] total bytes: 7.6 MB
In [7]: c.rolling_diff(x)
Segmentation fault
I have some code which, in several places, builds up a logical (boolean) mask for the purposes of filtering a dataset; however, it does this in a way that it builds up the inverse mask then uses the logical_not operation (~my_mask
) at the end because the logic is easier to understand that way (vs. rewriting the logic just to avoid the logical-NOT operation at the end).
In these cases, it would be useful to have an in-place version of logical-NOT which accepts a boolean mask and inverts it in-place to avoid the additional memory allocation that comes from using the normal ~
operator.
ds = rt.Dataset({'f': [1,2,3,4]})
ds['g'] = [1.0, 2.2, 3.3, 4.4]
ds['h'] = list('abcd')
ds['m'] = rt.Categorical(ds['f'])
print(ds.dtypes)
print(ds)
print(ds.imatrix_make(cats=True))
print(ds.dtypes)
print(ds)
{'f': dtype('int64'), 'g': dtype('float64'), 'h': dtype('S1'), 'm': dtype('int8')}
0 1 1.00 a 1
1 2 2.20 b 2
2 3 3.30 c 3
3 4 4.40 d 4
[[1. 1. 1. ]
[2. 2.2 2. ]
[3. 3.3 3. ]
[4. 4.4 4. ]]
{'f': dtype('float64'), 'g': dtype('float64'), 'h': dtype('S1'), 'm': dtype('float64')}
0 1.00 1.00 a 1.00
1 2.00 2.20 b 2.00
2 3.00 3.30 c 3.00
3 4.00 4.40 d 4.00
title describes the bug
Doing FastArray with custom quantiles fails when the quantile list is of different length than the default one.
fa = rt.FA(np.arange(1000))
fa.describe(q=[0.10, 0.25, 0.50, 0.75, .9, .99])
This happens with riptable version 1.5.2
bug with small arrays which dont compress and filtering
ds = Dataset({_k: list(range(_i *10, (_i +1) *10)) for _i, _k in enumerate(['a','b','c','d','e'])})
ds.save('test.sds')
load_sds('test.sds', filter=[False, False, True, True, True, False, False, False, False, False])
ds = rt.Dataset({'f': [1,2,3,4], 'g': [1.1, 2.2, np.nan, 4.4]})
ds[3, 'f'] = np.nan
ds['f2'] = 1
ds['g2'] = 1.0
ds['f3'] = ds['f'] + ds['f2']
ds['f4'] = ds['f'] * 2
ds['g3'] = ds['g'] + ds['g2']
ds[:, ['f3', 'f4', 'g3']]
-- | -- | -- | --
0 | 2 | 2 | 2.10
1 | 3 | 4 | 3.20
2 | 4 | 6 | nan
3 | -9223372036854775807 | 0 | 5.40
The last row above is wrong: very large negative int when I add 1? Why does 0 show up when I multiply an inv int by 2?
In comparison, for a float col, having an nan value behaves as expected.
print(ds['f'].sum(), ds['f'].nansum(), ds['g'].sum(), ds['g'].nansum())
-9223372036854775802 6 nan 7.700000000000001
riptable
does not seem to install from source within a fresh environment.
Steps to reproduce (all of this was ran from the top level of the repo except otherwise stated):
environment.yml
):name: rtdev
channels:
- conda-forge
- defaults
dependencies:
- python>=3.7
- numpy
- scipy
- numba
- python-dateutil
- pandas
- pre-commit
- black
- flake8
- pytest
- coverage
conda env create -f environment.yml
conda activate rtdev
riptable
python setup.py install
riptable
from inside python (same result when running from the top of the repo or from other places):python -c "import riptable"
which is met with the following error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/tmcclintock/Github/riptable/riptable/__init__.py", line 10, in <module>
from .rt_fastarray import FastArray, Threading, Recycle, Ledger
File "/Users/tmcclintock/Github/riptable/riptable/rt_fastarray.py", line 9, in <module>
import riptide_cpp as rc
ImportError: dlopen(/opt/anaconda3/envs/rtdev/lib/python3.8/site-packages/riptide_cpp-1.5-py3.8-macosx-10.9-x86_64.egg/riptide_cpp.cpython-38-darwin.so, 2): Library not loaded: @rpath/libzstd.1.dylib
Referenced from: /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages/riptide_cpp-1.5-py3.8-macosx-10.9-x86_64.egg/riptide_cpp.cpython-38-darwin.so
Reason: image not found
Of note: running $pip install riptable
again yields the following:
Requirement already satisfied: riptable in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages/riptable-1.0.3-py3.8.egg (1.0.3)
Requirement already satisfied: numpy in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages (from riptable) (1.19.1)
Requirement already satisfied: riptide_cpp in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages/riptide_cpp-1.5-py3.8-macosx-10.9-x86_64.egg (from riptable) (1.5)
Requirement already satisfied: ansi2html in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages/ansi2html-1.5.2-py3.8.egg (from riptable) (1.5.2)
Requirement already satisfied: numba in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages (from riptable) (0.51.1)
Requirement already satisfied: six in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages (from ansi2html->riptable) (1.15.0)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages (from numba->riptable) (0.34.0)
Requirement already satisfied: setuptools in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages (from numba->riptable) (49.6.0.post20200814)
which indicates that riptide_cpp
is installed, but for whatever reason the rpath/libzstd.1.dylib
image did not get put into the proper place in the path.
System details:
MacOS - 10.15.4
python - 3.8.5 (same issue with 3.7)
Suggested solution:
Somehow ensure that installer for riptide_cpp
works correctly, but not entirely sure how to do this. Happy to raise this issue in that repo instead.
Note that I did not encounter this issue when installing riptable
in linux machines.
ds=Dataset()
ds.num = aranage(4)
ds['_num']=arange(4.0)
should return error
The current "Global" prefix requires the application to have SE_CREATE_GLOBAL_NAME permissions which typically requires the process to run as an Administrator. Having the option to not use the global shared memory namespace could make development and automated testing easier.
import riptable as rt
rt.version
'1.0.19'
ds = rt.rt_dataset.Dataset({'f': [1,2,3,4], 'g': [5,6,7,8]})
ds
# f g
0 1 5
1 2 6
2 3 7
3 4 8
ds[:, ['f', 'g', 'f']]
# f g
0 1 5
1 2 6
2 3 7
3 4 8
fast_arr = ds.imatrix_make()
fast_arr[:, [0,1,0]]
FastArray([[1, 5, 1],
[2, 6, 2],
[3, 7, 3],
[4, 8, 4]])
ds[[0,1,0], :]
# f g
0 1 5
1 2 6
2 1 5
fast_arr[[0,1,0], :]
FastArray([[1, 5],
[2, 6],
[1, 5]])
Same inputs end up with different returns from np.searchsorted and rt.searchsorted and np.searchsorted returns the correct results:
a = np.array([100, 101, 102, 102, 102, 103, 106, 106, 107, 107, 110, 118,134, 146, 146, 147, 147, 148, 148, 149])
b = np.array([134, 146, 147, 148, 149])
np.searchsorted(a, b)
array([12, 13, 15, 17, 19])
vs
a = rt.FastArray([100, 101, 102, 102, 102, 103, 106, 106, 107, 107, 110, 118, 134, 146, 146, 147, 147, 148, 148, 149])
b = rt.FastArray([134, 146, 147, 148, 149])
rt.searchsorted(a, b)
FastArray([12, 14, 15, 17, 19], dtype=int8)
n = 100_000_000
var64 = rt.FA(np.random.normal(loc=10, scale=5, size=n).astype('float64'))
var32 = rt.FA(var64.copy().astype('float32'))
cat = rt.Cat(np.random.choice(['cat1', 'cat2'], size=n, replace=True))
cat.nanmean(var32) <-- keep typing this in and results will flutter
import riptable as rt
import numpy as np
ds = rt.Dataset({'x':np.arange(1000),
'c':rt.Cat([0,1]*500)})
agg = ds.c.min(ds.x)
do_something_stupid = ds.c.shift(agg, 1)
I get a seg fault running this on linux, RipTable version 1.0.19
It would be good to provide a set of benchmarks for riptable. I would suggest adopting the NumPy benchmark suite from numpy/benchmarks which is built to use the Airspeed Velocity (ASV) benchmark system.
at minimum this entire module needs review
I am playing with riptable version 1.0.19. I came across a bizarre error when I use groupby.diff(): it seems if the dataset has less than 2 billion rows, I am fine. But I got arbitrary results when I have 3 billion rows. Please see the attached notebook for a reproducible (big) example (use up to 500GB in RAM.)
tsd_1m_tmp = tsd_1m.head(int(3e9))
tsd_1m_tmp['foo_delta'] = tsd_1m_tmp.groupby(['date', 'symbol'])['foo'].diff()
tsd_1m_tmp
Performing an outer merge with rt.merge2()
on multiple keys, where the first key (for one of the datasets) is a string column.
Repro:
ds1 = rt.Dataset({'f': ['1', '2', '3', '4'], 'g': [1, 2, 3, 4], 'h1': [1, 1, 1, 1]})
ds2 = rt.Dataset({'f': ['2', '3', '4', '5'], 'g': [2, 3, 4, 5], 'h2': [2, 2, 2, 2]})
ds1.merge2(ds2, on=['f', 'g'], how='outer')
Stack Trace:
AttributeError Traceback (most recent call last)
<ipython-input-28-d68d510b6b4b> in <module>
----> 1 ds3 = ds1.merge2(ds2, on=['f', 'g'], how='outer')
/my_python_env/lib/python3.7/site-packages/riptable/rt_dataset.py in merge2(self, right, on, left_on, right_on, how, suffixes, copy, indicator, columns_left, columns_right, validate, keep, high_card, hint_size)
3708 self, right, on=on, left_on=left_on, right_on=right_on, how=how,
3709 suffixes=suffixes, copy=copy, indicator=indicator, columns_left=columns_left, columns_right=columns_right,
-> 3710 validate=validate, keep=keep, high_card=high_card, hint_size=hint_size)
3711 merge2.__doc__ = rt_merge.merge2.__doc__
3712
/my_python_env/lib/python3.7/site-packages/riptable/rt_merge.py in merge2(left, right, on, left_on, right_on, how, suffixes, copy, indicator, columns_left, columns_right, validate, keep, high_card, hint_size, **kwargs)
2016 left_grouping, _ = get_or_create_keygroup(left_on_arrs, 0, how != 'left', False, high_card, hint_size)
2017 right_grouping, right_groupby_grouping =\
-> 2018 get_or_create_keygroup(right_on_arrs, 1, how != 'right', how == 'outer', high_card, hint_size)
2019
2020 if logger.isEnabledFor(logging.INFO):
/my_python_env/lib/python3.7/site-packages/riptable/rt_merge.py in get_or_create_keygroup(keycols, index, force_invalids, create_groupby_grouping, high_card, hint_size)
1995 # Need to copy 'isvalid' to avoid sharing the same instance with 'valid_join_tuples_mask',
1996 # which'll lead to them incorrectly having the same data (since we're updating in-place).
-> 1997 valid_groupby_tuples_mask = isvalid.copy()
1998 elif isvalid is not None:
1999 valid_groupby_tuples_mask |= isvalid
AttributeError: 'NoneType' object has no attribute 'copy'
when sds Datasets are stacked, and a new column (for example 'col1') is introduced midway through, it will be gap filled.
However, if the new column is selected with just include='col1', since it is the only array, gap filling will not occur
This is not necessarily a bug, I just like to report that the behavior here is different depending on if I have an inv value in an int or float col.
ds = rt.Dataset({'f': [1,2,3,4], 'g': [1.1, 2.2, np.nan, 4.4]})
ds[3, 'f'] = np.nan
print(ds)
0 1 1.10
1 2 2.20
2 3 nan
3 Inv 4.40
ds['f'].clip(2, 3)
FastArray([2, 2, 3, 2])
ds['g'].clip(2, 3)
FastArray([2. , 2.2, nan, 3. ])
rt.quantile returns internal data that is used to compute the quantile (the sorted array and counts) -- contrarily to what the docstring specifies. Moreover, the actual quantiles returns are an np.array of dtype=Object which can cause issues down the road.
I would much rather have rt.quantile just return the quantiles: in general that's what is needed, and riptable should probably stays close to the numpy api whenever it can. Adding an extra keyword that requests that extra data when needed is fine of course.
The issue I had with the dtype=Object is that rt.unique(quantiles) returns a string FA, which will wreak havoc when used later on. I don't know if I should enter another issue for this?
>>> rt.__version__
'1.0.25'
>>> np.__version__
'1.18.5'
>>> x = rt.FA(np.arange(100))
>>> r = rt.quantile(x, [0.25, 0.75])
>>> r
(array([24.75, 74.25], dtype=object), FastArray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99], dtype=int32), (0, 0, 0))
>>> rt.unique(r[0])
FastArray([b'24.75', b'74.25'], dtype='|S5')
>>>
Running
tmp_array = np.random.randint(0, 5, 2_500_000_000)
tmp_cat = rt.Cat(tmp_array)
Then calling
tmp_cat.first(tmp_array)
gives garbage.
This seems to only occur for data larger than about 2 billion records.
It seems that groupby changes the type of categoricals back to strings, as here:
ds = rt.Dataset(dict(trader=['a', 'a', 'b', 'b', 'c', 'c'], shares=np.arange(6)))
ds.trader = rt.Categorical(ds.trader)
gds = ds.trader.sum(ds.shares)
print(ds.trader.dtype)
print(gds.trader.dtype)
int8
|S1
I get the same result if I do the groupby using the alternate approach/syntax:
gds = ds.gb('trader').shares.sum()
I have several files saved in shared memory (/dev/shm) even though I am certain I have never used the 'share' parameter in rt.save_sds() (I didn't know it existed until today). The files have been created at various dates between May and September 2020. I am unfortunately not able to write reproducible code, but I was wondering if this has been observed and people know the cause.
When doing category.ema_normal with a filter, the filtered out values are treated as zero.
r = np.arange(10, dtype=float) + 1000
tmp = rt.Dataset({'x': r,
'x_zero': r.copy(),
'fltr': rt.FA([True]*10)})
tmp.x_zero[5] = 0
tmp.fltr[5] = False
c = rt.Cat([0]*10)
t = r
tmp['ema_x'] = c.ema_normal(tmp['x'], time=t, decay_rate=1)
tmp['ema_x_zero'] = c.ema_normal(tmp['x_zero'], time=t, decay_rate=1)
tmp['ema_x_filtered'] = c.ema_normal(tmp['x'], time=t, filter=tmp.fltr, decay_rate=1)
print(tmp)
produces:
# x x_zero fltr ema_x ema_x_zero ema_x_filtered
- -------- -------- ----- -------- ---------- --------------
0 1,000.00 1,000.00 True 1,000.00 1,000.00 1,000.00
1 1,001.00 1,001.00 True 1,000.63 1,000.63 1,000.63
2 1,002.00 1,002.00 True 1,001.50 1,001.50 1,001.50
3 1,003.00 1,003.00 True 1,002.45 1,002.45 1,002.45
4 1,004.00 1,004.00 True 1,003.43 1,003.43 1,003.43
5 1,005.00 0.00 False 1,004.42 369.14 369.14
6 1,006.00 1,006.00 True 1,005.42 771.71 771.71
7 1,007.00 1,007.00 True 1,006.42 920.44 920.44
8 1,008.00 1,008.00 True 1,007.42 975.79 975.79
9 1,009.00 1,009.00 True 1,008.42 996.78 996.78
ema_x_filtered should be close to ema_x, without the big drop at index 5. The same issue happens with ema_decay. I am not sure about ema_weighted as I get an error running it.
In [1]: import riptable as rt
In [2]: c=rt.Cat((rt.arange(30) % 3)+1, {1: 'test', 2: 'test2', 3: 'test3'})
In [3]: c[0:5].filter().unique()
FastArray(['test', 'test2', 'test3'], dtype='<U5')
In [4]: c[0:5].unique()
NORMALLY FAIL
Testing.
in from_pandas in rt_dataset around line 2669 if dtype='S' fails, try with dtype='U'
Along with shm_open, have a memory map feature
The multiprocessing module allows more than 1 thread to call into riptable's core. Use a concurrent flag to detect this and swith to single thread mode.
riptable
has a lot of docstrings, but no dedicated docs. Since it is open source, we should take advantage of an open source documentation platform like RTD.
I would be happy to do this, as I have created docs pages for other projects in SIG. Just let me know if we want a particular platform, or if RTD is fine.
in rt_meta we need to remove the dependency on ansi2html.
I would keep three versions of data. data_plain, data_ansi, and data_html. I would remove code in rt_display.py related to handling display of meta data (DisplayText class). @zeneli
I might also change the test which checks for identical HTML output.
For the HTML tags, not sure if we force a color or preferred way to do it.
Categorical.unique on a slice of the Categorical does produce the unique categories used within the slice. Instead it returns a FastArray of used and unused cateogries.
Repro
>>> import riptable as rt
>>> rt.Cat(list('xyyz'))[:3].unique() # incorrect behavior; 'z' should not be in the FastArray
FastArray([b'x', b'y', b'z'], dtype='|S1')
>>> rt.Cat(list('xyyz'))[:3].filter().unique() # correct behavior; 'z' is not in the FastArray
FastArray([b'x', b'y'], dtype='|S1')
In a conversation with some users of riptide, one user mentioned that for a few specialized cases they'd find it useful to have access to the "raw" fancy indices as calculated by the relational join algorithm in rt.merge2
.
It should be relatively simple to refactor the current implementation of rt.merge2
to extract the part which sets up and calls the join implementation into a new function, e.g. rt.merge_indices()
; the remaining code inrt.merge2()
will just call this new function then apply the calculated fancy indices to the selected columns of the left and right Datasets to produce the merged output Dataset.
This new function will provide a building block that could be used to implement some other types of merge algorithms in the future, and perhaps even used in rt.merge_asof()
. Users of riptable will be able to use this to implement their own custom merge functions, or used as a general key-alignment mechanism (e.g. if you wanted to implement a vectorized "bimap" data structure).
But:
An example
from riptable import *
x =701389446541966656
y=asarray([x-1,x,x+1]).astype(np.uint64)
y.isin(x)
incorrectly returns True, True, False
Functions like rt.load_sds()
accept a str
- or bytes
-typed path (or a sequence of such); however, we should also support Python's os.PathLike
type in these functions to make calling them easier and more seamless for users who're making use of types from Python's pathlib
(e.g. pathlib.Path
).
All we'll really need to do to implement this support is to utilize the os.fspath()
function to convert any path-like objects to a string representation before e.g. passing those strings down to some riptide_cpp function.
More info:
Extracting a single element of a FastArray
returns a numpy array scalar, but it should probably be returning a riptable array scalar using one of the scalar types defined in rt_numpy.py.
I ran into this today when trying to implement a small new riptable feature I need for another project. The current implementation returning a numpy array scalar means the value loses semantic information about riptable's invalid values -- so an invalid value recognized by rt.isnan
, when extracted, gives something not recognized as an invalid.
Repro:
In [67]: rt.isnan(np.array([-128], dtype=np.int8).view(rt.FA)[0])
False
In [68]: np.array([-128], dtype=np.int8).view(rt.FA)[0]
-128
In [69]: isinstance(np.array([-128], dtype=np.int8).view(rt.FA)[0], np.generic)
True
In [70]: rt.isnan(np.array([-128], dtype=np.int8).view(rt.FA))
FastArray([ True])
In [71]: rt.isnan(np.array([-128], dtype=np.int8).view(rt.FA)[0])
False
In [72]: type(np.array([-128], dtype=np.int8).view(rt.FA)[0])
numpy.int8
As of riptable 1.0.9, the rt.isnotfinite
function returns an np.ndarray
rather than a FastArray
. The values in the returned array also appear to be incorrect for integer-based arrays: they don't match the logical-NOT of what rt.isfinite
returns when called with the same input.
Repro:
In [1]: import numpy as np
In [2]: import riptable as rt
# Floating-point example
In [3]: arr1 = np.array([-0.1, 0.0, 0.1, 0.2, np.nan, 0.4, np.inf, -np.inf, 0.7])
In [4]: rt.isfinite(arr1)
FastArray([ True, True, True, True, False, True, False, False, True])
In [5]: rt.isnotfinite(arr1)
array([False, False, False, False, True, False, True, True, False])
In [6]: ~rt.isfinite(arr1)
FastArray([False, False, False, False, True, False, True, True, False])
# Integer example (rt.isnotfinite not respecting invalids like rt.isfinite)
In [7]: arr2 = rt.FA([0, -128, -127, 120, 127], dtype=np.int8)
In [8]: rt.isfinite(arr2)
FastArray([ True, False, True, True, True])
In [9]: rt.isnotfinite(arr2)
array([False, False, False, False, False])
In [10]: ~rt.isfinite(arr2)
FastArray([False, True, False, False, False])
Address new test failures given the smaller set of dependencies and across increased variety of platforms.
Hello,
Riptable’s Categorical ignores the inplace argument of cat.fill_forward and fill_backward. Could you either implement this feature (preferred), or remove the keyword?
To reproduce:
fa = rt.FA([1.,np.nan, 1.])
c = rt.Cat([1, 1, 1])
c.fill_forward(fa, inplace=True)
assert(fa[1] == 1.)
Thanks,
--Marc
The following throws an AttributeError ('Categorical' object has no attribute '_grouping') with riptable version 1.0.25:
import riptable as rt
tds = rt.Dataset(dict(id_column=[1, 1, 1, 2, 2, 2], shares=np.arange(6), trader=['a', 'a', 'b', 'b', 'c', 'c']))
tds.trader = rt.Categorical(tds.trader)
gds = tds.gb('id_column').sum()
gds.col_rename('shares', 'shares_sum')
tds = rt.merge_lookup(tds, gds, on='id_column', suffixes=(False, False), copy=False)
print(tds)
The code above works if I remove line 2 (i.e. I leave ‘trader’ as a string column rather than a categorical), or if I remove copy=False from line 5. For performance reasons, however, we often want these.
Hi,
One of my columns in my rt.Datset wasn’t a continuous array (contained float 64s but had a stride of 24), and when I tried to left merge on another dataset, with columns_right = my_column, it failed. I fixed it by copying the array to make it continuous, but this seems like a bug, not a feature.
Best,
Daniel
Invalid ints do not persist through common operations, e.g.:
lds = rt.Dataset(dict(a=['a', 'b', 'c']))
rds = rt.Dataset(dict(a=['a', 'b'], v=[1, 2]))
ds = rt.merge_lookup(lds, rds, on='a')
print(ds)
print()
print(ds.v.isna())
print()
print(ds.v + ds.v)
print()
print((ds.v * ds.v).isna())
0 a 1
1 b 2
2 c Inv
[False False True]
[2 4 0]
[False False False]
Is this fixable? Alternatively should we do away with invalid ints?
dataset.nansum(axis=1) returns a scalar (the sum of all numbers in the dataset), instead of an FastArray of length dataset.shape[0].
A user reported running into an error today when calling rt.merge2()
with how='outer'
. After a bit of investigation, it turned out the user had (in this specific example) a 'right' Dataset whose keyset was a subset of the 'left' Dataset.
The key bug here was that calling FastArray.nansum()
on an empty array while also explicitly specifying the output dtype caused a TypeError
to be raised; specifically, FastArray._reduce_check()
was attempting to convert None
to an integer dtype, which failed. (The same case without specifying the dtype "succeeded" and returned nan
because it fell into a logic case where the result was converted to float64.)
Version: riptable 1.0.25
Creating a normal, single-key Categorical with no filter shows all of the categories are included in .category_dict
:
>>> strs = rt.Cat(['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ccccc', 'bbbbb', 'aaaaa', 'aaaaa', 'bbbbb', 'aaaaa'])
>>> strs.category_dict
{'key_0': FastArray([b'aaaaa', b'bbbbb', b'ccccc', b'ddddd'], dtype='|S5')}
Creating the same categorical, but providing a filter which masks out the first 'ccccc' and the only 'ddddd' eliminates 'ddddd' as a category; that's the expected behavior since there's only one occurrence -- the one we're filtering out.
>>> mask = rt.FA([True, True, False, False, True, True, True, True, True, True], dtype=np.bool)
>>> strs_filt = rt.Cat(['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ccccc', 'bbbbb', 'aaaaa', 'aaaaa', 'bbbbb', 'aaaaa'], filter=mask)
>>> strs_filt.category_dict
{'key_0': FastArray([b'aaaaa', b'bbbbb', b'ccccc'], dtype='|S5')}
In the example above, the elements where mask
is False
are also assigned to the 0 bin (representing invalid/NA/filtered).
However, when creating a multi-column ("multi-key") Categorical
, the filter does not seem to be (entirely) respected -- the filtered elements still get a category created for them (even if that row is the only occurrence of that key-tuple). The category created for the filtered row does have all of it's elements (in the category arrays) set to invalid/default/NA values. Another invariant that's broken here is that the created categories are not all unique -- there are multiple occurrences of the key-tuples (i.e. rows if you vstacked the category arrays together into a Dataset).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.