rtosholdings / riptable Goto Github PK

View Code? Open in Web Editor NEW

347.0 347.0 18.0 6.16 MB

64bit multithreaded python data analytics tools for numpy arrays and datasets

Home Page: https://riptable.readthedocs.io/en/stable/

License: Other

Python 100.00%

analytics dataframes numpy

riptable's People

Contributors

Stargazers

Watchers

Forkers

zeneli wenjuno chaoshengt neuroradiology lkafle cxz tmcclintock 3keepmovingforward3 972d5defe3218bd62b741e6a2f11f5b3 rathern-ottsay iq-scm steve-o-murphy rhanniga athuw caesarchad jshan94306 jinkimyoung mariusconstantin3011

riptable's Issues

BUG: rt.cut gives "Filtered" on an int column when bins contains np.inf

rt.version
'1.0.19'

ds = rt.rt_dataset.Dataset({'f': [0, 1, 10, 100, 5]})
ds['g'] = ds['f'].astype(float)
ds

# f g
0 0 0.00
1 1 1.00
2 10 10.00
3 100 100.00
4 5 5.00

ds.dtypes
{'f': dtype('int64'), 'g': dtype('float64')}

This gives expected result, although note in the resulting categorical, the bins are represented with floats. This is compatible with the doc string of rt.cut: "bins : ndarray of floats"

rt.cut(ds['f'], [-100, 2, 11, 50, 1000])
Categorical([-100.0->2.0, -100.0->2.0, 2.0->11.0, 50.0->1000.0, 2.0->11.0]) Length: 5
FastArray([1, 1, 2, 4, 2], dtype=int8) Base Index: 1
FastArray([b'-100.0->2.0', b'2.0->11.0', b'11.0->50.0', b'50.0->1000.0'], dtype='|S12') Unique count: 4

this results in the same:

rt.cut(ds['f'], [-100.0, 2.0, 11.0, 50.0, 1000.0]) == rt.cut(ds['f'], [-100, 2, 11, 50, 1000])
FastArray([ True, True, True, True, True])

However, if the bins have np.inf, then when applying on a int column will give wrong result:

this appears to me like a bug:

rt.cut(ds['f'], [-np.inf, 2, 11, 50, np.inf])
Categorical([Filtered, Filtered, Filtered, Filtered, Filtered]) Length: 5
FastArray([0, 0, 0, 0, 0], dtype=int8) Base Index: 1
FastArray([b'-inf->2.0', b'2.0->11.0', b'11.0->50.0', b'50.0->inf'], dtype='|S10') Unique count: 4

Of course, applying on a float column is OK even when the bin contains np.inf,

rt.cut(ds['g'], [-np.inf, 2, 11, 50, np.inf])
Categorical([-inf->2.0, -inf->2.0, 2.0->11.0, 50.0->inf, 2.0->11.0]) Length: 5
FastArray([1, 1, 2, 4, 2], dtype=int8) Base Index: 1
FastArray([b'-inf->2.0', b'2.0->11.0', b'11.0->50.0', b'50.0->inf'], dtype='|S10') Unique count: 4

TimeSpan does not display day components of timespan

In [11]: rt.TimeSpan(23,'h')

Out[11]: TimeSpan(['23:00:00.000000000'])

In [12]: rt.TimeSpan(24,'h')

Out[12]: TimeSpan(['00:00:00.000000000'])

In [13]: rt.TimeSpan(25,'h')

Out[13]: TimeSpan(['01:00:00.000000000'])

seg fault on rolling_diff

Calling rolling_diff on a misaligned array causes a seg fault that crashes the kernel.

In [1]: import riptable as rt
In [2]: import numpy as np
In [3]: c = rt.Cat(np.random.randint(0,10,1_000_000))
In [4]: x = np.random.randn(5)
In [5]: y = np.random.randn(1_000_000)
In [6]: c.rolling_diff(y)
     #   col_0
------   -----
     0     nan
     1     nan
     2     nan
     3     nan
     4     nan
     5    0.06
     6   -0.23
     7     nan
     8     nan
     9    0.06
    10    0.36
    11   -1.37
    12     nan
    13    2.49
    14   -0.77
   ...     ...
999985    1.81
999986   -0.58
999987   -0.44
999988    1.09
999989   -1.42
999990    1.26
999991   -0.90
999992   -1.03
999993   -0.50
999994   -0.32
999995    0.23
999996   -1.03
999997    1.57
999998   -2.12
999999   -2.43

[1000000 rows x 1 columns] total bytes: 7.6 MB

In [7]: c.rolling_diff(x)
Segmentation fault

In-place logical NOT function (rt.logical_noti)

I have some code which, in several places, builds up a logical (boolean) mask for the purposes of filtering a dataset; however, it does this in a way that it builds up the inverse mask then uses the logical_not operation (~my_mask) at the end because the logic is easier to understand that way (vs. rewriting the logic just to avoid the logical-NOT operation at the end).

In these cases, it would be useful to have an in-place version of logical-NOT which accepts a boolean mask and inverts it in-place to avoid the additional memory allocation that comes from using the normal ~ operator.

calling imatrix_make change the int cols to float in the original dataset

ds = rt.Dataset({'f': [1,2,3,4]})
ds['g'] = [1.0, 2.2, 3.3, 4.4]
ds['h'] = list('abcd')
ds['m'] = rt.Categorical(ds['f'])
print(ds.dtypes)
print(ds)
print(ds.imatrix_make(cats=True))
print(ds.dtypes)
print(ds)

{'f': dtype('int64'), 'g': dtype('float64'), 'h': dtype('S1'), 'm': dtype('int8')}

f g h m

0 1 1.00 a 1
1 2 2.20 b 2
2 3 3.30 c 3
3 4 4.40 d 4
[[1. 1. 1. ]
[2. 2.2 2. ]
[3. 3.3 3. ]
[4. 4.4 4. ]]
{'f': dtype('float64'), 'g': dtype('float64'), 'h': dtype('S1'), 'm': dtype('float64')}

f g h m

0 1.00 1.00 a 1.00
1 2.00 2.20 b 2.00
2 3.00 3.30 c 3.00
3 4.00 4.40 d 4.00

sort_inplace cannot sort first by ascending, second descending

title describes the bug

FastArray.describe fails on custom quantiles

Doing FastArray with custom quantiles fails when the quantile list is of different length than the default one.

fa = rt.FA(np.arange(1000))
fa.describe(q=[0.10, 0.25, 0.50, 0.75, .9, .99])

This happens with riptable version 1.5.2

saving small arrays and reading with filter

bug with small arrays which dont compress and filtering

ds = Dataset({_k: list(range(_i *10, (_i +1) *10)) for _i, _k in enumerate(['a','b','c','d','e'])})
ds.save('test.sds')
load_sds('test.sds', filter=[False, False, True, True, True, False, False, False, False, False])

invalid int does not propagate when add and gives wrong answer

ds = rt.Dataset({'f': [1,2,3,4], 'g': [1.1, 2.2, np.nan, 4.4]})

ds[3, 'f'] = np.nan
ds['f2'] = 1
ds['g2'] = 1.0

ds['f3'] = ds['f'] + ds['f2']
ds['f4'] = ds['f'] * 2
ds['g3'] = ds['g'] + ds['g2']

ds[:, ['f3', 'f4', 'g3']]

| f3 | f4 | g3

-- | -- | -- | --
0 | 2 | 2 | 2.10
1 | 3 | 4 | 3.20
2 | 4 | 6 | nan
3 | -9223372036854775807 | 0 | 5.40

The last row above is wrong: very large negative int when I add 1? Why does 0 show up when I multiply an inv int by 2?
In comparison, for a float col, having an nan value behaves as expected.

print(ds['f'].sum(), ds['f'].nansum(), ds['g'].sum(), ds['g'].nansum())
-9223372036854775802 6 nan 7.700000000000001

Installation issue from a fresh environment

riptable does not seem to install from source within a fresh environment.

Steps to reproduce (all of this was ran from the top level of the repo except otherwise stated):

Create an anaconda environment. As an example, here is a "developer's" environment file I created (environment.yml):

name: rtdev
channels:
  - conda-forge
  - defaults
dependencies:
  - python>=3.7
  - numpy
  - scipy
  - numba
  - python-dateutil
  - pandas
  - pre-commit
  - black
  - flake8
  - pytest
  - coverage

Install and activate the environment

conda env create -f environment.yml
conda activate rtdev

Install riptable

python setup.py install

Attempt to import riptable from inside python (same result when running from the top of the repo or from other places):

python -c "import riptable"

which is met with the following error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/tmcclintock/Github/riptable/riptable/__init__.py", line 10, in <module>
    from .rt_fastarray import FastArray, Threading, Recycle, Ledger
  File "/Users/tmcclintock/Github/riptable/riptable/rt_fastarray.py", line 9, in <module>
    import riptide_cpp as rc
ImportError: dlopen(/opt/anaconda3/envs/rtdev/lib/python3.8/site-packages/riptide_cpp-1.5-py3.8-macosx-10.9-x86_64.egg/riptide_cpp.cpython-38-darwin.so, 2): Library not loaded: @rpath/libzstd.1.dylib
  Referenced from: /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages/riptide_cpp-1.5-py3.8-macosx-10.9-x86_64.egg/riptide_cpp.cpython-38-darwin.so
  Reason: image not found

Of note: running $pip install riptable again yields the following:

Requirement already satisfied: riptable in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages/riptable-1.0.3-py3.8.egg (1.0.3)
Requirement already satisfied: numpy in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages (from riptable) (1.19.1)
Requirement already satisfied: riptide_cpp in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages/riptide_cpp-1.5-py3.8-macosx-10.9-x86_64.egg (from riptable) (1.5)
Requirement already satisfied: ansi2html in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages/ansi2html-1.5.2-py3.8.egg (from riptable) (1.5.2)
Requirement already satisfied: numba in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages (from riptable) (0.51.1)
Requirement already satisfied: six in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages (from ansi2html->riptable) (1.15.0)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages (from numba->riptable) (0.34.0)
Requirement already satisfied: setuptools in /opt/anaconda3/envs/rtdev/lib/python3.8/site-packages (from numba->riptable) (49.6.0.post20200814)

which indicates that riptide_cpp is installed, but for whatever reason the rpath/libzstd.1.dylib image did not get put into the proper place in the path.

System details:
MacOS - 10.15.4
python - 3.8.5 (same issue with 3.7)

Suggested solution:
Somehow ensure that installer for riptide_cpp works correctly, but not entirely sure how to do this. Happy to raise this issue in that repo instead.

Note that I did not encounter this issue when installing riptable in linux machines.

Assigning a column name with leading underscore does not return an error

ds=Dataset()
ds.num = aranage(4)
ds['_num']=arange(4.0)

should return error

Provide option to not use use global shared memory name on Windows

The current "Global" prefix requires the application to have SE_CREATE_GLOBAL_NAME permissions which typically requires the process to run as an Administrator. Having the option to not use the global shared memory namespace could make development and automated testing easier.

inconsistent behavior: repeated reference to the same column versus row, dataset versus fast array

import riptable as rt
rt.version
'1.0.19'

ds = rt.rt_dataset.Dataset({'f': [1,2,3,4], 'g': [5,6,7,8]})
ds
# f g
0 1 5
1 2 6
2 3 7
3 4 8

repeated referencing column 'f' twice, but it only results once:

ds[:, ['f', 'g', 'f']]
# f g
0 1 5
1 2 6
2 3 7
3 4 8

fast_arr = ds.imatrix_make()

in fast array, repeated reference will lead to this column getting duplicated.

this is also numpy and pandas behavior.

fast_arr[:, [0,1,0]]
FastArray([[1, 5, 1],
[2, 6, 2],
[3, 7, 3],
[4, 8, 4]])

On the other hand, if I have repeated reference to the same row multiple times, then both dataset and fast array will duplicated it

ds[[0,1,0], :]

# f g
0 1 5
1 2 6
2 1 5

fast_arr[[0,1,0], :]
FastArray([[1, 5],
[2, 6],
[1, 5]])

searchsorted requires unique list

Same inputs end up with different returns from np.searchsorted and rt.searchsorted and np.searchsorted returns the correct results:
a = np.array([100, 101, 102, 102, 102, 103, 106, 106, 107, 107, 110, 118,134, 146, 146, 147, 147, 148, 148, 149])
b = np.array([134, 146, 147, 148, 149])
np.searchsorted(a, b)

array([12, 13, 15, 17, 19])
vs
a = rt.FastArray([100, 101, 102, 102, 102, 103, 106, 106, 107, 107, 110, 118, 134, 146, 146, 147, 147, 148, 148, 149])
b = rt.FastArray([134, 146, 147, 148, 149])
rt.searchsorted(a, b)
FastArray([12, 14, 15, 17, 19], dtype=int8)

categorical mean on float32 has low resolution

n = 100_000_000

var64 = rt.FA(np.random.normal(loc=10, scale=5, size=n).astype('float64'))
var32 = rt.FA(var64.copy().astype('float32'))
cat = rt.Cat(np.random.choice(['cat1', 'cat2'], size=n, replace=True))

cat.nanmean(var32) <-- keep typing this in and results will flutter

categorical groupby shift can segfault

import riptable as rt
import numpy as np

ds = rt.Dataset({'x':np.arange(1000),
'c':rt.Cat([0,1]*500)})
agg = ds.c.min(ds.x)
do_something_stupid = ds.c.shift(agg, 1)

I get a seg fault running this on linux, RipTable version 1.0.19

benchmarking

It would be good to provide a set of benchmarks for riptable. I would suggest adopting the NumPy benchmark suite from numpy/benchmarks which is built to use the Airspeed Velocity (ASV) benchmark system.

rt_stats has bokeh and pandas depdendencies

at minimum this entire module needs review

groupby.diff error when array length > 3 billion

I am playing with riptable version 1.0.19. I came across a bizarre error when I use groupby.diff(): it seems if the dataset has less than 2 billion rows, I am fine. But I got arbitrary results when I have 3 billion rows. Please see the attached notebook for a reproducible (big) example (use up to 500GB in RAM.)

tsd_1m_tmp = tsd_1m.head(int(3e9))
tsd_1m_tmp['foo_delta'] = tsd_1m_tmp.groupby(['date', 'symbol'])['foo'].diff()
tsd_1m_tmp

multi-key outer merge fails when first key is a string column

Performing an outer merge with rt.merge2() on multiple keys, where the first key (for one of the datasets) is a string column.

Repro:

ds1 = rt.Dataset({'f': ['1', '2', '3', '4'], 'g': [1, 2, 3, 4], 'h1': [1, 1, 1, 1]})
ds2 = rt.Dataset({'f': ['2', '3', '4', '5'], 'g': [2, 3, 4, 5], 'h2': [2, 2, 2, 2]})
ds1.merge2(ds2, on=['f', 'g'], how='outer')

Stack Trace:

AttributeError                            Traceback (most recent call last)
<ipython-input-28-d68d510b6b4b> in <module>
----> 1 ds3 = ds1.merge2(ds2, on=['f', 'g'], how='outer')


/my_python_env/lib/python3.7/site-packages/riptable/rt_dataset.py in merge2(self, right, on, left_on, right_on, how, suffixes, copy, indicator, columns_left, columns_right, validate, keep, high_card, hint_size)
   3708             self, right, on=on, left_on=left_on, right_on=right_on, how=how,
   3709             suffixes=suffixes, copy=copy, indicator=indicator, columns_left=columns_left, columns_right=columns_right,
-> 3710             validate=validate, keep=keep, high_card=high_card, hint_size=hint_size)
   3711     merge2.__doc__ = rt_merge.merge2.__doc__
   3712


/my_python_env/lib/python3.7/site-packages/riptable/rt_merge.py in merge2(left, right, on, left_on, right_on, how, suffixes, copy, indicator, columns_left, columns_right, validate, keep, high_card, hint_size, **kwargs)
   2016     left_grouping, _ = get_or_create_keygroup(left_on_arrs, 0, how != 'left', False, high_card, hint_size)
   2017     right_grouping, right_groupby_grouping =\
-> 2018         get_or_create_keygroup(right_on_arrs, 1, how != 'right', how == 'outer', high_card, hint_size)
   2019
   2020     if logger.isEnabledFor(logging.INFO):


/my_python_env/lib/python3.7/site-packages/riptable/rt_merge.py in get_or_create_keygroup(keycols, index, force_invalids, create_groupby_grouping, high_card, hint_size)
   1995                             # Need to copy 'isvalid' to avoid sharing the same instance with 'valid_join_tuples_mask',
   1996                             # which'll lead to them incorrectly having the same data (since we're updating in-place).
-> 1997                             valid_groupby_tuples_mask = isvalid.copy()
   1998                         elif isvalid is not None:
   1999                             valid_groupby_tuples_mask |= isvalid


AttributeError: 'NoneType' object has no attribute 'copy'

sds when stacking, when selecting individual column will not gap fill

when sds Datasets are stacked, and a new column (for example 'col1') is introduced midway through, it will be gap filled.
However, if the new column is selected with just include='col1', since it is the only array, gap filling will not occur

inv int got clipped, but nan in a float col doesn't

This is not necessarily a bug, I just like to report that the behavior here is different depending on if I have an inv value in an int or float col.

ds = rt.Dataset({'f': [1,2,3,4], 'g': [1.1, 2.2, np.nan, 4.4]})
ds[3, 'f'] = np.nan

print(ds)

f g

0 1 1.10
1 2 2.20
2 3 nan
3 Inv 4.40

ds['f'].clip(2, 3)
FastArray([2, 2, 3, 2])

ds['g'].clip(2, 3)
FastArray([2. , 2.2, nan, 3. ])

rt.quantile returns extraneous data, and the wrong dtype

rt.quantile returns internal data that is used to compute the quantile (the sorted array and counts) -- contrarily to what the docstring specifies. Moreover, the actual quantiles returns are an np.array of dtype=Object which can cause issues down the road.

I would much rather have rt.quantile just return the quantiles: in general that's what is needed, and riptable should probably stays close to the numpy api whenever it can. Adding an extra keyword that requests that extra data when needed is fine of course.

The issue I had with the dtype=Object is that rt.unique(quantiles) returns a string FA, which will wreak havoc when used later on. I don't know if I should enter another issue for this?

>>> rt.__version__
'1.0.25'
>>> np.__version__
'1.18.5'
>>> x = rt.FA(np.arange(100))
>>> r = rt.quantile(x, [0.25, 0.75])
>>> r
(array([24.75, 74.25], dtype=object), FastArray([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
           16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
           32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
           48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
           64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
           80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
           96, 97, 98, 99], dtype=int32), (0, 0, 0))
>>> rt.unique(r[0])
FastArray([b'24.75', b'74.25'], dtype='|S5')
>>>

Groupbyops `first` and `last` for data larger than 2**31.

Running

tmp_array = np.random.randint(0, 5, 2_500_000_000)
tmp_cat = rt.Cat(tmp_array)

Then calling

tmp_cat.first(tmp_array)

gives garbage.

This seems to only occur for data larger than about 2 billion records.

Groupby changes type

It seems that groupby changes the type of categoricals back to strings, as here:

ds = rt.Dataset(dict(trader=['a', 'a', 'b', 'b', 'c', 'c'], shares=np.arange(6)))
ds.trader = rt.Categorical(ds.trader)
gds = ds.trader.sum(ds.shares)
print(ds.trader.dtype)
print(gds.trader.dtype)
int8
|S1

I get the same result if I do the groupby using the alternate approach/syntax:
gds = ds.gb('trader').shares.sum()

Shared memory appearing by error

I have several files saved in shared memory (/dev/shm) even though I am certain I have never used the 'share' parameter in rt.save_sds() (I didn't know it existed until today). The files have been created at various dates between May and September 2020. I am unfortunately not able to write reproducible code, but I was wondering if this has been observed and people know the cause.

When doing category.ema_normal with a filter, the filtered out values are treated as zero

When doing category.ema_normal with a filter, the filtered out values are treated as zero.

r = np.arange(10, dtype=float) + 1000
tmp = rt.Dataset({'x': r,
                  'x_zero': r.copy(),
                  'fltr': rt.FA([True]*10)})
tmp.x_zero[5] = 0
tmp.fltr[5] = False

c = rt.Cat([0]*10)
t = r

tmp['ema_x'] = c.ema_normal(tmp['x'], time=t, decay_rate=1)
tmp['ema_x_zero'] = c.ema_normal(tmp['x_zero'], time=t, decay_rate=1)
tmp['ema_x_filtered'] = c.ema_normal(tmp['x'], time=t, filter=tmp.fltr, decay_rate=1)
print(tmp)

produces:

#          x     x_zero    fltr      ema_x   ema_x_zero   ema_x_filtered
-   --------   --------   -----   --------   ----------   --------------
0   1,000.00   1,000.00    True   1,000.00     1,000.00         1,000.00
1   1,001.00   1,001.00    True   1,000.63     1,000.63         1,000.63
2   1,002.00   1,002.00    True   1,001.50     1,001.50         1,001.50
3   1,003.00   1,003.00    True   1,002.45     1,002.45         1,002.45
4   1,004.00   1,004.00    True   1,003.43     1,003.43         1,003.43
5   1,005.00       0.00   False   1,004.42       369.14           369.14
6   1,006.00   1,006.00    True   1,005.42       771.71           771.71
7   1,007.00   1,007.00    True   1,006.42       920.44           920.44
8   1,008.00   1,008.00    True   1,007.42       975.79           975.79
9   1,009.00   1,009.00    True   1,008.42       996.78           996.78

ema_x_filtered should be close to ema_x, without the big drop at index 5. The same issue happens with ema_decay. I am not sure about ema_weighted as I get an error running it.

Categorical unique when dirty, fails to call unique

In [1]: import riptable as rt
In [2]: c=rt.Cat((rt.arange(30) % 3)+1, {1: 'test', 2: 'test2', 3: 'test3'})
In [3]: c[0:5].filter().unique()
FastArray(['test', 'test2', 'test3'], dtype='<U5')

In [4]: c[0:5].unique()

NORMALLY FAIL

Test - discard this issue

Testing.

when converting from pandas, also try unicode

in from_pandas in rt_dataset around line 2669 if dtype='S' fails, try with dtype='U'

sds should support mmap also

Along with shm_open, have a memory map feature

Use a concurrent lock to prevent nested multithreading

The multiprocessing module allows more than 1 thread to call into riptable's core. Use a concurrent flag to detect this and swith to single thread mode.

Feature request: readthedocs

riptable has a lot of docstrings, but no dedicated docs. Since it is open source, we should take advantage of an open source documentation platform like RTD.

I would be happy to do this, as I have created docs pages for other projects in SIG. Just let me know if we want a particular platform, or if RTD is fine.

remove ansi2html dependency

in rt_meta we need to remove the dependency on ansi2html.

I would keep three versions of data. data_plain, data_ansi, and data_html. I would remove code in rt_display.py related to handling display of meta data (DisplayText class). @zeneli

I might also change the test which checks for identical HTML output.
For the HTML tags, not sure if we force a color or preferred way to do it.

BUG: Categorical.unique does not handle slicing correctly

Categorical.unique on a slice of the Categorical does produce the unique categories used within the slice. Instead it returns a FastArray of used and unused cateogries.

Repro

>>> import riptable as rt
>>> rt.Cat(list('xyyz'))[:3].unique()  # incorrect behavior; 'z' should not be in the FastArray
FastArray([b'x', b'y', b'z'], dtype='|S1')
>>> rt.Cat(list('xyyz'))[:3].filter().unique()  # correct behavior; 'z' is not in the FastArray
FastArray([b'x', b'y'], dtype='|S1')

Expose "raw" merge indices

In a conversation with some users of riptide, one user mentioned that for a few specialized cases they'd find it useful to have access to the "raw" fancy indices as calculated by the relational join algorithm in rt.merge2.

It should be relatively simple to refactor the current implementation of rt.merge2 to extract the part which sets up and calls the join implementation into a new function, e.g. rt.merge_indices(); the remaining code inrt.merge2() will just call this new function then apply the calculated fancy indices to the selected columns of the left and right Datasets to produce the merged output Dataset.

This new function will provide a building block that could be used to implement some other types of merge algorithms in the future, and perhaps even used in rt.merge_asof(). Users of riptable will be able to use this to implement their own custom merge functions, or used as a general key-alignment mechanism (e.g. if you wanted to implement a vectorized "bimap" data structure).

rt.hstack(destroy=True) and rt.hstack_any(destroy=True) raise (different) error when the second dataset miss the last column of an earlier dataset

But:

if I don't pass destroy=True, both functions are fine;
if the missing column is in the middle of the previous dataset, it is fine as well.

comparison of uint64 to int64 converts to float64

An example

from riptable import *
x =701389446541966656
y=asarray([x-1,x,x+1]).astype(np.uint64)
y.isin(x)

incorrectly returns True, True, False

Support os.PathLike in functions accepting paths

Functions like rt.load_sds() accept a str- or bytes-typed path (or a sequence of such); however, we should also support Python's os.PathLike type in these functions to make calling them easier and more seamless for users who're making use of types from Python's pathlib (e.g. pathlib.Path).

All we'll really need to do to implement this support is to utilize the os.fspath() function to convert any path-like objects to a string representation before e.g. passing those strings down to some riptide_cpp function.

More info:

Extracting scalar from FastArray returns numpy scalar

Extracting a single element of a FastArray returns a numpy array scalar, but it should probably be returning a riptable array scalar using one of the scalar types defined in rt_numpy.py.

I ran into this today when trying to implement a small new riptable feature I need for another project. The current implementation returning a numpy array scalar means the value loses semantic information about riptable's invalid values -- so an invalid value recognized by rt.isnan, when extracted, gives something not recognized as an invalid.

Repro:

In [67]: rt.isnan(np.array([-128], dtype=np.int8).view(rt.FA)[0])
False

In [68]: np.array([-128], dtype=np.int8).view(rt.FA)[0]
-128

In [69]: isinstance(np.array([-128], dtype=np.int8).view(rt.FA)[0], np.generic)
True

In [70]: rt.isnan(np.array([-128], dtype=np.int8).view(rt.FA))
FastArray([ True])

In [71]: rt.isnan(np.array([-128], dtype=np.int8).view(rt.FA)[0])
False

In [72]: type(np.array([-128], dtype=np.int8).view(rt.FA)[0])
numpy.int8

isnotfinite does not match ~isfinite

As of riptable 1.0.9, the rt.isnotfinite function returns an np.ndarray rather than a FastArray. The values in the returned array also appear to be incorrect for integer-based arrays: they don't match the logical-NOT of what rt.isfinite returns when called with the same input.

Repro:

In [1]: import numpy as np

In [2]: import riptable as rt

# Floating-point example
In [3]: arr1 = np.array([-0.1, 0.0, 0.1, 0.2, np.nan, 0.4, np.inf, -np.inf, 0.7])

In [4]: rt.isfinite(arr1)
FastArray([ True,  True,  True,  True, False,  True, False, False,  True])

In [5]: rt.isnotfinite(arr1)
array([False, False, False, False,  True, False,  True,  True, False])

In [6]: ~rt.isfinite(arr1)
FastArray([False, False, False, False,  True, False,  True,  True, False])

# Integer example (rt.isnotfinite not respecting invalids like rt.isfinite)
In [7]: arr2 = rt.FA([0, -128, -127, 120, 127], dtype=np.int8)

In [8]: rt.isfinite(arr2)
FastArray([ True, False,  True,  True,  True])

In [9]: rt.isnotfinite(arr2)
array([False, False, False, False, False])

In [10]: ~rt.isfinite(arr2)
FastArray([False,  True, False, False, False])

TEST: Get tests in passing state

Address new test failures given the smaller set of dependencies and across increased variety of platforms.

fill_forward does not need inplac argument

Hello,

Riptable’s Categorical ignores the inplace argument of cat.fill_forward and fill_backward. Could you either implement this feature (preferred), or remove the keyword?

To reproduce:

fa = rt.FA([1.,np.nan, 1.])
c = rt.Cat([1, 1, 1])
c.fill_forward(fa, inplace=True)
assert(fa[1] == 1.)

Thanks,

--Marc

Merge with categorical and copy=False not working

The following throws an AttributeError ('Categorical' object has no attribute '_grouping') with riptable version 1.0.25:

import riptable as rt
tds = rt.Dataset(dict(id_column=[1, 1, 1, 2, 2, 2], shares=np.arange(6), trader=['a', 'a', 'b', 'b', 'c', 'c']))
tds.trader = rt.Categorical(tds.trader)
gds = tds.gb('id_column').sum()
gds.col_rename('shares', 'shares_sum')
tds = rt.merge_lookup(tds, gds, on='id_column', suffixes=(False, False), copy=False)
print(tds)

The code above works if I remove line 2 (i.e. I leave ‘trader’ as a string column rather than a categorical), or if I remove copy=False from line 5. For performance reasons, however, we often want these.

left merge on another dataset with strided array fails

Hi,

One of my columns in my rt.Datset wasn’t a continuous array (contained float 64s but had a stride of 24), and when I tried to left merge on another dataset, with columns_right = my_column, it failed. I fixed it by copying the array to make it continuous, but this seems like a bug, not a feature.

Best,

Daniel

Invalid ints not working

Invalid ints do not persist through common operations, e.g.:

lds = rt.Dataset(dict(a=['a', 'b', 'c']))
rds = rt.Dataset(dict(a=['a', 'b'], v=[1, 2]))
ds = rt.merge_lookup(lds, rds, on='a')
print(ds)
print()

print(ds.v.isna())
print()
print(ds.v + ds.v)
print()
print((ds.v * ds.v).isna())

a v

0 a 1
1 b 2
2 c Inv

[False False True]

[2 4 0]

[False False False]

Is this fixable? Alternatively should we do away with invalid ints?

Dataset.nansum does not respect axis=1

dataset.nansum(axis=1) returns a scalar (the sum of all numbers in the dataset), instead of an FastArray of length dataset.shape[0].

speedup record array to dataset (do in parallel)

merge2 broken for outer merge where right keyset is a subset of left keyset

A user reported running into an error today when calling rt.merge2() with how='outer'. After a bit of investigation, it turned out the user had (in this specific example) a 'right' Dataset whose keyset was a subset of the 'left' Dataset.

The key bug here was that calling FastArray.nansum() on an empty array while also explicitly specifying the output dtype caused a TypeError to be raised; specifically, FastArray._reduce_check() was attempting to convert None to an integer dtype, which failed. (The same case without specifying the dtype "succeeded" and returned nan because it fell into a logic case where the result was converted to float64.)

Multi-column Categorical preserves filtered categories

Version: riptable 1.0.25

Creating a normal, single-key Categorical with no filter shows all of the categories are included in .category_dict:

>>> strs = rt.Cat(['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ccccc', 'bbbbb', 'aaaaa', 'aaaaa', 'bbbbb', 'aaaaa'])
>>> strs.category_dict
{'key_0': FastArray([b'aaaaa', b'bbbbb', b'ccccc', b'ddddd'], dtype='|S5')}

Creating the same categorical, but providing a filter which masks out the first 'ccccc' and the only 'ddddd' eliminates 'ddddd' as a category; that's the expected behavior since there's only one occurrence -- the one we're filtering out.

>>> mask = rt.FA([True, True, False, False, True, True, True, True, True, True], dtype=np.bool)
>>> strs_filt = rt.Cat(['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ccccc', 'bbbbb', 'aaaaa', 'aaaaa', 'bbbbb', 'aaaaa'], filter=mask)
>>> strs_filt.category_dict
{'key_0': FastArray([b'aaaaa', b'bbbbb', b'ccccc'], dtype='|S5')}

In the example above, the elements where mask is False are also assigned to the 0 bin (representing invalid/NA/filtered).

However, when creating a multi-column ("multi-key") Categorical, the filter does not seem to be (entirely) respected -- the filtered elements still get a category created for them (even if that row is the only occurrence of that key-tuple). The category created for the filtered row does have all of it's elements (in the category arrays) set to invalid/default/NA values. Another invariant that's broken here is that the created categories are not all unique -- there are multiple occurrences of the key-tuples (i.e. rows if you vstacked the category arrays together into a Dataset).

rtosholdings / riptable Goto Github PK

riptable's People

Contributors

Stargazers

Watchers

Forkers

riptable's Issues

This gives expected result, although note in the resulting categorical, the bins are represented with floats. This is compatible with the doc string of rt.cut: "bins : ndarray of floats"

this results in the same:

However, if the bins have np.inf, then when applying on a int column will give wrong result:

this appears to me like a bug:

Of course, applying on a float column is OK even when the bin contains np.inf,

f g h m

f g h m

| f3 | f4 | g3

repeated referencing column 'f' twice, but it only results once:

in fast array, repeated reference will lead to this column getting duplicated.

this is also numpy and pandas behavior.

On the other hand, if I have repeated reference to the same row multiple times, then both dataset and fast array will duplicated it

f g

a v

Recommend Projects

Recommend Topics

Recommend Org