Comments (14)
TODOs:
-
config.page_size=1_000_000
-
config.workdir=/tmp/tablite/
-
config.disk_limit="10G"
-
mkdir at first write. NOT before.
-
central write_table function
- create table yaml
- write to /tables
- create page index yaml
- write to /page_index
-
central write_page function
- write to /pages
- store data in .npy format
-
rework Table.save to write .npz
-
rework Table.load to read .npz
-
rework MP functions to use path instead of h5path
from tablite.
Segment join, lookup into paginated operations to avoid OOMError.
from tablite.
Preferred approach for subclassing tables...
class MyTable(tablite.Table):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.x = kwargs.get("x", 42) # <== special variable required on MyTable.
def copy(self):
# tablite.Table implements
# new = cls(); # self.x is now 42 !!!
# for name,column in self.items():
# new[name] = column
# MyTable therefore implements:
cp = super(MyTable, self).copy()
cp.x = self.x # updating to real x.
return cp
from tablite.
To assure friendly import and override structure, tablite will use the following class hierarchy
import level |
.py module |
---|---|
0 | version config utils datatypes datasets |
1 | base - holds classes: Table, Column, Page |
2 | core - core functions: import, export, join, lookup, filter, ... (mp stuff) |
3 | tools |
functions in core
will include the option to drop in a modified tqdm
if required.
from base import Table as BaseTable
class Table(BaseTable):
....
Table.import_file(...., add_your_favorite_tqdm_here...)
from tablite.
In the refactor we've decided to remove the slow operations of pop
and remove
, but will keep remove_all
from tablite.
Replace is refactored from using a single value to a mapping:
replaces values using mapping
example:
>>> t = Table(columns={'A': [1,2,3,4]})
>>> t['A'].replace({2:20,4:40})
>>> t[:]
np.ndarray([1,20,3,40])
from tablite.
Column.insert
is being removed as it encourages the user to use slow operations.
It is better to perform the data manipulation in memory and drop the result into the column using col.extend(....)
or col[a:b] = [result]
.
from tablite.
deprecated
copy_to_clipboard
andcopy_from_clipboard
as pyperclip doesn't seem to be maintained.- class method
from_dict
asTable(columns=dict)
now is supported.
from tablite.
Column.histogram()
will now return a dict with {key1: count1, key2:count2,...} instead of two lists people then have to zip into a dict anyway.
from tablite.
deprecated table functions:
- reload_saved_tables. Use t.save(path) and Table.load(path)
- reset_storage. It's in temp.
- from_dict
deprecated column functions:
- to_numpy. Default table['name'] returns a numpy array. User should call
table['name'].tolist()
to get python lists (up to 6x slower)
Everything else remains.
from tablite.
staticmethod Table.head
has been moved to tablite.tools
where it should belong.
from tablite.
As Table now accepts the keyword `columns` as a dict:
t = Table(columns={'b':[4,5,6], 'c':[7,8,9]})
and the header/data combinations:
t = Table(header=['b','c'], data=[[4,5,6],[7,8,9]])
it is no longer necessary to write:
t = Table
t['b'] = [4,5,6]
t['c'] = [7,8,9]
and the following assignment method is DEPRECATED:
t = Table()
t[('b','c')] = [ [4,5,6], [7,8,9] ]
Which then produced the table with two columns:
t['b'] == [4,5,6]
t['c'] == [7,8,9]
from tablite.
missed feature: tablite must cache datatype on pages, so that type determination is near instant.
from tablite.
Complete in versions : 2023.6
from tablite.
Related Issues (20)
- Join (reindexing) fails when table spans multiple pages HOT 2
- Documentation is out of sync HOT 1
- Determine method to handle out-of-memory for large joins. HOT 1
- Proposed format specification HOT 1
- multi proc groupby HOT 1
- multi proc join HOT 3
- Add warning in add_rows that is the slowest method HOT 1
- Deprecating support for python 3.8 in favor of type hints throughout the code HOT 1
- Columns with empty names HOT 2
- Table.load very slow with dtype('O') HOT 5
- Bloat in H5 storage following repeated SIGKILL HOT 3
- Statistics discrepancies in median/mode HOT 1
- Do Tablite Support different datasets Concurrently ? HOT 6
- Addition of match operator HOT 5
- sorting problem with datetime dt columns HOT 1
- Inconsistent row slice HOT 3
- Slow import of files with text escape HOT 16
- statistics() fails on time column HOT 2
- my first issue
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tablite.