Comments (6)
I tested with a set of 1483 files on (32GB, 4CPUs)and used traceback
library to check for memory leak. I ran 10 iterations and observed the memory usage which was peaking around 4 GB. There were no obvious signs of a memory leak in this test.
from data-prep-kit.
I ran another test of 1483 files on podman VM with different memory configurations. The results are below.
It seems that it needed around 4GB of available memory to run successfully for these 1483 files.
CPUs | Total Memory | Memory Used by Ray | Transform | Files Processed Successfully | Total Files | Status of JOB |
---|---|---|---|---|---|---|
4 | 8GB | 4.2GB | NOOP | 1483 | 1483 | Passed |
4 | 6GB | 3 GB | NOOP | 910 | 1483 | Crashed |
4 | 4GB | 2GB | NOOP | 504 | 1483 | Crashed |
from data-prep-kit.
@shahrokhDaijavad, @shivdeep-singh-ibm This is an important piece of info not only for us but also for potential users. Can we please:
- Bring all of the results together in a separate document inside the project
- Once 1 is complete, remove this issue
from data-prep-kit.
I agree, @blublinsky. I will create an md file called memorytest under doc with this information and link from the mac.md file to it.
from data-prep-kit.
@shahrokhDaijavad great. Should it be memory? or endurance?
from data-prep-kit.
@blublinsky It's a combination of testing for memory leak (which peaks and flattens around 4GB, i.e., no leak) and endurance that shows with smaller memory (4GB and 6GB total memory), it is still possible to process 500 or 900 files successfully before it crashes. I will explain in the readme.
from data-prep-kit.
Related Issues (20)
- Running fdedup in the Notebook examples directory has a bug HOT 2
- [Feature] pyarrow parquet write_table can save up to 30% storage with compression flag βZSTDβ HOT 1
- [Feature] Enable an embeddable mode
- [Bug] Add transform to example notebook in context of Issue#283
- [Bug] Update documentation of repo level ordering transform
- [Bug] Add tests for repo level ordering module
- Improve ray store used in repo level ordering module.
- [Bug] Add kfp support in context of Issue#283
- [Bug] get_config_parameter returns without checking if the config value exists
- [Logging Feature] Logging INFO about completed x files in y mins should add (xx1 successfully and xx2 failed)
- [Bug] Resize behaves badly when there are lots of schema changes
- Tokenizer transform logs are filled with docs info and chuck index when parameter tkn_chunk_size is specified
- Add repo_name column to code2parquet tranform
- [Feature] Enable transform() to terminate all processing of documents across all instances
- [Feature] Capability to chunk text for RAG systems
- [Feature] Create vector embeddings
- Demo of data-prep-kit for RAG
- [Bug] Failing to publish repo_level_ordering
- [Feature] Create a demo notebook for RAG HOT 1
- [Bug] pdf2parquet test failure when running locally (passes in github worflow)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data-prep-kit.