Comments (23)
Hi John,
We are actively looking into this. We will update progress here. If you could please share a bit of the details around the FS being indexed, that would be great. We are expecting that there are directories with very high amounts of sub-dirs, causing the queues to grow quickly. If you'd like to discuss via email, feel free to reach out to [email protected]
from gufi.
Hello Dominic!
Thanks for the reply! Just for public visibility, the file system I'm trying to index is BeeGFS. It is 8.1 PB usable and currently has 1,034,075,745 inodes in use. I'll email you directly with additional details.
John DeSantis
from gufi.
Hi. Please try out these two branches:
insitu
gufi_dir2index -C <count> ...
, where count
is the maximum number of subdirectories in each directory that are enqueued for parallel processing. Any remaining subdirectories are processed in the current directory's thread, reducing the number of active allocations. The value that should be used will depend on the distribution of branching factors of the specific tree you are using.
deferred-queue
gufi_dir2index -M <bytes> ...
, where bytes
is a soft target memory size for the queues to attempt to reduce the overall memory footprint to. bytes
is divided by the number of threads. Work is put on a separate per-thread queue that is not processed until the normal per-thread queue is emptied.
from gufi.
Hello,
Thank you! I am running the deferred-queue build now. I'll update this issue with the results once it is done.
John DeSantis
from gufi.
Hello,
I'm pleased to state that I've noticed stability in the memory footprint so far! I'll update again once the indexing finishes. Feel free to review the screen shot I've included to show that the deferred-queue branch has had a positive effect, including a small decrease in the amount of memory being used
And for what it's worth, the directory in question has 38,574,219 sub directories.
from gufi.
Hello,
I wanted to provide an update. Unfortunately, indexing on two specific directories still continues to fail due to memory usage, despite either branch and various combinations of arguments, so we've reached out to the users in question to tidy up their directories.
Once we hear back from them I'll get a count of their current directory structures and verify items have been culled. I'll then re-run the indexing process (both methods) and report back.
Thanks again!
John DeSantis
from gufi.
Please give the reduce-work-size
branch a try. The data structure that is enqueued has been reduced in size by over 40%. Additionally, if CMake detects the zlib library and -C
(with no value this time) is passed to gufi_dir2index
, the work items will be compressed.
from gufi.
Hello,
The reduce-work-size branch is extremely promising! So far, not only has the indexing process run longer (4x) for this specific directory, it's memory footprint has been reduced by ~83%!!!! Please see the screen shot below to see the fruits of your labor.
I'll update again once the process completes or is terminated.
from gufi.
Hello,
Both indexes completed without any logged errors. CPU and memory statistics for both processes is listed below:
56,591,355 directories & 80% reduction in memory footprint:
23962.18user 9703.66system 35:46:25elapsed 26%CPU (0avgtext+0avgdata 41947132maxresident)k 104162440inputs+3706836056outputs (24major+44473160minor)pagefaults 0swaps
38,574,219 directories & 91.5% reduction in memory footprint:
9857.33user 3419.66system 14:54:09elapsed 24%CPU (0avgtext+0avgdata 18088904maxresident)k 30993064inputs+1370928848outputs (2major+25345753minor)pagefaults 0swaps
We will need to provide more storage for the out of tree index though, as these two directories consumed all 3.7 TB allocated for the indexes.
Thank you!
John DeSantis
from gufi.
from gufi.
Hello,
Thanks so much for the challenging use case to help make the software better.
Certainly! I thoroughly appreciate the time and fixes the development team produced.
Potentially these things could help you and of course it would be nice to know if these tools
can handle such a wide directory structure that your site has, something we could help ensure works.
Agreed. Currently, I've only been using gufi_find
for querying the tree, so I'll need to read-up on available documentation (including recent slides) and look at some of the commit logs to take advantage of these options. Despite this, gufi_find
is still a gem, as our current production need is to speed up file system purging. Once we have developed a proper workflow for this process, we'll most likely apply it to several other production file systems, and then begin to utilize GUFI's other features for richer file system accounting.
If there are any other features the development team would like to test against large, production file systems, let me know.
Thank you again!
John DeSantis
from gufi.
from gufi.
The changes made in the 3 branches have been merged into one branch and then merged with master via #125. Compression with zlib has been changed to use -e
since -C
was used by the insitu branch.
Please let us know if the combining of all of the changes broke something.
from gufi.
Hello,
The changes made in the 3 branches have been merged into one branch and then merged with master via #125. Compression with zlib has been changed to use
-e
since-C
was used by the insitu branch.
Excellent!
Please let us know if the combining of all of the changes broke something.
Will do! I am running an index now. Also, I've deliberately have skipped the indexing of the problematic directories now since users are actively removing content; one of the logs had 1k+ lines stating that directories couldn't be opened due to the active file/directory removals.
Thanks again for such amazing responses from the development team!
John DeSantis
from gufi.
Gary,
Glad it is helping you. Yes guf_find is about 1/50th of the capability gufi has.
Where's that "mind blown" gif???
Very complex things are pretty easy like make me a sorted histogram of all the file types in the entire system to simple things like find me the top 3 largest directories. Hope you find use for some of it…
Thanks for these suggestions, a seed has been planted...
John DeSantis
from gufi.
Hello,
Marking this as resolved! I am no longer seeing memory issues that I had seen previously. Thanks again for all of the development team's hard work.
John DeSantis
from gufi.
from gufi.
from gufi.
Gary,
Try
./gufi_query -n 1 -E "select path(uid,uid),* from entries;" /gufi_index/lo_scemo/ | head -10And of course
./gufi_query -n 1 -E ‘select path(uid,uid) || “\” || name, size, uid, uidtouser(uid,20) from entries;" /gufi_index/lo_scemo/ | head -10
These work perfectly, thank you for the insight!
We need to produce a “sql guide go gufi”
I imagine we will because we have a tutorial coming up at MSST in May
A guide would be most beneficial, especially since (as you already said) there are a lot more data points available via gufi_query vs. the wrapper tools gufi_find, gufi_ls, etc. I'd love to attend MSST, but it's too short of a notice and I'm in the middle of a cross country move :(
You can look at the entire schema in the code or just go to a dir and do
Sqlite3 dbname where dbname is the name of the database in that dir
Then something like .schema or something
Agreed, that's what I did since we all love to abuse and exploit SQL!
There is a LOT there and happy to describe if it helps
You may want to rescind that offer, hahaha!
Thanks again,
John DeSantis
from gufi.
I pushed some commits that restore the original path()
, epath()
, and fpath()
sqlite functions. path()
no longer takes in arguments. I have renamed the function with 2 arguments (summary.name
and summary.rollupscore
) to rpath()
, which is meant for handling path names in rolled up indices. Its intended use in the -E
query is:
SELECT rpath(summary.name, summary.rollupscore) || "/" || pentries.name
FROM summary, pentries
WHERE summary.inode == pentries.pinode;
However, because we are not in the business of forcing user queries to be in a certain format, it is easy to misuse rpath()
.
uidtouser
, gidtogroup
, and several other functions had an extra unexpected argument. They have been removed. The extra argument was to help with alignment of output, but they were always unused and set to 0.
from gufi.
from gufi.
from gufi.
@tacc-desantis I just pushed GUFI-SQL.docx, Gary's guide on using GUFI's SQL schema and functions, to the docs directory. Please pull the latest version of GUFI to get all of the features mentioned in the guide.
from gufi.
Related Issues (20)
- Issues with folders with spaces HOT 3
- How to update the metadata? HOT 6
- Special characters in file/path names HOT 3
- Issue with gufi_ls HOT 4
- FEATURE REQUEST: Add totzero to the vrsummary table HOT 5
- FEATURE REQUEST: gufi_stats -c dirsize-log2-bins HOT 3
- totsubdirs in summary tables HOT 6
- Longitudinal Study: Phase 1 HOT 6
- Longitudinal Study: Phase 2 HOT 6
- query to get Big directories with the help of GUFI db HOT 3
- Recursive gufi_ls does not work well with root databases
- gufi_dir2trace race condition HOT 2
- Cleanup OSX regression test gufi_stat and verifytrace HOT 1
- Is there any way to update existing index directory?
- googletest causes builds to fail
- gufi_find uses DEFAULT_CONFIG_PATH which isn't defined
- Exclude .snapshot[s] directories eg GPFS HOT 1
- Issue with gufi_find command HOT 10
- gufi does not escape paths with special chars HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gufi.