Just a topic for possible discussion etc Just for the sake of seeing

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Yaroslav Halchenko wrote: <a class="user-mention notran

Comments (14)

joeyh commented on May 31, 2024

From my own past investigations in the 5-10 million range, git itself becomes quite slow with this many files, as any operation that changes the index causes it to write out a whole new index file, which is quite large when the index has to track so many files.

git-annex's use of its journal is partly to work around exactly that scalability problem of git; it allows it to build up a pile of changes which can then all be staged into the index in one go later. I'd be surprised if the necessary locking to allow for parallel operations on the journal was a major bottleneck. I guess that between 3 and 5 syscalls per file could be eliminated if the journal was left locked for the duration of the git annex add.

As to git-ls-files --others speed, some tests here comparing it with find -type f, it seems to be about half as fast. It has a good reason to look for .git subdirectories, as it should not traverse into nested repositories. I don't understand why it then goes on to look for .git/HEAD and .git/packed-refs, which can't be there when .git isn't. I suspect it could be sped up by 3 syscalls per directory traversed.

from datalad.

RichiH commented on May 31, 2024

This should probably be tagged git-annex.

from datalad.

yarikoptic commented on May 31, 2024

ok -- I have experimented a bit on the "simulated" HCP 500 subjects release dataset to make it feasible to import it into git/git-annex (although I haven't added any url yet per file, that will be next simulation). To overcome the problem @joeyh described, I have generated git/git-annex repository per each subject (which was easier on git), and then merged them all together to generate a monolithic beast.

It is on smaug:/mnt/scrap/datalad/mimic_repos/hcp-ls.20141020-500subjects/__all__ (separate ones are upstairs). The longest command to run I believe was 'git annex merge' but overall -- it was finished in a reasonable amount of time (I believe under 48 h ;) forgot to time). Nevertheless it might be worthwhile looking into profiling annex merge (it was heavy on CPU, delegating from time to time to git commands) to see if could be speeded up a bit.
Resultant size of .git is 4GB... quite impressive given that it stores nothing besides file listing and checksums I guess. Possibly some overhead from still having original remotes linked (rsyncing into a separate one now to explore ways to shrink). But if we consider it in the light of the original storage required for this dataset (>12TB IIRC) it is small, nevertheless forbids "lightweight" cloning for the sake of exploring just a small sub-set of files

One note for the record: git merge octopus strategy (used when merging those all multiple remotes at once) requires presence of a common base commit. To not mess with grafts, I have just initiated initial empty commit in the target repository first.

from datalad.

yarikoptic commented on May 31, 2024

@joeyh Ignorant me wondered (just thinking out loud) how much space could have been saved if annex's meta-information storage in ascii files was

using 1 level of redirection for uuid's (yes, merges might requre renumbering in case of conflicts etc)
using binary files (timestamps would be much more concise ;))
using a lightweight DB (sqlite?) per each style of file (e.g. log) which would join all the entires thus minimize impact of having large number of keyfiles

just wondered if such a game might be worth the candles resulting possibly in faster annex due to smaller footprint?

from datalad.

joeyh commented on May 31, 2024

Yaroslav Halchenko wrote:

@joeyh Ignorant me wondered (just thinking out loud) how much space could have
been saved if annex's meta-information storage in ascii files was

using 1 level of redirection for uuid's (yes, merges might requre
renumbering in case of conflicts etc)

Yeah, it doesn't seem practical for a fully distributed system. Or at
least very complicated.

I have not measured it, but my assumption is that git's pack file
compression avoids the space overhead of uuids once the objects reach a
pack file.

using binary files (timestamps would be much more concise ;))

I have thought about moving away from timestamps to vector clocks,
presumably based on the depth since the first commit on the git-annex
branch, but I don't understand vector clocks well enough yet.

using a lightweight DB (sqlite?) per each style of file (e.g. log) which
would join all the entires thus minimize impact of having large number of
keyfiles

I don't think that checking a sqlite database into git is going to be a
win. You quickly get right back into "large files in git are bad"
territory.

see shy jo

from datalad.

joeyh commented on May 31, 2024

Nice approach to distribute the add like that.

Yaroslav Halchenko wrote:

Nevertheless it might be worthwhile
looking into profiling annex merge (it was heavy on CPU, delegating from time
to time to git commands) to see if could be speeded up a bit.

AFAIK, the only git command that git annex merge runs repeatedly is
git hash-object, which is run once per new object that is created by the
merge. So that could be sped up by either having git-annex generate the
git loose object file itself, or adding a hash-object --batch interface
to git.

However, the merge only needs to generate new objects when the two
branches have different versions of a file. I would not expect that to
happen a lot in your situation, unless lots of files with the same
checksum ended up being added in your separate repsitories.

There's also the issue that all the git annex merge is currently using
the String data type. Converting to a more efficient data type like
bytestring might speed it up significantly.

see shy jo

from datalad.

yarikoptic commented on May 31, 2024

re 3.: sure thing we don't want anything non-git-friendly under git ;) it was more of question: what is the current overhead over if meta-information was maintained in a "proper DB".
I also wonder now if impact could have been reduced if number of nested directories (currently based on first 6 hex numbers of a checksum of the key) was reduced. With my use-case having around 6mil files, I have around 4.7 mil of those directories at the 2nd level with collisions up to 6 .log files in a directory. I guess all that hierarchy was created for faster lookups of corresponding meta-information files, but such a split (4096, 4096, up to 6) seems to be a bit suboptimal in terms of storage (and thus impairing performance since requires more directories/files caching) unless there is probably a billion of files (when things would just get out of hands anyways probably ;) ) . May be 4096/256 (i.e. using only 5 leading hash hexes with 3 in leading directory, 2 in subsequent) thus generating up to 1,048,576 (instead of current 16,769,024) subdirectories would be more sensible for typical loads at a cost of increasing number of collisions happen repositories have those hundreds of thousands of files. Again -- just thinking out loud ;-)

from datalad.

joeyh commented on May 31, 2024

Yaroslav Halchenko wrote:

re 3.: sure thing we don't want anything non-git-friendly under git ;) it was
more of question: what is the current overhead over if meta-information was
maintained in a "proper DB".
I also wonder now if impact could have been reduced if number of nested
directories (currently based on first 6 hex numbers of a checksum of the key)
was reduced. With my use-case having around 6mil files, I have around 4.7 mil
of those directories at the 2nd level with collisions up to 6 .log files in a
directory. I guess all that hierarchy was created for faster lookups of
corresponding meta-information files, but such a split (4096, 4096, up to 6)
seems to be a bit suboptimal in terms of storage (and thus impairing
performance since requires more directories/files caching) unless there is
probably a billion of files (when things would just get out of hands anyways
probably ;) ) . May be 4096/256 (i.e. using only 5 leading hash hexes with 3 in
leading directory, 2 in subsequent) thus generating up to 1,048,576 (instead of
current 16,769,024) subdirectories would be more sensible for typical loads at
a cost of increasing number of collisions happen repositories have thos e
hundre ds of thousands of files. Again -- just thinking out loud ;-)

Are you talking about the hash directories in .git/annex/objects, or the
hash directories in the git-annex branch? The two have rather different
performance implications.

see shy jo

from datalad.

yarikoptic commented on May 31, 2024

apparently github decided not to post my email reply I have sent days back, so here it comes: sorry for not being clear -- I was looking at git-annex branch ATM and its impact on the size of the git repository itself without any files being present.

from datalad.

joeyh commented on May 31, 2024

As far as the git-branch goes, the most space-efficient strategy is
probably to keep the size of git tree objects that need to be changed as
small as possible. Each change to a large tree object creates a new
large-ish tree object that needs to be stored.

If delta compression works well, related versions of tree objects will
be delta compressed and not waste space. I don't know how reliably or
well delta compress works on tree objects. Probably it works great, like
it does for source code, etc. Anyway, let's temporarily disregard
it.

Changing xxx/yyy/zzz.log creates 1 new blob object, and 3 new tree objects.
How large are these 3 tree objects?

The average tree sizes, in number of annexed objects (reposize) are:

root tree: 4096 (unless repo is very small)
first level hash trees: reposize / 4096

second level hash trees: reposize / 4096 / 4096

total: 4096 + reposize / 4096 + reposize / 4096 / 4096

Each object sha stored in a tree takes approx 42 bytes. (40 byte sha1,
plus IIRC a few other bits.)

Let's calculate the bytes needed to store the changed tree objects,
when changing a single log file, for a range of repository sizes:

Prelude> let reposizes = [100000, 1000000, 10000000, 100000000] :: [Double]
Prelude> let treebytes n = (n * 42) / 1024
Prelude> let sz reposize = 4096 + reposize / 4096 + reposize / 4096 / 4096
Prelude> mapM_ print $ map (\n -> (n, sz n, treebytes (sz n))) reposizes
(100000.0,4120.4200229644775,169.0016025044024)
(1000000.0,4340.200229644775,178.016025044024)
(1.0e7,6538.002296447754,268.1602504402399)
(1.0e8,28516.02296447754,1169.602504402399)

So, for a repo with 10 million annexed objects, changing one log file will
require changing 3 tree objects and one blob object. And the 3 tree
objects, in total, will point to 6538 objects. And so will need 269
kb to store. And even with a repo of just 100k annexed objects, 169 kb
is needed to change that one log file. Now we hope delta compression works..

You instead suggest 4096 toplevel and then 256 second level.

Prelude> let sz reposize = 4096 + reposize / 4096 + reposize / 4096 / 256
Prelude> mapM_ print $ map (\n -> (n, sz n, treebytes (sz n))) reposizes
(100000.0,4120.509429931641,169.00526958703995)
(1000000.0,4341.094299316406,178.05269587039948)
(1.0e7,6546.9429931640625,268.52695870399475)
(1.0e8,28605.429931640625,1173.2695870399475)

There is very little difference in this hashing strategy.

Let's instead consider getting rid of the second hash directory entirely:

Prelude> let sz reposize = 4096 + reposize / 4096
Prelude> mapM_ print $ map (\n -> (n, sz n, treebytes (sz n))) reposizes
(100000.0,4120.4140625,169.00135803222656)
(1000000.0,4340.140625,178.01358032226563)
(1.0e7,6537.40625,268.13580322265625)
(1.0e8,28510.0625,1169.3580322265625)

Space-wise, this behaves better than either other strategy, although
only marginally. Up to 100 million annexed files, the space used by all
three is so close to the same as to not matter.

This also avoids an extra tree object lookup when traversing to a log file.

If a lot of log files are being accessed, we can assume git (or the kernel)
will quickly cache the root tree object, and all 4096 first-level hash
directories.

The probability a second-level hash directory will be a cache hit depends
on how many items are in it; more is better.

current hashing

Prelude> let sndlevelsz n = reposize / 4096 / 4096
Prelude> mapM_ print $ map (\n -> (n, sndlevelsz n)) reposizes
(100000.0,5.9604644775390625e-3)
(1000000.0,5.9604644775390625e-2)
(1.0e7,0.5960464477539063)
(1.0e8,5.9604644775390625)

only 256 second level directories variant

Prelude> let sndlevelsz reposize = reposize / 4096 / 256
Prelude> mapM_ print $ map (\n -> (n, sndlevelsz n)) reposizes
(100000.0,9.5367431640625e-2)
(1000000.0,0.95367431640625)
(1.0e7,9.5367431640625)
(1.0e8,95.367431640625)

My guess from this is that 256 second level directories are at best marginally
more efficient, due to less cache misses. But still, up to 1 million annexed
files, these second-level directories each have on average only 1 item in them,
so the cache for them will always be cold.

So, eliminating the second-level cache directories would keep the size
the same, while probably improving time efficiency by some amount. It
seems like a win. However, I don't know how to realize these benefits
without sacrificing backwards compatability. And, I'm supposed to be
looking at size, not speed...

Earlier I swept delta compression of tree objects under the rug. Let's
actually measure it! I'll do so by creating a git repository containing
4096 different files.

joey@darkstar:/tmp/new>for x in $(seq 1 4096); do echo $x ;echo $x > $x; done
joey@darkstar:/tmp/new>git add *
joey@darkstar:/tmp/new>git commit -m a
joey@darkstar:/tmp/new>du -hs .git/objects/
18M .git/objects/
joey@darkstar:/tmp/new>git gc --aggressive
joey@darkstar:/tmp/new>du -hs .git/objects/
280K .git/objects/

So far, we have 1 tree object with 4096 files. Let's make 4095 more, all
the same size, while adding no more new file objects, by copying files
2..4096 to file 1, and committing that change.

joey@darkstar:/tmp/new>for x in $(seq 2 4096); do cp $x 1; git commit -a -q -m a; done
joey@darkstar:/tmp/new>du -hs .git/objects/
418M .git/objects/
joey@darkstar:/tmp/new>git gc --aggressive
joey@darkstar:/tmp/new>du .git/objects/ -hs
1.6M .git/objects/

So, we have here 4096 tree objects, each pointing to 4096 files.
I think each tree object is around 168 kb in size, so 672 mb of trees
in all, although git manages to store them in less space
even as loose objects (probably because git compresses loose objects
too). And, it all packs down to 1.6 mb.

So, delta compression of tree objects works pretty great. Although this was
probably a best case for delta compressing tree objects; all the related
tree objects were available for packing together without other unrelated
objects to get in the way.

see shy jo

from datalad.

joeyh commented on May 31, 2024

Yaroslav Halchenko wrote:

It is on smaug:/mnt/scrap/datalad/mimic_repos/hcp-ls.20141020-500subjects/all
Resultant size of .git is 4GB...

Actually, .git/objects is "only" 2.6 gb. The rest is .git/index (1 gb) and
.git/annex/index (nearly as large), which would not impact cloning.

see shy jo

from datalad.

yarikoptic commented on May 31, 2024

lovely exploration!!! thanks, even though I am yet to fully grasp it all and please feel free to tell me to stop ;-)...

So, eliminating the second-level cache directories would keep the size
the same, while probably improving time efficiency by some amount. It
seems like a win. However, I don't know how to realize these benefits
without sacrificing backwards compatability. And, I'm supposed to be
looking at size, not speed...

if we are to look at the size... I wondered: if instead of abc/def/SHA256-123456.log it was a abc/def.log with all the keys logged in the same file, which theoretically I see of advantage for tree size/lookups (per your analysis, and overall tree would be smaller since less files), and disadvantage for storage (blobs) since additional column with with a key would need to be added to .log (unless some header with line#s is introduced), and for performance since lookup within the file would be necessary, and for implementation (custom merge facility adjusted). But given that collisions while using 6 hex digits are not very likely (I saw up to 6 in my experiment) I expect performance, even with lookups, being not that bad while we would make even more use of git's ability to get nice deltas on files modifications. or am I missing some cornerstone(s)?

.git/objects is "only" 2.6 gb

is it possible to guestimate somehow what is the breakdown there between how much is used by tree objects, blobs, meta-information on commits/merges...? may be it would become feasible to assess then effects of my silly suggestion above (or may be I should just simulate it as well...)

BTW -- I thought to quickly check how much is used by trees/blobs in master -vs- git-annex branches but got confused -- any idea why .git/objects actually grew up after I removed master branch, left only git-annex and ran gc?

(git)smaug:/tmp/100307.cloned-addurl/100307.cloned-nomaster[git-annex]git

$> git reflog  # is empty

$> git br -a
* git-annex

$> git gc --aggressive
Counting objects: 131283, done.
Delta compression using up to 16 threads.
Compressing objects: 100% (123662/123662), done.
Writing objects: 100% (131283/131283), done.
Total 131283 (delta 47368), reused 83898 (delta 0)
Checking connectivity: 131283, done.

$> du -sck .git/objects                
15856   .git/objects
15856   total

$> git br -D master
Deleted branch master (was da447c1).

$> git gc --aggressive 
Counting objects: 119652, done.
Delta compression using up to 16 threads.
Compressing objects: 100% (112058/112058), done.
Writing objects: 100% (119652/119652), done.
Total 119652 (delta 47366), reused 72265 (delta 0)
Checking connectivity: 119652, done.

$> du -sck .git/objects
61568   .git/objects
61568   total

N.B. I might have underused git gc --aggressive and just relied too much on git repack -a -d -f --window=100... so those 2.6gb sizes might shrink even more I guess

from datalad.

joeyh commented on May 31, 2024

Yaroslav Halchenko wrote:

lovely exploration!!! thanks, even though I am yet to fully grasp it all and
please feel free to tell me to stop ;-)...

Well, it occurred to me later that I left something out. I was doing an
analysis based on incrementally building a repository by adding/changing
files. Ie, the way git is typically used. Which is what git-annex needs
to optimise for, but your use case is more making 1 commit with a
million files in it. The numbers are different in that case..

So, if we just have 1 commit with everything in it, what are the sizes of the
committed tree objects in that case? If I'm calculating this correctly, the
math does something nice:

root tree: 4096 (unless repo is very small)
first level hash trees: reposize / 4096 * 4096
second level hash trees: reposize / 4096 / 4096 * 4096 * 4096

Eg, each 1st level hash tree has size reposize / 4096, and there are
4096 of them, and so the combined size of all of them is reposize.
And the same for second level.

Prelude> let reposizes = [100000, 1000000, 10000000, 100000000] :: [Double]
Prelude> let treebytes n = (n * 42) / 1024
Prelude> let sz reposize = 4096 + reposize * 2
Prelude> mapM_ print $ map (\n -> (n, sz n, treebytes (sz n))) reposizes
(100000.0,204096.0,8371.125)
(1000000.0,2004096.0,82199.25)
(1.0e7,2.0004096e7,820480.5)
(1.0e8,2.00004096e8,8203293.0)

So, in the 1 million annexed objects added in a single commit case,
the tree objects will take up 801 mb (uncompressed).

And it doesn't matter how many hash buckets there are, the size is the same.

What if we get rid of the second level hash directory?

root tree: 4096 (unless repo is very small)
first level hash trees: reposize / 4096 * 4096

So that halves the space needed by the tree objects.

But, I don't know how to get rid of those second level directories w/o
sacrificing backwards compatability or interoperability. I thought about
having a flag file in the git-annex branch that told git-annex not to use
those directories. But then merging two repositories, one with the
flag file and one without would create a combined git-annex branch that
sometimes uses the second-level directories and sometimes not. A real mess..

[ Incidentially, my tree object sizes are slighly wrong because tree
objects also include filenames. That just adds 3 bytes to the trees that
contain xxx hash subdirectories. The leaf trees, that contain the log
files, of course contain the full git-annex key, which is much larger.. ]

if we are to look at the size... I wondered: if instead of abc/def/
SHA256-123456.log it was a abc/def.log with all the keys logged in the same
file

Up to a million annexed objects, the files would tend to be < 10000 lines
each (depending on the number of repositories). So it is scalable
memory/cpu wise. And I believe it will overall save space.

is it possible to guestimate somehow what is the breakdown there between how
much is used by tree objects, blobs, meta-information on commits/merges...? may
be it would become feasible to assess then effects of my silly suggestion above
(or may be I should just simulate it as well...)

Use git-verify-pack -v, passing it the .git/objects/pack/*.pack file.
This will produced a list of objects in the pack, you can grep for "tree" etc.

The next field after the type of object is the object size, in bytes,
and the following field is the deltaified size.

BTW -- I thought to quickly check how much is used by trees/blobs in master
-vs- git-annex branches but got confused -- any idea why .git/objects actually
grew up after I removed master branch, left only git-annex and ran gc?

N.B. I might have underused git gc --aggressive and just relied too much on git
repack -a -d -f --window=100... so those 2.6gb sizes might shrink even more I
guess

That, or maybe it created a new pack file and left the old enormous
pack as-is?

I also think that you could reduce the repo size further if you
squashed the history down to one commit, on both the git-annex branch
and on master. And then repacked it with optimal window size options.

Also, you could use the SHA1 backend, rather than SHA256. This will save
something like 40 to 80 bytes per annexed file, so could easily save
40-80 mb or so if you have a million files.

see shy jo

from datalad.

yarikoptic commented on May 31, 2024

I do not think we could do anything about this issue of git being hindered in repos with huge number of files, so will close for now

from datalad.

git-annex add (git ls-files) etc performance on laaarge datasets about datalad HOT 14 CLOSED

Comments (14)

second level hash trees: reposize / 4096 / 4096

current hashing

only 256 second level directories variant

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent