It takes twice as long as a plain listing of present submodules. <div class="snipp

Here is the patch <div class="highlight highlight-source-diff notranslate position

Investigate performance of `next-status` with many subdatasets about datalad-next HOT 3 CLOSED

mih commented on September 3, 2024

Investigate performance of `next-status` with many subdatasets

from datalad-next.

Comments (3)

mih commented on September 3, 2024

All of the runtime is coming from _eval_submodule():

Test patch

diff --git a/datalad_next/iter_collections/gitstatus.py b/datalad_next/iter_collections/gitstatus.py
index d3f4dd5..6fe6e71 100644
--- a/datalad_next/iter_collections/gitstatus.py
+++ b/datalad_next/iter_collections/gitstatus.py
@@ -289,7 +289,7 @@ def _yield_repo_items(
             # TODO others?
         )
         # TODO possibly trim eval_submodule_state
-        _eval_submodule(path, item, eval_submodule_state)
+        #_eval_submodule(path, item, eval_submodule_state)
         if item.status:
             yield item

# with patch
❯ time datalad next-status
nothing to save, working tree clean
datalad next-status  0.84s user 0.13s system 101% cpu 0.962 total

# without the patch
❯ time datalad next-status
nothing to save, working tree clean
datalad next-status  95.98s user 26.99s system 110% cpu 1:51.23 total

from datalad-next.

mih commented on September 3, 2024

The culprit is the timing of detection that a submodule is absent.

The following patch tried swapping out the iter_subproc() method for a simpler subprocess.run(). Marginal difference -- big compliment to iter_subproc()!

diff --git a/datalad_next/iter_collections/gitstatus.py b/datalad_next/iter_collections/gitstatus.py
index d3f4dd5..9f0f7a8 100644
--- a/datalad_next/iter_collections/gitstatus.py
+++ b/datalad_next/iter_collections/gitstatus.py
@@ -13,6 +13,7 @@ from typing import Generator
 
 from datalad_next.runners import (
     CommandError,
+    call_git_lines,
     iter_git_subproc,
 )
 from datalad_next.itertools import (
@@ -414,18 +415,16 @@ def _get_submod_worktree_head(path: Path) -> tuple[bool, str | None, bool]:
         # its basis. it is not meaningful to track the managed branch in
         # a superdataset
         HEAD = corresponding_head
-    with iter_git_subproc(
-        ['rev-parse', '--path-format=relative',
-         '--show-toplevel', HEAD],
+    res = call_git_lines(
+        ['rev-parse', '--path-format=relative', '--show-toplevel', HEAD],
         cwd=path,
-    ) as r:
-        res = tuple(decode_bytes(itemize(r, sep=None, keep_ends=False)))
-        assert len(res) == 2
-        if res[0].startswith('..'):
-            # this is not a report on a submodule at this location
-            return False, None, adjusted
-        else:
-            return True, res[1], adjusted
+    )
+    assert len(res) == 2
+    if res[0].startswith('..'):
+        # this is not a report on a submodule at this location
+        return False, None, adjusted
+    else:
+        return True, res[1], adjusted
 
 
 def _eval_submodule(basepath, item, eval_mode) -> None:

from datalad-next.

mih commented on September 3, 2024

Here is the patch

diff --git a/datalad_next/iter_collections/gitstatus.py b/datalad_next/iter_collections/gitstatus.py
index d3f4dd5..5e4a980 100644
--- a/datalad_next/iter_collections/gitstatus.py
+++ b/datalad_next/iter_collections/gitstatus.py
@@ -437,6 +436,14 @@ def _eval_submodule(basepath, item, eval_mode) -> None:
         return
 
     item_path = basepath / item.path
+
+    # this is the cheapest test for the theoretical chance that a submodule
+    # is present at `item_path`. This is beneficial even when we would only
+    # run a single call to `git rev-parse`
+    # https://github.com/datalad/datalad-next/issues/606
+    if not (item_path / '.git').exists():
+        return
+
     # get head commit, and whether a submodule is actually present,
     # and/or in adjusted mode
     subds_present, head_commit, adjusted = _get_submod_worktree_head(item_path)

❯ time datalad next-status
nothing to save, working tree clean
datalad next-status  1.17s user 0.21s system 100% cpu 1.372 total

A 80x speedup for this extreme use case.

from datalad-next.

Investigate performance of `next-status` with many subdatasets about datalad-next HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent