Comments (18)
How many refs in the repo?
from klaus.
How many refs in the repo?
Not a single one as a regular file in the local testing repository, the one on the server has the master ref as a regular file, the rest are packed refs (207329 lines in that file).
from klaus.
How many refs in the repo?
Not a single one as a regular file in the local testing repository, the one on the server has the master ref as a regular file, the rest are packed refs (207329 lines in that file).
The sheer number of refs is the problem here; it means 200k random object accesses.
from klaus.
What can we do to make it faster?
from klaus.
When I initially built Klaus my use case to have an easy way to browse commits locally, essentially a better UI for git log
+ git diff
. I didn’t expect people to do crazy stuff like hosting thousands of repos (@jelmer :)) or repos with hundreds of thousands of commits.
from klaus.
How many refs in the repo?
Not a single one as a regular file in the local testing repository, the one on the server has the master ref as a regular file, the rest are packed refs (207329 lines in that file).
The sheer number of refs is the problem here; it means 200k random object accesses.
The thing is, the site for the project itself loads reasonably well (or at least as well as I'd expect with that repository).
It's only the index page which has abysmal load times for me.
Without knowing the code I'd assume that the project page itself does a little more lifting compared to the index page which effectively retrieves the time of the last commit on the default branch(?) and the description.
It's really just that difference in load times that makes me think there is something happening in the code for the index that does more than needs to be done.
from klaus.
I started scrolling through the code just now and it struck me that index is used in a different manner than I used it in this issue, the page that is very slow is the repo_list specifically.
To be precise, I've pulled the whole thing into a venv and are fiddling with it by now, this line (when ordering is set to last_updated):
Line 79 in a6d322a
The loading times of the repo list can me mitigated entirely for my use case by not querying the last_updated
for all refs, just the default one (which IMHO is good enough for the repo list).
I.e. replacing the following code with [ b'HEAD' ]
, although the default_branch
method that I've seen in some other issues should probably be used (though PR № 308 would make them equal anyways):
Line 64 in a6d322a
from klaus.
I wonder if the is is related to determining the timestamp of latest change to the repo. Maybe related to #248
from klaus.
If that’s indeed the case the page should los much faster if you order by name instead of update
from klaus.
If that’s indeed the case the page should los much faster if you order by name instead of update
I tested that, however it then hangs when rendering the template for the repo list because that one still displays the timestamp, hence it just moves the "lazily parse all refs of the entire repository" to the template.
from klaus.
Do you want to try to hotfix the code so that it doesn’t look up any timestamps?
from klaus.
In the process of debugging this I made get_last_updated_at()
(or what's it called) return 0, which worked nicely, currently I'm running this patch which only looks up the HEAD
ref and it works, as noted earlier:
From 10f646fb1e38eb1e4469398915a8e3010ddb07c6 Mon Sep 17 00:00:00 2001
From: benaryorg <[email protected]>
Date: Sun, 2 Apr 2023 10:33:45 +0000
Subject: [PATCH] retrieve only HEAD for last updated of repo
Signed-off-by: benaryorg <[email protected]>
---
klaus/repo.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/klaus/repo.py b/klaus/repo.py
index 5033607..590dc6d 100644
--- a/klaus/repo.py
+++ b/klaus/repo.py
@@ -61,7 +61,7 @@ class FancyRepo(dulwich.repo.Repo):
"""Get datetime of last commit to this repository."""
# Cache result to speed up repo_list.html template.
# If self.get_refs() has changed, we should invalidate the cache.
- all_refs = self.get_refs()
+ all_refs = [ b'HEAD', ]
return cached_call(
key=(id(self), "get_last_updated_at"),
validator=all_refs,
--
2.39.2
The patch is far from production quality and I'm not sure about the implications.
from klaus.
Maybe we can have a rule that stops looking up the timestamps and just use HEAD for the timestamp if you have more than N refs in a repo.
from klaus.
Or something like: we stop looking at all refs and instead look at a hardcoded list of typical master branch names and the top K tags determined by sorting by name.
from klaus.
FWIW GitHub also seems to just give up beyond a certain number of refs; it just displays a handful for that repository.
from klaus.
Ah, so one caveat I've discovered so far happens when the default branch isn't set up 100% correctly.
I use gitolite under the hood and both when using the create-on-clone and the create-on-push methods the repository on the server-side gets initialised with the git defaults (or configured settings), making HEAD point to ref/heads/master in this case.
Since all of my default branches are main, that means that klaus when retrieving the HEAD
will fail to resolve it and instead display the repository as no commits yet
making the link unclickable (opening manually works just fine tho).
I probably should be using gitolite's symbolic-ref
non-core thingy for that, but it might affect existing users, so I'd just like to throw that in here.
(This is something to be aware of in #308 at least.)
So personally the combination of the two approaches would be great; check HEAD first and if that doesn't resolve to any commits fallback to a list of usual default branch-names.
That's just my opinion though.
Edit: the symbolic-ref
solution is ~fine~ except for the caching of the last_updated timestamp, presumably because it parses all refs.
If this issue is at some point satisfactorily fixed maybe the caching can be removed (if it only has to look at like ten files and parse part of a pack instead of a whole lot more).
from klaus.
One thing I realised is that --mirror
seems to download all the refs/pull/
refs as well. Are they useful for your use case?
from klaus.
One thing I realised is that
--mirror
seems to download all therefs/pull/
refs as well. Are they useful for your use case?
Since I am specifically mirroring the repository, yes.
Edit: I realised that's a little short. What I am doing is keeping the history of the repository in case something upstream changes, whether that is a compromised ecosystem or just GitHub having issues, so I can use everything, including the PRs to keep things.
from klaus.
Related Issues (20)
- Nice error message if attempted clone fails due to disabled SmartHTTP
- Highlight lines of code HOT 8
- Add signature snapshot HOT 3
- set the host for the clone url HOT 2
- Klaus Demo is unavailable HOT 1
- openmetrics support HOT 1
- OOM error on small repo when cloning/pull HOT 31
- Make bin/klaus an entrypoint HOT 1
- Add EXPOSE to the Dockerfile HOT 1
- Handle SymRefLoop
- handle dulwich.objects.EmptyFileException
- How to do lighttpd + klaus? HOT 4
- Add 2.0.2 to PyPI HOT 1
- ownership issue HOT 12
- SetuptoolsDeprecationWarning: Installing 'klaus.templates' as data is deprecated, please list it in `packages`. HOT 2
- CSS auto dark mode? HOT 7
- Browsing to a repo shows the most recently updated branch, rather than the default HOT 17
- Setuptools warnings: Package 'klaus.static' is absent from the `packages` configuration. HOT 1
- No longer possible to serve shared repositories in klaus HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from klaus.