Giter Club home page Giter Club logo

Comments (13)

jankatins avatar jankatins commented on August 26, 2024

I'm not sure how the prompt (of your shell?) can be affected by nbstripoutput, but if you want fast, try jq: http://janschulz.github.io/windows-dev-environment.html -> "proper diffs and commits for notebooks" (sorry, my homepage seems to not put in anchors in headlines :-( )

from nbstripout.

michaelaye avatar michaelaye commented on August 26, 2024

oh, that's because my prompt (zsh->garrett) includes git status information, and the status is defined by git getting active, using the nbstripout filter to define the status for each notebook.

from nbstripout.

jankatins avatar jankatins commented on August 26, 2024

Huih, I've currently 38 ipynb in a folder with a jq based stripoutput filter (*.ipynb filter=stripoutput) and it's fast in cmder (which also tells me the current git status). in my case git status is also fast (instant?). cmder uses git status --porcelain 2>nul to check if the dir is dirty...

from nbstripout.

michaelaye avatar michaelaye commented on August 26, 2024

are u aware that git apparently is able to cache the status somehow? It's only slow for me for the first time entering the folder after I-dont-know-what-interval-or-action. After that it's seemless as well. 66 notebooks here though. ;)

from nbstripout.

jankatins avatar jankatins commented on August 26, 2024

Thats sounds more like a filesystem cache (the git status manpage doesn't say anything about a cache for status information)? Do you have a ssd or hdd?

from nbstripout.

michaelaye avatar michaelaye commented on August 26, 2024

ssd.

from nbstripout.

kynan avatar kynan commented on August 26, 2024

@michaelaye Do you have core.preloadindex enabled? That might be it:

Enable parallel index preload for operations like git diff

This can speed up operations like git diff and git status especially on filesystems like NFS that have weak caching semantics and thus
relatively high IO latencies. With this set to true, Git will do the index comparison to the filesystem data in parallel, allowing overlapping
IO’s.

@JanSchulz Thanks for the pointer to jq. Not sure I'd want to introduce such a heavy dependency, but will have a play with it.

from nbstripout.

stas00 avatar stas00 commented on August 26, 2024

FYI, I uploaded two versions of much faster nbstripout-like tools. They focus on being fast git strip out filters, without much else.

  1. nbstripout-fast - a pure python json version (it also supports rapidjson, but I found it to be slower)
  2. nbstripout-jq - jq version with a thin /bin/sh wrapper

As of this writing nbstripout-fast and nbstripout-jq are respectively 13 and 9 times faster than nbstripout with the notebooks I tested them with.

They aren't exactly the same as nbstripout, so you may need to apply a few minor tweaks if you need the exact behavior.

These were written for fastai - we ended up using the pure python version as it was the fastest.

Enjoy.

from nbstripout.

kynan avatar kynan commented on August 26, 2024

Thanks, @stas00! Feel free to send a PR for the rapidjson variant. I'd also be open to add the nbstripout-jq as a separate script.

from nbstripout.

stas00 avatar stas00 commented on August 26, 2024

@kynan, the python rewrite is very different internally - so PR won't be possible. It doesn't use any of the nbconvert/nbformat stuff. it's faster than rapidjson or jq. On the other hand it only does the stripping out (no installation tools, validation, etc.).

I first tried to get nbstripout to use rapidjson or ujson, but neither has an identical API to json. So that didn't work.

Given that most likely the strip out has to be as fast as possible for using git with many notebooks, i'd say the original nbstripout needs to be split into 2 pieces - one optimized for speed of the critical path of doing the strip out and another for installation/configuration/validation/etc. If you do that then you can just use nbstripout-fast as is - with a few tweaks to support keep_outputs and and a few other flags. And leave the rest of the features in the other script.

Also, nbstripout-fast is already written in the whitelist format as you suggested here, except the fields are hardcoded and will probably need to be configurable to make it generic enough for anybody to use without changes to the source code.

And yes, feel free to incorporate nbstripout-jq in your repo. it too will need some tweaks to be identical to nbstripout. We dropped the jq version since pure python json rewrite nbstripout-fast is faster.

p.s. if you decide to integrate/adopt the code the -d option keeps outputs and also preserves a few other metadata fields that we use for documentation notebooks, which differs from code notebooks. Obviously this won't be useful for a general user.

from nbstripout.

mlucool avatar mlucool commented on August 26, 2024

Hi,

This is a really excellent project, but like above I found it had too much overhead on larger repos. I wrote a pure rust version (with python bindings so it can be pip installed) located at https://github.com/deshaw/nbstripout-fast (happed to chose the same name as @stas00). My testing shows this is ~200x faster. Really intersted to hear your thoughts @kynan. This is not a true a 1:1 match, but all key features should be included and a few more added.

Is there a way in which this project could let users choose to use the rust version? Your install/setups scripts are great and clearly this project is very very popular. I think some sort of linkage there would be a net postive for the community.

from nbstripout.

kynan avatar kynan commented on August 26, 2024

This is a true a 1:1 match, but all key features should be included and a few more added.

Do you mean this is not a true 1:1 match?

Is there a way in which this project could let users choose to use the rust version?

In theory yes, however doesn't it feel a little odd to have one tool install another?

Can you give more detail on what exactly you're thinking of? How do you see nbstripouts installation work when choosing nbstripout-fast ? I also think this discussion should be moved to a separate issue.

from nbstripout.

mlucool avatar mlucool commented on August 26, 2024

Do you mean this is not a true 1:1 match?

Yes - sorry!

I also think this discussion should be moved to a separate issue.

Opened #179

from nbstripout.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.