Giter Club home page Giter Club logo

gather's Introduction

nbgather: 🧽✨ Spit shine for Jupyter notebooks

Tools for cleaning code, recovering lost code, and comparing versions of code in Jupyter Lab.

Download the alpha extension with the following command:

jupyter labextension install nbgather

Then you can clean and compare versions of your code like so:

Code gathering tools can help you clean your code and review versions of results.

Want to try it out first? Play around with nbgather on an example notebook on BinderHub.

Did the install fail? Make sure Jupyter Lab is up-to-date, and that you are running Jupyter Lab from Python 3.

This project is in alpha: The code this collects will sometimes be more than you want. It saves your a history of all code you've executed and the outputs it produces to the notebook's metadata. The user interface has a few quirks.

Help us make this a real, practical, and really useful tool. We welcome any and all feedback and contributions. We are particularly in need of the opinions and efforts of those with a penchant for hacking code analysis.

Usage Tips

Can it extract more precise slices of code? Yes. First submit a pull request telling us the desired extraction behavior, so we can incorporate this behavior into the tool.

Meanwhile, you can help the backend make more precise slices by telling the tool which functions don't modify their arguments. By default, the tool assumes that functions change all arguments they're called with, and the objects they're called on, with exceptions for some common APIs. To edit the slicing rules, open the Advanced Settings Editor in the Jupyter Lab Settings menu and choose the "nbgather" tab. In your user-defined settings, override moduleMap, following this format to specify which functions don't modify their arguments.

How do I clear the notebook's history? Open up your .ipynb file in a text editor, find the history key in the top-level metadata object, and set history to [].

Contributing

To run the development version of nbgather, run:

git clone <this-repository-url>  # clone the repository
npm install                      # download dependencies
jupyter labextension link .      # install this package in Jupyter Lab
npm run watch                    # automatically recompile source code
jupyter lab --watch              # launch Jupyter Lab, automatically re-load extension

This requires npm version 4 or later, and was tested most recently with Node v9.5.0.

Submit all change as a pull request. Feel free to author the the lead contributor (Andrew Head, [email protected]) if you have any questions about getting started with the code or about features or updates you'd like to contribute.

Also, make sure to format the code and test it before submitting a pull request, as described below:

Formatting the code

Before submitting a pull request with changed code, format the code files by running npm run format:all.

Testing the code

To run the tests from the command line, call:

npm run test

The first time you run tests, they will take about a minute to finish. The second time, and all subsequent times, the tests will take only a few seconds. The first test run takes longer because the Jest test runner transpiles dependencies like the '@jupyterlab' libraries into a dialect of JavaScript it expects before running the tests.

Troubleshooting

Here are some tips for dealing with build errors we've encountered while developing code gathering tools:

  • Errors about missing semicolons in React types files: upgrade the typescript and ts-node packages
  • Conflicting dependencies: upgrade either the Python Jupyter Lab (may require Python upgrade to Python 3 to get the most recent version of Jupyter Lab) or the Jupyter Lab npm pacakges
  • Other build issues: we've found some issues can be solved by just deleting your node_modules/ directory and reinstalling it.

gather's People

Contributors

andrewhead avatar choldgraf avatar jasonsjiang avatar joyceerhl avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar rdeline avatar rdrahul avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gather's Issues

Docs: Add getting-started docs

Is your feature request related to a problem? Please describe.
Have enough documentation that someone can figure out:

  • why they should use this tool
  • how to install it
  • an easy case where they can try it out
  • how to use each of the major functions (including gather to notebook, gather to cells, and gather to script)
  • a real-world case where they can see it working for a complex problem
  • a sense of how it works (so people can get a better conceptual model for what will happen when they gather code in complex notebooks)

Describe the solution you'd like
We should have a README that describes the purpose of the tool, installation instructions, GIFs of using the major functions and of a more complex scenario, and a brief write-up about the implementation of the tool.

Describe alternatives you've considered
If pressed for time, we could link to the video figure and paper. We may also want to put this information on a github.io page or a project webpage instead of in the README.

Additional context
N/A

Notebook instead of lab

This extension looks pretty good and will be very useful for many cases. But is it possible to use it directly with simple jupyter notebook instead of jupyter lab?

Gather generates invalid Python code in simple scenario.

To repro with gather extension:

  1. Open a notebook with these two cells:
#%%
A = [0,1,2,3]
B = [4,5,6,7]
sum = 0
diff_sum = 0
for i in range(min(len(A), len(B))):
    sum += A[i] + B[i]
    diff_sum += A[i] - B[i]

#%%
print(sum)
  1. Execute the first cell twice.
  2. Execute the second cell.
  3. Gather the second cell to a new notebook.

Expected: 2 cells

#%%
A = [0,1,2,3]
B = [4,5,6,7]
sum = 0
for i in range(min(len(A), len(B))):
    sum += A[i] + B[i]

#%%
print(sum)

Actual: 3 cells

#%%
A = [0,1,2,3]
B = [4,5,6,7]
for i in range(min(len(A), len(B))):

#%%
A = [0,1,2,3]
B = [4,5,6,7]
sum = 0
for i in range(min(len(A), len(B))):
    sum += A[i] + B[i]

#%%
print(sum)

The first cell is invalid Python code and shouldn't be included in the final gather.

Discuss binder integration

Moving the Binder conversation here so that we're not just chatting in a closed PR :-)

re: repository size, it's OK to include extra cruft up to a certain point. If it's "can I include a bunch of notebooks that might be big-ish" that's totally fine. If it's "I want to include a 500MB dataset in my repository" that might be more of a performance hit. The biggest time-sink in Binder is the time it takes to "Docker pull" onto a node when somebody launches a repository for the first time on that node. After that, Binder uses the docker cache within the same node and it is much faster.

re: this repository or another repository, it can be whatever you'd like! Many people just use whichever repository they use for their documentation.

Function and class definition textSliceLines are incorrectly computed

Describe the bug
Originally reported in #23 for classes, but predates #28
The line after a class or function definition is included twice in the gathered program.

Here's an example illustrating the problem for funcdefs after executing the following program as a single cell in Jupyter:

def foo():
    print("Hello") 
def bar(): 
    foo()
bar() 

The gathered program looks like this:

def foo():
    print("Hello")
def bar():
def bar():
    foo()
bar()
bar()

Here's another example illustrating the problem for classes:

class Foo():
    class Bar():
        pass
v=1
Foo().Bar()

The gathered program looks like this:

class Foo():
    class Bar():
        pass
v=1
v=1
Foo().Bar()

Additional information
I'm somewhat out of my depth with debugging the parser, but it seems like the problem lies with the ILocation object that the parser returns, since for class and funcdefs, loc.last_column === 0 and loc.last_line is the true last line incremented by one. textSliceLines is then incorrectly computed based on the values of last_column and last_line in cellSlice.ts--in this case, lines 3 and 5 are included twice. I've added unit tests that illustrate this behavior here.

If I'm not mistaken, this means a fix will involve modifying the parser to return an ILocation object whose last_column and last_line properties have accounted for the dedent. Alternatively, the cellSlice.getTextSlice logic could be updated.

Gather API

I'm not clear as to how gather actually works, whether it does a static analysis of a notebook, or whether it looks at cell execution / state history and then tries to identify the executed cells that influenced the execution of a given cell.

If it's a static analysis, given a notebook ipynb file, and an identified code cell, whether "code cell 5 of 10" or a code cell identified by a particular cell tag, it would be useful to be able to call something like gather("mynotebook.ipynb", code_cell=5) or gather("mynotebook.ipynb", code_id="final_chart") and return gather code cells with the identified code cell as the last cell.

Parser does not correctly set this.indents = [0] when this.indents is undefined

Describe the bug
In python3.jison on line 69, this.indents is incorrectly set. This corresponds to line 1064 in the generated python3.js file.

if (this.indents == undefined) this.indents == [0]; // should be this.indents = [0]

This causes the parser to throw a TypeError: Cannot read property 'length' of undefined at Object.anonymous [as performAction] on the next line when the parser attempts to access this.indents.length:

if (this.indents.length > 1) {

Slicing: Fails to include import for symbol used in method

Describe the bug

When a symbol is imported then used within a method, the import is not included even if the method is included in the slice. Originally reported by @barik.

To Reproduce

Here's an example of the part of a notebook where this bug occurs.

image

Expected behavior

import re should also be highlighted.

Desktop (please complete the following information):

  • OS: OSX Mojave
  • Browser: Chrome 73.0.3683.103

Idea: support gathering functions at point of definition

Can functions be gathered from the point they are defined?

In the following example, f cannot be gathered by clicking on the function name or on the cell in which it is defined.

a = 1
# new cell
def f():
    return a+1

However, if I now assign f to a variable, that variable can gathered, which collects f and a as well as expected.

a = 1
# new cell
def f():
    return a+1
# new cell
b = f

Desired behavior is that after a function definition is evaluated, the function name (i.e., f) is highlighted indicating it is available for gathering.

Dependencies of lambda functions with >0 args are not gathered

Describe the bug
UPDATE: After experimenting further it seems that program slicing fails specifically on a lambda function which accepts at least 1 argument.

To Reproduce
Steps to reproduce the behavior:

  1. Build and run Gather extension.
  2. Upload this notebook to Jupyter.
  3. Execute all cells.
  4. Try gathering the definitions and outputs for each case.

Expected Behavior
Dependencies of lambda functions should be gathered.

Screenshots
lambdas

Icons for gathering not appearing in version browser

Describe the bug

The icons for "Open in notebook" and "Copy to clipboard" in the version browser are not appearing.

image

To Reproduce

Click on a variable or output, then click on "Gather to... Version Browser". You should be able to see that the icons are missing on the gather buttons.

Expected behavior

The icons should appear on the buttons, like so (screenshot taken from the version of gather currently hosted on BinderHub):

image

Additional context

This problem was probably introduced when gather started using Jupyter Lab libraries version >1.0 and the paths to the Jupyter icons changed. I already fixed the paths for the icons for the buttons in the Jupyter notebook toolbar in this commit, so we probably just need to point the CSS classes of the buttons in the version browser interface to the classes defined in that commit.

Cleanup: Remove Jupyter notebook install option and source code

It's likely that installing the Jupyter notebook extension will be a pain for those trying out the demo, given that it will only work in a narrow band of Jupyter notebook versions. We may want to remove the Jupyter notebook extension as an option from the README, removing the nb directory of code, and only let people install the Jupyter Lab version.

If this is the case, we should leave a note in the README referring readers to a previous version of the project that included the Jupyter notebook implementation.

Make gathering conservative by default

Is your feature request related to a problem? Please describe.

Currently, the slicer assumes that methods don't modify their arguments. While this assumption is often correct, sometimes it's not. And when methods do modify their arguments, the gathered notebook will be missing these methods, and hence code needed to reproduced a result.

Describe the solution you'd like

Basically, more accurate slicing, that's more likely to gather code that might not be needed than to leave it out.

For the exact implementation, I suggest modifying the slicer to assume that:

  1. Methods change their arguments, unless otherwise noted
  2. Methods change the objects they're called on, unless otherwise noted

And providing an easy way for users to specify when methods don't modify their arguments. For example, they could provide a lightweight configuration file that looks like:

[
  {
    "obj-name": "m",
    "function-name": "fit",
    "does-not-modify": ["OBJECT"]
  }, {
    "function-name": "clean_data",
    "does-not-modify": [0, "auxiliary_data"],
  }
]

That is, a user could specify function calls that modify their arguments by the function-name, optionally the obj-name or name of the object the function, and by a list of what the function modifies. This can be either the object the function was called on ("OBJECT"), positional arguments (e.g., 0 for the first argument), or keyword arguments (e.g., an argument named auxiliary_data).

The user could specify these rules of which methods don't modify their arguments in a Jupyter Lab setting editor. This could be populated with some defaults (e.g., some common Pandas data frame methods like df.head() and df.describe())

Describe alternatives you've considered

The slicer could be improved to infer when functions modify their arguments. This would take some engineering effort that's not currently available.

The current implementation of the tools assume that methods don't modify their arguments. I worry that this might make the tool unusable as by default a lot of relevant code might be missing from slices.

Style: Use a formatting standard

Currently the TypeScript doesn't have a style guide, much less a consistent style. Using one could make it easier for contributors to know what style to use when contributing code. Consider using standard.

jupterlab-extionsion fails due to HTTP Error 500

Describe the bug
A clear and concise description of what the bug is.
Once I install this extension, the jupyterlab manager doesn't work:
image
The problem disappears when the ext is uninstalled.

To Reproduce
Steps to reproduce the behavior:

  1. Install nbgather by jupyter labextension install nbgather

Screenshots
Seen above

Desktop (please complete the following information):

  • OS: Red Hat Enterprise Linux Server release 7.6
  • Browser: Chrome
  • Version [e.g. 22]

** Jupyter
jupyter core : 4.5.0
jupyter-notebook : 6.0.0
qtconsole : not installed
ipython : 7.6.1
ipykernel : 5.1.1
jupyter client : 5.3.1
jupyter lab : 1.0.2
nbconvert : 5.5.0
ipywidgets : 7.5.0
nbformat : 4.4.0
traitlets : 4.3.2

Additional context
I recalled to see this buggy behavior on another extension, "autoversion". Didn't test it.

JupyterLab 2.2.0 not supported

My JupyterLab is version 2.2.0, the latest one.

I run the command

jupyter labextension install nbgather

The error looks like

An error occured.
ValueError: The extension "nbgather" does not yet support the current version of JupyterLab.

Conflicting Dependencies:
JupyterLab              Extension      Package
>=2.2.0 <2.3.0          >=1.2.0 <2.0.0 @jupyterlab/application
>=2.2.0 <2.3.0          >=1.2.0 <2.0.0 @jupyterlab/apputils
>=4.2.0 <4.3.0          >=3.2.0 <4.0.0 @jupyterlab/coreutils
>=2.2.0 <2.3.0          >=1.2.0 <2.0.0 @jupyterlab/docmanager
>=2.2.0 <2.3.0          >=1.2.0 <2.0.0 @jupyterlab/fileeditor
>=2.2.0 <2.3.0          >=1.2.1 <2.0.0 @jupyterlab/notebook
>=2.2.0 <2.3.0          >=1.2.0 <2.0.0 @jupyterlab/rendermime

Add a "Clear History" action

Is your feature request related to a problem? Please describe.
A user might want to clear the history of a notebook, e.g., if they executed a cell with some sensitive data that they don't want to have stored to the notebook file.

Describe the solution you'd like
Add an action to the interface that lets someone "Clear History". This would then reset the execution history log, and make sure that any history metadata saved with the notebook is emptied.

Describe alternatives you've considered
Currently, an analyst could open up the ipynb file on their own, and delete the metadata that includes the execution history.

Additional context
One additional benefit of this is reducing storage space for notebooks with very long histories.

Bump to support Jupyter Lab API 1.0

When setting up a development environment for nbgather, linking the extension fails because the Jupyter Lab package versions in package.json are not compatible with the version of Jupyter Lab that is now downloaded by pip by default.

The project should be refactored to use APIs from the Jupyter Lab project version >=1.0.

The extension "nbgather" does not yet support the current version of JupyterLab (for jupyterlab 3.0.14)

Describe the bug
The extension "nbgather" does not yet support the current version of JupyterLab for
jupyterlab 3.0.14 pyhd8ed1ab_0 conda-forge

To Reproduce
run
in terminal
jupyter labextension install nbgather

Expected behavior
how to install this package

Screenshots
$ jupyter labextension install nbgather
An error occured.
ValueError: The extension "nbgather" does not yet support the current version of JupyterLab.

Conflicting Dependencies:
JupyterLab Extension Package

=3.0.9 <3.1.0 >=1.2.0 <2.0.0 @jupyterlab/application
=3.0.7 <3.1.0 >=1.2.0 <2.0.0 @jupyterlab/apputils
=3.0.7 <3.1.0 >=1.2.0 <2.0.0 @jupyterlab/codeeditor
=3.0.7 <3.1.0 >=1.2.0 <2.0.0 @jupyterlab/codemirror
=5.0.5 <5.1.0 >=3.2.0 <4.0.0 @jupyterlab/coreutils
=3.0.9 <3.1.0 >=1.2.0 <2.0.0 @jupyterlab/docmanager
=3.0.9 <3.1.0 >=1.2.0 <2.0.0 @jupyterlab/fileeditor
=3.0.9 <3.1.0 >=1.2.1 <2.0.0 @jupyterlab/notebook
=3.0.8 <3.1.0 >=1.2.0 <2.0.0 @jupyterlab/rendermime
See the log file for details: /tmp/jupyterlab-debug-fyx19196.log

Desktop (please complete the following information):

  • OS:
    NAME="Red Hat Enterprise Linux Server"
    VERSION="7.9 (Maipo)"
    ID="rhel"
    ID_LIKE="fedora"

Change Variable Highlighting Color

The light blue and pink highlighting that Gather uses look great in the default grey+white Jupyter Lab theme, but highlighting REALLY highlights when you are in Dark mode or the Material Darker theme. Is there a way to tone things down that I am missing? Thanks!

Dependencies of class properties are not gathered

Describe the bug
When a function references dependencies of class properties set either in the class constructor or class functions, those dependencies are not gathered.

To Reproduce
Steps to reproduce the behavior:

  1. Build and run Gather extension.
  2. Upload this notebook to Jupyter.
  3. Execute all cells.
  4. Try gathering the outputs for each case.

Expected behavior
Dependencies like variable declarations and module imports should be gathered.

Screenshots
gather_setstate_deps

Lightweight API for adding new `gather` commands

In Issue #16, @micahjsmith proposed a new type of gather functionality. nbgather could be made more extensible to let others build new plugins like that one, which do new things with the gathered cells.

I imagine the interface for adding would involve a registerGatherCommand function, where the caller provides a command title, an icon, and a callback that will get triggered and provided with the gathered cells.

Let's use this issue to track interest in having an API for registering new gather commands, and design discussions about what that API would look like.

Turn off gather-from-revisions by default

Is your feature request related to a problem? Please describe.
If we let people explore the versions of slices that produced results by default, the tool may be unusable for long sessions. This is because outputs can take up a lot of memory and, when the notebook is saved, storage. To enable people to explore versions of slices with their output, we need to keep every versions of every output produced.

Describe the solution you'd like
In the settings menu for the plugin, include a configuration option for turning on the "Gather from Revisions" action. Have a note next to the option that tells the user that if the feature is enabled, all outputs ever created will be saved to a notebook, which has the potential for increasing the size of the notebook many times. Add documentation for how to do this to the README.

Describe alternatives you've considered
Leave Gather to Revisions in by default. That said, our usability study suggested the tool is most useful for slicing and gathering code that was executed out of order, not gathering from revisions. I also feel that this feature, while exciting, hasn't yet reached the form where it will be obviously useful and easy-to-use in common cases where people want to use it. So for this tool to be useful, perhaps the feature of gathering from revisions should be opt-in only.

Additional context
N/A

Add optional, opt-in logging

Describe the solution you'd like
It would be great if we could collect usage data from people who are using code gathering tools in their work. This could help us decide on future improvements to this tool, and collect data that could be shared other researchers and tool builders build better notebook tools.

Data collection would be strictly opt-in, and would be off by default.

Additional context
To do this, we would need the following:

  • Fill out relevant institutional paperwork
  • A non-intrusive, easy-to-dismiss popup that appears when someone uses the tool, that asks if folks want to share their usage data with us, with a description of the data that will be collected, anonymization of the data, and research goals
  • Persistent storage of a user's choice. Use this choice to decide whether to report log events
  • A server (e.g., notebooks.berkeley.edu) that receives log events over a secure server

Some of the events we might want to collect are:

  • when code gathering events get invoked, how large are the slices? Where in the notebook do they come from? Are these slices opened in notebooks, scripts, or as cells in existing notebooks?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.