labmlai / labml Goto Github PK

🔎 Monitor deep learning model training and hardware usage from your mobile phone 📱

License: MIT License

Python 7.34% Makefile 0.07% Jupyter Notebook 87.05% Shell 0.07% TypeScript 4.86% HTML 0.05% SCSS 0.39% JavaScript 0.17% Jinja 0.01% Cython 0.01%

machine-learning deep-learning pytorch experiment analytics visualization tensorboard mobile keras tensorflow

labml's People

Contributors

Stargazers

Watchers

Forkers

cbeach zwbjtu123 sprinterzzj derlunz bourbakis asdlei99 janithwanni nmasnadithya adore96 zengxh tonyle9 gear kaushalya vidhijain herolin12 swami1995 lxngoddess5321 pplonski arm7ai dannyb2018 sadeepdarshana xiaming9880 vishalbelsare ahatamiz linkonbsmrstu nhduong tuleo conanjm snapbuy adrien1018 greathajerr 418sec qqq-tech hirajanwin skiedra o7s8r6 sailfish009 sanyam07 bdtmnk rezacsedu dongfangyixi elgalu chiahungtai mardom thusharabandara xet7 cuulee ejhill24 foeinlove amanda-barbara jurjsorinliviu jayarethanam vineetvermaml wcswcswcs rishirelan icodein mcx enformatik jiangy2 allensmile overfittingstudyroom sagarduwal razaulazam tmdcks943 cupid4 dila-triyana dn6 yfl-fengzifei ito-integral adegokeisrael jiangwenjie-stack yofchio666 cbschen batermj 6naci guanchengwan issac-westcott menicefellow lisoya yangning-k ajunlonglive jxzhangjhu qlinhta 5l1v3r1 ericbioinf fernando83mat mingkin ylcoldplayer oggyfaker bibofeng tarekegn82 neobrainz jing-mo jacksonmeking xulang01 runngezhang runtao l3onardo allenshow apollohuang1

labml's Issues

Remove git commits/branches check

Hello there!

First of all thanks for your library, used it in my recent open source project!

Now, I want to share my criticism.

I have another project, but there we decided to set remotes' names of our repo different from default: bars and upstream, there is no origin as you can see.

So I had this error:

.labml.yml:

check_repo_dirty: false
experiments_path: '.labml'
web_api: 'secret'

Error:

Traceback (most recent call last):
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/src/train.py", line 395, in <module>
    train()
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/src/train.py", line 354, in train
    with experiment.record(name=MODEL_SAVE_NAME, exp_conf=args.__dict__) if args.labml else ExitStack():
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/labml/experiment.py", line 388, in rec
ord
    create(name=name,
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/labml/experiment.py", line 86, in crea
te
    _create_experiment(uuid=uuid,
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/labml/internal/experiment/__init__.py"
, line 511, in create_experiment
    _internal = Experiment(uuid=uuid,
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/labml/internal/experiment/__init__.py"
, line 225, in __init__
    self.run.repo_remotes = list(repo.remote().urls)
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/remote.py", line 553, in urls
    raise ex
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/remote.py", line 529, in urls
    remote_details = self.repo.git.remote("get-url", "--all", self.name)
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/cmd.py", line 545, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/cmd.py", line 1011, in _call_process
    return self.execute(call, **exec_kwargs)
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/cmd.py", line 828, in execute
    raise GitCommandError(command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git remote get-url --all origin
  stderr: 'fatal: No such remote 'origin''

So my question №1 is: why labml check git branches/remote/commits? What is the idea behind this logic? I think that library for training monitoring for ML project doesn't need to do that. If developer / data scientist want to track git and prevent training because of uncommited changes, he/she can write this logic by his own.

Question №2: if i set check_repo_dirty: false why labml still checking repo? And what is the default value of the parameter?

2 possible suggestions:

Put condition before try in labml/internal/experiment/__init__.py to prevent all this git code :

if self.check_repo_dirty:
    try:
        repo = git.Repo(lab_singleton().path)
    
        self.run.repo_remotes = list(repo.remote().urls)
        self.run.commit = repo.head.commit.hexsha
        self.run.commit_message = repo.head.commit.message.strip()
        self.run.is_dirty = repo.is_dirty()
        self.run.diff = repo.git.diff()
    except git.InvalidGitRepositoryError:
        if not is_colab() and not is_kaggle():
            labml_notice(["Not a valid git repository: ",
                          (str(lab_singleton().path), Text.value)])
        self.run.commit = 'unknown'
        self.run.commit_message = ''
        self.run.is_dirty = True
        self.run.diff = ''

Or completely remove all this git tracking code or make it deprecated.

Thanks!

Feature request: Allow setting listen address on command line & infer URL from request

Currently the app is fixed to listen on 0.0.0.0:5005. It would be great if the bind address & port can be set from command line (e.g. labml app-server --bind-address=127.0.0.1 --port=5678).

Also, the webpage will always try to fetch data from localhost:5005, making it inconvenient to connect to the server from a non-local machine. The host URL should be inferred from the request instead of being hardcoded.

Remove '.mean' suffix from metrics

XGBoost support

Silent configs

Configs without multiple options and explicitly specified values should be treated as silent by default. We can have an api to explicitly mark configs as silent or not.

We can treat non-silent configs as hyper parameters on lab dashboard and when writing to Tensorboard HParams.

Labml is stuck

I can't do anything with labml, even if it is as simple as executing:

It is stuck forever, and I literally downloaded and executed your notebook https://colab.research.google.com/github/lab-ml/labml/blob/master/guides/monitor.ipynb

and it gets stuck at the sixth cell: tracker.set_queue('loss.train', 20, True)

What should I do please?

Handle tracking data from multiple processes in distributed runs

Forecast loss curve

Keep logs of each worker separately for distributed/multi-process training

tensorflow import?

I don't think the tensorflow import in the experiments.pytorch file is necessary - you can write to tensorboard without tensorflow.

Indeed, some of your usages are actually deprecated.

Implement Flash attention stable diffusion problems

I cannot adapt yor code for flash attention in stable diffusion. Your arguments for SpatialTransformers function are different from the CompVis repository and I don't know if you use different or if you just changed the name of the arguments (for example there is "depth" that O don't find in tour version

Tracker bug: UnicodeEncodeError: 'charmap' codec can't encode characters

After I stopped the training with tracker and start it again, I see the following error from experiment.record(name=args.experiment_name)

Traceback (most recent call last):
  File "model_combined.py", line 282, in <module>
    with experiment.record(name=args.experiment_name):
  File "C:\Users\miles\anaconda3\envs\idio\lib\site-packages\labml\experiment.py", line 439, in record
    return start()
  File "C:\Users\miles\anaconda3\envs\idio\lib\site-packages\labml\experiment.py", line 278, in start
    return _experiment_singleton().start(run_uuid=_load_run_uuid, checkpoint=_load_checkpoint)
  File "C:\Users\miles\anaconda3\envs\idio\lib\site-packages\labml\internal\experiment\__init__.py", line 463, in start
    self.run.save_info()
  File "C:\Users\miles\anaconda3\envs\idio\lib\site-packages\labml\internal\experiment\experiment_run.py", line 249, in save_info
    f.write(self.diff)
  File "C:\Users\me\anaconda3\envs\idio\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 2827-2831: character maps to <undefined>

Smoothing in log scale

502 - Bad Gateway

Hi,
since yesterday, I constantly receive the message '502 Bad Gateway' every time I launch an labml experiment, both from jupyter notebook and from Colab. Here an example:

Moreover, I get this error from https://app.labml.ai/runs :

Is there a problem with your app?

Thanks in advantage.

Is this Open Source?

Hi,
at readme it says this is Open Source, but where is that MIT license file?

Network error in comparison section

Issue: when runs were added to comparison section then were deleted, there is network error + 404 error

Run in app.labml.ai

How to reproduce: create 2 runs, add one to another to compare, then delete run which was added

Tested in incognito-mode tab, so this is no cache/cookies problem

Columns and DataType Not Explicitly Set on line 133 of build_numpy_cache.py

Hello!

I found an AI-Specific Code smell in your project.
The smell is called: Columns and DataType Not Explicitly Set

You can find more information about it in this paper: https://dl.acm.org/doi/abs/10.1145/3522664.3528620.

According to the paper, the smell is described as follows:

Problem	If the columns are not selected explicitly, it is not easy for developers to know what to expect in the downstream data schema. If the datatype is not set explicitly, it may silently continue the next step even though the input is unexpected, which may cause errors later. The same applies to other data importing scenarios.
Solution	It is recommended to set the columns and DataType explicitly in data processing.
Impact	Readability

Example:

### Pandas Column Selection
import pandas as pd
df = pd.read_csv('data.csv')
+ df = df[['col1', 'col2', 'col3']]

### Pandas Set DataType
import pandas as pd
- df = pd.read_csv('data.csv')
+ df = pd.read_csv('data.csv', dtype={'col1': 'str', 'col2': 'int', 'col3': 'float'})

You can find the code related to this smell in this link: https://github.com/lab-ml/labml/blob/deea217e6d13d245d32ff904593876e5c56d3528/samples/stocks/build_numpy_cache.py#L123-L143.

I also found instances of this smell in other files, such as:

File: https://github.com/lab-ml/labml/blob/master/helpers/labml_helpers/datasets/csv.py#L16-L26 Line: 21
.

I hope this information is helpful!

Support for TF 2.0

The previous issue #2 suggested that migration to the TensorFlow 2.0 API was underway. Has it being completed? or is there room to work on that at the moment.

Thanks.

UnicodeEncodeError: 'gbk' codec can't encode character

Hello, first of all thank you for your open source, I'm trying to learn to use this module recently. But today I got the following error when running the model, it seems like there are some errors when I follow the GIt commit? What should I do to fix it ?

500 Error Issue

Hi,

I see this error "Oops! Something went wrong 500 Seems like we are having issues right now".

I'm also unable to run the app locally. Please advise.

labml app-server gives me: labml: error: argument command: invalid choice: 'app-server' (choose from 'dashboard', 'capture', 'launch', 'monitor', 'service', 'service-run')

Checkpointing optimizers

Hi,
I am working with your framework. First of all, great job. It really saved me from the usual research mess :)
I have some questions about the checkpointing. I've seen that each layer is saved in a .npy format. However, this does not work for other objects that are based on state_dict, for example optimizers. For long trainings they should be saved with the model, since we don't want to retrain the whole model from scratch. I've looked into your checkpointing strategy here. Do you see any significant problem if instead saving all layers in .npy files we directly save the state_dict?

logger.inspect to print model summary

logger.inspect(model) should print a model summary when model is an instance on torch.nn.Module

Running issue...

Still updating app.labml.ai, please wait for it to complete...
cannot visulize when I ran the colab example

Failed to connect server

Hello,when i start the experiment,the labml warning"[WinError 10061] The connection could not be made due to the target computer's positive rejection. Failed to connect: http://localhost:5005/api/v1/track?“is coming and i can't use the command labml app-server to start labml server in Anaconda Prompt,although i have installed the package named labml-app

How can I restarted thread again?

I try to use labml in my project. I run my code, it produced error. Try to fix it and run again but in the second time run it produced the following error:


RuntimeError                              Traceback (most recent call last)

[<ipython-input-42-c44305bd3335>](https://localhost:8080/#) in <module>()
    428 #
    429 if __name__ == '__main__':
--> 430     main()

10 frames

[/usr/lib/python3.7/threading.py](https://localhost:8080/#) in start(self)
    846 
    847         if self._started.is_set():
--> 848             raise RuntimeError("threads can only be started once")
    849         with _active_limbo_lock:
    850             _limbo[self] = self

RuntimeError: threads can only be started once

I think the previous thread stills running!
How can I reset/kill/restart the thread to run again from starting point?