labmlai / labml Goto Github PK

🔎 Monitor deep learning model training and hardware usage from your mobile phone 📱

License: MIT License

Python 7.57% Makefile 0.07% Jupyter Notebook 86.59% Shell 0.07% TypeScript 5.10% HTML 0.02% SCSS 0.39% JavaScript 0.17% Jinja 0.01% Cython 0.01%

machine-learning deep-learning pytorch experiment analytics visualization tensorboard mobile keras tensorflow

labml's People

Contributors

Stargazers

Watchers

Forkers

cbeach zwbjtu123 sprinterzzj derlunz bourbakis asdlei99 janithwanni nmasnadithya adore96 zengxh tonyle9 gear kaushalya vidhijain herolin12 swami1995 lxngoddess5321 pplonski arm7ai dannyb2018 sadeepdarshana xiaming9880 vishalbelsare ahatamiz linkonbsmrstu nhduong tuleo conanjm snapbuy adrien1018 greathajerr 418sec qqq-tech hirajanwin skiedra o7s8r6 sailfish009 sanyam07 bdtmnk rezacsedu dongfangyixi elgalu chiahungtai mardom thusharabandara xet7 cuulee ejhill24 foeinlove amanda-barbara jurjsorinliviu jayarethanam vineetvermaml wcswcswcs rishirelan icodein mcx enformatik jiangy2 allensmile overfittingstudyroom sagarduwal razaulazam tmdcks943 cupid4 dila-triyana dn6 yfl-fengzifei ito-integral adegokeisrael jiangwenjie-stack yofchio666 cbschen batermj 6naci guanchengwan menicefellow lisoya yangning-k ajunlonglive jxzhangjhu 5l1v3r1 ericbioinf fernando83mat mingkin ylcoldplayer oggyfaker bibofeng tarekegn82 neobrainz jing-mo jacksonmeking xulang01 runngezhang runtao l3onardo allenshow apollohuang1 hurricanejin kab1012

labml's Issues

Failed to connect server

Hello,when i start the experiment,the labml warning"[WinError 10061] The connection could not be made due to the target computer's positive rejection. Failed to connect: http://localhost:5005/api/v1/track?“is coming and i can't use the command labml app-server to start labml server in Anaconda Prompt,although i have installed the package named labml-app

How can I restarted thread again?

I try to use labml in my project. I run my code, it produced error. Try to fix it and run again but in the second time run it produced the following error:


RuntimeError                              Traceback (most recent call last)

[<ipython-input-42-c44305bd3335>](https://localhost:8080/#) in <module>()
    428 #
    429 if __name__ == '__main__':
--> 430     main()

10 frames

[/usr/lib/python3.7/threading.py](https://localhost:8080/#) in start(self)
    846 
    847         if self._started.is_set():
--> 848             raise RuntimeError("threads can only be started once")
    849         with _active_limbo_lock:
    850             _limbo[self] = self

RuntimeError: threads can only be started once

I think the previous thread stills running!
How can I reset/kill/restart the thread to run again from starting point?

Some private runs visible on homepage before login or refresh

The current app.labml.ai homepage somehow displays some private runs that are not supposed to be visible to public when visiting the website first time, even without login. It only asked for login after one refresh or revisit.

Silent configs

Configs without multiple options and explicitly specified values should be treated as silent by default. We can have an api to explicitly mark configs as silent or not.

We can treat non-silent configs as hyper parameters on lab dashboard and when writing to Tensorboard HParams.

Move the spark lines to the top in detailed views

LightGBM support

Implement Flash attention stable diffusion problems

I cannot adapt yor code for flash attention in stable diffusion. Your arguments for SpatialTransformers function are different from the CompVis repository and I don't know if you use different or if you just changed the name of the arguments (for example there is "depth" that O don't find in tour version

Handle tracking data from multiple processes in distributed runs

UnicodeEncodeError: 'gbk' codec can't encode character

Hello, first of all thank you for your open source, I'm trying to learn to use this module recently. But today I got the following error when running the model, it seems like there are some errors when I follow the GIt commit? What should I do to fix it ?

XGBoost support

Feature request: Allow setting listen address on command line & infer URL from request

Currently the app is fixed to listen on 0.0.0.0:5005. It would be great if the bind address & port can be set from command line (e.g. labml app-server --bind-address=127.0.0.1 --port=5678).

Also, the webpage will always try to fetch data from localhost:5005, making it inconvenient to connect to the server from a non-local machine. The host URL should be inferred from the request instead of being hardcoded.

Support for TF 2.0

The previous issue #2 suggested that migration to the TensorFlow 2.0 API was underway. Has it being completed? or is there room to work on that at the moment.

Thanks.

The app-server don't send any response

I opened a server using the labml app-server command, however opening localhost:5005 from a browser keeps waiting, using wget http://localhost:5005/api/v1/, it keeps displaying

--2024-04-02 14:59:16 -- http://localhost:5005/api/v1/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:5005... connected.
HTTP request sent, awaiting response...

The output of app-server is as follows:

[2024-04-02 14:59:00 +0800] [822732] [INFO] Starting gunicorn 21.2.0
[2024-04-02 14:59:00 +0800] [822732] [INFO] Listening at: http://0.0.0.0:5005 (822732)
[2024-04-02 14:59:00 +0800] [822732] [INFO] Using worker: uvicorn.workers.
[2024-04-02 14:59:00 +0800] [822733] [INFO] Booting worker with pid: 822733
[2024-04-02 14:59:03 +0800] [822732] [INFO] Handling signal: winch
[2024-04-02 14:59:30 +0800] [822732] [CRITICAL] WORKER TIMEOUT (pid:822733)
[2024-04-02 14:59:31 +0800] [822732] [ERROR] Worker (pid:822733) was sent code 134!
[2024-04-02 14:59:31 +0800] [823163] [INFO] Booting worker with pid: 823163

Running issue...

Still updating app.labml.ai, please wait for it to complete...
cannot visulize when I ran the colab example

Is this Open Source?

Hi,
at readme it says this is Open Source, but where is that MIT license file?

Allow Config Functions with Arguments

Tracker bug: UnicodeEncodeError: 'charmap' codec can't encode characters

After I stopped the training with tracker and start it again, I see the following error from experiment.record(name=args.experiment_name)

Traceback (most recent call last):
  File "model_combined.py", line 282, in <module>
    with experiment.record(name=args.experiment_name):
  File "C:\Users\miles\anaconda3\envs\idio\lib\site-packages\labml\experiment.py", line 439, in record
    return start()
  File "C:\Users\miles\anaconda3\envs\idio\lib\site-packages\labml\experiment.py", line 278, in start
    return _experiment_singleton().start(run_uuid=_load_run_uuid, checkpoint=_load_checkpoint)
  File "C:\Users\miles\anaconda3\envs\idio\lib\site-packages\labml\internal\experiment\__init__.py", line 463, in start
    self.run.save_info()
  File "C:\Users\miles\anaconda3\envs\idio\lib\site-packages\labml\internal\experiment\experiment_run.py", line 249, in save_info
    f.write(self.diff)
  File "C:\Users\me\anaconda3\envs\idio\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 2827-2831: character maps to <undefined>

logger.inspect to print model summary

logger.inspect(model) should print a model summary when model is an instance on torch.nn.Module

Network error in comparison section

Issue: when runs were added to comparison section then were deleted, there is network error + 404 error

Run in app.labml.ai

How to reproduce: create 2 runs, add one to another to compare, then delete run which was added

Tested in incognito-mode tab, so this is no cache/cookies problem

500 Error Issue

Hi,

I see this error "Oops! Something went wrong 500 Seems like we are having issues right now".

I'm also unable to run the app locally. Please advise.

labml app-server gives me: labml: error: argument command: invalid choice: 'app-server' (choose from 'dashboard', 'capture', 'launch', 'monitor', 'service', 'service-run')

Checkpointing optimizers

Hi,
I am working with your framework. First of all, great job. It really saved me from the usual research mess :)
I have some questions about the checkpointing. I've seen that each layer is saved in a .npy format. However, this does not work for other objects that are based on state_dict, for example optimizers. For long trainings they should be saved with the model, since we don't want to retrain the whole model from scratch. I've looked into your checkpointing strategy here. Do you see any significant problem if instead saving all layers in .npy files we directly save the state_dict?

Remove git commits/branches check

Hello there!

First of all thanks for your library, used it in my recent open source project!

Now, I want to share my criticism.

I have another project, but there we decided to set remotes' names of our repo different from default: bars and upstream, there is no origin as you can see.

So I had this error:

.labml.yml:

check_repo_dirty: false
experiments_path: '.labml'
web_api: 'secret'

Error:

Traceback (most recent call last):
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/src/train.py", line 395, in <module>
    train()
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/src/train.py", line 354, in train
    with experiment.record(name=MODEL_SAVE_NAME, exp_conf=args.__dict__) if args.labml else ExitStack():
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/labml/experiment.py", line 388, in rec
ord
    create(name=name,
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/labml/experiment.py", line 86, in crea
te
    _create_experiment(uuid=uuid,
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/labml/internal/experiment/__init__.py"
, line 511, in create_experiment
    _internal = Experiment(uuid=uuid,
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/labml/internal/experiment/__init__.py"
, line 225, in __init__
    self.run.repo_remotes = list(repo.remote().urls)
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/remote.py", line 553, in urls
    raise ex
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/remote.py", line 529, in urls
    remote_details = self.repo.git.remote("get-url", "--all", self.name)
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/cmd.py", line 545, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/cmd.py", line 1011, in _call_process
    return self.execute(call, **exec_kwargs)
  File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/cmd.py", line 828, in execute
    raise GitCommandError(command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git remote get-url --all origin
  stderr: 'fatal: No such remote 'origin''

So my question №1 is: why labml check git branches/remote/commits? What is the idea behind this logic? I think that library for training monitoring for ML project doesn't need to do that. If developer / data scientist want to track git and prevent training because of uncommited changes, he/she can write this logic by his own.

Question №2: if i set check_repo_dirty: false why labml still checking repo? And what is the default value of the parameter?

2 possible suggestions:

Put condition before try in labml/internal/experiment/__init__.py to prevent all this git code :

if self.check_repo_dirty:
    try:
        repo = git.Repo(lab_singleton().path)
    
        self.run.repo_remotes = list(repo.remote().urls)
        self.run.commit = repo.head.commit.hexsha
        self.run.commit_message = repo.head.commit.message.strip()
        self.run.is_dirty = repo.is_dirty()
        self.run.diff = repo.git.diff()
    except git.InvalidGitRepositoryError:
        if not is_colab() and not is_kaggle():
            labml_notice(["Not a valid git repository: ",
                          (str(lab_singleton().path), Text.value)])
        self.run.commit = 'unknown'
        self.run.commit_message = ''
        self.run.is_dirty = True
        self.run.diff = ''

Or completely remove all this git tracking code or make it deprecated.

Thanks!

Add a new card in run view, if more than 4 metrics have the same prefix

Update the project to support the latest version of weya

Keep logs of each worker separately for distributed/multi-process training

Add scikit-learn interface support

Save button in process details view

Remove '.mean' suffix from metrics

Labml is stuck

I can't do anything with labml, even if it is as simple as executing:

It is stuck forever, and I literally downloaded and executed your notebook https://colab.research.google.com/github/lab-ml/labml/blob/master/guides/monitor.ipynb

and it gets stuck at the sixth cell: tracker.set_queue('loss.train', 20, True)

What should I do please?

Smoothing in log scale

Track Disk I/O Utilization

Thanks for this great project! it would be possible to also monitor/track disk I/O utilization?

labml.experiment' has no attribute 'add_pytorch_models'

when I make a fun call

experiment.add_pytorch_models(dict(model=conf.model))

But it show the AttributeError:

AttributeError Traceback (most recent call last)
/tmp/ipykernel_1766/1758481103.py in
----> 1 experiment.add_pytorch_models(dict(model=conf.model))

AttributeError: module 'labml.experiment' has no attribute 'add_pytorch_models'

You can find more information about it in this paper: https://dl.acm.org/doi/abs/10.1145/3522664.3528620.

According to the paper, the smell is described as follows:

Problem	If the columns are not selected explicitly, it is not easy for developers to know what to expect in the downstream data schema. If the datatype is not set explicitly, it may silently continue the next step even though the input is unexpected, which may cause errors later. The same applies to other data importing scenarios.
Solution	It is recommended to set the columns and DataType explicitly in data processing.
Impact	Readability

Example:

### Pandas Column Selection
import pandas as pd
df = pd.read_csv('data.csv')
+ df = df[['col1', 'col2', 'col3']]

### Pandas Set DataType
import pandas as pd
- df = pd.read_csv('data.csv')
+ df = pd.read_csv('data.csv', dtype={'col1': 'str', 'col2': 'int', 'col3': 'float'})

You can find the code related to this smell in this link: https://github.com/lab-ml/labml/blob/deea217e6d13d245d32ff904593876e5c56d3528/samples/stocks/build_numpy_cache.py#L123-L143.

I also found instances of this smell in other files, such as:

File: https://github.com/lab-ml/labml/blob/master/helpers/labml_helpers/datasets/csv.py#L16-L26 Line: 21
.

I hope this information is helpful!

Search spark-lines in detail views

502 - Bad Gateway

Hi,
since yesterday, I constantly receive the message '502 Bad Gateway' every time I launch an labml experiment, both from jupyter notebook and from Colab. Here an example:

Moreover, I get this error from https://app.labml.ai/runs :

Is there a problem with your app?

Thanks in advantage.