labmlai / labml Goto Github PK
View Code? Open in Web Editor NEWπ Monitor deep learning model training and hardware usage from your mobile phone π±
Home Page: https://labml.ai
License: MIT License
π Monitor deep learning model training and hardware usage from your mobile phone π±
Home Page: https://labml.ai
License: MIT License
Hello,when i start the experiment,the labml warning"[WinError 10061] The connection could not be made due to the target computer's positive rejection. Failed to connect: http://localhost:5005/api/v1/track?βis coming and i can't use the command labml app-server to start labml server in Anaconda Prompt,although i have installed the package named labml-app
RuntimeError Traceback (most recent call last)
[<ipython-input-42-c44305bd3335>](https://localhost:8080/#) in <module>()
428 #
429 if __name__ == '__main__':
--> 430 main()
10 frames
[/usr/lib/python3.7/threading.py](https://localhost:8080/#) in start(self)
846
847 if self._started.is_set():
--> 848 raise RuntimeError("threads can only be started once")
849 with _active_limbo_lock:
850 _limbo[self] = self
RuntimeError: threads can only be started once
I think the previous thread stills running!
How can I reset/kill/restart the thread to run again from starting point?
The current app.labml.ai homepage somehow displays some private runs that are not supposed to be visible to public when visiting the website first time, even without login. It only asked for login after one refresh or revisit.
Configs without multiple options and explicitly specified values should be treated as silent by default. We can have an api to explicitly mark configs as silent or not.
We can treat non-silent configs as hyper parameters on lab dashboard and when writing to Tensorboard HParams.
I cannot adapt yor code for flash attention in stable diffusion. Your arguments for SpatialTransformers function are different from the CompVis repository and I don't know if you use different or if you just changed the name of the arguments (for example there is "depth" that O don't find in tour version
Currently the app is fixed to listen on 0.0.0.0:5005. It would be great if the bind address & port can be set from command line (e.g. labml app-server --bind-address=127.0.0.1 --port=5678
).
Also, the webpage will always try to fetch data from localhost:5005, making it inconvenient to connect to the server from a non-local machine. The host URL should be inferred from the request instead of being hardcoded.
The previous issue #2 suggested that migration to the TensorFlow 2.0 API was underway. Has it being completed? or is there room to work on that at the moment.
Thanks.
I opened a server using the labml app-server
command, however opening localhost:5005
from a browser keeps waiting, using wget http://localhost:5005/api/v1/
, it keeps displaying
--2024-04-02 14:59:16 -- http://localhost:5005/api/v1/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:5005... connected.
HTTP request sent, awaiting response...
The output of app-server is as follows:
[2024-04-02 14:59:00 +0800] [822732] [INFO] Starting gunicorn 21.2.0
[2024-04-02 14:59:00 +0800] [822732] [INFO] Listening at: http://0.0.0.0:5005 (822732)
[2024-04-02 14:59:00 +0800] [822732] [INFO] Using worker: uvicorn.workers.
[2024-04-02 14:59:00 +0800] [822733] [INFO] Booting worker with pid: 822733
[2024-04-02 14:59:03 +0800] [822732] [INFO] Handling signal: winch
[2024-04-02 14:59:30 +0800] [822732] [CRITICAL] WORKER TIMEOUT (pid:822733)
[2024-04-02 14:59:31 +0800] [822732] [ERROR] Worker (pid:822733) was sent code 134!
[2024-04-02 14:59:31 +0800] [823163] [INFO] Booting worker with pid: 823163
Hi,
at readme it says this is Open Source, but where is that MIT license file?
After I stopped the training with tracker and start it again, I see the following error from experiment.record(name=args.experiment_name)
Traceback (most recent call last):
File "model_combined.py", line 282, in <module>
with experiment.record(name=args.experiment_name):
File "C:\Users\miles\anaconda3\envs\idio\lib\site-packages\labml\experiment.py", line 439, in record
return start()
File "C:\Users\miles\anaconda3\envs\idio\lib\site-packages\labml\experiment.py", line 278, in start
return _experiment_singleton().start(run_uuid=_load_run_uuid, checkpoint=_load_checkpoint)
File "C:\Users\miles\anaconda3\envs\idio\lib\site-packages\labml\internal\experiment\__init__.py", line 463, in start
self.run.save_info()
File "C:\Users\miles\anaconda3\envs\idio\lib\site-packages\labml\internal\experiment\experiment_run.py", line 249, in save_info
f.write(self.diff)
File "C:\Users\me\anaconda3\envs\idio\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 2827-2831: character maps to <undefined>
logger.inspect(model)
should print a model summary when model is an instance on torch.nn.Module
Issue: when runs were added to comparison section then were deleted, there is network error + 404 error
How to reproduce: create 2 runs, add one to another to compare, then delete run which was added
Tested in incognito-mode tab, so this is no cache/cookies problem
Hi,
I see this error "Oops! Something went wrong 500 Seems like we are having issues right now".
I'm also unable to run the app locally. Please advise.
labml app-server gives me: labml: error: argument command: invalid choice: 'app-server' (choose from 'dashboard', 'capture', 'launch', 'monitor', 'service', 'service-run')
Hi,
I am working with your framework. First of all, great job. It really saved me from the usual research mess :)
I have some questions about the checkpointing. I've seen that each layer is saved in a .npy format. However, this does not work for other objects that are based on state_dict
, for example optimizers. For long trainings they should be saved with the model, since we don't want to retrain the whole model from scratch. I've looked into your checkpointing strategy here. Do you see any significant problem if instead saving all layers in .npy files we directly save the state_dict
?
Hello there!
First of all thanks for your library, used it in my recent open source project!
Now, I want to share my criticism.
I have another project, but there we decided to set remotes' names of our repo different from default: bars
and upstream
, there is no origin
as you can see.
So I had this error:
.labml.yml:
check_repo_dirty: false
experiments_path: '.labml'
web_api: 'secret'
Error:
Traceback (most recent call last):
File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/src/train.py", line 395, in <module>
train()
File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/src/train.py", line 354, in train
with experiment.record(name=MODEL_SAVE_NAME, exp_conf=args.__dict__) if args.labml else ExitStack():
File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/labml/experiment.py", line 388, in rec
ord
create(name=name,
File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/labml/experiment.py", line 86, in crea
te
_create_experiment(uuid=uuid,
File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/labml/internal/experiment/__init__.py"
, line 511, in create_experiment
_internal = Experiment(uuid=uuid,
File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/labml/internal/experiment/__init__.py"
, line 225, in __init__
self.run.repo_remotes = list(repo.remote().urls)
File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/remote.py", line 553, in urls
raise ex
File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/remote.py", line 529, in urls
remote_details = self.repo.git.remote("get-url", "--all", self.name)
File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/cmd.py", line 545, in <lambda>
return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/cmd.py", line 1011, in _call_process
return self.execute(call, **exec_kwargs)
File "/media/sviperm/9740514d-d8c8-4f3e-afee-16ce6923340c3/sviperm/Documents/Aurora/Aurora.ContextualMistakes/venv/lib/python3.9/site-packages/git/cmd.py", line 828, in execute
raise GitCommandError(command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
cmdline: git remote get-url --all origin
stderr: 'fatal: No such remote 'origin''
So my question β1 is: why labml
check git branches/remote/commits? What is the idea behind this logic? I think that library for training monitoring for ML project doesn't need to do that. If developer / data scientist want to track git and prevent training because of uncommited changes, he/she can write this logic by his own.
Question β2: if i set check_repo_dirty: false
why labml still checking repo? And what is the default value of the parameter?
2 possible suggestions:
try
in labml/internal/experiment/__init__.py to prevent all this git code :
if self.check_repo_dirty:
try:
repo = git.Repo(lab_singleton().path)
self.run.repo_remotes = list(repo.remote().urls)
self.run.commit = repo.head.commit.hexsha
self.run.commit_message = repo.head.commit.message.strip()
self.run.is_dirty = repo.is_dirty()
self.run.diff = repo.git.diff()
except git.InvalidGitRepositoryError:
if not is_colab() and not is_kaggle():
labml_notice(["Not a valid git repository: ",
(str(lab_singleton().path), Text.value)])
self.run.commit = 'unknown'
self.run.commit_message = ''
self.run.is_dirty = True
self.run.diff = ''
Thanks!
I can't do anything with labml, even if it is as simple as executing:
It is stuck forever, and I literally downloaded and executed your notebook https://colab.research.google.com/github/lab-ml/labml/blob/master/guides/monitor.ipynb
and it gets stuck at the sixth cell: tracker.set_queue('loss.train', 20, True)
What should I do please?
Thanks for this great project! it would be possible to also monitor/track disk I/O utilization?
when I make a fun call
experiment.add_pytorch_models(dict(model=conf.model))
AttributeError Traceback (most recent call last)
/tmp/ipykernel_1766/1758481103.py in
----> 1 experiment.add_pytorch_models(dict(model=conf.model))
AttributeError: module 'labml.experiment' has no attribute 'add_pytorch_models'
I don't think the tensorflow
import in the experiments.pytorch
file is necessary - you can write to tensorboard
without tensorflow.
Indeed, some of your usages are actually deprecated.
I'm trying to monitor multiple machines' usages. Their names in the dashbord are always My Computer. It seems there is no option in configs.yaml for naming a machine.
Hello!
I found an AI-Specific Code smell in your project.
The smell is called: Columns and DataType Not Explicitly Set
You can find more information about it in this paper: https://dl.acm.org/doi/abs/10.1145/3522664.3528620.
According to the paper, the smell is described as follows:
Problem | If the columns are not selected explicitly, it is not easy for developers to know what to expect in the downstream data schema. If the datatype is not set explicitly, it may silently continue the next step even though the input is unexpected, which may cause errors later. The same applies to other data importing scenarios. |
---|---|
Solution | It is recommended to set the columns and DataType explicitly in data processing. |
Impact | Readability |
Example:
### Pandas Column Selection
import pandas as pd
df = pd.read_csv('data.csv')
+ df = df[['col1', 'col2', 'col3']]
### Pandas Set DataType
import pandas as pd
- df = pd.read_csv('data.csv')
+ df = pd.read_csv('data.csv', dtype={'col1': 'str', 'col2': 'int', 'col3': 'float'})
You can find the code related to this smell in this link: https://github.com/lab-ml/labml/blob/deea217e6d13d245d32ff904593876e5c56d3528/samples/stocks/build_numpy_cache.py#L123-L143.
I also found instances of this smell in other files, such as:
File: https://github.com/lab-ml/labml/blob/master/helpers/labml_helpers/datasets/csv.py#L16-L26 Line: 21
.
I hope this information is helpful!
Hi,
since yesterday, I constantly receive the message '502 Bad Gateway' every time I launch an labml experiment, both from jupyter notebook and from Colab. Here an example:
Moreover, I get this error from https://app.labml.ai/runs :
Is there a problem with your app?
Thanks in advantage.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.