apple / corenet Goto Github PK

View Code? Open in Web Editor NEW

6.7K 6.7K 516.0 837 KB

CoreNet: A library for training deep neural networks

License: Other

Makefile 0.26% Python 99.72% Jupyter Notebook 0.02%

corenet's People

Contributors

Stargazers

Watchers

Forkers

shauryashaurya kustomzone tangyorigami djsmanchanda dfischer drahfa zhangzhuobys pratikdhanave jithinraj nikita9 mlubbad cvlabsio mohamedainab briancpark manfar evelynmitchell de30 gr4y41 mz0in strategist922 linecode eltociear hendricksjudy anask63 tony sparupat yorodm beinerches etyme jn7163 pprp petercao prahs codingbjorn yickling freedom-democracy hubayirp kingfener yldrmali f901107 furyhawk mjsrog chillyagi marnusvanwyk hippo0x0 alexyangle apollohuang1 slideicy m-i conchfeng yoshikaz1228 vickkyy shomu-maersk architectureofthings shubhamporiya rkp64 osabero gokul-06 lyphstyles ericismyeldestson codeduckky dafuk yuan776 akamil-etsy shahinsharifi shepardo yichi-lu danijelkecman skaiphd paperwave aakashapoorv fiditenemini abhiramvad techthiyanes cubatlin rickyhong 0iui0 waghts95 techventurebuilder faithfulnguyen nscooling thanhpham1987 skytodmoon ajinkyapuar wplayergy sporksenet-horatorbr shoesrox-85 vibrantman53 postil-z prasanth595 billite-jiggyough decentralizedbug charliechap3 kedokudo weedge cshang2017 dotlyhiyou hertera1 yingzi6776 vincentsider

corenet's Issues

When are you going to license as MIT or other FOSS license?

Why use interpreted language ?

Why use interpreted language ? An interpreted language is slower and less powerful. You should use a compiled language such as c++.

For example I had work on [this project] in c++.

You should definitely stop using interpreted language.

A HF/Docker/Modal reproducible training/inference example

Considering that it depends on specific torch (torch==2.2.1) and possibly CUDA, many MacBooks won't be able to run some of the examples. If you want to run tests and notebooks, you'll need lfs and so on - so it becomes an infra nightmare.

Is there any plan to create a template for training/inference on Docker / Modal.com, using say pytorch/pytorch:2.2.1-cuda12.1-cudnn8-devel?
Is there any plan to create a HuggingFace space on at least one of the 10+ demos?
I see that pip install with mlx support already requires huggingface_hub. Is there a reason why?

/bin/bash: corenet-train：未找到命令

When I run the train_a_new_model_on_a_new_dataset_from_scratch.ipynb file, I get the error /bin/bash: corenet-train: Command not found, how to solve it？

Streaming HuggingFace Datasets

Hi, is there a possibility to adjust the code such that downloading the datasets is not necessary but streaming HuggingFace datasets.
Even high-level guidelines would be nice!

'freeze_modules_based_on_opts()' is freezing module parameters twice

Example

opts = argparse.Namespace(**{"model.freeze_modules": "conv1"})


model = nn.Sequential(
     OrderedDict([
          ('conv1', nn.Conv2d(20,64,5)),
          ('conv2', nn.Conv2d(20,64,5))
        ])
)


print(freeze_modules_based_on_opts(opts, model))

where is the `corenet-train` entrypoint ?

as title

where can I get checkpoints

hello, where is checkpoints?

Do OpenELM's training datasets contain copyrighted material?

I'm very excited about the release of this model and the efforts the team went through to openly document seemingly every aspect of it. Thank you!

I wonder if any information can be given concerning the selection of training datasets. On https://machinelearning.apple.com/research/openelm it says:

our release includes the complete framework for training and evaluation of the language model on publicly available datasets

More specifically, on https://github.com/apple/corenet/blob/main/projects/openelm/README-pretraining.md it says:

OpenELM was pretrained on public datasets. Specifically, our pre-training dataset contains RefinedWeb, PILE, a subset of RedPajama, and a subset of Dolma v1.6.

Digging into RefinedWeb on https://huggingface.co/datasets/tiiuae/falcon-refinedweb/viewer/default/train?q=nytimes.com, it contains content from sources like nytimes.com and cnn.com.

This is not surprising: Because of the vast amounts of data needed to train LLMs (basically a snapshot of the internet), all training datasets will contain copyrighted material. LLMs are sort of a snapshot of humanity’s knowledge. References to copyrighted characters like Superman, Captain Kirk, Donald Duck, Bugs Bunny etc etc are part of that collective knowledge and references to them might pop up just about anywhere in a dataset. Getting a snapshot of humanity’s knowledge that is free of such references would be as impossible as removing the sugar from a cake after it has been baked.

So while the project only mentions "publicly available datasets" and never makes any claims to be "free of copyrighted material", can any information be shared about the selection process that went into choosing the datasets that were used to train OpenELM?

torchtext version issue

@team,

my Mac Specification - Mac M3 Max 128GB RAM and 4TB Storage.

I cloned the repo and trying to install below error occurs

ERROR: Could not find a version that satisfies the requirement torchtext==0.17.1 (from corenet) (from versions: 0.1.1, 0.2.0, 0.2.1, 0.2.3, 0.3.1, 0.4.0, 0.5.0, 0.6.0, 0.16.2, 0.17.2, 0.18.0)
ERROR: No matching distribution found for torchtext==0.17.1

I am able to get the torchtext 0.18.0 but again it is not working.

CatLIP checkpoints on the hub 🤗

Hey hey! - I'm VB, I work on the open source team at Hugging Face. Massive congratulations on the OpenELM release, it's quite refreshing to see such a brilliant open release from Apple.

I was going through the trained checkpoints, and wasn't able to find CatLIP checkpoints. It'd be nice if you can upload it to Hugging Face similar to OpenELM checkpoints.

Let me know if you need a hand with that.

Cheers!
VB

Instruct Template

Hi there, looking at OpenELM instruct template what I understood is that the template is something like this:

Is that right? With MLX the model is not properly working on instruct mode.

Corenet detect M2 GPU

I run the "train_a_new_model_on_a_new_dataset_from_scratch" notebook on my M2 and no GPUs are detected, the script moves to using CPUs. Is it possible to configure Corenet to detect the M2 GPU?

NameError: name 'corenet' is not defined OpenELM Parameter-Efficient Finetuning (PEFT)

Adopting OpenELM Parameter-Efficient Finetuning (PEFT).
the training encounters "NameError: name 'corenet' is not defined"

How to deal with?
Thanks!

Request for Access to OpenELM Training Logs

Hi CoreNet team,

Thank you for your fantastic work on CoreNet and the OpenELM model. I've been trying to understand ML infrastructure reliability, and I'm particularly interested in analyzing the training logs that were mentioned in your OpenELM paper.

However, I've searched online and couldn't find these logs available anywhere. Could you please guide me on how to access these logs, or consider making them available if possible?

Access to these logs would be incredibly beneficial for my research, and I believe it could also be valuable for the community interested in understanding the nuances of ML model training at scale.

Thank you for your time and consideration.