kaggle / kagglehub Goto Github PK

View Code? Open in Web Editor NEW

42.0 42.0 7.0 415 KB

Python library to access Kaggle resources

License: Apache License 2.0

Python 98.36% Dockerfile 0.64% Shell 1.00%

kagglehub's People

Contributors

Stargazers

Watchers

Forkers

mayankmalik-colab duewiger magalhaesch neilalfred queenplatinum

kagglehub's Issues

New logging forces user to have a home folder

This was not the case with 0.2.5

To reproduce:

Dockerfile:

FROM python:3.12.3-slim

ARG UID=10001
RUN adduser \
    --disabled-password \
    --gecos "" \
    --shell "/sbin/nologin" \
    --no-create-home \
    --uid "${UID}" \
    appuser

RUN pip install kagglehub==0.2.6
USER $UID

ENTRYPOINT [ "python", "-c", "import kagglehub" ]

docker build -t kagglehub . && docker run --rm kagglehub

download data w/o auth!

Posted something long ago, here.

Shortly, we can do following without hassle. These are public resources. Can we do the same with kagglehub?

images = keras.utils.get_file(
    origin="https://huggingface.co/datasets/images.tar.gz",
    untar=True,
)

hf_dataset_identifier = "{user_id}/data_id"
filename = "dataset.zip"
file_path = hf_hub_download(
    repo_id=hf_dataset_identifier, 
    filename=filename, 
    repo_type="dataset"
)

Unable to upload subfolders with 'link -s'

kagglehub version 0.2.5
It seems unable to upload subfolders with 'link -s' while kaggle api could upload for dataset.

Upload for private model yields "No instances available"

Hello, thank you for making this utility lib for Kaggle!

I'm trying to upload a local model to Kaggle Hub and want it to be private. However, after following the instructions in the README, I am not able to import the model to a Kaggle notebook and instead see a "No instances available" on the notebooks Add input tab:

Steps to reproduce:

import kagglehub

handle = "lewtun/mistral-7b-sft/pyTorch/v1"
local_files = "./mistral-7b-sft/" # Just a fine-tuned Mistral 7B
kagglehub.model_upload(handle, local_files)

It is possible that something is wrong with the variation being set during the upload? Thanks!

The service storage has thrown an exception

I am getting an error with pushing a 7B model to the Kaggle Hub that wasn't happening previously. I am using kagglehub==0.2.4 and my script to reproduce the error is

import kagglehub
from huggingface_hub import snapshot_download


"""
Script to upload a Transformers model from the Hugging Face Hub to Kaggle.

To push to your Kaggle account, generate a `kaggle.json` file with your Kaggle API credentials and store it in ~/.kaggle/kaggle.json
See: https://github.com/Kaggle/kagglehub?tab=readme-ov-file#authenticate

Usage:

python upload_model.py \
    --model_id ORG/MODEL_ID \
    --revision REV
"""


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_id", type=str, help="Name of repository on the Hub in '<ORG>/<NAME>' format.")
    parser.add_argument(
        "--revision", type=str, default="main", help="Name of branch in repository to save experiments."
    )
    parser.add_argument(
        "--kaggle_handle",
        type=str,
        default=None,
        help="Kaggle handle to upload the model to. Should be in the format <KAGGLE_USERNAME>/<MODEL>/<FRAMEWORK>/<VARIATION>. Defaults to <KAGGLE_USERNAME>/{model_id}/transformers/{revision}",
    )
    args = parser.parse_args()

    # Download repo
    model_name = args.model_id.split("/")[-1]
    local_dir = Path(f"data/{model_name}-{args.revision}")
    local_dir.mkdir(parents=True, exist_ok=True)
    snapshot_download(
        repo_id=args.model_id,
        revision=args.revision,
        local_dir=local_dir,
        ignore_patterns=[
            "checkpoint-*",
            "pytorch_model*",
            ".git*",
            "*_results.json",
            "*.bin",
            "trainer_*",
        ],  # Kaggle doesn't allow uploads >25 files, so we need to heavily filter: https://github.com/Kaggle/kagglehub/issues/116
    )

    # If no handle is provided, default to <KAGGLE_USERNAME>/{model_id}/pyTorch/{revision}
    if args.kaggle_handle is None:
        kaggle_username = kagglehub.config.get_kaggle_credentials().username
        kaggle_handle = f"{kaggle_username}/{model_name}/transformers/{args.revision}"

    print(f"Pushing to Kaggle Hub with handle {kaggle_handle} ...")
    kagglehub.model_upload(kaggle_handle, local_dir)

    print("Done!")

Here is the full stack trace:

Traceback (most recent call last):
  File "/fsx/lewis/git/hf/h4/scripts/deployment/aimo/upload_model.py", line 84, in <module>
    main()
  File "/fsx/lewis/git/hf/h4/scripts/deployment/aimo/upload_model.py", line 75, in main
    kagglehub.model_upload(kaggle_handle, local_dir)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/kagglehub/models.py", line 53, in model_upload
    create_model_instance_or_version(h, tokens, license_name, version_notes)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/kagglehub/models_helpers.py", line 67, in create_model_instance_or_version
    raise (e)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/kagglehub/models_helpers.py", line 61, in create_model_instance_or_version
    _create_model_instance(model_handle, files, license_name)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/kagglehub/models_helpers.py", line 34, in _create_model_instance
    api_client.post(f"/models/{model_handle.owner}/{model_handle.model}/create/instance", data)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/kagglehub/clients.py", line 122, in post
    process_post_response(response_dict)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/kagglehub/exceptions.py", line 115, in process_post_response
    raise BackendError(response["error"], error_code)
kagglehub.exceptions.BackendError: The service storage has thrown an exception. HttpStatusCode is NotFound. No such object: kaggle-models-data/inbox/1002070/534392bb61a8dd2cdbab22e064199089/generation_config.json.lock

Support for Proxy Configuration in kagglehub

I noticed that the KaggleAPI library supports the use of a proxy by specifying it in the kaggle.json file. Since kagglehub also reads the configuration from the same file, I would like to check if there are any plans to support proxy configuration in kagglehub as well.

If so, I am willing to implement this feature myself. If the kagglehub maintainers are open to this addition, I would appreciate it if someone could review my pull request once it is submitted.

Unable to upload repos with more than 25 files

Hello, I am trying to load a model that has more than 25 files and am hitting this error:

Traceback (most recent call last):
  File "/fsx/lewis/git/hf/h4/scripts/deployment/aimo/upload_model.py", line 77, in <module>
    main()
  File "/fsx/lewis/git/hf/h4/scripts/deployment/aimo/upload_model.py", line 68, in main
    kagglehub.model_upload(kaggle_handle, local_dir)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/kagglehub/models.py", line 53, in model_upload
    create_model_instance_or_version(h, tokens, license_name, version_notes)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/kagglehub/models_helpers.py", line 67, in create_model_instance_or_version
    raise (e)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/kagglehub/models_helpers.py", line 61, in create_model_instance_or_version
    _create_model_instance(model_handle, files, license_name)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/kagglehub/models_helpers.py", line 34, in _create_model_instance
    api_client.post(f"/models/{model_handle.owner}/{model_handle.model}/create/instance", data)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/kagglehub/clients.py", line 122, in post
    process_post_response(response_dict)
  File "/fsx/lewis/miniconda3/envs/h4/lib/python3.10/site-packages/kagglehub/exceptions.py", line 108, in process_post_response
    raise BackendError(response["error"], error_code)
kagglehub.exceptions.BackendError: The file count exceeds the maximum of 25

It would be nice if this constraint could be relaxed since I typically shard my models into smaller files to speed up the model loading on Kaggle notebooks

kaggle / kagglehub Goto Github PK

kagglehub's People

Contributors

Stargazers

Watchers

Forkers

kagglehub's Issues

New logging forces user to have a home folder

download data w/o auth!

Unable to upload subfolders with 'link -s'

Upload for private model yields "No instances available"

The service storage has thrown an exception

Support for Proxy Configuration in kagglehub

Unable to upload repos with more than 25 files

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent