captchaagent / hcaptcha-model-factory Goto Github PK

🏗 hCaptcha image label binary model factory (PyTorch Training, Cluster-based Auto Label Tools, Export ONNX model, ONNX model inference)

License: GNU General Public License v3.0

Python 89.55% Jupyter Notebook 10.45%

hcaptcha onnx opencv python pytorch auto-labeling clustering deep-learning kmeans onnx-models

hcaptcha-model-factory's People

Contributors

Stargazers

Watchers

Forkers

chenpython winktool searchingforcode kingking888 vel0c1ty22 ajunlonglive quocnd1704 henzycuong1 bbbadolbb fer010 taylorhill0413 yeyuchen198 ndtrongvn jchlrw tinyx3k 123456789zws shuaibibobo

hcaptcha-model-factory's Issues

feat(auto-label): Add example picture to the dataset

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
No

Describe the solution you'd like
A clear and concise description of what you want to happen.
Basically HCaptcha provides 1 example picture of the subject described, we can add it to the training data which will increase the dataset by +1 for each captcha (9 challenge pictures + 1 example)

[bug]

this happens when trying to run python main.py --new

Question about training?

Question about training?
How to educate? provided that it would not process the entire set of pictures?
For example:
I had 50k images trained, new 10k were added, how to re-train 10k so as not to process all 60k images?

How do I use the trained models in hcaptcha-challenger?

I tried to change the .onnx model to trained one but idk if that's the right way to do it?

how to label data

In wiki, it shows as follows:
I think you do not need a label tool for this task... Just drag to the corresponding label folder is enough. It's easy, right?

I think it's hard to understand.Can you explain how to label data in detail?

mistake

when trying to run it returns this error, what can it be?

[bug]

Im having this issue

C:\Users\Sky\hcaptcha-model-factory\src>python main.py new
prompt[en] --> sunflower
2022-12-22 21:28:50 | DEBUG - Diagnose task | task_name=sunflower
Use AI to automatically label datasets? {'y', 'n'} --> y
please put all the images in the `unlabel` folder and press any key to continue...
2022-12-22 21:29:02 | INFO - Found 9 images in C:\Users\Sky\hcaptcha-model-factory\data\sunflower\unlabel
2022-12-22 21:29:02 | DEBUG - Extracting embeddings...
C:\Users\Sky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\torch\nn\functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ..\c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
2022-12-22 21:29:03 | INFO - Embeddings extracted
2022-12-22 21:29:03 | INFO - PCA..., shape of embs: (9, 512)
2022-12-22 21:29:03 | INFO - PCA done, shape of embs: (9, 9)
2022-12-22 21:29:03 | DEBUG - Clustering...
2022-12-22 21:29:03 | DEBUG - Clustering done
2022-12-22 21:29:03 | INFO - Saving labels...
2022-12-22 21:29:03 | DEBUG - Labels saved
2022-12-22 21:29:03 | SUCCESS - Auto labeling completed
Start automatic training? {'y', 'n'} --> y
2022-12-22 21:29:04 | DEBUG - Diagnose task | task_name=sunflower
2022-12-22 21:29:04 | ERROR - An error has been caught in function 'new', process 'MainProcess' (16996), thread 'MainThread' (4960):
Traceback (most recent call last):
  File "main.py", line 6, in <module>
    Fire(Scaffold)
  File "C:\Users\Sky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "C:\Users\Sky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "C:\Users\Sky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
> File "C:\Users\Sky\hcaptcha-model-factory\src\apis\scaffold.py", line 78, in new
    Scaffold.train(task=task)
  File "C:\Users\Sky\hcaptcha-model-factory\src\apis\scaffold.py", line 98, in train
    model = Scaffold._model or ResNet(
  File "C:\Users\Sky\hcaptcha-model-factory\src\factories\kernel.py", line 93, in __init__
    self._build_env()
  File "C:\Users\Sky\hcaptcha-model-factory\src\factories\resnet.py", line 64, in _build_env
    raise FileNotFoundError(
FileNotFoundError: The structure of the dataset is incomplete | dir=C:\Users\Sky\hcaptcha-model-factory\data\sunflower

im using python 3.8.1

torchvision==0.9.2
torch==1.8.2

those 2 has to be changed cus it it couldnt find the avaliable version

torchvision==0.10.0
torch==1.9.0

that is what i changed it to

feat(pending): web panel

feat(add): model VCS

[question] auto label error

Question
im using auto label but i always got value error

i saved image by open(filename, "wb").write(bytes) i think here is error?

Expected behavior


PS D:\python\hcaptcha-model-factory\src> py main.py new 
prompt[en] --> fish_underwater
2022-09-21 19:57:05 | DEBUG - Diagnose task | task_name=fish_underwater
Use AI to automatically label datasets? {'y', 'n'} --> y
please put all the images in the `unlabel` folder and press any key to continue...
2022-09-21 19:57:26 | INFO - Found 7 images in D:\python\hcaptcha-model-factory\data\fish_underwater\unlabel
2022-09-21 19:57:26 | DEBUG - Extracting embeddings...
2022-09-21 19:57:27 | INFO - Embeddings extracted
2022-09-21 19:57:27 | INFO - PCA..., shape of embs: (7, 512)
2022-09-21 19:57:27 | ERROR - An error has been caught in function '_CallAndUpdateTrace', process 'MainProcess' (8452), thread 'MainThread' (21844):
Traceback (most recent call last):
  File "main.py", line 6, in <module>
    Fire(Scaffold)
  File "C:\Users\luanon404\AppData\Local\Programs\Python\Python38\lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "C:\Users\luanon404\AppData\Local\Programs\Python\Python38\lib\site-packages\fire\core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
> File "C:\Users\luanon404\AppData\Local\Programs\Python\Python38\lib\site-packages\fire\core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "D:\python\hcaptcha-model-factory\src\apis\scaffold.py", line 70, in new
    ClusterLabeler(data_dir=data_dir).run()
  File "D:\python\hcaptcha-model-factory\src\components\auto_label\cluster.py", line 65, in run
    self.embs = PCA(n_components=self.num_feat).fit_transform(self.embs)
  File "C:\Users\luanon404\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\decomposition\_pca.py", line 407, in fit_transform
    U, S, Vt = self._fit(X)
  File "C:\Users\luanon404\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\decomposition\_pca.py", line 457, in _fit
    return self._fit_full(X, n_components)
  File "C:\Users\luanon404\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\decomposition\_pca.py", line 475, in _fit_full
    raise ValueError(
ValueError: n_components=128 must be between 0 and min(n_samples, n_features)=7 with svd_solver='full'

Desktop (please complete the following information):

OS: windows
Version: latest

Additional context

feat(frontend): workflow panel

Validation: Split Error

Im getting this error:

Traceback (most recent call last):
  File "C:\Users\admin\Documents\GitHub\hcaptcha-model-factory\src\main.py", line 227, in <module>
    val()
  File "C:\Users\admin\Documents\GitHub\hcaptcha-model-factory\src\main.py", line 159, in val
    data = torchvision.datasets.ImageFolder(
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\torchvision\datasets\folder.py", line 310, in __init__
    super().__init__(
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\torchvision\datasets\folder.py", line 146, in __init__
    samples = self.make_dataset(self.root, class_to_idx, extensions, is_valid_file)
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\torchvision\datasets\folder.py", line 190, in make_dataset
    return make_dataset(directory, class_to_idx, extensions=extensions, is_valid_file=is_valid_file)
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\torchvision\datasets\folder.py", line 103, in make_dataset
    raise FileNotFoundError(msg)
FileNotFoundError: Found no valid file for the classes 0, 1. Supported extensions are: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, .webp

When trying to validate my onnx model.
I tried splitting the data before, using another path, nothing worked.

feat(pending): few shot learning

feat(pending): test cases and examples

feat(pending): multi-task training

[feat]

you can put jfif extension in your old project, the one you manually separated as photos in yes/bad, a lot of it, it's very easy to use.

[question]

I wanted to use the old version, there isn't, I think it's simpler

Image Collecting [question]

Question
Tried using this but i dont have enough images to work with. How will i get 1000+ images??

feat(todo): Collector

https://github.com/captcha-challenger/hcaptcha-whistleblower

[Question]

Describe the bug
6 legs idk

To Reproduce
Steps to reproduce the behavior:

Use the script like u normally would
What the fuck

Expected behavior
i expected it to magically work

OS: [e.g. iOS]
Version [e.g. 0.1.7]

da problem

2022-11-17 13:24:38 | INFO - Found 20 images in E:\papka\hcaptcha-model-factory\data\strawberry_cake\unlabel
2022-11-17 13:24:38 | DEBUG - Extracting embeddings...
2022-11-17 13:24:43 | INFO - Embeddings extracted
2022-11-17 13:24:43 | INFO - PCA..., shape of embs: (20, 512)
2022-11-17 13:24:43 | INFO - PCA done, shape of embs: (20, 20)
2022-11-17 13:24:43 | DEBUG - Clustering...
2022-11-17 13:24:44 | DEBUG - Clustering done
2022-11-17 13:24:44 | INFO - Saving labels...
2022-11-17 13:24:44 | DEBUG - Labels saved
2022-11-17 13:24:44 | SUCCESS - Auto labeling completed
Start automatic training? {'y', 'n'} --> y
2022-11-17 13:24:49 | DEBUG - Diagnose task | task_name=strawberry_cake
2022-11-17 13:24:49 | ERROR - An error has been caught in function 'new', process 'MainProcess' (4468), thread 'MainThre
Traceback (most recent call last):
  File "main.py", line 6, in <module>
    Fire(Scaffold)
  File "C:\Users\a\AppData\Local\Programs\Python\Python38\lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "C:\Users\a\AppData\Local\Programs\Python\Python38\lib\site-packages\fire\core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "C:\Users\a\AppData\Local\Programs\Python\Python38\lib\site-packages\fire\core.py", line 681, in _CallAndUpdateTr
    component = fn(*varargs, **kwargs)
> File "E:\papka\hcaptcha-model-factory\src\apis\scaffold.py", line 78, in new
    Scaffold.train(task=task)
  File "E:\papka\hcaptcha-model-factory\src\apis\scaffold.py", line 98, in train
    model = Scaffold._model or ResNet(
  File "E:\papka\hcaptcha-model-factory\src\factories\kernel.py", line 93, in __init__
    self._build_env()
  File "E:\papka\hcaptcha-model-factory\src\factories\resnet.py", line 64, in _build_env
    raise FileNotFoundError(
FileNotFoundError: The structure of the dataset is incomplete | dir=E:\papka\hcaptcha-model-factory\data\strawberry_cake

Creating model

Hello.
I do not exacly understand how this works.
If I would like to create a model for trucks for example, would I do this?
Put images that contain trucks in the "yes" folder and images that dont have trucks in the "bad" folder
Then run run.bat and wait.
Would this be correct?
Thanks!

adding on to model

I have a few questions i am decent at python but do not have much experience with ai.

How can i train a new model? I see the instructions but it is very hard to understand
How do i add on to the model in hcaptcha-challenger? I want to fix the model they have.
How can i test it? So i know how to use it

[DOCS] ROOKIE FAQ

How to Install Requirements gracefully

👍 Before you start, create a python 3.10+ virtual environment.

Install PyTorch

You need to download the latest version of torch, torchvision and torchaudio from Start Locally | PyTorch
```
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117
```

Download additional dependencies

pip install -U numpy packaging protobuf onnxruntime opencv-python==4.5.5.62 pillow~=9.2.0 scikit-learn==1.0.1 fire~=0.4.0 loguru~=0.6.0 pyyaml~=6.0

FileNotFoundError: The structure of the dataset is incomplete

You need to get challenge Images from the hCaptcha challenge, and in principle, the more pictures the better. In the first training round, you should put in at least 150 images(yes + bad).

download samples: sunflower.zip

You need to manually program the yes and bad folders and then run the task with the trainval command.

~\hcaptcha-model-factory\data\sunflower\yes
~\hcaptcha-model-factory\data\sunflower\bad

python main.py trainval --task sunflower

feat(add): scaffold pattern

problem

Torch not compiled with CUDA enabled

[Feature] Validation Mode for ONNX Model

Yo, it would be cool to validate/check the accuracy of onnx files/models.
would this be possible?

[question] Turn on CUDA

Hello. I can't turn on CUDA. Everything that needed to be installed, it works fine on cpu, but I don’t know how to rearrange it on gpu.

add a new photo extension

most of the photos being downloaded by hcaptcha, are coming in jfif, you could add this extension in your project

feat(pending): motion workflow

Description: quick-train for the hCAPTCHA Animal challenge.
Target: expect to deliver the pluggable model within ten minutes.

doubt

this is happening to me when i open the project

[Workflow] Use the trained model to classify unlabeled dataset

# -*- coding: utf-8 -*-
# Time       : 2022/08/19 7:33
# Author     : QIN2DIM
# Github     : https://github.com/QIN2DIM
# Description: Use the trained model to classify unlabeled data sets
import os
import shutil

import cv2
import numpy as np


def load_model(path_model_onnx):
    if (
        not os.path.isfile(path_model_onnx)
        or not path_model_onnx.endswith(".onnx")
        or not os.path.getsize(path_model_onnx)
    ):
        raise RuntimeError
    return cv2.dnn.readNetFromONNX(path_model_onnx)


def classify(net, data):
    img_arr = np.frombuffer(data, np.uint8)
    img = cv2.imdecode(img_arr, flags=1)

    img = cv2.resize(img, (64, 64))
    blob = cv2.dnn.blobFromImage(img, 1 / 255.0, (64, 64), (0, 0, 0), swapRB=True, crop=False)

    net.setInput(blob)
    out = net.forward()
    if not np.argmax(out, axis=1)[0]:
        return True
    return False


def run():
    """
    RecurTraining Motion workflow
    ---------

    bird_flying         # handled label name
     ├── _inner         # recur-output
     │    ├── yes
     │    └── bad
     ├── yes            # labeled dataset for train/val
     ├── bad            # labeled dataset for train/val
     └── *.jpg          # unlabeled dataset

     1. 在一切開始前，你需要手動分類大約 100 張圖片（正反類合計），
        通過正常的 trainval 工作流獲取首個 ONNX 模型；
     2. 當你纍計獲取更多的未標注的圖片後，通過 recur 工作流使用模型進行標注（圖像二分類）；
     3. 人工檢查模型輸出，手動校準分類錯誤的極少量圖片，你可以修改標注或刪去圖片；
     4. 合并數據集，將 _inner/yes 以及 _inner/bad 的 recur 輸出合并至已分類的數據目錄；
     5. 使用合并後的數據集再次訓練。

     more: 循環往復，不斷迭代模型。
    """
    # Path to the ONNX model
    model_path = "drawing_of_a_haunted_house.onnx"
    # Path to the unlabeled dataset
    dataset_dir = "drawing_of_a_haunted_house"
    # Path to the recur-output
    output_dir_yes = os.path.join(dataset_dir, "_inner/yes")
    output_dir_bad = os.path.join(dataset_dir, "_inner/bad")
    # Initialize output directory
    os.makedirs(output_dir_yes, exist_ok=True)
    os.makedirs(output_dir_bad, exist_ok=True)

    # 導入上一輪迭代后的模型
    model = load_model(model_path)

    for index, fn in enumerate(img_fns := os.listdir(dataset_dir)):
        # skip nested folders
        img_src = os.path.join(dataset_dir, fn)
        if os.path.isfile(img_src):
            with open(img_src, "rb") as file:
                data = file.read()
            img_dst = os.path.join(output_dir_yes if classify(model, data) else output_dir_bad, fn)
            shutil.move(img_src, img_dst)
        if index % 50 == 0:
            print(f">> recur - progress=[{index}/{len(img_fns)}]")


if __name__ == "__main__":
    run()

How do I help get new captchas solved quicker

Hello, im using your program and it works great!

I want to help you update your program more frequently.
I already installed the hcaptcha-model-factory, but when I want to train it, so I can upload the files for you, it needs more pictures.
Can you give me a step by step guide on how you update your program ?
Maybe a video would be nice so that I can help you get updates quicker !

Thank you!

Predict image with YOLO model " yolov5n6.onnx ''

Dear sir . Can you have me to classification the image with model "yolov5n6.onnx"
I try the other is return the result "0" or "1"
But with this model i don't know witch way to do .
if possible please give me example code .
thanks for support .