hezarai / hezar Goto Github PK

The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!

Home Page: https://hezarai.github.io/hezar/

License: Apache License 2.0

Python 100.00%

persian persian-nlp persian-ai persian-speech-recognition hezar persian-ocr hezarai persian-image-captioning persian-dataset

hezar's People

Contributors

Stargazers

Watchers

hezar's Issues

Take `tests` more seriously!

As of now, there are no proper test files in the tests folder. First, we need to figure out the right way to do it (specific to this library) and then start adding them. As this library follows a hierarchical abstract factory design, some base classes like models.Model or trainers.Trainer might be sensitive to changes and cause issues in their derived classes.
@pooya-mohammadi

Issue in Loading text classification mdoel

Wildcard imports are bad practice!

As it's mentioned in Python docs, wildcard imports are not good practice! Nevertheless this can be seen in most of the well-known open source projects. As of now in Hezar, we do it this way: In every submodule import everything explicitly if the submodule or file has less than 5ish properties, otherwise provide the needed properties in __all__ and import with wildcards.
You can see this style mostly in utils and other files that contain a lot of classes, methods, etc because otherwise it'd be easy to miss some of them as we move forward. Also, IDEs and editors automatically warn the developer if new stuff are not added to __all__.

Integrate preprocessor loading/functioning in Model

Right now, in order to use a model to predict on raw data you have to do it like this:

from hezar import Model, Tokenizer

model_path = "hezarai/bert-fa-sentiment-digikala-snappfood"

model = Model.load(model_path)
tokenizer = Tokenizer.load(model_path)

example = ["هزار، کتابخانه‌ای کامل برای به کارگیری آسان هوش مصنوعی"]
inputs = tokenizer(example, return_tensors="pt")

outputs = model.predict(inputs)

But, for normal users, this might be vague. What they might want is something like this:

from hezar import Model

model_path = "hezarai/bert-fa-sentiment-digikala-snappfood"
example = ["هزار، کتابخانه‌ای کامل برای به کارگیری آسان هوش مصنوعی"]
model = Model.load(model_path)
outputs = model.predict(example)

Implement the metrics module

Regarding #12, we decided to write our own evaluation in a "not re-inventing the wheel" manner.
First we start with casual metrics like "f1", "precision", "recall", etc from scikit-learn and then move on to more high-level ones like "BLEU", etc.

Add Hezar to Hugging Face libraries

There are a lot of libraries supported in the Hugging Face Hub. (full list).
In order to add Hezar to this list we have to take the following steps:
The full guide is here but these are the main steps to do.

Add upload/download feature to the libraries modules so that everything can be hosted on the Hugging Face Hub. (guide)
Create an Inference API Docker image like these
Add it to the Hugging Face API Inference Community repo and send a pull request. (guide)
Register supported tasks. (guide)

@pooya-mohammadi

Set up lazy loading for it installs so many libraries right now

Train ParsDistilBERT on LSCP (sequence labeling)

Implement sequence labeling dataset (A torch.utils.data.Dataset subclass)
Implement a data collator class for sequence labeling (@arxyzan )
Add a sequence labeling training example to hezar/examples/train_sequence_labeling_example.py
R&D on Trainer to be compatible with sequence labeling (hezar/hezar/trainers/trainer.py)
Modify trainer to be compatible with sequence labeling
Run trainer for ParsDistilBert + LSCP
evaluate results (Baseline: f1: ~0.83) (check this @arxyzan)
push to hub
load back and test on raw inputs
Add notebook example here

Reconsider `trainers` design, functionalities, modules, etc.

Right now, there is only one Trainer class in trainers and it's been only tested for text classification tasks.
A lot of other libraries have implemented a similar class for all of their trainings but this has some drawbacks for Hezar:

Hezar must support a wide range of models and tasks with specific training strategies
Even if we can put them all together in a single class, it'll cause either over-engineering or spaghetti-code.

Besides, the Trainer class is a really naive one right now and needs a lot of refactors in my opinion.
@pooya-mohammadi @arxyzan

Add data collator for sequence labeling

Add the sequence labeling data collator class as hezar.data.data_collators.SequenceLabelingDataCollator()
This class would be a reimplementation of the class transformers.data.data_collator.DataCollatorForTokenClassification()

Access preprocessors in the Trainer

How can we access the preprocessors in the Trainer properly?

Add roberta for sequence labeling

Add Roberta model from Transformers for sequence labeling. See BertSequenceLabeling and RobertaTextClassification for references.

Add LSCP dataset to HuggingFace datasets

Provide dataset link (@arxyzan)
Check dataset license (@arxyzan)
Convert dataset format to proper format for sequence labeling (according to Huggingface format) (@arxyzan)
Upload the processed dataset to HF Hub (@arxyzan)
Implement the dataset loading script (https://huggingface.co/docs/datasets/dataset_script and https://huggingface.co/datasets/conll2003/blob/main/conll2003.py are good references) (@arxyzan)
Test loading (@arxyzan & @pooya-mohammadi)

Naming schema for the models on the Hub

Right now, the general schema is this:
<pretrained-model-name>-<language>-<task>-<dataset-name>. For example a Roberta model trained on snappfood/digikala comments
would be written as roberta-fa-sentiment-digikala-snappfood.
Idk why but I feel like something is wrong with this!

Release to TestPyPi

Add DistilBERT for sequence labeling

In this issue, DistilBERT model with a token classification head must be added to the library's models.

Implement Pytorch Module class (class name: DistilBertSequenceLabeling)
Add to hezar/hezar/models/sequence_labeling/distilbert/distilbert_sequence_labeling.py
Add DistilBertSequenceLabelingConfig to hezar/hezar/models/sequence_labeling/distilbert/distilbert_sequence_labeling_config.py
Test on some random inputs
Upload a raw sequence labeling distilbert model to the Hub and reload from hub.

Note: For reference, see hezar/hezar/models/text_classification/distilbert/distilbert_text_classification.py and hezar/hezar/models/text_classification/distilbert/distilbert_text_classification_config.py

DoD: Clean reload from Hub

Integrate HuggingFace evaluate into trainer?

evaluate is a pretty solid package for a lot of common evaluation techniques, but the problem is that no single module is implemented as a Python module, but instead loading any module, downloads its corresponding script (from somewhere I have not figured out yet!) and caches it in ~/.cache/huggingface/evaluate/.... The problem with this schema is that each module has its own specific arguments and accessing them requires writing code like below:

import evaluate

f1 = evaluate.load("f1")
print(f1.inputs_description)  # this gives the docstring so that we see input arguments

I don't know why they chose to do it this way, but I think it can be problematic. Maybe implementing these modules explicitly in Hezar is a better approach.

*Right now we use torchmetrics for evaluation

Documentation and doc generation

In this thread, let's discuss how we can provide API documentation in an easy semi-automatic way.
Tools I have in mind so far are mkdocs, mkdocstrings and pdoc.

Lets push to main until the public release

Due to the limitations of GitHub for private repos regarding branch enforcing rules and some other constraints (specially limited timing), lets just push to main until the public release.
@pooya-mohammadi @arxyzan

Publish to PyPi

Create a CI/CD (github actions) to push to PyPi
Set PyPi credentials in repository secret @arxyzan

torchmetrics is required for the library

Add torchmetrics to pyproject.toml or tell me to do it

Add sponsers section to Readme

Train ParsBERT on LSCP (sequence labeling)

Implement sequence labeling dataset (A torch.utils.data.Dataset subclass)
Implement a data collator class for sequence labeling (@arxyzan )
Add a sequence labeling training example to hezar/examples/train_sequence_labeling_example.py
R&D on Trainer to be compatible with sequence labeling (hezar/hezar/trainers/trainer.py)
Modify trainer to be compatible with sequence labeling
Run trainer for Parsbert + LSCP
evaluate results (Baseline: f1: ~0.89)
push to hub
load back and test on raw inputs
Add notebook example here

Forget about we, I, they, and let's focus on the content

https://github.com/hezarai/hezar/tree/main/notebooks#notebooks
In this section, we provide a list of the notebooks that use Hezar. Most of these notebooks have a tutorial perspective regarding different features of Hezar.
I believe changing with following would be better:

In this section, a list of notebooks that use Hezar is provided...

Fix sequence labeling output issue to handle split tokens

A tokenizer might split a single word to sub-words and since the model outputs assign a label to every token (not the whole word), we have to rejoin the split words and their labels correspondingly.

Add ready to use embedding models to the Hub

We need a logo

[pypi-release.yml] Error: Parameter token or opts.auth is required

I just created the first tag v0.13.0 to test pypi publish workflow and got this error.

Error: Parameter token or opts.auth is required

Run link: https://github.com/hezarai/hezar/actions/runs/5399096875

@pooya-mohammadi

It's better to provide an upper version for dependencies, since we haven't tried all the new version

hezar/pyproject.toml

Line 34 in c7b0ae3

torch = ">=1.10.0"

For example:
torch = ">=1.10.0, <=1.13.1"
instead of torch = ">=1.10.0"

Add BERT for sequence labeling

In this issue, BERT model with a token classification head must be added to the library's models.

Implement Pytorch Module class (class name: BertSequenceLabeling)
Add to hezar/hezar/models/sequence_labeling/bert/bert_sequence_labeling.py
Add BertSequenceLabelingConfig to hezar/hezar/models/sequence_labeling/bert/bert_sequence_labeling_config.py
Test on some random inputs
Upload a raw sequence labeling bert model to the Hub and reload from hub.

Note: For reference, see hezar/hezar/models/text_classification/bert/bert_text_classification.py and hezar/hezar/models/text_classification/bert/bert_text_classification_config.py

DoD: Clean reload from Hub

Let's remove v at the beginning of tags

hezar/.github/workflows/pypi-release.yml

Line 6 in 2f64bae

- 'v*.*.*' # Push events to matching v*, i.e. v1.0.0, v20.15.10

It is an old habit to add v at the beginning of a tag. Please remove it.

Complete `TextNormalizer ` features

We need these functionalities in the TextNormalizer:

Add support for Regex replacements
Add support for zwnj handling in Persian
Add support for emoji detection and deletion
Add support for space correction
Add support for grammar issues (if possible)

Let's be as expelicit as possible

https://github.com/hezarai/hezar#use-a-model-from-hub

Instead of Use a model from Hub

It would be better to say,
Try a sentiment analysis model
or something like that so when users see the header they would get the main point!

Also, we may wanna add more quick use-cases of the library to this section in the future, which categorizing them would save us from extensive refactoring processes!

Prepare for HuggingFace Inference API

Description coming soon... (This is a big one)

Save training logs and reports to the hub in `Trainer.push_to_hub()`

It makes sense to be able to push training results and reports like evaluation results to the hub when pushing a trainer to the hub.

Add Dockerfile

R&D CLIP model training and adding to the library

This issue specifies the roadmap for adding CLIP to Hezar (model architecture, task, datasets, pretrained weights, etc)

Add embeddings module

Word embeddings module is a must-have feature in Hezar! Both ready-to-use models and trainers. More on this later...

Support for multi output metrics in the trainers

The compute_metrics method should return a single dimension dictionary like {"f1": ..., "recall": ...}.
Some metrics like seqeval can output multiple values themselves. Right now, seqeval outputs its own recall, accuracy, f1 and precision and its better to output them in the trainer. (Now sequence labeling trainer only outputs f1)
The challenge is, the Trainer sets up the metrics tracker in which the metrics are registered by their name from the AVAILABLE_METRICS list.
How can we register the proper metrics in the metrics tracker?

Workflow error on creating release

I attempted to test the workflow by creating a new tag and it raised an error.

See it here: https://github.com/hezarai/hezar/actions/runs/5399733975/jobs/9807244244#step:3:10

@pooya-mohammadi Can you check this?

Should we define milestones here?

To better organize our plans and goals, it's better to define some milestones here.
@arxyzan @pooya-mohammadi

Git commit guidelines

These rules are inspired by reviewing big open source projects and their best practices and will evolve through time.
We all first come to an agreement on these and then add this to the docs.
@pooya-mohammadi @arxyzan

references:

Functional best practices

The cardinal rule for creating good commits is to ensure there is only one "logical change" per commit. There are many reasons why this is an important rule:

The smaller the amount of code being changed, the quicker & easier it is to review & identify potential flaws.
If a change is found to be flawed later, it may be necessary to revert the broken commit. This is much easier to do if there are not other unrelated code changes entangled with the original commit.
When troubleshooting problems using Git's bisect capability, small well-defined changes will aid in isolating exactly where the code problem was introduced.
When browsing history using Git annotate/blame, small well-defined changes also aid in isolating exactly where & why a piece of code came from.

Things to avoid:

Mixing whitespace changes with functional code changes
Mixing two unrelated functional changes
Sending large new features in a single giant commit (Less is not more here!)

Styling best practices

A commit has a subject (The main message) and a body (The description). The body is separated from the subject by a black line like below:

Fix issues with amp in Trainer

Automatic mixed precision had some buggy behavior that ...

Providing the body is not necessary. Do it when you feel like it's needed.

General rules:

Use imperative mood in the subject. Commit message must be able to complete this sentence: "This commit will <commit message here>"
- Wrong: "Adding support for ..."
- Correct: "Add support for ..."
Keep it short and concise. Preferrably less than 50 characters.
Capitalize the subject line
Do NOT end the subject with a dot/period
Wrap body lines at 72 characters
Use the body to explain what and why you have done something
Do NOT address the path to the changed files in the subject, but instead try to provide enough info in the message so that people know what happened in your commit. Most IDEs and editors give you the ability to inspect the location and diffs by clicking on the commit message so addressing them in the commit subject is not a good practice and also hits the 50 character limit easier.
- Wrong: [models][text_classification][bert] Add text normalizing in preprocess()
- Correct: Add text normalizing in preprocess for BertTextClassification
Use the body to explain what and why, NOT how! (You must provide the "how" part in docs or code not in the commit message!)
For commits that reference an issue or PR, write the proper commit subject followed by the reference in paranthesis
- Add NFKC normalizer (#9999)
- Fix tokenizer encode() bug (#999)
Reference codes in back quotes:
- variable
- method()
- Class()

You can ignore the issue! vars is the sugar method for dict

hezar/hezar/configs.py

Line 62 in 1f987ac

return self.__dict__

vars is the sugar method for __dict__. However you have defined your own type of vars. It's totally ok

Add type hints to function arguments?!

A lot of the methods do not have explicit type hints for their arguments. We wouldn't necessarily need to provide type hints for every single one of them but for most of them we would.

Use Tasks instead ModelTypes

Avoid using Persian alone

Hezar: A seamless AI library for Persian Language or Persian Community.

I believe it would be better to be explicit about it in all over the docs/Readme

Load dataset from hub using Dataset.load(...)

Implement the load() method for Dataset so that Hub datasets can be loaded easily without configuring overhead.

Suggestion on choosing Better Error types

hezar/hezar/configs.py

Line 47 in 1f987ac

raise ValueError(f"`{self.__class__.__name__}` has no attribute `{item}`!")

We can follow to options:

Considering the get_item is implemented by hezar in a dataclass we can create a new ErrorType
We can raise the same keyError instead ValueError

Definitely, it shouldn't be ValueError cause users are not entering any values!! They are trying to get some!

Implement SequenceLabelingDataset class

This issue is a sub-issue based on the first step in #36 and #39 (Implement sequence labeling dataset (A torch.utils.data.Dataset subclass)).

A ready to use dataset for sequence labeling (PoS tagging) is added to the Hub: LSCP-500K

The dataset format in LSCP is going to be the standard format for all sequence labeling datasets and the dataset class to implement in this task is going to be the standard class for all sequence labeling datasets (PoS, NER, Chunk, etc).

To implement this class you can see hezar.data.datasets.text_classification.text_classification_dataset.py as a solid reference.

This file would better be located at hezar.data.datasets.sequence_labeling.sequence_labeling_dataset.py. The dataset classes do not have much hezar-specific instructions. Just make sure to include a SequenceLabelingDatasetConfig class and inherit from hezar.data.datasets.dataset.Dataset().