Giter Club home page Giter Club logo

hezar's People

Contributors

arxyzan avatar hamedbabaei avatar pooya-mohammadi avatar ssbakh07 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hezar's Issues

Take `tests` more seriously!

As of now, there are no proper test files in the tests folder. First, we need to figure out the right way to do it (specific to this library) and then start adding them. As this library follows a hierarchical abstract factory design, some base classes like models.Model or trainers.Trainer might be sensitive to changes and cause issues in their derived classes.
@pooya-mohammadi

Wildcard imports are bad practice!

As it's mentioned in Python docs, wildcard imports are not good practice! Nevertheless this can be seen in most of the well-known open source projects. As of now in Hezar, we do it this way: In every submodule import everything explicitly if the submodule or file has less than 5ish properties, otherwise provide the needed properties in __all__ and import with wildcards.
You can see this style mostly in utils and other files that contain a lot of classes, methods, etc because otherwise it'd be easy to miss some of them as we move forward. Also, IDEs and editors automatically warn the developer if new stuff are not added to __all__.

Integrate preprocessor loading/functioning in Model

Right now, in order to use a model to predict on raw data you have to do it like this:

from hezar import Model, Tokenizer

model_path = "hezarai/bert-fa-sentiment-digikala-snappfood"

model = Model.load(model_path)
tokenizer = Tokenizer.load(model_path)

example = ["هزار، کتابخانه‌ای کامل برای به کارگیری آسان هوش مصنوعی"]
inputs = tokenizer(example, return_tensors="pt")

outputs = model.predict(inputs)

But, for normal users, this might be vague. What they might want is something like this:

from hezar import Model

model_path = "hezarai/bert-fa-sentiment-digikala-snappfood"
example = ["هزار، کتابخانه‌ای کامل برای به کارگیری آسان هوش مصنوعی"]
model = Model.load(model_path)
outputs = model.predict(example)

Implement the metrics module

Regarding #12, we decided to write our own evaluation in a "not re-inventing the wheel" manner.
First we start with casual metrics like "f1", "precision", "recall", etc from scikit-learn and then move on to more high-level ones like "BLEU", etc.

Add Hezar to Hugging Face libraries

There are a lot of libraries supported in the Hugging Face Hub. (full list).
In order to add Hezar to this list we have to take the following steps:
The full guide is here but these are the main steps to do.

@pooya-mohammadi

Train ParsDistilBERT on LSCP (sequence labeling)

  1. Implement sequence labeling dataset (A torch.utils.data.Dataset subclass)
  2. Implement a data collator class for sequence labeling (@arxyzan )
  3. Add a sequence labeling training example to hezar/examples/train_sequence_labeling_example.py
  4. R&D on Trainer to be compatible with sequence labeling (hezar/hezar/trainers/trainer.py)
  5. Modify trainer to be compatible with sequence labeling
  6. Run trainer for ParsDistilBert + LSCP
  7. evaluate results (Baseline: f1: ~0.83) (check this @arxyzan)
  8. push to hub
  9. load back and test on raw inputs
  10. Add notebook example here

Reconsider `trainers` design, functionalities, modules, etc.

Right now, there is only one Trainer class in trainers and it's been only tested for text classification tasks.
A lot of other libraries have implemented a similar class for all of their trainings but this has some drawbacks for Hezar:

  • Hezar must support a wide range of models and tasks with specific training strategies
  • Even if we can put them all together in a single class, it'll cause either over-engineering or spaghetti-code.

Besides, the Trainer class is a really naive one right now and needs a lot of refactors in my opinion.
@pooya-mohammadi @arxyzan

Add data collator for sequence labeling

Add the sequence labeling data collator class as hezar.data.data_collators.SequenceLabelingDataCollator()
This class would be a reimplementation of the class transformers.data.data_collator.DataCollatorForTokenClassification()

Add roberta for sequence labeling

Add Roberta model from Transformers for sequence labeling. See BertSequenceLabeling and RobertaTextClassification for references.

Naming schema for the models on the Hub

Right now, the general schema is this:
<pretrained-model-name>-<language>-<task>-<dataset-name>. For example a Roberta model trained on snappfood/digikala comments
would be written as roberta-fa-sentiment-digikala-snappfood.
Idk why but I feel like something is wrong with this!

Add DistilBERT for sequence labeling

In this issue, DistilBERT model with a token classification head must be added to the library's models.

  1. Implement Pytorch Module class (class name: DistilBertSequenceLabeling)
  2. Add to hezar/hezar/models/sequence_labeling/distilbert/distilbert_sequence_labeling.py
  3. Add DistilBertSequenceLabelingConfig to hezar/hezar/models/sequence_labeling/distilbert/distilbert_sequence_labeling_config.py
  4. Test on some random inputs
  5. Upload a raw sequence labeling distilbert model to the Hub and reload from hub.

Note: For reference, see hezar/hezar/models/text_classification/distilbert/distilbert_text_classification.py and hezar/hezar/models/text_classification/distilbert/distilbert_text_classification_config.py

DoD: Clean reload from Hub

Integrate HuggingFace evaluate into trainer?

evaluate is a pretty solid package for a lot of common evaluation techniques, but the problem is that no single module is implemented as a Python module, but instead loading any module, downloads its corresponding script (from somewhere I have not figured out yet!) and caches it in ~/.cache/huggingface/evaluate/.... The problem with this schema is that each module has its own specific arguments and accessing them requires writing code like below:

import evaluate

f1 = evaluate.load("f1")
print(f1.inputs_description)  # this gives the docstring so that we see input arguments

I don't know why they chose to do it this way, but I think it can be problematic. Maybe implementing these modules explicitly in Hezar is a better approach.

*Right now we use torchmetrics for evaluation

Publish to PyPi

  1. Create a CI/CD (github actions) to push to PyPi
  2. Set PyPi credentials in repository secret @arxyzan

Train ParsBERT on LSCP (sequence labeling)

  1. Implement sequence labeling dataset (A torch.utils.data.Dataset subclass)
  2. Implement a data collator class for sequence labeling (@arxyzan )
  3. Add a sequence labeling training example to hezar/examples/train_sequence_labeling_example.py
  4. R&D on Trainer to be compatible with sequence labeling (hezar/hezar/trainers/trainer.py)
  5. Modify trainer to be compatible with sequence labeling
  6. Run trainer for Parsbert + LSCP
  7. evaluate results (Baseline: f1: ~0.89)
  8. push to hub
  9. load back and test on raw inputs
  10. Add notebook example here

Add BERT for sequence labeling

In this issue, BERT model with a token classification head must be added to the library's models.

  1. Implement Pytorch Module class (class name: BertSequenceLabeling)
  2. Add to hezar/hezar/models/sequence_labeling/bert/bert_sequence_labeling.py
  3. Add BertSequenceLabelingConfig to hezar/hezar/models/sequence_labeling/bert/bert_sequence_labeling_config.py
  4. Test on some random inputs
  5. Upload a raw sequence labeling bert model to the Hub and reload from hub.

Note: For reference, see hezar/hezar/models/text_classification/bert/bert_text_classification.py and hezar/hezar/models/text_classification/bert/bert_text_classification_config.py

DoD: Clean reload from Hub

Complete `TextNormalizer ` features

We need these functionalities in the TextNormalizer:

  • Add support for Regex replacements
  • Add support for zwnj handling in Persian
  • Add support for emoji detection and deletion
  • Add support for space correction
  • Add support for grammar issues (if possible)

Add embeddings module

Word embeddings module is a must-have feature in Hezar! Both ready-to-use models and trainers. More on this later...

Support for multi output metrics in the trainers

The compute_metrics method should return a single dimension dictionary like {"f1": ..., "recall": ...}.
Some metrics like seqeval can output multiple values themselves. Right now, seqeval outputs its own recall, accuracy, f1 and precision and its better to output them in the trainer. (Now sequence labeling trainer only outputs f1)
The challenge is, the Trainer sets up the metrics tracker in which the metrics are registered by their name from the AVAILABLE_METRICS list.
How can we register the proper metrics in the metrics tracker?

Git commit guidelines

Git commit guidelines

These rules are inspired by reviewing big open source projects and their best practices and will evolve through time.
We all first come to an agreement on these and then add this to the docs.
@pooya-mohammadi @arxyzan

references:

Functional best practices

The cardinal rule for creating good commits is to ensure there is only one "logical change" per commit. There are many reasons why this is an important rule:

  • The smaller the amount of code being changed, the quicker & easier it is to review & identify potential flaws.
  • If a change is found to be flawed later, it may be necessary to revert the broken commit. This is much easier to do if there are not other unrelated code changes entangled with the original commit.
  • When troubleshooting problems using Git's bisect capability, small well-defined changes will aid in isolating exactly where the code problem was introduced.
  • When browsing history using Git annotate/blame, small well-defined changes also aid in isolating exactly where & why a piece of code came from.

Things to avoid:

  • Mixing whitespace changes with functional code changes
  • Mixing two unrelated functional changes
  • Sending large new features in a single giant commit (Less is not more here!)

Styling best practices

A commit has a subject (The main message) and a body (The description). The body is separated from the subject by a black line like below:

Fix issues with amp in Trainer

Automatic mixed precision had some buggy behavior that ...

Providing the body is not necessary. Do it when you feel like it's needed.

General rules:

  • Use imperative mood in the subject. Commit message must be able to complete this sentence: "This commit will <commit message here>"
    • Wrong: "Adding support for ..."
    • Correct: "Add support for ..."
  • Keep it short and concise. Preferrably less than 50 characters.
  • Capitalize the subject line
  • Do NOT end the subject with a dot/period
  • Wrap body lines at 72 characters
  • Use the body to explain what and why you have done something
  • Do NOT address the path to the changed files in the subject, but instead try to provide enough info in the message so that people know what happened in your commit. Most IDEs and editors give you the ability to inspect the location and diffs by clicking on the commit message so addressing them in the commit subject is not a good practice and also hits the 50 character limit easier.
    • Wrong: [models][text_classification][bert] Add text normalizing in preprocess()
    • Correct: Add text normalizing in preprocess for BertTextClassification
  • Use the body to explain what and why, NOT how! (You must provide the "how" part in docs or code not in the commit message!)
  • For commits that reference an issue or PR, write the proper commit subject followed by the reference in paranthesis
    • Add NFKC normalizer (#9999)
    • Fix tokenizer encode() bug (#999)
  • Reference codes in back quotes:
    • variable
    • method()
    • Class()

Add type hints to function arguments?!

A lot of the methods do not have explicit type hints for their arguments. We wouldn't necessarily need to provide type hints for every single one of them but for most of them we would.

Avoid using Persian alone

Hezar: A seamless AI library for Persian Language or Persian Community.

I believe it would be better to be explicit about it in all over the docs/Readme

Suggestion on choosing Better Error types

raise ValueError(f"`{self.__class__.__name__}` has no attribute `{item}`!")

We can follow to options:

  1. Considering the get_item is implemented by hezar in a dataclass we can create a new ErrorType
  2. We can raise the same keyError instead ValueError

Definitely, it shouldn't be ValueError cause users are not entering any values!! They are trying to get some!

Implement SequenceLabelingDataset class

This issue is a sub-issue based on the first step in #36 and #39 (Implement sequence labeling dataset (A torch.utils.data.Dataset subclass)).

A ready to use dataset for sequence labeling (PoS tagging) is added to the Hub: LSCP-500K

The dataset format in LSCP is going to be the standard format for all sequence labeling datasets and the dataset class to implement in this task is going to be the standard class for all sequence labeling datasets (PoS, NER, Chunk, etc).

To implement this class you can see hezar.data.datasets.text_classification.text_classification_dataset.py as a solid reference.

This file would better be located at hezar.data.datasets.sequence_labeling.sequence_labeling_dataset.py. The dataset classes do not have much hezar-specific instructions. Just make sure to include a SequenceLabelingDatasetConfig class and inherit from hezar.data.datasets.dataset.Dataset().

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.