hezarai / hezar Goto Github PK
View Code? Open in Web Editor NEWThe all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!
Home Page: https://hezarai.github.io/hezar/
License: Apache License 2.0
The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!
Home Page: https://hezarai.github.io/hezar/
License: Apache License 2.0
As of now, there are no proper test files in the tests
folder. First, we need to figure out the right way to do it (specific to this library) and then start adding them. As this library follows a hierarchical abstract factory design, some base classes like models.Model
or trainers.Trainer
might be sensitive to changes and cause issues in their derived classes.
@pooya-mohammadi
As it's mentioned in Python docs, wildcard imports are not good practice! Nevertheless this can be seen in most of the well-known open source projects. As of now in Hezar, we do it this way: In every submodule import everything explicitly if the submodule or file has less than 5ish properties, otherwise provide the needed properties in __all__
and import with wildcards.
You can see this style mostly in utils
and other files that contain a lot of classes, methods, etc because otherwise it'd be easy to miss some of them as we move forward. Also, IDEs and editors automatically warn the developer if new stuff are not added to __all__
.
Right now, in order to use a model to predict on raw data you have to do it like this:
from hezar import Model, Tokenizer
model_path = "hezarai/bert-fa-sentiment-digikala-snappfood"
model = Model.load(model_path)
tokenizer = Tokenizer.load(model_path)
example = ["هزار، کتابخانهای کامل برای به کارگیری آسان هوش مصنوعی"]
inputs = tokenizer(example, return_tensors="pt")
outputs = model.predict(inputs)
But, for normal users, this might be vague. What they might want is something like this:
from hezar import Model
model_path = "hezarai/bert-fa-sentiment-digikala-snappfood"
example = ["هزار، کتابخانهای کامل برای به کارگیری آسان هوش مصنوعی"]
model = Model.load(model_path)
outputs = model.predict(example)
Regarding #12, we decided to write our own evaluation in a "not re-inventing the wheel" manner.
First we start with casual metrics like "f1", "precision", "recall", etc from scikit-learn
and then move on to more high-level ones like "BLEU", etc.
There are a lot of libraries supported in the Hugging Face Hub. (full list).
In order to add Hezar to this list we have to take the following steps:
The full guide is here but these are the main steps to do.
hezar/examples/train_sequence_labeling_example.py
hezar/hezar/trainers/trainer.py
)Right now, there is only one Trainer class in trainers
and it's been only tested for text classification tasks.
A lot of other libraries have implemented a similar class for all of their trainings but this has some drawbacks for Hezar:
Besides, the Trainer class is a really naive one right now and needs a lot of refactors in my opinion.
@pooya-mohammadi @arxyzan
Add the sequence labeling data collator class as hezar.data.data_collators.SequenceLabelingDataCollator()
This class would be a reimplementation of the class transformers.data.data_collator.DataCollatorForTokenClassification()
How can we access the preprocessors in the Trainer properly?
Add Roberta model from Transformers for sequence labeling. See BertSequenceLabeling
and RobertaTextClassification
for references.
Right now, the general schema is this:
<pretrained-model-name>-<language>-<task>-<dataset-name>
. For example a Roberta model trained on snappfood/digikala comments
would be written as roberta-fa-sentiment-digikala-snappfood
.
Idk why but I feel like something is wrong with this!
In this issue, DistilBERT model with a token classification head must be added to the library's models.
Note: For reference, see hezar/hezar/models/text_classification/distilbert/distilbert_text_classification.py and hezar/hezar/models/text_classification/distilbert/distilbert_text_classification_config.py
DoD: Clean reload from Hub
evaluate
is a pretty solid package for a lot of common evaluation techniques, but the problem is that no single module is implemented as a Python module, but instead loading any module, downloads its corresponding script (from somewhere I have not figured out yet!) and caches it in ~/.cache/huggingface/evaluate/...
. The problem with this schema is that each module has its own specific arguments and accessing them requires writing code like below:
import evaluate
f1 = evaluate.load("f1")
print(f1.inputs_description) # this gives the docstring so that we see input arguments
I don't know why they chose to do it this way, but I think it can be problematic. Maybe implementing these modules explicitly in Hezar is a better approach.
*Right now we use torchmetrics
for evaluation
In this thread, let's discuss how we can provide API documentation in an easy semi-automatic way.
Tools I have in mind so far are mkdocs, mkdocstrings and pdoc.
Due to the limitations of GitHub for private repos regarding branch enforcing rules and some other constraints (specially limited timing), lets just push to main until the public release.
@pooya-mohammadi @arxyzan
Add torchmetrics to pyproject.toml or tell me to do it
torch.utils.data.Dataset
subclass)hezar/examples/train_sequence_labeling_example.py
hezar/hezar/trainers/trainer.py
)https://github.com/hezarai/hezar/tree/main/notebooks#notebooks
In this section, we provide a list of the notebooks that use Hezar. Most of these notebooks have a tutorial perspective regarding different features of Hezar.
I believe changing with following would be better:
In this section, a list of notebooks that use Hezar is provided...
A tokenizer might split a single word to sub-words and since the model outputs assign a label to every token (not the whole word), we have to rejoin the split words and their labels correspondingly.
I just created the first tag v0.13.0
to test pypi publish workflow and got this error.
Error: Parameter token or opts.auth is required
Run link: https://github.com/hezarai/hezar/actions/runs/5399096875
In this issue, BERT model with a token classification head must be added to the library's models.
BertSequenceLabeling
)hezar/hezar/models/sequence_labeling/bert/bert_sequence_labeling.py
BertSequenceLabelingConfig
to hezar/hezar/models/sequence_labeling/bert/bert_sequence_labeling_config.py
Note: For reference, see hezar/hezar/models/text_classification/bert/bert_text_classification.py
and hezar/hezar/models/text_classification/bert/bert_text_classification_config.py
DoD: Clean reload from Hub
It is an old habit to add v at the beginning of a tag. Please remove it.
We need these functionalities in the TextNormalizer
:
https://github.com/hezarai/hezar#use-a-model-from-hub
Instead of Use a model from Hub
It would be better to say,
Try a sentiment analysis model
or something like that so when users see the header they would get the main point!
Also, we may wanna add more quick use-cases of the library to this section in the future, which categorizing them would save us from extensive refactoring processes!
Description coming soon... (This is a big one)
It makes sense to be able to push training results and reports like evaluation results to the hub when pushing a trainer to the hub.
This issue specifies the roadmap for adding CLIP to Hezar (model architecture, task, datasets, pretrained weights, etc)
Word embeddings module is a must-have feature in Hezar! Both ready-to-use models and trainers. More on this later...
The compute_metrics
method should return a single dimension dictionary like {"f1": ..., "recall": ...}
.
Some metrics like seqeval
can output multiple values themselves. Right now, seqeval
outputs its own recall, accuracy, f1 and precision and its better to output them in the trainer. (Now sequence labeling trainer only outputs f1
)
The challenge is, the Trainer sets up the metrics tracker
in which the metrics are registered by their name from the AVAILABLE_METRICS
list.
How can we register the proper metrics in the metrics tracker?
I attempted to test the workflow by creating a new tag and it raised an error.
See it here: https://github.com/hezarai/hezar/actions/runs/5399733975/jobs/9807244244#step:3:10
@pooya-mohammadi Can you check this?
To better organize our plans and goals, it's better to define some milestones here.
@arxyzan @pooya-mohammadi
These rules are inspired by reviewing big open source projects and their best practices and will evolve through time.
We all first come to an agreement on these and then add this to the docs.
@pooya-mohammadi @arxyzan
references:
The cardinal rule for creating good commits is to ensure there is only one "logical change" per commit. There are many reasons why this is an important rule:
Things to avoid:
A commit has a subject (The main message) and a body (The description). The body is separated from the subject by a black line like below:
Fix issues with amp in Trainer
Automatic mixed precision had some buggy behavior that ...
Providing the body is not necessary. Do it when you feel like it's needed.
General rules:
variable
method()
Class()
Line 62 in 1f987ac
vars
is the sugar method for __dict__
. However you have defined your own type of vars
. It's totally ok
A lot of the methods do not have explicit type hints for their arguments. We wouldn't necessarily need to provide type hints for every single one of them but for most of them we would.
I believe it would be better to be explicit about it in all over the docs/Readme
Implement the load()
method for Dataset
so that Hub datasets can be loaded easily without configuring overhead.
Line 47 in 1f987ac
We can follow to options:
Definitely, it shouldn't be ValueError cause users are not entering any values!! They are trying to get some!
This issue is a sub-issue based on the first step in #36 and #39 (Implement sequence labeling dataset (A torch.utils.data.Dataset subclass)).
A ready to use dataset for sequence labeling (PoS tagging) is added to the Hub: LSCP-500K
The dataset format in LSCP is going to be the standard format for all sequence labeling datasets and the dataset class to implement in this task is going to be the standard class for all sequence labeling datasets (PoS, NER, Chunk, etc).
To implement this class you can see hezar.data.datasets.text_classification.text_classification_dataset.py
as a solid reference.
This file would better be located at hezar.data.datasets.sequence_labeling.sequence_labeling_dataset.py
. The dataset classes do not have much hezar-specific instructions. Just make sure to include a SequenceLabelingDatasetConfig
class and inherit from hezar.data.datasets.dataset.Dataset()
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.