Giter Club home page Giter Club logo

microsoft / azureml-bert Goto Github PK

View Code? Open in Web Editor NEW
384.0 23.0 126.0 322 KB

End-to-End recipes for pre-training and fine-tuning BERT using Azure Machine Learning Service

Home Page: https://azure.microsoft.com/en-us/blog/microsoft-makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/

License: MIT License

Jupyter Notebook 50.46% Python 49.54%
azure-machine-learning bert nlp pytorch pretrained-models finetuning pretraining bert-model azureml-bert tuning

azureml-bert's Introduction

BERT on Azure Machine Learning Service

This repo contains end-to-end recipes to pretrain and finetune the BERT (Bidirectional Encoder Representations from Transformers) language representation model using Azure Machine Learning service.

Update on 7/7/2020: πŸ›‘ A more recent implementation for BERT pretraining available at https://github.com/microsoft/onnxruntime-training-examples/tree/master/nvidia-bert is significantly faster than the implementation in this repo. That implementation uses ONNX Runtime to accelerate training and it can be used in environments with GPU including Azure Machine Learning service. Details on using ONNX Runtime for training and accelerating training of Transformer models like BERT and GPT-2 are available in the blog at ONNX Runtime Training Technical Deep Dive.

BERT

BERT is a language representation model that is distinguished by its capacity to effectively capture deep and subtle textual relationships in a corpus. In the original paper, the authors demonstrate that the BERT model could be easily adapted to build state-of-the-art models for a number of NLP tasks, including text classification, named entity recognition and question answering. In this repo, we provide notebooks that allow a developer to pretrain a BERT model from scratch on a corpus, as well as to fine-tune an existing BERT model to solve a specialized task. A brief introduction to BERT is available in this repo for a quick start on BERT.

Pretrain

Challenges in BERT Pretraining

Pretraining a BERT language representation model to the desired level of accuracy is quite challenging; as a result, most developers start from a BERT model that was pre-trained on a standard corpus (such as Wikipedia), instead of training it from scratch. This strategy works well if the final model is being trained on a corpus that is similar to the corpus used in the pre-train step; however, if the problem involves a specialized corpus that's quite different from the standard corpus, the results won't be optimal. Additionally, to advance language representation beyond BERT’s accuracy, users will need to change the model architecture, training data, cost function, tasks, and optimization routines. All these changes need to be explored at large parameter and training data sizes. In the case of BERT-large, this could be quite substantial as it has 340 million parameters and trained over a very large document corpus. To support this with GPUs, machine learning engineers will need distributed training support to train these large models. However, due to the complexity and fragility of configuring these distributed environments, even expert tweaking can end up with inferior results from the trained models.

To address these issues, this repo is publishing a workflow for pretraining BERT-large models. Developers can now build their own language representation models like BERT using their domain-specific data on GPUs, either with their own hardware or using Azure Machine Learning service. The pretrain recipe in this repo includes the dataset and preprocessing scripts so anyone can experiment with building their own general purpose language representation models beyond BERT. Overall this is a stable, predictable recipe that converges to a good optimum for researchers to try explorations on their own.

Implementation

The pretraining recipe in this repo is based on the PyTorch Pretrained BERT v0.6.2 package from Hugging Face. The implementation in this pretraining recipe includes optimization techniques such as gradient accumulation (gradients are accumulated for smaller mini-batches before updating model weights) and mixed precision training. The notebook and python modules for pretraining are available at pretrain directory.

Data Preprocessing

Data preparation is one of the important steps in any Machine Learning project. For BERT pretraining, document-level corpus is needed. The quality of the data used for pretraining directly impacts the quality of the trained models. To make the data preprocessing easier and for repeatability of results, data preprocessing code is included in the repo. It may be used to pre-process Wikipedia corpus or other datasets for pretraining. Refer to additional information at data preparation for pretraining for details on that.

Finetune

The finetuning recipe in this repo shows how to finetune the BERT language representation model using Azure Machine Learning service. The notebooks and python modules for finetuning are available at finetune directory. We finetune and evaluate our pretrained checkpoints against the following:

GLUE benchmark

The General Language Understanding Evaluation (GLUE) benchmark is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems. The BERT_Eval_GLUE.ipynb jupyter notebook allows the user to run one of the pretrained checkpoints against these tasks on Azure ML.

Azure Machine Learning service

Azure Machine Learning service provides a cloud-based environment to prep data, train, test, deploy, manage, and track machine learning models. This service fully supports open-source technologies such as PyTorch, TensorFlow, and scikit-learn and can be used for any kind of machine learning, from classical ML to deep learning, supervised and unsupervised learning.

Notebooks

Jupyter notebooks can be used to use AzureML Python SDK and submit pretrain and finetune jobs. This repo contains the following notebooks for different activities.

PyTorch Notebooks
Activity Notebook
Pretrain BERT_Pretrain.ipynb
GLUE finetune/evaluate BERT_Eval_GLUE.ipynb
TensorFlow Notebooks
Activity Notebook
GLUE finetune/evaluate Tensorflow-BERT-AzureML.ipynb

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

azureml-bert's People

Contributors

aashna avatar aashnamsft avatar dlazesz avatar hosseinsarshar avatar jingyanwangms avatar krishansubudhi avatar krkusuk avatar llidev avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar raviskolli avatar sassbalint avatar skaarthik avatar xiaoyongzhu avatar xiaoyongzhumsft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azureml-bert's Issues

Pre-training time

Roughly how long does pretraining on 64xV100 take? Megatron claims to take ~ 3 days -- is this roughly the same?

Bert Data for Pretraining: No such file or directory: 'bert_data/validation_512_only'

Hi, I have Pretraining running but it fails after 1st Epoch with the following error:
File "/AzureML-BERT/pretrain/PyTorch/dataset.py", line 100, in init
path = get_random_partition(self.dir_path, index)
File "
/AzureML-BERT/pretrain/PyTorch/dataset.py", line 33, in get_random_partition
for x in os.listdir(data_directory)]
FileNotFoundError: [Errno 2] No such file or directory: 'bert_data/validation_512_only'

I have the created the Wiki pretraining data using create_pretraining script. I do not see validation_512_only being generated?

dataloading error

I run into an attribute error while trying to load the bert training data:
AttributeError: Can't get attribute 'WikiNBookCorpusPretrainingDataCreator' on <module '__main__' from 'train.py'>

The data is retrieved from https://bertonazuremlwestus2.blob.core.windows.net/public/bert_data.tar.gz based on the notebook instruction (BERT_Pretrain.ipynb)

It seems this issue is raised because WikiNBookCorpusPretrainingDataCreator class is picked into the data file upon creation but not recognized while loading it, reflecting a mismatch between the loading code and the dataset.

Expected pretraining results (e.g., loss, next-sentence-prediction accuracy, etc.)

Would be great to get expected pretraining results (e.g., loss, next-sentence-prediction accuracy, etc.) and learning curves using the 64 * V100 training.

I found the fine-tuning task results on the blog but would be nice to compare the pretraining results as a "sanity check."

I don't have access to V100 GPUs for now so I'm using P40 GPUs to replicate/validate the pretraining pipeline.

@maxluk @aashnamsft any pretraining details that you guys can share? Even a screenshot of the AzureML metrics page would be helpful to compare and double-check.

pretrain/PyTorch/dataprep/create_pretraining.py fails to locate BertTokenizer

We don't have pretrain/PyTorch/pytorch_pretrained_bert.py.

$ python create_pretraining.py --input_dir=~/Data/bert_data/out3 --
output_dir=/home/kaiida/Data/bert_data/out4
Traceback (most recent call last):
  File "create_pretraining.py", line 29, in <module>
    from pytorch_pretrained_bert.tokenization import BertTokenizer
ModuleNotFoundError: No module named 'pytorch_pretrained_bert'

Initializing BertMultiTask model error

Hi, I am getting the following error when I run Bert pretraining:
09/09/2019 10:13:43 - INFO - logger - Vocabulary contains 30522 tokens
09/09/2019 10:13:43 - INFO - logger - Initializing BertMultiTask model
Traceback (most recent call last):
File "AzureML-BERT/pretrain/PyTorch/train_nitin.py", line 361, in
summary_writer = summary_writer)
File "/home/nigaregr/Documents/AzureML-BERT/pretrain/PyTorch/models.py", line 121, in init
self.network.register_batch(BatchType.PRETRAIN_BATCH, "pretrain_dataset", loss_calculation=BertPretrainingLoss(self.bert_encoder, bert_config))
File "/home/nigaregr/Documents/AzureML-BERT/pretrain/PyTorch/models.py", line 25, in init
self.cls = BertPreTrainingHeads(config, self.bert.embeddings.word_embeddings.weight)
TypeError: init() takes 2 positional arguments but 3 were given

how can I get ip of master node

when I get it via env variable AZ_BATCHAI_MPI_MASTER_NODE, I get follow error

File "src/scripts/submit_job/distributed.py", line 273, in set_environment_variables_for_nccl_backend
os.environ['MASTER_ADDR'] = os.environ['AZ_BATCHAI_MPI_MASTER_NODE']
File "/opt/conda/lib/python3.6/os.py", line 669, in getitem
raise KeyError(key) from None
KeyError: 'AZ_BATCHAI_MPI_MASTER_NODE'

Running Tensorflow notebook getting error

When running the notebook I'm getting this error at "Submit and Monitor your run" section. The Status of the experiment is 'Failed'.

It started out OK and went to status 'Running' but after a while this error is presented.
Can someone point met in the right direction?
I just don't understand where to look for a solution. I'm new to Azure Machine Learning and don't understand the "amlcompute" reference in the errormessage.


Extracting pretrained model...
Start to download GLUE dataset...

Downloading and extracting CoLA...
Completed!
Downloading and extracting SST...
Completed!
Processing MRPC...
Traceback (most recent call last):
File "download_glue_data.py", line 137, in
sys.exit(main(sys.argv[1:]))
File "download_glue_data.py", line 129, in main
format_mrpc(args.data_dir, args.path_to_mrpc)
File "download_glue_data.py", line 61, in format_mrpc
urllib.request.urlretrieve(MRPC_TRAIN, mrpc_train_file)
File "/azureml-envs/azureml_d830dfb29fc815326bc8cd6e3a484d0b/lib/python3.6/urllib/request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/azureml-envs/azureml_d830dfb29fc815326bc8cd6e3a484d0b/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/azureml-envs/azureml_d830dfb29fc815326bc8cd6e3a484d0b/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/azureml-envs/azureml_d830dfb29fc815326bc8cd6e3a484d0b/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/azureml-envs/azureml_d830dfb29fc815326bc8cd6e3a484d0b/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/azureml-envs/azureml_d830dfb29fc815326bc8cd6e3a484d0b/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/azureml-envs/azureml_d830dfb29fc815326bc8cd6e3a484d0b/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

The experiment failed. Finalizing run...
Logging experiment finalizing status in history service
Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 0.10145998001098633 seconds

Error occurred: Compute target amlcompute is in state deleting and thus not ready for runs.

How to register PyTorch BERT model, create image, and deploy as service?

I have finetuned successfully PyTorch BERT on GLUE on an Azure cluster, and now I tried to deploy the resulting fine-tuned model as a web service. However, so far my attempts have been unsuccessful.

Registering model

The Jupyter notebook, while having as title "Register best model" at

"### Find and register the best model\n",
is not actually registering a model in the Azure workspace as far as I can tell.

I tried to do so using the code
run.register_model("fine_tuned", "outputs")

(this should save a pytorch_model.bin as model in the workspace).

Configure and create the image

After the model is registered, I need to create a Docker image. To do so, I do

 from azureml.core.model import Model
 from azureml.core.image import ContainerImage

 model = Model(ws, 'fine_tuned')

 image_config = ContainerImage.image_configuration(execution_script="score.py", 
                                                   docker_file="Dockerfile",
                                                   runtime="python", 
                                                   conda_file="myenv.yml",
                                                   dependencies=["config.json", "vocab.txt"])

image = ContainerImage.create(name = "scorer-image",
                               models = [model], 
                               image_config = image_config,
                               workspace = ws
                               )

image.wait_for_creation()

The code above succeeds, but the resulting image is not working when run locally and crashes with

FileNotFoundError: [Errno 2] No such file or directory: 'azureml-models/fine_tuned/1/outputs/bert_config.json'

I suspect this happens because I am not uploading the config.json and vocab.txt files in the same directory as the model file pytorch_model.bin, so that the model can't successfully load e.g. with BertForSequenceClassification.from_pretrained.

Does anybody have suggestions about how to handle this? Should I convert the Pytorch model to ONNX format?

Error occurred: No module named 'tqdm'

I am trying to run the Pretrained-BERT-GLUE.ipynb. However, while executing I am getting the error "Error occurred: No module named 'tqdm'. I tried to specify pip packages but still its not working. Can anyone help me please?

GPU service with Azure machine learning

Hello,

firstly, thanks for your work and sharing.

My situation is, I already implemented the programming about fine-tuning BERT and Xlnet on my research task. However, I need GPU to run and train the model. So may I ask you several questions about the the Azure Service.

Only by Azure machine learning can I get the GPU service?
May I get the GPU service from Azure virtual machine?

through which method can I get the GPU service without modifying my code?

Thanks a lot and looking forward to your reply!

Best,
Y

question on time estimates for 'reasonable' p/retraining of 'reasonable' dataset size

Hi,
This is awesome - just what we were looking for.

We need quick estimates for time 'T' it will take to p/retrain (take BERT and continue training with a custom new corpus - I assume that will be better than training from scratch) BERT-en-lg with dataset of size 'S' on GPU of type 'G'.

Let's say we use the GPUs and wikipedia dataset suggested in this notebook to pretrain a new model until it performs as well as original BERT-en-lg on GLUE (or pick any other task).

How long will that take?

thanks much!

Can't use "local" computing in BERT_Eval_SQUAD.ipynb

Hello,
since I am only free tier, I cannot use the gpu-cluster in Azure.
So I modified AzureML-BERT/finetune/PyTorch/notebooks/BERT_Eval_SQUAD.ipynb script a little like below

from azureml.train.dnn import PyTorch
from azureml.core.runconfig import RunConfiguration
from azureml.core.container_registry import ContainerRegistry

run_user_managed = RunConfiguration()
run_user_managed.environment.python.user_managed_dependencies = True

# Define custom Docker image info
image_name = 'mcr.microsoft.com/azureml/bert:pretrain-openmpi3.1.2-cuda10.0-cudnn7-ubuntu16.04'

estimator = PyTorch(source_directory='../../../',
                    compute_target="local",
                     #Docker image
                    use_docker=True,
                    custom_docker_image=image_name,
                    user_managed=True,
                    script_params = {
                          '--bert_model':'bert-large-uncased',
                          "--model_file_location": checkpoint_path,
                          '--model_file': 'bert_encoder_epoch_245.pt',
                          '--do_train' : '',
                          '--do_predict': '',
                          '--train_file': train_path,
                          '--predict_file': dev_path,
                          '--max_seq_length': 512,
                          '--train_batch_size': 8,
                          '--learning_rate': 3e-5,
                          '--num_train_epochs': 2.0,
                          '--doc_stride': 128,
                          '--seed': 32,
                          '--gradient_accumulation_steps':4,
                          '--warmup_proportion':0.25,
                          '--output_dir': './outputs',
                          '--fp16':'',
                          #'--loss_scale':128,
                    },
                    entry_script='./finetune/run_squad_azureml.py',
                    node_count=1,
                    process_count_per_node=4,
                    distributed_backend='mpi',
                    use_gpu=True)

# path to the Python environment in the custom Docker image
estimator._estimator_config.environment.python.interpreter_path = '/opt/miniconda/envs/amlbert/bin/python'
run = experiment.submit(estimator)
from azureml.widgets import RunDetails
RunDetails(run).show()

However, when I run this script, I get this error.
error": {
"message": "{\n "error_details": {\n "correlation": {\n "operation": "0b58ac6218ccb845aeae7d20056dfba1",\n "request": "U4WAXOUHsWo="\n },\n "environment": "koreacentral",\n "error": {\n "code": "UserError",\n "message": "Communicators are not supported for local runs."\n },\n "location": "koreacentral",\n "time": "2019-11-03T03:11:31.494157+00:00"\n },\n "status_code": 400,\n "url": "https://koreacentral.experiments.azureml.net/execution/v1.0/subscriptions/8170d900-06ad-4a1d-babd-1a30120ea257/resourceGroups/Bertpipeline/providers/Microsoft.MachineLearningServices/workspaces/BertSquad/experiments/BERT-SQuAD/localrun?runId=BERT-SQuAD_1572750686_983087f7\"\n}"
}

Is there any way to handle this issue?

Thanks for providing such an amzing repo!

Can I run this project at my local machine?

I have downloaded the training data by following the dataprep and create the Docker environment according to the Dockerfile.
But I encounter the following problem when I try to do pretrain:

(amlbert) root@2d453b9b839f:~/pretrain/PyTorch# python train.py  --config_file ../configs/bert-large-single-node.json --path /out
The arguments are: ['train.py', '--config_file', '../configs/bert-large-single-node.json', '--path', '/out']
Traceback (most recent call last):
  File "train.py", line 285, in <module>
    local_rank = get_local_rank()
  File "/root/pretrain/PyTorch/azureml_adapter.py", line 27, in get_local_rank
    return int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
  File "/opt/miniconda/envs/amlbert/lib/python3.6/os.py", line 669, in __getitem__
    raise KeyError(key) from None
KeyError: 'OMPI_COMM_WORLD_LOCAL_RANK'

Did I misunderstand anything?

Using RDMA capable nodes

Is there a reason for using Standard_NC24s_v3 rather than the RDMA capable Standard_NC24rs_v3?

Example's on a custom dataset

The repo shows how to fine tune on standard datasets but not how to run on your own data this limits the utility of the project.

Issues with finetune examples

image

Dedicated vs. Low-priority servers

Hi, I just have a question. Does this code handle being pre-empted on Azure ML low-priority VMs? I've read that BERT is particularly sensitive and requires the model, optimizer state, and dataset shuffling to all be saved and restored if pre-empted during pre-training.

total languages supported

was going through Microsoft Azure BERT

I want to implement it in my work, but before that I want to know how many languages are supported here.

Tensor size doesn't match.

Hello!

We are working on custom corpus BERT pretraining. I followed the guide about data preparation (texts should be good now), however running the notebook gives the following error:

2 items cleaning up...
Cleanup took 0.0017843246459960938 seconds
06/28/2020 11:53:45 - INFO - __main__ -   Exiting context: ProjectPythonPath
Traceback (most recent call last):
  File "train.py", line 482, in <module>
    eval_loss = train(index)
  File "train.py", line 132, in train
    batch = next(dataloaders[dataset_type])
  File "train.py", line 47, in <genexpr>
    return (x for x in DataLoader(dataset, batch_size=train_batch_size // 2 if eval_set else train_batch_size,
  File "/opt/miniconda/envs/amlbert/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 615, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/opt/miniconda/envs/amlbert/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/opt/miniconda/envs/amlbert/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/opt/miniconda/envs/amlbert/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 144 and 128 in dimension 1 at /pytorch/aten/src/TH/generic/THTensorMoreMath.cpp:1307

which I don't understand perfectly in the current context.

We tried to run also with wiki en corpus data, still the same. Have tried with large-cased and multilingual-cased vocabs.

ds.upload causing AzureHttpError: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. ErrorCode: AuthenticationFailed

ds.upload(src_dir='./squad', target_path='./squad')
results with error

~/miniconda3/envs/azureml/lib/python3.6/site-packages/azureml/_vendor/azure_storage/common/storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
304 except AzureException as ex:
305 retry_context.exception = ex
--> 306 raise ex
307 except Exception as ex:
308 retry_context.exception = ex

~/miniconda3/envs/azureml/lib/python3.6/site-packages/azureml/_vendor/azure_storage/common/storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
290 # and raised as an azure http exception
291 _http_error_handler(
--> 292 HTTPError(response.status, response.message, response.headers, response.body))
293
294 # Parse the response

~/miniconda3/envs/azureml/lib/python3.6/site-packages/azureml/_vendor/azure_storage/common/_error.py in _http_error_handler(http_error)
113 ex.error_code = error_code
114
--> 115 raise ex
116
117

AzureHttpError: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. ErrorCode: AuthenticationFailed

Is it possible to run scripts with PyTorch distributed launch utility?

In your tutorial, you mentioned that:

The original run_squad.py script uses PyTorch distributed launch untility to launch multiple processes across nodes and GPUs. We prepared a modified version run_squad_azureml.py so that we can launch it based on AzureML build-in MPI backend.

On AzureML services, is it possible to use PyTorch distributed launch utility to launch multiple processes scripts? i.e.,

python -m torch.distributed.launch script.py

Training PyTorch on GLUE fails on my configuration

I am training on the CoLA and MRPC glue tasks https://github.com/Microsoft/AzureML-BERT/blob/master/PyTorch/Pretrained-BERT-GLUE.ipynb, and the trainings fail with

The experiment failed. Finalizing run...
Logging experiment finalizing status in history service
02/15/2019 23:15:46 - INFO - main - Exiting context: DaskOnBatch
02/15/2019 23:15:46 - INFO - main - Exiting context: OutputCollection
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.20153117179870605 seconds
02/15/2019 23:15:47 - INFO - main - Exiting context: ProjectPythonPath
Traceback (most recent call last):
File "azureml-setup/context_manager_injector.py", line 152, in
execute_with_context(cm_objects, options.invocation)
File "azureml-setup/context_manager_injector.py", line 88, in execute_with_context
runpy.run_path(sys.argv[0], globals(), run_name="main")
File "/opt/miniconda/lib/python3.6/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/opt/miniconda/lib/python3.6/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/opt/miniconda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "run_classifier_azureml.py", line 626, in
main()
File "run_classifier_azureml.py", line 569, in main
lr_this_step = args.learning_rate * warmup_linear(comm.global_step/t_total, args.warmup_proportion)
AttributeError: 'DistributedCommunicator' object has no attribute 'global_step'

in azureml-logs/80_driver_log.txt, without running any epoch. I am using azureml-sdk with SDK version: 1.0.15 on a Data Science ubuntu VM, and I am not sure what the requirements are. Thanks for sharing this example!

validation data is missing

Hi,

I am trying to run train.py to train bert. The dataloader is expecting a 'validation_512_only' folder for validation. It seems this folder is missing from the zip file. I cannot find it from the corresponding bertonazuremlwestus2 blob storage too.

Is there an other path to the validation folder? Could someone help to update the doc?

Thanks.

BERTopic not working on AzureML

Hello, has anyone successfully got BERTopic running on AzureML?

Environment: Azure ML 3.8

Having installed the BERTopic (pip install BERTopic), I then use the following starter code (from the BERTopic GitHub):

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

After running for around 4 minutes, this gives the following error:
UFuncTypeError: ufunc 'correct_alternative_cosine' did not contain a loop with signature matching types <class 'numpy.dtype[float32]'> -> None

Any support would be gratefully received!

Regards,
James

Example configuration file

Hello, I am trying to pretrain BERT.
Can you provide example configuration file (e.g. bert-large-single-node.json)?
Thank you

broken modified run_classifier.py link for hyperdrive

Hi,
I am trying to follow the tutorial to finetune BERT and perform hyperparamter tuning. The instruction refers to adding logging of metrics to run_classifier.py and refers to the modified run_classifier.py in hyperlink. But that is broken. Can you please share the modified run_classifier.py training script that has logging included for hyperdrive? Appreciate the help. Copy pasted the instructions from the tutorial below

Further within run_classifier.py, we log the learning rate, and the epoch training and eval loss the model achieves:

run.log('lr', np.float(args.learning_rate))

run.log('train_mean_loss', mean_loss)
run.log('eval_mean_loss', mean_loss)
run.log('train_example_loss', mean_loss)
run.log('eval_example_loss', mean_loss)
These run metrics will become particularly important when we begin hyperparameter tuning our model in the "Tune model hyperparameters" section.

Let's first copy the modified run_classifier.py into our local project directory this link is broken
Thanks,
Sriram

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.