Hello!
I noticed that you modified some bits of code from the Huggingface repo, and I was wondering if you could add some explanation/logic behind the changes.
For example, what are the changes made to optimization.py, and why were they made? It seems like there's some similarity to the MT-DNN code as well, and I'm curious what your thought process was.
And for tokenization.py, why did you reimplement end-to-end tokenization as FullTokenizer as opposed to the original BertTokenizer?
I just want to get a better understanding of the code, and I'd really appreciate your response!