I'm running into unexpected behavior of the text tokenizer, running this on Windows, P

Strange, this is what I get from the tokenizer for that prompt: <div class="snippe

Not sure. It works properly in colab too: <a href="https://colab.research.google.com/

Affirmative. <div class="snippet-clipboard-content notranslate position-relative o

Text Tokenizer is fragmenting words about min-dalle HOT 13 CLOSED

Kreevoz commented on August 22, 2024

Text Tokenizer is fragmenting words

from min-dalle.

Comments (13)

Kreevoz commented on August 22, 2024 1

Yep, just freshly cloned your repo to make sure it's all okay, and it is. Windows users may rejoice now!

from min-dalle.

kuprel commented on August 22, 2024

Strange, this is what I get from the tokenizer for that prompt:

tokenizing text
['Ġa']
['Ġcomfy']
['Ġchair']
['Ġthat']
['Ġlooks']
['Ġlike']
['Ġan']
['Ġavocado']
text tokens [0, 58, 29872, 2408, 766, 4126, 1572, 101, 16632, 2]

from min-dalle.

Kreevoz commented on August 22, 2024

Strange indeed... I can't explain this behavior either. Doesn't seem to be an issue with parsing the text from the commandline.. I'll try setting up a few virtual environments with different python versions. 🤔

from min-dalle.

Kreevoz commented on August 22, 2024

Exact same result using Python 3.9 in a fresh virtual environment.
Edit: Tested 3.9.13 and 3.9.7, same behavior, so it is not a recent bug I suppose?

Do any packages here look out of order?
(Other than the jaxlib which I installed manually from https://github.com/cloudhan/jax-windows-builder )

(vENV39) PS Y:\min-dalle> pip freeze
absl-py==1.1.0
certifi==2022.6.15
charset-normalizer==2.0.12
chex==0.1.3
cycler==0.11.0
dm-tree==0.1.7
etils==0.6.0
flatbuffers==2.0
flax==0.4.2
fonttools==4.33.3
idna==3.3
importlib-resources==5.8.0
jax==0.3.14
jaxlib @ file:///Y:/jaxlib-0.3.14%2Bcuda11.cudnn82-cp39-none-win_amd64.whl
kiwisolver==1.4.3
matplotlib==3.5.2
msgpack==1.0.4
numpy==1.23.0
opt-einsum==3.3.0
optax==0.1.2
packaging==21.3
Pillow==9.1.1
pyparsing==3.0.9
python-dateutil==2.8.2
requests==2.28.0
scipy==1.8.1
six==1.16.0
toolz==0.11.2
torch==1.12.0+cu116
torchaudio==0.12.0+cu116
torchvision==0.13.0+cu116
typing_extensions==4.2.0
urllib3==1.26.9
zipp==3.8.0

from min-dalle.

Kreevoz commented on August 22, 2024

I've added a couple more print statements to see what the tokenizer is up to:

tokenizing text
['Ġ', 'a']
['Ġ', 'a']
['Ġ', 'a']
['Ġ', 'c', 'o', 'm', 'f', 'y']
['Ġ', 'c', 'o', 'm', 'f', 'y']
['Ġ', 'c', 'om', 'f', 'y']
['Ġ', 'com', 'f', 'y']
['Ġ', 'com', 'fy']
['Ġ', 'com', 'fy']
['Ġ', 'c', 'h', 'a', 'i', 'r']
['Ġ', 'c', 'h', 'a', 'i', 'r']
['Ġ', 'ch', 'a', 'i', 'r']
['Ġ', 'ch', 'a', 'ir']
['Ġ', 'ch', 'air']
['Ġ', 'chair']
['Ġ', 'chair']
['Ġ', 't', 'h', 'a', 't']
['Ġ', 't', 'h', 'a', 't']
['Ġ', 't', 'h', 'at']
['Ġ', 'th', 'at']
['Ġ', 'th', 'at']
['Ġ', 'l', 'o', 'o', 'k', 's']
['Ġ', 'l', 'o', 'o', 'k', 's']
['Ġ', 'l', 'o', 'ok', 's']
['Ġ', 'l', 'ook', 's']
['Ġ', 'look', 's']
['Ġ', 'look', 's']
['Ġ', 'l', 'i', 'k', 'e']
['Ġ', 'l', 'i', 'k', 'e']
['Ġ', 'l', 'ik', 'e']
['Ġ', 'l', 'ike']
['Ġ', 'like']
['Ġ', 'like']
['Ġ', 'a', 'n']
['Ġ', 'a', 'n']
['Ġ', 'an']
['Ġ', 'an']
['Ġ', 'a', 'v', 'o', 'c', 'a', 'd', 'o']
['Ġ', 'a', 'v', 'o', 'c', 'a', 'd', 'o']
['Ġ', 'a', 'v', 'o', 'c', 'ad', 'o']
['Ġ', 'a', 'v', 'oc', 'ad', 'o']
['Ġ', 'av', 'oc', 'ad', 'o']
['Ġ', 'av', 'oc', 'ado']
['Ġ', 'av', 'oc', 'ado']

Why would your tokenizer fail to complete the long words when executing on my hardware? 😵

from min-dalle.

kuprel commented on August 22, 2024

Not sure. It works properly in colab too: https://colab.research.google.com/github/kuprel/min-dalle/blob/main/min_dalle.ipynb

from min-dalle.

alexx-km commented on August 22, 2024

I've got the same issue, also on Windows (running on CPU only as I have an AMD GPU)... I'll check my packages if there are any other similarities between your setup and mine!

from min-dalle.

Kreevoz commented on August 22, 2024

I found a pattern for this bug!

Vocabulary entries that have an entry that begins with: Ġ are not being matched by the tokenizer.
Tokens that do not start with that symbol will assemble successfully into long words. (I do not know why the list of tokens contains both types of entries?)

For instance: "project", "record", "management" will assemble into valid tokens.
But: "projections", "recordings", "manage" will not, because they are only listed as "Ġprojections", "Ġrecordings", "Ġmanage" in the json files.

That is why it breaks up words into such odd chunks. It can only pick the ones that start without that special character!

So there must be platform differences between linux and windows in how that accented Ġ is parsed.
Can you account for this in your tokenizer? Can we strip that out?

from min-dalle.

kuprel commented on August 22, 2024

Do you get that Ġ character when you run this in python? print(chr(ord(" ") + 256))

from min-dalle.

Kreevoz commented on August 22, 2024

Affirmative.

>>> print(chr(ord(" ") + 256))
Ġ

from min-dalle.

kuprel commented on August 22, 2024

If you can figure out what will make it work on windows let me know. I don't have any windows machines

from min-dalle.

Kreevoz commented on August 22, 2024

Yes, I got a fix. It was one of those annoying OS-specific things indeed.

You need to explicitly specify that the json files get parsed as utf-8.

On Windows the parser will default to the system locale unless specified (usually cp1252 or similar for english installs, different codepages for other languages). This causes the accented G to get lost/garbled up.

The fix is easily added in lines 16, 18 and 20 in the ./min_dalle/min_dalle.py file:

        with open(os.path.join(model_path, 'config.json'), 'r', encoding='utf8') as f: 
            self.config = json.load(f)
        with open(os.path.join(model_path, 'vocab.json'), 'r', encoding='utf8') as f:
            vocab = json.load(f)
        with open(os.path.join(model_path, 'merges.txt'), 'r', encoding='utf8') as f:

This should not negatively impact how the code executes under linux. The output now conforms to your examples when executing on windows and the tokens are correct.

from min-dalle.

kuprel commented on August 22, 2024

Awesome thanks. I just updated it. Does it work now?

from min-dalle.

Text Tokenizer is fragmenting words about min-dalle HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent