Giter Club home page Giter Club logo

Comments (13)

Kreevoz avatar Kreevoz commented on August 22, 2024 1

Yep, just freshly cloned your repo to make sure it's all okay, and it is. Windows users may rejoice now!

from min-dalle.

kuprel avatar kuprel commented on August 22, 2024

Strange, this is what I get from the tokenizer for that prompt:

tokenizing text
['Ġa']
['Ġcomfy']
['Ġchair']
['Ġthat']
['Ġlooks']
['Ġlike']
['Ġan']
['Ġavocado']
text tokens [0, 58, 29872, 2408, 766, 4126, 1572, 101, 16632, 2]

from min-dalle.

Kreevoz avatar Kreevoz commented on August 22, 2024

Strange indeed... I can't explain this behavior either. Doesn't seem to be an issue with parsing the text from the commandline.. I'll try setting up a few virtual environments with different python versions. 🤔

from min-dalle.

Kreevoz avatar Kreevoz commented on August 22, 2024

Exact same result using Python 3.9 in a fresh virtual environment.
Edit: Tested 3.9.13 and 3.9.7, same behavior, so it is not a recent bug I suppose?

Do any packages here look out of order?
(Other than the jaxlib which I installed manually from https://github.com/cloudhan/jax-windows-builder )

(vENV39) PS Y:\min-dalle> pip freeze
absl-py==1.1.0
certifi==2022.6.15
charset-normalizer==2.0.12
chex==0.1.3
cycler==0.11.0
dm-tree==0.1.7
etils==0.6.0
flatbuffers==2.0
flax==0.4.2
fonttools==4.33.3
idna==3.3
importlib-resources==5.8.0
jax==0.3.14
jaxlib @ file:///Y:/jaxlib-0.3.14%2Bcuda11.cudnn82-cp39-none-win_amd64.whl
kiwisolver==1.4.3
matplotlib==3.5.2
msgpack==1.0.4
numpy==1.23.0
opt-einsum==3.3.0
optax==0.1.2
packaging==21.3
Pillow==9.1.1
pyparsing==3.0.9
python-dateutil==2.8.2
requests==2.28.0
scipy==1.8.1
six==1.16.0
toolz==0.11.2
torch==1.12.0+cu116
torchaudio==0.12.0+cu116
torchvision==0.13.0+cu116
typing_extensions==4.2.0
urllib3==1.26.9
zipp==3.8.0

from min-dalle.

Kreevoz avatar Kreevoz commented on August 22, 2024

I've added a couple more print statements to see what the tokenizer is up to:

tokenizing text
['Ġ', 'a']
['Ġ', 'a']
['Ġ', 'a']
['Ġ', 'c', 'o', 'm', 'f', 'y']
['Ġ', 'c', 'o', 'm', 'f', 'y']
['Ġ', 'c', 'om', 'f', 'y']
['Ġ', 'com', 'f', 'y']
['Ġ', 'com', 'fy']
['Ġ', 'com', 'fy']
['Ġ', 'c', 'h', 'a', 'i', 'r']
['Ġ', 'c', 'h', 'a', 'i', 'r']
['Ġ', 'ch', 'a', 'i', 'r']
['Ġ', 'ch', 'a', 'ir']
['Ġ', 'ch', 'air']
['Ġ', 'chair']
['Ġ', 'chair']
['Ġ', 't', 'h', 'a', 't']
['Ġ', 't', 'h', 'a', 't']
['Ġ', 't', 'h', 'at']
['Ġ', 'th', 'at']
['Ġ', 'th', 'at']
['Ġ', 'l', 'o', 'o', 'k', 's']
['Ġ', 'l', 'o', 'o', 'k', 's']
['Ġ', 'l', 'o', 'ok', 's']
['Ġ', 'l', 'ook', 's']
['Ġ', 'look', 's']
['Ġ', 'look', 's']
['Ġ', 'l', 'i', 'k', 'e']
['Ġ', 'l', 'i', 'k', 'e']
['Ġ', 'l', 'ik', 'e']
['Ġ', 'l', 'ike']
['Ġ', 'like']
['Ġ', 'like']
['Ġ', 'a', 'n']
['Ġ', 'a', 'n']
['Ġ', 'an']
['Ġ', 'an']
['Ġ', 'a', 'v', 'o', 'c', 'a', 'd', 'o']
['Ġ', 'a', 'v', 'o', 'c', 'a', 'd', 'o']
['Ġ', 'a', 'v', 'o', 'c', 'ad', 'o']
['Ġ', 'a', 'v', 'oc', 'ad', 'o']
['Ġ', 'av', 'oc', 'ad', 'o']
['Ġ', 'av', 'oc', 'ado']
['Ġ', 'av', 'oc', 'ado']

Why would your tokenizer fail to complete the long words when executing on my hardware? 😵

from min-dalle.

kuprel avatar kuprel commented on August 22, 2024

Not sure. It works properly in colab too: https://colab.research.google.com/github/kuprel/min-dalle/blob/main/min_dalle.ipynb

from min-dalle.

alexx-km avatar alexx-km commented on August 22, 2024

I've got the same issue, also on Windows (running on CPU only as I have an AMD GPU)... I'll check my packages if there are any other similarities between your setup and mine!

from min-dalle.

Kreevoz avatar Kreevoz commented on August 22, 2024

I found a pattern for this bug!

Vocabulary entries that have an entry that begins with: Ġ are not being matched by the tokenizer.
Tokens that do not start with that symbol will assemble successfully into long words. (I do not know why the list of tokens contains both types of entries?)

For instance: "project", "record", "management" will assemble into valid tokens.
But: "projections", "recordings", "manage" will not, because they are only listed as "Ġprojections", "Ġrecordings", "Ġmanage" in the json files.

That is why it breaks up words into such odd chunks. It can only pick the ones that start without that special character!

So there must be platform differences between linux and windows in how that accented Ġ is parsed.
Can you account for this in your tokenizer? Can we strip that out?

from min-dalle.

kuprel avatar kuprel commented on August 22, 2024

Do you get that Ġ character when you run this in python? print(chr(ord(" ") + 256))

from min-dalle.

Kreevoz avatar Kreevoz commented on August 22, 2024

Affirmative.

>>> print(chr(ord(" ") + 256))
Ġ

from min-dalle.

kuprel avatar kuprel commented on August 22, 2024

If you can figure out what will make it work on windows let me know. I don't have any windows machines

from min-dalle.

Kreevoz avatar Kreevoz commented on August 22, 2024

Yes, I got a fix. It was one of those annoying OS-specific things indeed.

You need to explicitly specify that the json files get parsed as utf-8.

On Windows the parser will default to the system locale unless specified (usually cp1252 or similar for english installs, different codepages for other languages). This causes the accented G to get lost/garbled up.

The fix is easily added in lines 16, 18 and 20 in the ./min_dalle/min_dalle.py file:

        with open(os.path.join(model_path, 'config.json'), 'r', encoding='utf8') as f: 
            self.config = json.load(f)
        with open(os.path.join(model_path, 'vocab.json'), 'r', encoding='utf8') as f:
            vocab = json.load(f)
        with open(os.path.join(model_path, 'merges.txt'), 'r', encoding='utf8') as f:

This should not negatively impact how the code executes under linux. The output now conforms to your examples when executing on windows and the tokens are correct.

from min-dalle.

kuprel avatar kuprel commented on August 22, 2024

Awesome thanks. I just updated it. Does it work now?

from min-dalle.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.