sayakpaul / mlp-mixer-cifar10 Goto Github PK

Implements MLP-Mixer (https://arxiv.org/abs/2105.01601) with the CIFAR-10 dataset.

License: Apache License 2.0

Jupyter Notebook 100.00%

vision mlp tensorflow keras non-conv non-self-attention

mlp-mixer-cifar10's Introduction

MLP-Mixer-CIFAR10

This repository implements MLP-Mixer as proposed in MLP-Mixer: An all-MLP Architecture for Vision. The paper introduces an all MLP (Multi-layer Perceptron) architecture for computer vision tasks. Yannic Kilcher walks through the architecture in this video.

Experiments reported in this repository are on CIFAR-10.

What's included?

Distributed training with mixed-precision.
Visualization of the token-mixing MLP weights.
A TensorBoard callback to keep track of the learned linear projections of the image patches.

Screen.Recording.2021-05-25.at.5.49.20.PM.mov

Notebooks

MLP_Mixer_Training.ipynb: MLP-Mixer utilities along with model training.
ResNet20.ipynb: Trains a ResNet20 for comparison purposes.
Visualization.ipynb: Visualizes the learned projections and token-mixing MLPs.

Note: These notebooks are runnable on Colab. If you don't have access to a tensor-core GPU, please disable the mixed-precision block while running the code.

Results

MLP-Mixer achieves competitive results. The figure below summarizes top-1 accuracies on CIFAR-10 test set with respect to varying MLP blocks.

Notable hyperparameters are:

Image size: 72x72
Patch size: 9x9
Hidden dimension for patches: 64
Hidden dimension for patches: 128

The table below reports the parameter counts for the different MLP-Mixer variants:

ResNet20 (0.571969 Million) achieves 78.14% under the exact same training configuration. Refer to this notebook for more details.

Models

You can reproduce the results reported above. The model files are available here.

Acknowledgements

ML-GDE Program for providing GCP credits.

mlp-mixer-cifar10's People

Contributors

Stargazers

Watchers

Forkers

mlp-mixer-cifar10's Issues

Could patches number != MLP token mixing dimension?

I try to change the model into B/16 MLP-Mixer.
is this setting, the patch number ( sequence length) != MLP token mixing dimension.
But the code will report an error when it implements "x = layers.Add()([x, token_mixing])" because the two operation numbers have different shapes.
Take an example,
B/16 Settings:
image 3232, 2D hidden layer 768, PP= 16*16, token mixing mlp dimentsion= 384, channel mlp dimension = 3072.
Thus patch number ( sequence length) = 4, table value shape= (4, 768)
When the code runs x = layers.Add()([x, token_mixing]) in the token mixing layer.
rx shape=[4, 768], token_mixing shape = [384, 768]

It is strange why the MLP-Mixer paper could set different parameters "patch number ( sequence length) != MLP token mixing dimensio"

Excuse me, why the batchsize is None in the implementation?

Why the accuracy drops after epoch 100/100 (accuracy drops from 91% to 71%)

I trained the Network ( NUM_MIXER_LAYERS =4 )

At epoch 100:

Epoch 100/100

1/44 [..............................] - ETA: 1s - loss: 0.2472 - accuracy: 0.9160��
3/44 [=>............................] - ETA: 1s - loss: 0.2424 - accuracy: 0.9162��
5/44 [==>...........................] - ETA: 1s - loss: 0.2431 - accuracy: 0.9155��
7/44 [===>..........................] - ETA: 1s - loss: 0.2424 - accuracy: 0.9154��
9/44 [=====>........................] - ETA: 1s - loss: 0.2419 - accuracy: 0.9155��
11/44 [======>.......................] - ETA: 1s - loss: 0.2423 - accuracy: 0.9150��
13/44 [=======>......................] - ETA: 1s - loss: 0.2426 - accuracy: 0.9145��
15/44 [=========>....................] - ETA: 1s - loss: 0.2430 - accuracy: 0.9142��
17/44 [==========>...................] - ETA: 1s - loss: 0.2433 - accuracy: 0.9140��
19/44 [===========>..................] - ETA: 1s - loss: 0.2435 - accuracy: 0.9138��
21/44 [=============>................] - ETA: 0s - loss: 0.2438 - accuracy: 0.9136��
23/44 [==============>...............] - ETA: 0s - loss: 0.2439 - accuracy: 0.9135��
25/44 [================>.............] - ETA: 0s - loss: 0.2440 - accuracy: 0.9134��
27/44 [=================>............] - ETA: 0s - loss: 0.2440 - accuracy: 0.9133��
29/44 [==================>...........] - ETA: 0s - loss: 0.2442 - accuracy: 0.9132��
31/44 [====================>.........] - ETA: 0s - loss: 0.2445 - accuracy: 0.9130��
33/44 [=====================>........] - ETA: 0s - loss: 0.2447 - accuracy: 0.9129��
35/44 [======================>.......] - ETA: 0s - loss: 0.2450 - accuracy: 0.9127��
37/44 [========================>.....] - ETA: 0s - loss: 0.2454 - accuracy: 0.9125��
39/44 [=========================>....] - ETA: 0s - loss: 0.2459 - accuracy: 0.9123��
41/44 [==========================>...] - ETA: 0s - loss: 0.2463 - accuracy: 0.9121��
43/44 [============================>.] - ETA: 0s - loss: 0.2469 - accuracy: 0.9119��
44/44 [==============================] - 2s 46ms/step - loss: 0.2474 - accuracy: 0.9117 - val_loss: 1.1145 - val_accuracy: 0.7226

Then it still have an extra training,
1/313 [..............................] - ETA: 24:32 - loss: 0.5860 - accuracy: 0.8125��
8/313 [..............................] - ETA: 2s - loss: 1.2071 - accuracy: 0.6953 ��
.....
313/313 [==============================] - ETA: 0s - loss: 1.0934 - accuracy: 0.7161��
313/313 [==============================] - 12s 22ms/step - loss: 1.0934 - accuracy: 0.7161
Test accuracy: 71.61

Consider either turning off auto-sharding or switching the auto_shard_policy to DATA

Excuse me, when I try to run it on the serve, it tips:

Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new tf.data.Options() object then setting options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA before applying the options object to the dataset via dataset.with_options(options).
2021-11-21 11:59:20.861052: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.

BTW, my TensorFlow version is 2.4.0, how to fix this problem?

sayakpaul / mlp-mixer-cifar10 Goto Github PK

mlp-mixer-cifar10's Introduction

MLP-Mixer-CIFAR10

What's included?

Notebooks

Results

Models

Acknowledgements

mlp-mixer-cifar10's People

Contributors

Stargazers

Watchers

Forkers

mlp-mixer-cifar10's Issues

Could patches number != MLP token mixing dimension?

Excuse me, why the batchsize is None in the implementation?

Why the accuracy drops after epoch 100/100 (accuracy drops from 91% to 71%)

Consider either turning off auto-sharding or switching the auto_shard_policy to DATA

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent