Model Initialization

If possible, could you please share the exact initialization parameters passed into ReXNetV1 in order to create ReXNetV1_2.0? The options (and defaults) are:

  • input_ch=16
  • final_ch=180
  • width_mult=1.0
  • depth_mult=1.0
  • classes=1000
  • use_se=True
  • se_ratio=12
  • dropout_ratio=0.2
  • bn_momentum=0.9

My understanding is that width_mult should be set to 2. However doing so and them attempting to load the provided model weights for the -2.0 model results in many unaligned saved weights vs declared model weights. The paper isn't straightforward in providing guidance in this regard either, but that can be resolved easily, I think, the way ResNet and EfficientNet have convenience methods to build each version of their network, e.g.:

the reslut of MNN and onnx is different from pytorch

Thansk for for your great work.
I use rexnet to train classify net.
and I transfer it from Pytorch -> ONNX -> MNN.
the result of onnx and mnn are the same, but they are different from pytorch result.

Does rexnet can transform to ONNX normally and maintain result same?

ONNX export failed: Couldn't export Python operator SiLUJitImplementation

Thanks for sharing your wonderful work.

When I convert to onnx, I encounter a problem: ONNX export failed: Couldn't export Python operator SiLUJitImplementation.
The convert code is as follow:

def pth2onnx(model_name, model, input_shape):
input_name = ["input"]
output_name = ["output"]
input = Variable(torch.randn(input_shape)).cpu()
m_model = model.cpu()
test_path = r'./'
torch.onnx.export(m_model, input, model_name, input_names=input_name, output_names=output_name, verbose=False, opset_version=11)

Use the lastest code

One problem

In, Is there any error on "layers = [ceil(0 * depth_mult) for element in layers]" (135 lines)?

Training recipe

Thanks for sharing your code.

I tried to train my own ReXNet using the recipe the repo provided:

./ 4 /imagenet/ --model rexnetv1 --rex-width-mult 1.0 --opt sgd --amp \
 --lr 0.5 --weight-decay 1e-5 \
 --batch-size 128 --epochs 400 --sched cosine \
 --remode pixel --reprob 0.2 --drop 0.2 --aa rand-m9-mstd0.5 

In your paper, it shows you used the ReXNet with stochastic depth rate of 0.2. However, the provided recipe does not used stochastic depth drop.

My question is that, in order to re-produce the results, do I need to use the stochastic depth drop?


Are there any plans to update the rexnet-lite model?

ReXNet_V1-2.0x weight 사용시 문제


위의 코드에서 2.0x weight를 사용시 에러가 발생합니다.

동일한 코드에서 1.0x weight를 사용하였을떄는 문제가 발생하지 않았습니다.

그래서 제 코드가 아닌 다른 부분에서 에러가 있을것으로 예상이 됩니다.

Counting FLOPs

Thanks for sharing your wonderful work.

I am curious about counting FLOPs.
I found this but It shows higher FLOPs when I use HardSwish instead of Swish.
Can you share your FLOPs counting script?

Thank you very much

pretrained models prolem

When I run

import torch
import rexnetv1

model = rexnetv1.ReXNetV1(width_mult=1.0).cuda()
print(model(torch.randn(1, 3, 224, 224).cuda()))

I met this problem:
TypeError:init() takes from 1 to 2 positional arguments but 3 were given

I did not change anything, how to solve it?


In README RexNet 1.3, 1.5 and 2.0 redirect to 404 error.

Comparison with RegNet

Edit : (I separated this question from a previous issue #3 )

This paper considers network design spaces similar to the approach taken in the recent RegNet paper (Designing Network Design Spaces). Are the principles from that paper congruent to yours?

본 논문에서와 같은 설계방법론에 대한 논의는 이전에 fair에서 나온 RegNet paper (Designing Network Design Spaces)를 연상시키는데요, 해당 논문의 접근과의 어떻게 비교되는지요?

Unstable training

thanks for your great work.Now I used the rexnet as my backbone for my own classification task,I found that the backbone is not stable(maybe with the maximum 1% of val_acc).I used the Adam optimizer and the ExponentialLR as the lr scheduler, without the warmup strategy. I wond konw that how to make stable training,can you give me some guidence? Looking forward to your reply.

Improvements for ResNet


Thanks for the great work. I have several questions.

First, do the numbers in Table 7 include the training techniques mentioned in Appendix B.2?
Second, I'm wondering why are the improvements for ResNet50 and VGG16 much smaller than that of MobileNets. (0.8% and 0.2% compared to 4%).


Rank Expansion & Training

While programming a Tensorflow implementation of ReXNet, I had a couple of questions regarding your proposed rank expansion and training methodologies.

First Question

First of all, regarding your implementation of the [Linear Bottleneck] residual connection, the code says it aggregates the input into the first layers of the bottleneck with the dimensions equal to the input_channels already inside the sequential output.

if self.use_shortcut:
    out[:, 0:self.in_channels] += x

However, when reading the ResNet paper, the authors of the residual block said that when the input channels and output channels are different, it is necessary for x to be linearly projected to the same shape and then added to the residual function. My implementation of this can be seen as follows:

# Tensorflow implementation of ReXNet
# y = the linear bottleneck model (ConvBNSwish + ConvAct + Squeeze + etc.)
if use_shortcut:
    x = Conv2D(filters=self.out_channels, strides=self.stride,...)(_input)
    x = BatchNormalization(...)(x)
    y += x
return y

I was wondering if there was a particular reason why you did not opt to use a linear projection on the input, and if there are any significant performance differences between the two methodologies.

Second Question

Furthermore, I noticed in the [Linear Bottleneck] code that you only provided a depthwise operation without a pointwise convolution. The code provided says

ConvBNAct(out, in_channels=dw_channels, channels=dw_channels, kernel=3, stride=stride, pad=1,
                  num_group=dw_channels, active=False)

Does this mean this is not a depthwise separable convolution? Or is this also a method to reduce the representational bottleneck?

Thank you!


좋은 논문과 그에 대한 코드를 제공해주셔서 감사합니다. Mobilenet에 이어서 경량화 뉴럴넷을 찾아보는 중 ReXNet을 발견하고 좋은 성능을 보여줬기에 관심이 생겼습니다. 그래서 실제로 모바일 핸드폰에 돌리기 위해 Tensorflow로 구현해보면서 몇 가지 질문들이 생겨서 이슈를 남깁니다.


rexnetv1을 공부하고 구현하면서 몇가지 의문점이 생겨 문의드립니다.


연구 하시느라 고생이 많으십니다.
최근에 rexnetV1 구현을 직접 손으로 구현해보면서 공부를하고 있는 직장인입니다.
손으로 코드를 구현하다가 몇가지 의문이 생겨서 조심스럽게 문의드립니다.

  1. stem_channel 또한 width_multi에 영향을 받도록 설계를 하신건가요?


제가 이해한 바로는 width_multi가 1.0이하 일때 영향을 받고 1.0 이상일때는 32로 고정되도록 의도했다고 이해했습니다.
그런데 width_multi를 나누시는 이유가 이해가 되지 않습니다.
아니면 다른 이유가 있으신지요?

  1. 왜 nn.Dropout2d가 아니라 nn.Dropout인가?


이 또한 제가 이해한 바로는 nn.Dropout이 연결되는 시점이 flattern되지 않은 BxCxHxW feature맵 형태의 시점인걸로 보입니다.
그런데 사용자가 원하는 클래스 n개의 갯수로 컨볼루션이 될 것이고 드랍아웃이 될것인데 그러면 채널 갯수만큼 드랍아웃이 되어야 할것으로 보입니다.
nn.Dropout이라면 채널 갯수만큼 드랍아웃이 되지 않고 의도되지 않는 동작을 할 것으로 보이는데 문제가 되지 않나요?
(nn.Dropout은 BxK인 flatten인 형태일때 k가 p의 비율만큼 드랍아웃되는 것이 아닌가요?)
혹시 제가 잘못 이해하고있거나 다른 의도가 있으시다면 좀 조언 부탁드립니다.

답변 기다리겠습니다 좋은 논문 만들어 주셔서 감사합니다.

RandAug + EraseAug + SE Block + Swish ?

In my work, I am in the process of verifying RexNet to use as a factor in the in-house model tuning.

You have trained RexNet models with RandAug, EraseAug, SE Blocks and SiLU(Swish) Activations.

Those mentioned above were not used in mobilenet-v2 training.

Since you argue that adjusting the number of channels per layers in mobilenet-v2 is important factor to improve the performance, I have trained RexNet without those techniques. Then I got 72.9-73.2% top-1 accuracy, To check if it was a training problem, I trained RexNet with the above technique on, and it came out similar to the paper.

so the questions are,

  1. The argument 'Diminishing Representational Bottleneck' by adjusting the channel size seems to be uncertain from the paper, what do you think?

  2. Have you tried to train mobilenet-v2 models with above techniques, without adjusting channel sizes?

Thanks. Looking forward to your response.

COCO ReXNet Model

Hi @dyhan0920 and other Contributors.
Thank you for making public such a good repo.

It's totally amazing that ReXNet models are better than Efficient Net models. Looks like we got another SOTA model!!!

I saw the profiling in ImageNet & COCO dataset. I want to try evaluating those models but was not able to for COCO. So, can you provide the weights for COCO ReXNet version models?

GPU memory

Dear all,

Thanks for a nice work.
I have a question about used GPU memory when running the code for inference mode.

I compared trainable parameters and GPU memory between the ResNet50 in the torchvision from Facebook and ReXNetV1 for torch tensor whose shape is [1, 3, 1080, 1920]. As a result, the ReXNetV1 has fewer trainable parameters, but it requires more GPU memory compared to the ResNet50.
Please note that the the parameter of the ReXNetV1, width_mult, is 1.0.
The GPU memory usage is checked by the command, nvidia-smi when running the below code.

  • Model parameter

    • ReXNetV1: 4,796,873
    • ResNetV50: 25,557,032
  • GPU memory

    • ReXNetV1: 9,723MiB
    • ResNetV50: 7,819MiB
  • Experiment environment

    • OS: Linux 16.0.4
    • torch version: 1.5.1
    • torchvision version: 0.6.1
    • GPU: single NVIDIA Titan RTX

Thus, I wonder why the ReXNetV1 requires more memory than the ResNet50. In other words, I wonder which module in the ReXNet V1 seems to use the most memory.
For reference, the code used in the experiment is below.

import torch
import torchvision.models as models
import rexnetv1

# Please select the model to be used in the experiment to measure GPU memory usage. 
# Comment out the model to be unused. (e.g # model = models.resnet50(pretrained=True).to('cuda'))
# Option 1: RexNetV1
model = rexnetv1.ReXNetV1(width_mult=1.0).to('cuda: 0')
# Option 2: ResNet50
# model = models.resnet50(pretrained=True).to('cuda: 0')

x = torch.randn([1, 3, 1080, 1920], dtype=torch.float).to('cuda: 0')

for idx in range(100):
    y = model(x)

print('Model params.: {}'.format(sum(p.numel() for p in model.parameters() if p.requires_grad)))

pretrain rexnet lite

Thank you so much for awesome models.
Could you please provide me rexnet_lite_1.3.pth,... ?
Thank you in advance.

Latency, throughput, GPU performance

Since ReXNet is based on MNASnet architectures(mobile et v1 and v2) I guess that they suffer from the same low throughput low GPU performance issue.
Can you provide any numbers?

I am specifically interested in per image and per batch latency, throughput(images/sec) depending on hardware such as cpu, arm processor, GPU etc.

I know this is a lot to ask for but I believe that it will be valuable to others researchers too and any kind of numbers would be helpful.

Thankfully, this paper focuses more on the design principles which maybe applicable to other gpu friendly sota architectures such as Tresnet or Resnest.

좋은 논문과 코드 감사드립니다.
논문에서 주로 제시된 rexnet들은 mobilenet기반이라 원래 모델의 gpu 성능 한계를 그대로 유지할 것이라고 생각됩니다.

이에대한 비교가 이루어졌나요? 실제 수치(gpu 배치 크기당 초당 이미지 처리속도 등)가 제공된다면 감사하겠습니다.

사실 그부분에 제시된 모델자체는 제한이있더라도 설계방법론자체는 다른 gpu 효과적 모델에 적용이가능할것같아서 고무적입니다.


extract features

model = ReXNetV1(width_mult=1.5)
fea = model.extract_features(18)

TypeError: conv2d() received an invalid combination of arguments - got (int, Parameter, NoneType, tuple, tuple, tuple, int), but expected one of:

  • (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups)
    didn't match because some of the arguments have invalid types: (int, Parameter, NoneType, tuple, tuple, tuple, int)
  • (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups)
    didn't match because some of the arguments have invalid types: (int, Parameter, NoneType, tuple, tuple, tuple, int)
    How to apply to segmentation

transforn to onnx occur warning

Itransfer rexnet to onnx and occur warning as below:
F:\ TracerWarning: There are 2 live references to the data region being modified when tracing in-place operator add_. This might cause the trace to be incorrect, because all other views that also reference this data will not reflect this change in the trace! On the other hand, if all other views use the same memory chunk, but are disjoint (e.g. are outputs of torch.split), this might still be safe.
out[:, 0:self.in_channels] += x
F:\ TracerWarning: There are 4 live references to the data region being modified when tracing in-place operator copy_ (possibly due to an assignment). This might cause the trace to be incorrect, because all other views that also reference this data will not reflect this change in the trace! On the other hand, if all other views use the same memory chunk, but are disjoint (e.g. are outputs of torch.split), this might still be safe.
out[:, 0:self.in_channels] += x

could you help me?
I think it make the result of onnx is different from pytorch

