drsy / motis Goto Github PK

View Code? Open in Web Editor NEW

119.0 4.0 10.0 16.9 MB

[NAACL 2022]Mobile Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP)

Swift 47.54% Objective-C 1.54% Objective-C++ 9.56% Ruby 0.26% C++ 41.10%

ios-swift ai image-search clip vector-search knn lsh semantic-search random-projection knowledge-distillation

motis's Introduction

Hi there 👋

😉 I am Siyu Ren.

🎓 I got my Bachelor degree from Tong Ji University and Ph.D degree at Shanghai Jiao Tong University.

🔎 Currently, my research interest includes Efficient Methods for NLP/Large Language Models and techniques around mechanistic understanding of LLMs.

📚 For my academic publications, please refer to https://drsy.github.io/.

motis's People

Contributors

Stargazers

Watchers

Forkers

motis's Issues

Cannot match up model encodings

Hi! Thanks for publishing this work, it's a great reference.

I'm trying to integrate a couple of different systems, and I need the model encodings to match. So far, I haven't been able to make that work:

Given this python;

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("image_1.png")).unsqueeze(0).float().to(device)
text = clip.tokenize(["a face", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    print(image_features.tolist()[0])

I'm trying to get the same array of floats out using Clip.mm's - (NSArray<NSNumber*>*)test_uiimagetomat:(UIImage*)image function. Try as I might, they always differ - and I'm not sure what the difference is. I can see that the cvt methods do the same as the image preprocess, then the normalise with the values from clip.

Here's some of the initial values from the python code above:

[0.3502497971057892, 0.0028706961311399937, -0.46749746799468994, -0.14868411421775818, -0.03139263391494751, -0.4536064863204956

And from the Swift:

[0.3193549513816833496, 0.0140316337347030640, -0.4410626888275146484, -0.0908056870102882385, -0.0415024310350418091, -0.4141347408294677734

I used the preview of the quicklook on debugging the iOS code to save the image from the UIImage to ensure the same image is being used. In both cases, I'm using the original vit-b-32 CLIP image encoding. Strangely, the numbers above are kind of similar - but not sure if that's coincidental.

Any advice?

Wrong returned images when using the Android app

Hello, I have downloaded and installed the Android app in my phone. But when I tried some text prompt, the returned image was always the first one. I am wondering if I installed the app in a wrong way or the apk has something wrong, does anyone try the app and get the right results?

drsy / motis Goto Github PK

motis's Introduction

Hi there 👋

motis's People

Contributors

Stargazers

Watchers

Forkers

motis's Issues

Cannot match up model encodings

Wrong returned images when using the Android app

Source code for Android sample

State_dict weights

test result

Training code

can i have the traced models ?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent