Giter Club home page Giter Club logo

Comments (7)

mtalcott avatar mtalcott commented on May 23, 2024

@SecDWizar Thanks for the detailed report! I think I understand the problem. For an immediate fix, try again with "Refresh media items" checked (as it is by default) now that the daily quota has been reset. This should re-fetch media items and resolve the 403s (although it might actually take 2 days' quota, see below).

The Photos API returns a baseUrl that is used to 1) download a small version of the image and 2) get the size. Only 75k baseUrl requests are allowed per day before 429s are returned. These baseUrls also only stay valid for about an hour, so when the quota reset at midnight, it started returning 403s instead as the baseUrls previously fetched had expired.

Based on the 2056 subtasks, I estimate your total library size is ~100k media items. The tool won't call baseUrls again for any media items it already has a size/image for, but it will use 1 call for the image and 1 call for the size, so it might require 3 days total quota to get images and sizes for your whole library.

Longer term, I want to handle this situation more gracefully upfront with user-facing messaging, and provide an option to not fetch size so that the daily quota can all go towards downloading images.

from google-photos-deduper.

SecDWizar avatar SecDWizar commented on May 23, 2024

Thank you very much for this explanation, now it's all clear (and those long guid urls make sense).

A couple of small questions:
How do you know what images it initially has, for the list of items to process and get those small versions and sizes for? what quota does this take? does it refresh? does this list refreshes only when checking that "refresh media items" and selecting restart? if so - does it invalidates that whole list? and getting it again counts as just one api call?

  • sounds to me this should be true on all accounts, so why not just fetch this up automatically whenever those ephemeral (1h) links go stale?

How do you track changes? adding from the phone images or deleting from the phone images. again if getting all the media items is just one API call, then processing it (which can be expensive, need a nice data structure for this - but this is done on my computer so who cares), this should even be done periodically IMHO and processed in the background (well, maybe it's a technical challenge) and even offer in the GUI occasionally, with the changes you discussed about, some prompt that media item changes detected, 100 more images added/scanned, 50 removed, etc. make that work (and gui) thus dynamic.

What are your thoughts?

from google-photos-deduper.

mtalcott avatar mtalcott commented on May 23, 2024

The image metadata is stored in MongoDB. The port is exposed by default when you run docker-compose, so you can take a peek at the media_items it has gather with a tool like https://www.mongodb.com/products/compass. Refreshing the media items pulls new data from the Photos API into the mongo collection, and spins of new tasks to store downsized images and get the size, each of which count against that 75k/day quota I mentioned above.

sounds to me this should be true on all accounts, so why not just fetch this up automatically whenever those ephemeral (1h) links go stale?

Yes, that's what it should probably do instead - plus notify the user that the quota was hit then keep going once the quota resets. I wasn't sure when the quota reset, so knowing it reset for you at midnight PST is helpful.

I also don't have an option yet to cancel the current task, that'd be helpful once the quota is hit in case you want to shut the whole thing down and try again later.

How do you track changes? adding from the phone images or deleting from the phone images. again if getting all the media items is just one API call, then processing it (which can be expensive, need a nice data structure for this - but this is done on my computer so who cares), this should even be done periodically IMHO and processed in the background (well, maybe it's a technical challenge) and even offer in the GUI occasionally, with the changes you discussed about, some prompt that media item changes detected, 100 more images added/scanned, 50 removed, etc. make that work (and gui) thus dynamic.

The Photos API makes this difficult. There's no way to filter by modified date or anything. In fact, it doesn't even tell you how many media items exist - it just returns a next page token and you have to keep iterating until there are no more pages. Getting the media item metadata is the least expensive part though, it's calling the baseUrls and storing the images that takes more time, which is why I've parallelized it.

from google-photos-deduper.

mtalcott avatar mtalcott commented on May 23, 2024

@SecDWizar This should be addressed by #16.

The task will now fail once quota is hit, but progress is saved and it can be restarted the next day.

from google-photos-deduper.

SecDWizar avatar SecDWizar commented on May 23, 2024

@SecDWizar This should be addressed by #16.

The task will now fail once quota is hit, but progress is saved and it can be restarted the next day.

Hi Hi,
Sorry for the late reply,

Progress is saved with/out "refresh media items"?

I've pulled and rebuilt the images - that should do it, right?
Ever since then it failed (I always tried with "Refresh media items". I didn't see what happened in the logs as I always looked at it after a few days - and it's hard to find it like this in the logs.

image

from google-photos-deduper.

SecDWizar avatar SecDWizar commented on May 23, 2024

Found it. so I think that's a different bug, want me to open a new one for that?
Still the question is asked about progress is saved with or without "refresh media items"? (I mean if you do "refresh media items" it zeros it, etc.)

worker_1  | [2023-09-14 13:09:26,333: ERROR/ForkPoolWorker-31] Task app.tasks.process_duplicates[75866370-e90a-4066-a894-42a29a6f3b12] raised unexpected: RuntimeError('Image decoding failed (unknown image type): /mnt/images/AMP2KI72Y-ZxNHXGmA_vS0fgKGuVV-R43RJHXupfrUWomAFDAwANBVTaFUTYkYpBr4PfkTlMuLACm1eDBOfpA4CoCRT30BPurQ-250.jpg')
worker_1  | Traceback (most recent call last):
worker_1  |   File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 477, in trace_task
worker_1  |     R = retval = fun(*args, **kwargs)
worker_1  |   File "/usr/src/app/app/__init__.py", line 25, in __call__
worker_1  |     return self.run(*args, **kwargs)
worker_1  |   File "/usr/src/app/app/tasks.py", line 90, in process_duplicates
worker_1  |     results = task_instance.run()
worker_1  |   File "/usr/src/app/app/lib/process_duplicates_task.py", line 109, in run
worker_1  |     similarity_map = duplicate_detector.calculate_similarity_map()
worker_1  |   File "/usr/src/app/app/lib/duplicate_image_detector.py", line 61, in calculate_similarity_map
worker_1  |     embeddings = self._calculate_embeddings()
worker_1  |   File "/usr/src/app/app/lib/duplicate_image_detector.py", line 114, in _calculate_embeddings
worker_1  |     mp_image = mp.Image.create_from_file(storage_path)
worker_1  | RuntimeError: Image decoding failed (unknown image type): /mnt/images/AMP2KI72Y-ZxNHXGmA_vS0fgKGuVV-R43RJHXupfrUWomAFDAwANBVTaFUTYkYpBr4PfkTlMuLACm1eDBOfpA4CoCRT30BPurQ-250.jpg

from google-photos-deduper.

SecDWizar avatar SecDWizar commented on May 23, 2024

How do you track changes? adding from the phone images or deleting from the phone images. again if getting all the media items is just one API call, then processing it (which can be expensive, need a nice data structure for this - but this is done on my computer so who cares), this should even be done periodically IMHO and processed in the background (well, maybe it's a technical challenge) and even offer in the GUI occasionally, with the changes you discussed about, some prompt that media item changes detected, 100 more images added/scanned, 50 removed, etc. make that work (and gui) thus dynamic.

The Photos API makes this difficult. There's no way to filter by modified date or anything. In fact, it doesn't even tell you how many media items exist - it just returns a next page token and you have to keep iterating until there are no more pages. Getting the media item metadata is the least expensive part though, it's calling the baseUrls and storing the images that takes more time, which is why I've parallelized it.

In that case it should be done on schedule and kept (Mongo) IMHO, perhaps even a UI option of triggering that refresh and also to set that schedule (by default once a day). from what I understand it's pagination-amount-api-calls right? how many entries per page? say if one has ~100K items, and 100 entries per page that'll be 1K calls, ,if it's 10 entries it'll be 10K calls, etc.? so that's problematic...

I'm just thinking that images are added by the phone all the time. also sometimes people delete them.

Maybe it shouldn't be at a schedule and just that "refresh media items" will refresh it by choice, so once in a while you'd hit that and it'll resync everything, right?

from google-photos-deduper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.