Comments (7)
@SecDWizar Thanks for the detailed report! I think I understand the problem. For an immediate fix, try again with "Refresh media items" checked (as it is by default) now that the daily quota has been reset. This should re-fetch media items and resolve the 403s (although it might actually take 2 days' quota, see below).
The Photos API returns a baseUrl
that is used to 1) download a small version of the image and 2) get the size. Only 75k baseUrl requests are allowed per day before 429s are returned. These baseUrls
also only stay valid for about an hour, so when the quota reset at midnight, it started returning 403s instead as the baseUrls previously fetched had expired.
Based on the 2056 subtasks, I estimate your total library size is ~100k media items. The tool won't call baseUrls again for any media items it already has a size/image for, but it will use 1 call for the image and 1 call for the size, so it might require 3 days total quota to get images and sizes for your whole library.
Longer term, I want to handle this situation more gracefully upfront with user-facing messaging, and provide an option to not fetch size so that the daily quota can all go towards downloading images.
from google-photos-deduper.
Thank you very much for this explanation, now it's all clear (and those long guid urls make sense).
A couple of small questions:
How do you know what images it initially has, for the list of items to process and get those small versions and sizes for? what quota does this take? does it refresh? does this list refreshes only when checking that "refresh media items" and selecting restart? if so - does it invalidates that whole list? and getting it again counts as just one api call?
- sounds to me this should be true on all accounts, so why not just fetch this up automatically whenever those ephemeral (1h) links go stale?
How do you track changes? adding from the phone images or deleting from the phone images. again if getting all the media items is just one API call, then processing it (which can be expensive, need a nice data structure for this - but this is done on my computer so who cares), this should even be done periodically IMHO and processed in the background (well, maybe it's a technical challenge) and even offer in the GUI occasionally, with the changes you discussed about, some prompt that media item changes detected, 100 more images added/scanned, 50 removed, etc. make that work (and gui) thus dynamic.
What are your thoughts?
from google-photos-deduper.
The image metadata is stored in MongoDB. The port is exposed by default when you run docker-compose
, so you can take a peek at the media_items
it has gather with a tool like https://www.mongodb.com/products/compass. Refreshing the media items pulls new data from the Photos API into the mongo collection, and spins of new tasks to store downsized images and get the size, each of which count against that 75k/day quota I mentioned above.
sounds to me this should be true on all accounts, so why not just fetch this up automatically whenever those ephemeral (1h) links go stale?
Yes, that's what it should probably do instead - plus notify the user that the quota was hit then keep going once the quota resets. I wasn't sure when the quota reset, so knowing it reset for you at midnight PST is helpful.
I also don't have an option yet to cancel the current task, that'd be helpful once the quota is hit in case you want to shut the whole thing down and try again later.
How do you track changes? adding from the phone images or deleting from the phone images. again if getting all the media items is just one API call, then processing it (which can be expensive, need a nice data structure for this - but this is done on my computer so who cares), this should even be done periodically IMHO and processed in the background (well, maybe it's a technical challenge) and even offer in the GUI occasionally, with the changes you discussed about, some prompt that media item changes detected, 100 more images added/scanned, 50 removed, etc. make that work (and gui) thus dynamic.
The Photos API makes this difficult. There's no way to filter by modified date or anything. In fact, it doesn't even tell you how many media items exist - it just returns a next page token and you have to keep iterating until there are no more pages. Getting the media item metadata is the least expensive part though, it's calling the baseUrls and storing the images that takes more time, which is why I've parallelized it.
from google-photos-deduper.
@SecDWizar This should be addressed by #16.
The task will now fail once quota is hit, but progress is saved and it can be restarted the next day.
from google-photos-deduper.
@SecDWizar This should be addressed by #16.
The task will now fail once quota is hit, but progress is saved and it can be restarted the next day.
Hi Hi,
Sorry for the late reply,
Progress is saved with/out "refresh media items"?
I've pulled and rebuilt the images - that should do it, right?
Ever since then it failed (I always tried with "Refresh media items". I didn't see what happened in the logs as I always looked at it after a few days - and it's hard to find it like this in the logs.
from google-photos-deduper.
Found it. so I think that's a different bug, want me to open a new one for that?
Still the question is asked about progress is saved with or without "refresh media items"? (I mean if you do "refresh media items" it zeros it, etc.)
worker_1 | [2023-09-14 13:09:26,333: ERROR/ForkPoolWorker-31] Task app.tasks.process_duplicates[75866370-e90a-4066-a894-42a29a6f3b12] raised unexpected: RuntimeError('Image decoding failed (unknown image type): /mnt/images/AMP2KI72Y-ZxNHXGmA_vS0fgKGuVV-R43RJHXupfrUWomAFDAwANBVTaFUTYkYpBr4PfkTlMuLACm1eDBOfpA4CoCRT30BPurQ-250.jpg')
worker_1 | Traceback (most recent call last):
worker_1 | File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 477, in trace_task
worker_1 | R = retval = fun(*args, **kwargs)
worker_1 | File "/usr/src/app/app/__init__.py", line 25, in __call__
worker_1 | return self.run(*args, **kwargs)
worker_1 | File "/usr/src/app/app/tasks.py", line 90, in process_duplicates
worker_1 | results = task_instance.run()
worker_1 | File "/usr/src/app/app/lib/process_duplicates_task.py", line 109, in run
worker_1 | similarity_map = duplicate_detector.calculate_similarity_map()
worker_1 | File "/usr/src/app/app/lib/duplicate_image_detector.py", line 61, in calculate_similarity_map
worker_1 | embeddings = self._calculate_embeddings()
worker_1 | File "/usr/src/app/app/lib/duplicate_image_detector.py", line 114, in _calculate_embeddings
worker_1 | mp_image = mp.Image.create_from_file(storage_path)
worker_1 | RuntimeError: Image decoding failed (unknown image type): /mnt/images/AMP2KI72Y-ZxNHXGmA_vS0fgKGuVV-R43RJHXupfrUWomAFDAwANBVTaFUTYkYpBr4PfkTlMuLACm1eDBOfpA4CoCRT30BPurQ-250.jpg
from google-photos-deduper.
How do you track changes? adding from the phone images or deleting from the phone images. again if getting all the media items is just one API call, then processing it (which can be expensive, need a nice data structure for this - but this is done on my computer so who cares), this should even be done periodically IMHO and processed in the background (well, maybe it's a technical challenge) and even offer in the GUI occasionally, with the changes you discussed about, some prompt that media item changes detected, 100 more images added/scanned, 50 removed, etc. make that work (and gui) thus dynamic.
The Photos API makes this difficult. There's no way to filter by modified date or anything. In fact, it doesn't even tell you how many media items exist - it just returns a next page token and you have to keep iterating until there are no more pages. Getting the media item metadata is the least expensive part though, it's calling the baseUrls and storing the images that takes more time, which is why I've parallelized it.
In that case it should be done on schedule and kept (Mongo) IMHO, perhaps even a UI option of triggering that refresh and also to set that schedule (by default once a day). from what I understand it's pagination-amount-api-calls right? how many entries per page? say if one has ~100K items, and 100 entries per page that'll be 1K calls, ,if it's 10 entries it'll be 10K calls, etc.? so that's problematic...
I'm just thinking that images are added by the phone all the time. also sometimes people delete them.
Maybe it shouldn't be at a schedule and just that "refresh media items" will refresh it by choice, so once in a while you'd hit that and it'll resync everything, right?
from google-photos-deduper.
Related Issues (20)
- Cannot build Chrome Extension HOT 2
- Error processing duplicates HOT 9
- Ability to limit the library scan by date HOT 1
- Ability to delete an image that is part of a burst shot
- MaxRetryError HOT 1
- KeyError: 'storageFilename' HOT 8
- Server 404 HOT 1
- Some duplicate images not in resultlist HOT 3
- Standalone deleter HOT 1
- I encountered an error HOT 3
- Chrome extension loading error HOT 2
- Help with "Set up local environment variables." HOT 3
- Error Building Image HOT 2
- Deleting Photo Fails HOT 1
- Bad Gateway 502 HOT 1
- Any plans to automate installation and set up?
- 403 Client Error: Forbidden for url: https://photoslibrary.googleapis.com/v1/mediaItems?pageSize=100
- Show full previews
- Allow to keep more than photo
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from google-photos-deduper.