On initial index if "Download cap" is set then pages should only be fetched until it h

I'll track the RSS feature in <a class="issue-link js-issue-link" data-error-text="Fai

Yep looks like I need to add a PR to youtube-dl. <a class="issue-lin

Don't grab all pages if download cap is enabled about tubesync HOT 7 CLOSED

OmgImAlexis commented on May 22, 2024

Don't grab all pages if download cap is enabled

from tubesync.

Comments (7)

meeb commented on May 22, 2024 2

While that upstream issue if added might help the logs, it likely won't stop the requirement that TubeSync will still need to index all YouTube video IDs in a playlist each time it does an index as they are not assured to be chronologically returned by YouTube when it gets crawled. The initial requirement with TubeSync is to "find all new video IDs" which still means indexing entire channels and playlists. This flag, if implemented upstream in youtube-dl, would likely just limit what's returned by extract_info() rather than limit what's actually requested from YouTube. If there is some enforcement of chronological ordering feature that could be used for YouTube it likely wouldn't be transferable to other sites which will get support in TubeSync in the future either. Of course, if devs of youtube-dl who admittedly do have a far superior knowledge of the internals of YouTube APIs/front ends and their own codebase Than I do find a way to actually make this work properly I would implement it. In the foreseeable short term however you can expect TubeSync to index entire channels and videos every index and compare the upload dates to function properly. I'll still see if changing the log severity is sensible though to stop annoying users who attempt to index very large channels a lot.

from tubesync.

DeftNerd commented on May 22, 2024 1

I hate to suggest a major refactor, but I wanted to give some ideas that might help with this problem.

Using youtube-dl to generate an index of all the videos and store them in a database to slowly download them does make sense, but when looking for new videos, tubesync seems to be configured to redownload the entire index of videos again to look for updates.

A more efficient method to look for new videos would be to use the integrated YouTube RSS feeds. They're always ordered by "published" date

https://www.youtube.com/feeds/videos.xml?channel_id=someidhere

Adding an RSS/XML parser to the system might be a slight hassle, but it would significantly reduce the risk of youtube getting mad at excessive page indexing.

from tubesync.

meeb commented on May 22, 2024 1

I'll track the RSS feature in #73 and the log level / log spam reduction options in #74 - I'll close this for now as I don't think there's anything left to add to the original issue, but feel free to comment or re-open it if you want to add more suggestions or comments.

from tubesync.

meeb commented on May 22, 2024

While this would be nice I don't think it's possible. If it is possible it would require a lot more hacking into youtube-dl internals which I'd probably like to avoid doing just so updating libraries etc. is trivial. youtube-dl's extract_info() with extract_flat=True just returns all video IDs on a channel or playlist etc. From a quick check, there's no way to know the order of videos on a playlist or channel so you can't just not crawl "page 22" or similar because "page 21" already has videos older than the download cap. You would need to index every video on a playlist/channel to find potentially new videos, which would result in having to check every video upload date against any set age caps. That could be classed as a debug log message not an info log message though to be suppressed that way...

from tubesync.

OmgImAlexis commented on May 22, 2024

Yep looks like I need to add a PR to youtube-dl.

ytdl-org/youtube-dl#1816

from tubesync.

OmgImAlexis commented on May 22, 2024

Dateafter shouldn’t download pages outside of the range.

from tubesync.

meeb commented on May 22, 2024

Cheers for the suggestion!

I had noticed the RSS feeds, but compared to the current youtube-dl based method it doesn't actually reduce the number of requests made to YouTube that much. After getting a list of video IDs for a channel or playlist TubeSync still needs to make one request per video to get its metadata and these are the bulk of the requests to YouTube that seem to be triggering the rate limiting. For example adding a channel with 1000 videos in it results in about 25 requests for indexing, then 1000 requests for metadata, once indexed it's "just" 25 requests per indexing interval period which is probably fine.

Additionally, unless I'm blind, I can't see any way to get more than the most recent 14 or so videos via RSS (there's no ?page=2 or similar accepted parameter I can find?) so while that would indeed work for updating for new content easily it doesn't solve the initial index all media on a channel requirement.

Also I assume if a channel added > 14 videos between indexing it would have to fall back to the current way as well, which I guess is pretty unlikely but no doubt someone will find a channel that does this and trigger an edge case of missing content.

Using the feeds could shave off a few requests per day, but not enough to likely solve issues for anyone experiencing 429 rate limiting issues, for which I'll probably have to just add in some 60 second delay between metadata requests to pad requests out for newly added channels or similar if people keep experiencing problems.

I'll add it onto the future roadmap as a possible feature as using the feeds would be nicer to keep channels updated with new content. It won't replace anything too significant internally and it's also not that much work really, just use a different indexer once already indexed at least once. It wouldn't require any massive internal reworking.

from tubesync.

Don't grab all pages if download cap is enabled about tubesync HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent