Giter Club home page Giter Club logo

Comments (7)

meeb avatar meeb commented on May 22, 2024 2

While that upstream issue if added might help the logs, it likely won't stop the requirement that TubeSync will still need to index all YouTube video IDs in a playlist each time it does an index as they are not assured to be chronologically returned by YouTube when it gets crawled. The initial requirement with TubeSync is to "find all new video IDs" which still means indexing entire channels and playlists. This flag, if implemented upstream in youtube-dl, would likely just limit what's returned by extract_info() rather than limit what's actually requested from YouTube. If there is some enforcement of chronological ordering feature that could be used for YouTube it likely wouldn't be transferable to other sites which will get support in TubeSync in the future either. Of course, if devs of youtube-dl who admittedly do have a far superior knowledge of the internals of YouTube APIs/front ends and their own codebase Than I do find a way to actually make this work properly I would implement it. In the foreseeable short term however you can expect TubeSync to index entire channels and videos every index and compare the upload dates to function properly. I'll still see if changing the log severity is sensible though to stop annoying users who attempt to index very large channels a lot.

from tubesync.

DeftNerd avatar DeftNerd commented on May 22, 2024 1

I hate to suggest a major refactor, but I wanted to give some ideas that might help with this problem.

Using youtube-dl to generate an index of all the videos and store them in a database to slowly download them does make sense, but when looking for new videos, tubesync seems to be configured to redownload the entire index of videos again to look for updates.

A more efficient method to look for new videos would be to use the integrated YouTube RSS feeds. They're always ordered by "published" date

https://www.youtube.com/feeds/videos.xml?channel_id=someidhere

Adding an RSS/XML parser to the system might be a slight hassle, but it would significantly reduce the risk of youtube getting mad at excessive page indexing.

from tubesync.

meeb avatar meeb commented on May 22, 2024 1

I'll track the RSS feature in #73 and the log level / log spam reduction options in #74 - I'll close this for now as I don't think there's anything left to add to the original issue, but feel free to comment or re-open it if you want to add more suggestions or comments.

from tubesync.

meeb avatar meeb commented on May 22, 2024

While this would be nice I don't think it's possible. If it is possible it would require a lot more hacking into youtube-dl internals which I'd probably like to avoid doing just so updating libraries etc. is trivial. youtube-dl's extract_info() with extract_flat=True just returns all video IDs on a channel or playlist etc. From a quick check, there's no way to know the order of videos on a playlist or channel so you can't just not crawl "page 22" or similar because "page 21" already has videos older than the download cap. You would need to index every video on a playlist/channel to find potentially new videos, which would result in having to check every video upload date against any set age caps. That could be classed as a debug log message not an info log message though to be suppressed that way...

from tubesync.

OmgImAlexis avatar OmgImAlexis commented on May 22, 2024

Yep looks like I need to add a PR to youtube-dl.

ytdl-org/youtube-dl#1816

from tubesync.

OmgImAlexis avatar OmgImAlexis commented on May 22, 2024

Dateafter shouldn’t download pages outside of the range.

from tubesync.

meeb avatar meeb commented on May 22, 2024

Cheers for the suggestion!

I had noticed the RSS feeds, but compared to the current youtube-dl based method it doesn't actually reduce the number of requests made to YouTube that much. After getting a list of video IDs for a channel or playlist TubeSync still needs to make one request per video to get its metadata and these are the bulk of the requests to YouTube that seem to be triggering the rate limiting. For example adding a channel with 1000 videos in it results in about 25 requests for indexing, then 1000 requests for metadata, once indexed it's "just" 25 requests per indexing interval period which is probably fine.

Additionally, unless I'm blind, I can't see any way to get more than the most recent 14 or so videos via RSS (there's no ?page=2 or similar accepted parameter I can find?) so while that would indeed work for updating for new content easily it doesn't solve the initial index all media on a channel requirement.

Also I assume if a channel added > 14 videos between indexing it would have to fall back to the current way as well, which I guess is pretty unlikely but no doubt someone will find a channel that does this and trigger an edge case of missing content.

Using the feeds could shave off a few requests per day, but not enough to likely solve issues for anyone experiencing 429 rate limiting issues, for which I'll probably have to just add in some 60 second delay between metadata requests to pad requests out for newly added channels or similar if people keep experiencing problems.

I'll add it onto the future roadmap as a possible feature as using the feeds would be nicer to keep channels updated with new content. It won't replace anything too significant internally and it's also not that much work really, just use a different indexer once already indexed at least once. It wouldn't require any massive internal reworking.

from tubesync.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.