brokkr / poca Goto Github PK
View Code? Open in Web Editor NEWA fast, multithreaded and highly customizable command line podcast client, written in Python 3
License: GNU General Public License v3.0
A fast, multithreaded and highly customizable command line podcast client, written in Python 3
License: GNU General Public License v3.0
Currently you assign a set limit of MBs to the subscription. Make it an optional setting (if not set, there is no limit) and add an additional optional setting limiting the number of entries to keep (e.g. I always want the latest newshour and the one from the day before)
Add skip audio tagging if file is not mp3 (or implement Ogg tagging)
xmlconf seems a bit antiquated way of doing things - just write out a huge string - when we have lxml.objectify. Also, it's style is a bit different from that output by poca-subscribe. Is there a better way of producing a default template than giant string-writing?
Set up a socket for receiving feed updates and fire off one process for each subscription. The feed processes report back to the socket. On that socket runs a single, serial downloader that processes the updates (little Wanted+Unwanted+Lacking etc. packages). The processing includes deletes, downloads, and reports to user. The downloader/main process simply deals with the updates in the order they appear on the socket, i.e. more responsive servers will get first in line.
The proposed distinction between multiple update processes and main process is identifiable in the current code as that between 'plans' and 'execution'.
Since the downloading will still be serial, multiprocessing won't accomplish much in terms of speed gains but it should minimize 'lag' and waiting. We stay away from parallel downloads partly because each dl would steal bandwidth from the others, partly because most updates won't see multiple downloads if your average user subscribes to say 10-20 podcasts and update once an hour (assuming). Finally and most importantly, total multiprocessing invites far more chaos when things go wrong and would require a greater ui rethink.
The following download failure entries into the file log
2017-02-27 13:10 RADIOAVISEN. Removed: radioavisen-2017-02-24-12-00-2.mp3
2017-02-27 13:10 RADIOAVISEN. Failed: radioavisen-2017-02-27-12-00-2.mp3
2017-02-27 14:10 RADIOAVISEN. Failed: radioavisen-2017-02-27-12-00-2.mp3
2017-02-27 15:10 RADIOAVISEN. Failed: radioavisen-2017-02-27-12-00-2.mp3
2017-02-27 16:10 RADIOAVISEN. Downloaded: radioavisen-2017-02-27-12-00-2.mp3
are not added to the buffer:
In [2]: fname = '/home/mads/.poca/db/.poca'
In [3]: with open(fname, 'rb') as f:
jar = pickle.load(f)
In [4]: jar.buffer
Out[4]: []
It seems only failures on the feedparser part are added to the buffer. Is this how we want it to work
When poca notices changes to max_no and max_mb, it resets etags so a reset is forced. This does not happen when filters are changed.
It should be comparaively easy to add support for mailing the changes to yourself? At least with a local mail server. See dispatches for inspiration.
Check to see if tag exists, if not create empty tag
Currently file logging user deletions are being handed uids and logging uids. This is due to the confusion over what sort of entity we're handing over to output. Standardizing on filename or entrys would help avoid this confusion.
Pro filenames:
Pro entries:
entry['poca_filename']
is instantly recognisable - you know what that is. Plain filename
could be anything.Add flag to run with user designated config directory.
Since all feed requests are made with saved etag, a feed request made after changes to a subscrition's max_mb or max_no attributes in the config file will return an empty feed, causing the program to skip the subscription.
CUrrently only mp3 files are tagged. We should extend support for ogg. (test case: Linux Voice)
Save an 0.7 snapshot for testing the upgrade path. Will 0.8 correctly recognise and parse the dbs of 0.7? We haven't changed anything about history but configs are saved...?
Maybe it could be an option to download via TOR?
Removal should also remove files (and possibly clear out any history?)
Download cover should check for extension rather than assume jpg
When the amount was governed by file sizes we needed file sizes on every file. As part of creating a combo instance an expansion was done on all file entries, including adding information about file size. When this is not included in the feed, we resort to pinging each url in turn to gather this information. For a long feed this can take several minutes.
This should only happen once, because the entryinfo.expand function is only run on entries not in jar. However, it seems to be a returning issue in some cases....?
Options:
Add a podcast search to as-yet-to-be poca-subscribe tool. I think I've bookmarked one with an API somewhere, no?
Terse output would mean only outputting actual changes. So the user would only see lines saying episode removed and episode downloaded. And probably the error ones as well. This could be useful for logging, especially as a prerequisite for email logging (issue #26)
Syntastic has a ton of (style?) complaints. Go through them and either dismiss or adjust.
Config checks that the db directory is writable but the results of all the checks we do in history.py are simply dropped and all is assumed well.
Most podcast MP3s come with an embedded image these days but some seem to rely on some itunes magic with images inserted into the feed (usually itunes speciffic tags) Does feedparser report these? can we access them? Download them as a fallback cover.jpg in the folder?
An error in some (previous?) version seems to have caused some files in Savage Love and TAC to 'drop out' of the db. These files are then invisible to poca and are never removed unless by hand.
Solutions:
We never really probe for what sort of strings we are tossing around. More testing to make sure that we don't run into trouble with unicode/non-unicode strings in filenames, feeds or tags.
We do a select few checks on config settings but not in any consistent way. E.g. if an incorrect date format is used in after_date
the program simply crashes with a ValueError
.
There are actually a number of distinct jobs here:
settings
and subscriptions
in the global part and title
+ url
in each subscription)max_number
) or an XXXX-XX-XX date string into a struct_time
instance.Currently a selection of these tasks are performed in between harvesting XML and creating poca's own data holding objects. Which begs three questions:
We have assigned 60s as the timeout for a single block of about 8Kb. However, the timeout seems to kick in if the entire download takes more than 60s, despite the individual block taking much less time than that. Need to investigate how to control those signals.
Similar to other restrictions on combo.lst we could restrict it further by filtering based on
Some entries in the RFI feed are not being replaced despite them being from november 2015. How they got in there is a mystery. More improtantly: Why aren't they being replaced by newer ones? They have clearcut entries in both jar.lst and jar.dic - though they may not be conform to 0.5 specs? Maybe 'valid' is not in entry?
In order to avoid getting summaries of file actions on the stream (in addition to the one-per-line +/-/%) we use logger.warn() but filter warnings out from the stream handler. This works but is utterly incomprehensible to anybody not in on it. Requires explanation or reworking.
I would prefer to manually delete outdated episodes, however I don't want to download entire feeds. Is it possible to limit the number of new downloads without automatically deleting old episodes?
Subscription settings for
should inherit global settings for same. Overrides should be possible on a per-subscription basis.
Man file makes references to google code and other outdated information.
Some podcasts leave out track numbering or play fast and loose with it, occasionally inserting 'special' shows, that do not get a track number. This can be a problem for audio players.
A solution could be to allow the user to draw on variables for insertion into the metadata, specifically:
This raises two related question
If we want it in file names, variables is the best option. If not it might be best to contain it to a few select scenarios.
If a feed has been removed from poca.xml, remove the dl folder and the history
If n entry has no valid enclosures, i.e. a pure text entry, the program crashes.
Working your way through: When a narrative podcast - e.g. Welcome to Night Vale, has a large back archive, you'll want to start at the beginning and work your way through. We need a setting that will give you the first ten episodes and then when you send the signal, it will replace those with the next ten and so forth.
Many podcasts have either random UID filenames or just name the files inconsistently. This can cause problems with the order in which the files appear in a player.
We should have options for each podcast to rename files based on:
Instead of giving free reins we could simply start by having a few simple prepackaged solutions for misbehaving podcasts.
We might also need sign scrubbing similar to derailleur?
Some podcasts, typically news, update more frequently than you might need them to. Limiting the max_number doesn't do anything to combat that as you may only ever have one episode but it will constantly be a new one.
One way to deal with this is using the hour fitler which filters according to pubdate. However, some feeds either vary in the hour of publishing or simply disregard setting the hour on pubdate.
To deal with this we add a quota filter. This will simply instruct poca to filter the feed so that only X entries from any single day remain in the feed. So it will still rely on pubdate but to a lesser extent - hopefully.
If a download starts up without internet connection a socket.gaierror is generated but we're unable to catch it. Instead we cuse a genereic catch-all exception.
files.py, line 47:
except: return Outcome(False, "Unknown error")
Some settings are fairly complex to most users. Some may edit them without knowing what they mean. It may therefore be necessary to load defaults and only override them with legitimate settings and settings combinations (utf8/2.4)
Feature: Add tags to subscriptions by way of tag attributes. Specifically:
<subscription category="news">...</subscription>
This ties mostly into the list command that would be able to group subscriptions in the same category together<subscription state="inactive">...</subscription>
A way to temporarily opt out of a podcast without having to save it somewhere else. Should delete audio files but keep db.The readme could be updated after all these years.
Everytime a subscription is given the once-over we rebuilt the metadata. Seems kinda pointless and a waste?
While the less verbose output is easier to interpret, it is still not immediately obvious when there are changes as opposed to when nothing new is in the pipes. One way to make the output quicker to eye-parse is by changing the output ("No changes", "1 file(s) to download", etc.) to signs indicating what's going on.
There shouldn't be an issue with the encoding seeing as we're running Python 3 and Bash (from which we would be cat/less-reading the log) shouldn't have an issue with it either. I believe.
It isn't customary in a CLI program but why not? It could also be an option in preferences:
<pictogram_output>yes</pictogram_output>
Suggestions:
Error: ⚠ (http://unicode-table.com/en/26A0/)
Download: ➕ (http://unicode-table.com/en/2795/)
Remove: ➖ (http://unicode-table.com/en/2796/)
Exit: ❌ (http://unicode-table.com/en/274C/)
Downloading: ⇵ (http://unicode-table.com/en/21C5/)
Failed download: ☇ (http://unicode-table.com/en/2607/)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.