brokkr / poca Goto Github PK

View Code? Open in Web Editor NEW

23.0 4.0 4.0 730 KB

A fast, multithreaded and highly customizable command line podcast client, written in Python 3

License: GNU General Public License v3.0

Python 100.00%

podcast metadata python cron rss xml filter id3 cli mutagen

poca's People

Stargazers

Watchers

Forkers

code-slave adarnimrod newrain7803 nsw42

poca's Issues

feature: limit by number of entries as well as/rather than than size

Currently you assign a set limit of MBs to the subscription. Make it an optional setting (if not set, there is no limit) and add an additional optional setting limiting the number of entries to keep (e.g. I always want the latest newshour and the one from the day before)

reinstate mp3 tagging

poca-subscribe: update docs and setup.py before release

We're adding a new script and a new submodule. check that they're added to setup.py.
Also: Update readme.md and wiki

Assumption: All files are mp3

Add skip audio tagging if file is not mp3 (or implement Ogg tagging)

poca-subscribe: review xmlconf

xmlconf seems a bit antiquated way of doing things - just write out a huge string - when we have lxml.objectify. Also, it's style is a bit different from that output by poca-subscribe. Is there a better way of producing a default template than giant string-writing?

Multiprocessing

Set up a socket for receiving feed updates and fire off one process for each subscription. The feed processes report back to the socket. On that socket runs a single, serial downloader that processes the updates (little Wanted+Unwanted+Lacking etc. packages). The processing includes deletes, downloads, and reports to user. The downloader/main process simply deals with the updates in the order they appear on the socket, i.e. more responsive servers will get first in line.

The proposed distinction between multiple update processes and main process is identifiable in the current code as that between 'plans' and 'execution'.

Since the downloading will still be serial, multiprocessing won't accomplish much in terms of speed gains but it should minimize 'lag' and waiting. We stay away from parallel downloads partly because each dl would steal bandwidth from the others, partly because most updates won't see multiple downloads if your average user subscribes to say 10-20 podcasts and update once an hour (assuming). Finally and most importantly, total multiprocessing invites far more chaos when things go wrong and would require a greater ui rethink.

Documentation: Architecture, configuration missing

Architecture: A wiki page detailing the inner workings and categories ('fruit' labels and color codes)
Configuration: An overview of the configuration file, a listing of all settings and an example configuration

Feed failures are saved to .poca db but episode failures are not

The following download failure entries into the file log


2017-02-27 13:10 RADIOAVISEN. Removed: radioavisen-2017-02-24-12-00-2.mp3
2017-02-27 13:10 RADIOAVISEN. Failed: radioavisen-2017-02-27-12-00-2.mp3
2017-02-27 14:10 RADIOAVISEN. Failed: radioavisen-2017-02-27-12-00-2.mp3
2017-02-27 15:10 RADIOAVISEN. Failed: radioavisen-2017-02-27-12-00-2.mp3
2017-02-27 16:10 RADIOAVISEN. Downloaded: radioavisen-2017-02-27-12-00-2.mp3

are not added to the buffer:

In [2]: fname = '/home/mads/.poca/db/.poca'

In [3]: with open(fname, 'rb') as f:
    jar = pickle.load(f)

In [4]: jar.buffer
Out[4]: []

It seems only failures on the feedparser part are added to the buffer. Is this how we want it to work

filters: changes to filters do not cause poca to revisit recent feeds (etags)

When poca notices changes to max_no and max_mb, it resets etags so a reset is forced. This does not happen when filters are changed.

Email log

It should be comparaively easy to add support for mailing the changes to yourself? At least with a local mail server. See dispatches for inspiration.

Assumption: File has id3 tag; ID3NoHeaderError if file has no tag

Check to see if tag exists, if not create empty tag

logging actions: standardize on either entry or filename for both file and stream log

Currently file logging user deletions are being handed uids and logging uids. This is due to the confusion over what sort of entity we're handing over to output. Standardizing on filename or entrys would help avoid this confusion.

Pro filenames:

It's all output needs. Output is a dumb function that should only be given the very basic necessities, unlike the central machinery of the Feed/Combo/Wanted classes.

Pro entries:

Entries are the standard of data exchange throughout the program
entry['poca_filename'] is instantly recognisable - you know what that is. Plain filename could be anything.

Run with alternate config

Add flag to run with user designated config directory.

If no metadata overrides are defined, leave id3 headers alone

Unchanged etag prevents update after change in max_no / max_mb

Since all feed requests are made with saved etag, a feed request made after changes to a subscrition's max_mb or max_no attributes in the config file will return an empty feed, causing the program to skip the subscription.

Ogg file support

CUrrently only mp3 files are tagged. We should extend support for ogg. (test case: Linux Voice)

Upgrade path from 0.7 to 0.8 dbs: Does it work?

Save an 0.7 snapshot for testing the upgrade path. Will 0.8 correctly recognise and parse the dbs of 0.7? We haven't changed anything about history but configs are saved...?

TOR support

Maybe it could be an option to download via TOR?

http://stackoverflow.com/a/2015649

CLI tool to manage subscriptions

Removal should also remove files (and possibly clear out any history?)

download cover: assumes jpg

Download cover should check for extension rather than assume jpg

Looking up file sizes imposes serious lag

When the amount was governed by file sizes we needed file sizes on every file. As part of creating a combo instance an expansion was done on all file entries, including adding information about file size. When this is not included in the feed, we resort to pinging each url in turn to gather this information. For a long feed this can take several minutes.

This should only happen once, because the entryinfo.expand function is only run on entries not in jar. However, it seems to be a returning issue in some cases....?

Options:

Investigate if it is indeed a returning issue or just a one time thing per feed
Remove all references to file size (we aren't using it currently but it might return?)
Work around the fact that some entries will not have file size information

poca-subscribe: online podcast search

Add a podcast search to as-yet-to-be poca-subscribe tool. I think I've bookmarked one with an API somewhere, no?

'Terse' output

Terse output would mean only outputting actual changes. So the user would only see lines saying episode removed and episode downloaded. And probably the error ones as well. This could be useful for logging, especially as a prerequisite for email logging (issue #26)

Syntastic moaning

Syntastic has a ton of (style?) complaints. Go through them and either dismiss or adjust.

history: pickle problems are recorded but not acted upon

Config checks that the db directory is writable but the results of all the checks we do in history.py are simply dropped and all is assumed well.

Download cover.jpg image from feed

Most podcast MP3s come with an embedded image these days but some seem to rely on some itunes magic with images inserted into the feed (usually itunes speciffic tags) Does feedparser report these? can we access them? Download them as a fallback cover.jpg in the folder?

Files that by some error drop out of db are never removed

An error in some (previous?) version seems to have caused some files in Savage Love and TAC to 'drop out' of the db. These files are then invisible to poca and are never removed unless by hand.

Solutions:

Ideally not to have files drop out
Abandon db, embrace file-on-disk-is-history
Some check-up/reset loop that cleans up discrepancies

Unicode testing

We never really probe for what sort of strings we are tossing around. More testing to make sure that we don't run into trouble with unicode/non-unicode strings in filenames, feeds or tags.

download function: '%20' are turned into '20' rather than spaces

Check validity of config

We do a select few checks on config settings but not in any consistent way. E.g. if an incorrect date format is used in after_date the program simply crashes with a ValueError.

There are actually a number of distinct jobs here:

Checking if needed elements are present (like settings and subscriptions in the global part and title + url in each subscription)
Checking if the values are valid - like a correct url, the proper date format, a path, etc.
Converting certain values, like a string into an integer (max_number) or an XXXX-XX-XX date string into a struct_time instance.

Currently a selection of these tasks are performed in between harvesting XML and creating poca's own data holding objects. Which begs three questions:

Could we make config less of a jungle if we separated these tasks into their own functions to be performed one after the other?
What are the criteria for testing values and element presence? Should we test a select few, all or none?
Should we perform all needed conversions in config or is it ok to pass on max_number as a string, to be converted at convenience?

timeout: timeout for download of block_size seems to apply to whole update

We have assigned 60s as the timeout for a single block of about 8Kb. However, the timeout seems to kick in if the entire download takes more than 60s, despite the individual block taking much less time than that. Need to investigate how to control those signals.

Filter entries

Similar to other restrictions on combo.lst we could restrict it further by filtering based on

feed info
filename
size
date and time

Old (ancient) RFI entries are not replaced

Some entries in the RFI feed are not being replaced despite them being from november 2015. How they got in there is a mystery. More improtantly: Why aren't they being replaced by newer ones? They have clearcut entries in both jar.lst and jar.dic - though they may not be conform to 0.5 specs? Maybe 'valid' is not in entry?

logging: filter solution on stream logger is a bad hack

In order to avoid getting summaries of file actions on the stream (in addition to the one-per-line +/-/%) we use logger.warn() but filter warnings out from the stream handler. This works but is utterly incomprehensible to anybody not in on it. Requires explanation or reworking.

Limiting new downloads without deleting old episodes

I would prefer to manually delete outdated episodes, however I don't want to download entire feeds. Is it possible to limit the number of new downloads without automatically deleting old episodes?

Global subscription settings

Subscription settings for

max_number
metadata
filters

should inherit global settings for same. Overrides should be possible on a per-subscription basis.

Outdated man file

Man file makes references to google code and other outdated information.

Use variables in metadata (aka make up consecutive track numbers because the podcast's own are useless)

Some podcasts leave out track numbering or play fast and loose with it, occasionally inserting 'special' shows, that do not get a track number. This can be a problem for audio players.

A solution could be to allow the user to draw on variables for insertion into the metadata, specifically:

Consecutive track numbering: We don't care what number the episode has in the mind of the creator, we simply label the first one '001' and take it from there, incrementing one with each new episode.
'Reverse date' into title or album? Or track? (if id3 fields accept So January 18th 2017 would be 20170118 ensuring that ordering by title or album will be ordered in the order they appear.
Other feed data -> metadata?

This raises two related question

whether any or all of this does not apply equally to file naming (see issue #16 )
whether we want to go down the put variables into users' hands route or just add a toggle switch ("yes, please overwrite this subscription's track numbers with made up stuff")

If we want it in file names, variables is the best option. If not it might be best to contain it to a few select scenarios.

Detect & remove missing feeds

If a feed has been removed from poca.xml, remove the dl folder and the history

Crash on no-enclosures entries

If n entry has no valid enclosures, i.e. a pure text entry, the program crashes.

Option to start a podcast from the beginning rather than the latest episodes

Working your way through: When a narrative podcast - e.g. Welcome to Night Vale, has a large back archive, you'll want to start at the beginning and work your way through. We need a setting that will give you the first ten episodes and then when you send the signal, it will replace those with the next ten and so forth.

Options to rename files

Many podcasts have either random UID filenames or just name the files inconsistently. This can cause problems with the order in which the files appear in a player.

We should have options for each podcast to rename files based on:

Metadata
feed data
pubDate
serial number running from first downloaded to latest
?

Instead of giving free reins we could simply start by having a few simple prepackaged solutions for misbehaving podcasts.

We might also need sign scrubbing similar to derailleur?

Filter: Add per-day-quota to deal with too-frequent updates

Some podcasts, typically news, update more frequently than you might need them to. Limiting the max_number doesn't do anything to combat that as you may only ever have one episode but it will constantly be a new one.

One way to deal with this is using the hour fitler which filters according to pubdate. However, some feeds either vary in the hour of publishing or simply disregard setting the hour on pubdate.

To deal with this we add a quota filter. This will simply instruct poca to filter the feed so that only X entries from any single day remain in the feed. So it will still rely on pubdate but to a lesser extent - hopefully.

poca-subscribe: various questions

~~Should delete with no title/url parameters loop though each and every sub? Or just inform user of option to match using --title/--url? Or both? (yes)~~
~~Should add inform user of defaults? (no there are not enough settings that benefit)~~
~~Should add have a better way to apply metadata/filters to subs? (yes but this depends on the one below - add a new issue to 0.9)~~
~~Should add sample metadata from most recent file? (This belongs in 0.9)~~
~~Should we add a check command to poca-subscribe that runs through the settings, defaults and subscriptions and informs user of validity and consequences? (no)~~

Download: socket.gaierror is not caught

If a download starts up without internet connection a socket.gaierror is generated but we're unable to catch it. Instead we cuse a genereic catch-all exception.

files.py, line 47:

except: return Outcome(False, "Unknown error")

Defaults

Some settings are fairly complex to most users. Some may edit them without knowing what they mean. It may therefore be necessary to load defaults and only override them with legitimate settings and settings combinations (utf8/2.4)

poca-subscribe: subscription attributes

Feature: Add tags to subscriptions by way of tag attributes. Specifically:

Categories: <subscription category="news">...</subscription> This ties mostly into the list command that would be able to group subscriptions in the same category together
State: <subscription state="inactive">...</subscription> A way to temporarily opt out of a podcast without having to save it somewhere else. Should delete audio files but keep db.

README.md needs work

The readme could be updated after all these years.

Only Python 3 instructions
Pip install
Overhauling / rewriting the description, making note of the prevalence of smart phones (rsync anybody?)
Jazzing up: Add a recording of the tool working (https://asciinema.org/ ?)

Entries in combo list are re-built every time we check

Everytime a subscription is given the once-over we rebuilt the metadata. Seems kinda pointless and a waste?

Use symbols in output for easier parsing

While the less verbose output is easier to interpret, it is still not immediately obvious when there are changes as opposed to when nothing new is in the pipes. One way to make the output quicker to eye-parse is by changing the output ("No changes", "1 file(s) to download", etc.) to signs indicating what's going on.

There shouldn't be an issue with the encoding seeing as we're running Python 3 and Bash (from which we would be cat/less-reading the log) shouldn't have an issue with it either. I believe.

It isn't customary in a CLI program but why not? It could also be an option in preferences:
<pictogram_output>yes</pictogram_output>

Suggestions:

Error: ⚠ (http://unicode-table.com/en/26A0/)
Download: ➕ (http://unicode-table.com/en/2795/)
Remove: ➖ (http://unicode-table.com/en/2796/)
Exit: ❌ (http://unicode-table.com/en/274C/)
Downloading: ⇵ (http://unicode-table.com/en/21C5/)
Failed download: ☇ (http://unicode-table.com/en/2607/)

brokkr / poca Goto Github PK

poca's People

Stargazers

Watchers

Forkers

poca's Issues

Recommend Projects

Recommend Topics

Recommend Org