ed-cooper / lecture-hoarder Goto Github PK

Automated tool to download University of Manchester lecture podcasts

License: GNU General Public License v3.0

Python 100.00%

lecture-hoarder's Introduction

Lecture Hoarder

Automated tool to scrape the University of Manchester video portal and download all available lecture podcasts for your course.

Note: Requires valid University of Manchester username and password

Use at your own risk

Lecture Hoarder relies on an unstable web interface that is liable to change.

This program comes with ABSOLUTELY NO WARRANTY; for details see the license. The author accepts no liability for any loss of data caused by this program. Please remember to back your files up regularly.

Installation

Requires Python 3.6+

Clone the repository

git clone [email protected]:ed-cooper/lecture-hoarder.git

Go to the install directory

cd lecture-hoarder

Install the dependencies

pip3 install -r requirements.txt

Simple Usage

Inside your installation directory, run:

python3 lecturehoarder

Podcasts are downloaded to ~/Documents/Lectures.

Advanced Usage

Lecture Hoarder can be configured by placing a lecture-hoarder-settings.yaml file in your home directory - e.g. /home/john/lecture-hoarder-settings.yaml on Linux.

Configuration options include:

Changing the download directory
Excluding certain courses from being downloaded
Pre-specifying a username / password combination
And more

For information on configuration, please see the wiki page.

Useful Notes

Podcasts take a long time to download, so the first run may take a while to complete.

If you interrupt the program while downloading, you may find .partial files in the output directory. They are incomplete downloads and can safely be ignored/deleted.

The program will only download podcasts that you have not already downloaded, meaning that any subsequent runs (provided you don't change the download directory) will be much faster.

lecture-hoarder's People

Contributors

Stargazers

Watchers

Forkers

csnewman

lecture-hoarder's Issues

Deprecate login_service_url and video_service_base_url settings

Now web handling has been abstracted, putting specific properties to the UomPodcastProvider into the Profile doesn't seem reasonable.

Additionally, the initial reason for them to be in the settings file (see #3) is no longer the case.

Instead, they should be moved to attributes in the UomPodcastProvider class, where they can still be changed by an overriding class, if necessary.

Validate every usage of BeautifulSoup in UomPodcastProvider

Every time the .find method or equivalent is used, we should validate that html HTML item was actually found and raise a PodcastProviderError otherwise.

Currently these errors are mostly not handled and will result in confusing random exceptions.

Abstract file storage

Similar concept to the abstraction of web requests (#21)

Allows the program to be tested without side effects - potentially useful for a dry run option

Move source into separate /lecture-hoarder directory

The base directory for the repo has became polluted with various files

Is it time to move the source code into its own directory?

YAML Config

Python is not an appropriate config format, instead YAML (or others) should be used.

The config file should also be automatically generated. It is also advisable that the config is placed inside the users home directory and has the read permission restricted to only that user.

Make settings file optional

The codebase now contains sensible default values for all settings.

The program should be able to run without any settings file.

Add proper command line option support

Initially, we should aim for the settings file to be specified with -s or --settings-file

Future options could include displaying the license, a dry run, manual override for the settings file

Add contributing guidelines

Only download podcasts from the current year

Currently we download all available podcasts, but typically users only want podcasts from the current year.

We now extract the course series and use it for categorisation (see #19 ) which can be used to bootstrap the implementation for this,

This should be supported by a setting to allow all podcasts to be downloaded.

Broken on Windows

Add use at own risk warning

This project is using an unstable interface with their servers, which hasn't been formally approved.

I therefore feel there should be a disclaimer in the Readme and displayed each time the project is ran.

Login broken by switch to Duo 2FA

Recommend setup by venv

Packages have became outdated over time.

Update README to recommend venvs for setup, so that older packages can be isolated from the main python install.

Abstract web requests

All web request logic is currently handled by __main__.py

This clutters the file, making it hard to understand and maintain

A new interface should be created for handling web requests

In addition, it should be generic, so that dependency injection can be used to assist #3

Clipping for long podcast names

Long podcast names cause a single download to spread over multiple lines.

This leads to corruption when it comes to trying to overwrite the download status.

We should use the known terminal width to clip podcast names to a suitable length, and add an ellipsis to show that some text is hidden.

Check file access permissions

Currently we assume we can read from folders, create new files, etc.

If this is not the case, an exception occurs with a traceback displayed to the user:

Traceback (most recent call last):
  File "run.py", line 199, in <module>
    os.makedirs(course_dir, exist_ok=True)
  File "/usr/lib/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/media/edward/Mass Storage'

Although the problem is clearly identified in the error, we should aim to make the message more user friendly through proper checks

Packaging

Releases should have a .deb file produced, that will install the program into the /bin or /usr/bin location.

Check for duplicate but out of order podcasts

Occasionally, lecturers may add podcasts that came before ones already available, causing the order of podcasts to change.

Additionally, podcasts may be deleted.

Currently, detection of duplicates requires an exact name match, meaning that in the above situations we download all the subsequent podcasts again, causing multiple podcasts with the same number to appear.

Download automatic subtitles

Podcasts now have subtitle files with automatic captions available for download.

Supporting this would be useful.

Add course filtering

Allow users to choose which modules/courses they want to download.

Probably should be in the config, but maybe also a CLI argument to override it.

Realtime progress updates

It is possible to adjust the code to download the files in chunks:

Source: https://stackoverflow.com/questions/13137817/how-to-download-image-using-requests

The current code also calculates the total download size from the content-length headers

If we can find a non-blocking way to wait for tasks to be completed, it is therefore possible to create a live progress update for downloads.

(Feedback can be reported via the queue array)

Categorise lectures into years

The year for each lecture is given by the first numeric character in the course name

Having a migration handler would also be useful

Errors sometimes not being reported correctly

When testing error reporting, I found that simulating an error occurring often lead to unexpected results.

Example code: (line 104)

    # Check status code valid
    if True:  # get_video_service_podcast_page.status_code != 200:
        podcast["completion_time"] = time.time()
        podcast["error"] = "Could not get podcast webpage for " + podcast["name"] + \
                           " - Service responded with status code" + get_video_service_podcast_page.status_code
        podcast["status"] = "error"
        return

All the real errors I have experienced so far have resulted in exceptions occurring, so I'm not too concerned about fixing this immediately.

In addition, any errors that occur can almost always be remedied by running the program again.

Video page format change

Dom of site has changed. Videos no longer download

Change get_podcast_downloader return type

This return type is the only thing preventing the entire PodcastProvider interface being entirely independent of the web.

The return type will need to support asynchronous downloading and contain the total download size (as given by int(http_download_response.headers['Content-Length'])).

This will probably require the creation of a new class for the return type.

Runtime login

It is quite a risk having raw passwords on disk. On each run the program should ask you for your username and password.

Abstract into model

Currently we have an undocumented dictionary format for podcast downloads, containing the following properties:

name
podcast_link
download_path
status
error
progress
total_size
completion_time

For future development, we should develop a dedicated class for podcasts, as well as making status an enum type.

We should also think about breaking up functionality - ideally run.py should only care about general data flow through the program, rather than implementation details such as output formatting, extraction of page data, etc.

Add automated testing

Will allow us to track regressions.

Filter podcast names

Currently, we filter the names of courses to remove illegal characters - e.g. COMP10120 - First Year Team Project 2018/19 becomes COMP10120 - First Year Team Project 201819

The same also needs to happen for podcast names (which come from podcast_li.a.string)