Giter Club home page Giter Club logo

lecture-hoarder's Introduction

Lecture Hoarder Build Status

Automated tool to scrape the University of Manchester video portal and download all available lecture podcasts for your course.

Note: Requires valid University of Manchester username and password

Use at your own risk

Lecture Hoarder relies on an unstable web interface that is liable to change.

This program comes with ABSOLUTELY NO WARRANTY; for details see the license. The author accepts no liability for any loss of data caused by this program. Please remember to back your files up regularly.

Installation

Requires Python 3.6+

  1. Clone the repository
git clone [email protected]:ed-cooper/lecture-hoarder.git
  1. Go to the install directory
cd lecture-hoarder
  1. Install the dependencies
pip3 install -r requirements.txt

Simple Usage

Inside your installation directory, run:

python3 lecturehoarder

Podcasts are downloaded to ~/Documents/Lectures.

Advanced Usage

Lecture Hoarder can be configured by placing a lecture-hoarder-settings.yaml file in your home directory - e.g. /home/john/lecture-hoarder-settings.yaml on Linux.

Configuration options include:

  • Changing the download directory
  • Excluding certain courses from being downloaded
  • Pre-specifying a username / password combination
  • And more

For information on configuration, please see the wiki page.

Useful Notes

Podcasts take a long time to download, so the first run may take a while to complete.

If you interrupt the program while downloading, you may find .partial files in the output directory. They are incomplete downloads and can safely be ignored/deleted.

The program will only download podcasts that you have not already downloaded, meaning that any subsequent runs (provided you don't change the download directory) will be much faster.

lecture-hoarder's People

Contributors

csnewman avatar dependabot[bot] avatar ed-cooper avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

csnewman

lecture-hoarder's Issues

Deprecate login_service_url and video_service_base_url settings

Now web handling has been abstracted, putting specific properties to the UomPodcastProvider into the Profile doesn't seem reasonable.

Additionally, the initial reason for them to be in the settings file (see #3) is no longer the case.

Instead, they should be moved to attributes in the UomPodcastProvider class, where they can still be changed by an overriding class, if necessary.

Validate every usage of BeautifulSoup in UomPodcastProvider

Every time the .find method or equivalent is used, we should validate that html HTML item was actually found and raise a PodcastProviderError otherwise.

Currently these errors are mostly not handled and will result in confusing random exceptions.

Abstract file storage

Similar concept to the abstraction of web requests (#21)

Allows the program to be tested without side effects - potentially useful for a dry run option

YAML Config

Python is not an appropriate config format, instead YAML (or others) should be used.

The config file should also be automatically generated. It is also advisable that the config is placed inside the users home directory and has the read permission restricted to only that user.

Make settings file optional

The codebase now contains sensible default values for all settings.

The program should be able to run without any settings file.

Add proper command line option support

Initially, we should aim for the settings file to be specified with -s or --settings-file

Future options could include displaying the license, a dry run, manual override for the settings file

Only download podcasts from the current year

Currently we download all available podcasts, but typically users only want podcasts from the current year.

We now extract the course series and use it for categorisation (see #19 ) which can be used to bootstrap the implementation for this,

This should be supported by a setting to allow all podcasts to be downloaded.

Add use at own risk warning

This project is using an unstable interface with their servers, which hasn't been formally approved.

I therefore feel there should be a disclaimer in the Readme and displayed each time the project is ran.

Recommend setup by venv

Packages have became outdated over time.

Update README to recommend venvs for setup, so that older packages can be isolated from the main python install.

Abstract web requests

All web request logic is currently handled by __main__.py

This clutters the file, making it hard to understand and maintain

A new interface should be created for handling web requests

In addition, it should be generic, so that dependency injection can be used to assist #3

Clipping for long podcast names

Long podcast names cause a single download to spread over multiple lines.

This leads to corruption when it comes to trying to overwrite the download status.

We should use the known terminal width to clip podcast names to a suitable length, and add an ellipsis to show that some text is hidden.

Check file access permissions

Currently we assume we can read from folders, create new files, etc.

If this is not the case, an exception occurs with a traceback displayed to the user:

Traceback (most recent call last):
  File "run.py", line 199, in <module>
    os.makedirs(course_dir, exist_ok=True)
  File "/usr/lib/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/media/edward/Mass Storage'

Although the problem is clearly identified in the error, we should aim to make the message more user friendly through proper checks

Packaging

Releases should have a .deb file produced, that will install the program into the /bin or /usr/bin location.

Check for duplicate but out of order podcasts

Occasionally, lecturers may add podcasts that came before ones already available, causing the order of podcasts to change.

Additionally, podcasts may be deleted.

Currently, detection of duplicates requires an exact name match, meaning that in the above situations we download all the subsequent podcasts again, causing multiple podcasts with the same number to appear.

Download automatic subtitles

Podcasts now have subtitle files with automatic captions available for download.

Supporting this would be useful.

Add course filtering

Allow users to choose which modules/courses they want to download.

Probably should be in the config, but maybe also a CLI argument to override it.

Categorise lectures into years

The year for each lecture is given by the first numeric character in the course name

Having a migration handler would also be useful

Errors sometimes not being reported correctly

When testing error reporting, I found that simulating an error occurring often lead to unexpected results.

Example code: (line 104)

    # Check status code valid
    if True:  # get_video_service_podcast_page.status_code != 200:
        podcast["completion_time"] = time.time()
        podcast["error"] = "Could not get podcast webpage for " + podcast["name"] + \
                           " - Service responded with status code" + get_video_service_podcast_page.status_code
        podcast["status"] = "error"
        return

All the real errors I have experienced so far have resulted in exceptions occurring, so I'm not too concerned about fixing this immediately.

In addition, any errors that occur can almost always be remedied by running the program again.

Change get_podcast_downloader return type

This return type is the only thing preventing the entire PodcastProvider interface being entirely independent of the web.

The return type will need to support asynchronous downloading and contain the total download size (as given by int(http_download_response.headers['Content-Length'])).

This will probably require the creation of a new class for the return type.

Runtime login

It is quite a risk having raw passwords on disk. On each run the program should ask you for your username and password.

Abstract into model

Currently we have an undocumented dictionary format for podcast downloads, containing the following properties:

  • name
  • podcast_link
  • download_path
  • status
  • error
  • progress
  • total_size
  • completion_time

For future development, we should develop a dedicated class for podcasts, as well as making status an enum type.

We should also think about breaking up functionality - ideally run.py should only care about general data flow through the program, rather than implementation details such as output formatting, extraction of page data, etc.

Filter podcast names

Currently, we filter the names of courses to remove illegal characters - e.g. COMP10120 - First Year Team Project 2018/19 becomes COMP10120 - First Year Team Project 201819

The same also needs to happen for podcast names (which come from podcast_li.a.string)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.