Giter Club home page Giter Club logo

edx-dl's Introduction

Build Status Coverage Status Code Climate PyPI version

Description

edx-dl is a simple tool to download videos and lecture materials from Open edX-based sites. It requires a Python interpreter (>= 2.7) and very few other dependencies. It is platform independent, and should work fine under Unix (Linux, BSDs etc.), Windows or Mac OS X.

We strongly recommend that, if you don't already have a Python interpreter installed, that you install Python >= 3.6, if possible, since it is better in general.

Installation (recommended)

To install edx-dl run:

pip install edx-dl

Manual Installation

To install all the dependencies please do:

pip install -r requirements.txt

youtube-dl

One of the most important dependencies of edx-dl is youtube-dl. The installation step listed above already pulls in the most recent version of youtube-dl for you.

Unfortunately, since many Open edX sites store their videos on Youtube and Youtube changes their layout from time to time, it may be necessary to upgrade your copy of youtube-dl. There are many ways to proceed here, but the simplest is to simply use:

pip install --upgrade youtube-dl

Quick Start

Once you have installed everything, to use edx-dl.py, let it discover the courses in which you are enrolled, by issuing:

edx-dl -u [email protected] --list-courses

From there, choose the course you are interested in, copy its URL and use it in the following command:

edx-dl -u [email protected] COURSE_URL

replacing COURSE_URL with the URL that you just copied in the first step. It should look something like: https://courses.edx.org/courses/edX/DemoX.1/2014/info

Your downloaded videos will be placed in a new directory called Downloaded, inside your current directory, but you can also choose another destination with the -o argument.

To see all available options and a brief description of what they do, simply execute:

edx-dl --help

Important Note: To use sites other than <edx.org>, you have to specify the site along with the -x option. For example, -x stanford, if the course that you want to get is hosted on Stanford's site.

Docker container

You can run this application via Docker if you want. Just install docker and run

docker run --rm -it \
       -v "$(pwd)/edx/:/Downloaded" \
       strm/edx-dl -u <USER> -p <PASSWORD>

Reporting issues

Before reporting any issue please follow the steps below:

  1. Verify that you are running the latest version of all the programs (both of edx-dl and of youtube-dl). Use the following command if in doubt:

     pip install --upgrade edx-dl
    
  2. If you get an error like "YouTube said: Please sign in to view this video.", then we can't do much about it. You can try to pass your credentials to youtube-dl (see https://github.com/rg3/youtube-dl#authentication-options) with the use of edx-dl's option --youtube-dl-options. If it doesn't work, then you will have to tell edx-dl to ignore the download of that particular video with the option --ignore-errors.

  3. If the problem persists, feel free to open an issue in our bugtracker, please fill the issue template with as much information as possible.

Supported sites

These are the current supported sites:

This is the full list of sites powered by Open edX. Not all of them are supported at the moment, we welcome you to contribute support for them and send a pull request also via our issue tracker.

Authors

See the contributors to the project in the AUTHORS.md file. If you have contributed to the project, we would like to gladly credit you for your work. Just send us a note to be added to that list.

edx-dl's People

Contributors

amsourav avatar balta2ar avatar danmbox avatar double-thinker avatar dumpweed avatar emadshaaban92 avatar esantoro avatar eugeneloy avatar harrisony avatar iemejia avatar kyilmaz80 avatar ly0 avatar oshibuki avatar pitchers avatar rafa-dot-el avatar rakasha avatar rbrito avatar samrat avatar shk3 avatar sirhcel avatar stevenmaude avatar sudhirmishra avatar sundarcf avatar therealssj avatar therusskiy avatar trtg avatar tsspl avatar ttrinh-v avatar vaidyasm avatar zkazsi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

edx-dl's Issues

TypeError: decode() takes no keyword arguments

$ python edx-dl.py [email protected] 123456
Traceback (most recent call last):
File "edx-dl.py", line 39, in
resp = json.loads(response.read().decode(encoding = 'utf-8'))
TypeError: decode() takes no keyword arguments

Environment:
Ubuntu 10.04
Python 2.6.5

Fix:
Change line 39 to:
resp = json.loads(response.read().decode('utf-8'))

Week Subdirectories

Suggested enhancement:

Add the option to save downloads into folders by week. The main benefit is improved organization that matches the course schedule.

Not critical of course, but would be helpful.

Thanks for this great tool!

Retry the failed videos a few times (Videos are downloaded randomly and some are skipped!)

First of all thanks for such a great script it literally saved my life.

I've been having this problem that videos get downloaded randomly and even some are skipped, and it finishes before all videos are downloaded; so I have to run the script several (I guess it's been more than 20 times by now) times to download all the videos.

For information my internet connection speed is 512kbps and I've been downloading CS188.1x Artificial Intelligence and CHEM181x Food for Thought courses.

two different errors for the same course

Hi ! :)

I saw your work on Github and I'm so amazed of you high programming
skills , so I just need some help from you

when I use edx-downloader it gives me the following ,I used Git Bash
and it gave me errors and I couldn't solve :
http://i.imgur.com/fjUwvnI.jpg

when I used the normal Windows cmd I've got this :
http://i.imgur.com/YjJP4Cz.jpg

BUT I tried to download another course from edx and it worked (to be
more specific it worked for : "2.03x Dynamics & ANTH_207x Intro to
Human Evolution "

so I think there's something wrong with 16.101x Intro to Aerodynamics
!!!!! something I don't know ! there're two different errors ( as I
see ) from cmd and Git Bash for the same course ..

I Hope you can help me :)

btw : I don't have any good experience in programming :v

Encoding error

I was getting the following error when downloading the subtitles for some edx videos.

Traceback (most recent call last):
File "edx-dl.py", line 380, in
main()
File "edx-dl.py", line 375, in main
open(os.path.join(os.getcwd(), subs_filename)+'.srt', 'w+').write(subs_string)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 868: ordinal not in range(128)

The following is due to not encoding the subtitles downloaded before writing to file. Solved the issue by calling the encode option: so line 375 now looks like this:

open(os.path.join(os.getcwd(), subs_filename)+'.srt', 'w+').write(subs_string.encode('utf-8'))

Can't download subtitles

After downloading the first video. Some commits ago I was able to download all subtitles in the course but one. Now I can't download any of them.

Tested with 4d53e25

[youtube] Setting language
[youtube] YTGNNCmWqh0: Downloading webpage
[youtube] YTGNNCmWqh0: Downloading video info webpage
[youtube] YTGNNCmWqh0: Extracting video information
[download] Destination: Downloaded/8.01x Classical Mechanics/01-Walter Lewin 8.01x Intro Video.mp4
[download] 100% of 35.29MiB in 02:46.90KiB/s ETA 00:00
[info] Writing edX subtitles: Downloaded/8.01x Classical Mechanics/01-Walter Lewin 8.01x Intro Video.srt
Warning: edX subtitles (error:Not Found)
Traceback (most recent call last):
File "edx-dl.py", line 371, in
main()
File "edx-dl.py", line 353, in main
'wb+').write(subs_string.encode('utf-8'))

404 error on launch

Just downloaded all the dependencies, cd into the directory, and ran the file

richard$ python ./edx-dl.py
Traceback (most recent call last):
File "./edx-dl.py", line 38, in
response = urllib2.urlopen(request)
File "/Applications/Canopy.app/appdata/canopy-1.0.0.1160.macosx-x86/Canopy.app/Contents/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/Applications/Canopy.app/appdata/canopy-1.0.0.1160.macosx-x86/Canopy.app/Contents/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/Applications/Canopy.app/appdata/canopy-1.0.0.1160.macosx-x86/Canopy.app/Contents/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/Applications/Canopy.app/appdata/canopy-1.0.0.1160.macosx-x86/Canopy.app/Contents/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(_args)
File "/Applications/Canopy.app/appdata/canopy-1.0.0.1160.macosx-x86/Canopy.app/Contents/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(_args)
File "/Applications/Canopy.app/appdata/canopy-1.0.0.1160.macosx-x86/Canopy.app/Contents/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

unicode characters in course names

Using python 2.7, it fails when printing list of available courses. Workaround is

sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)

just before that loop.

Merge edx-downloader with coursera-dl or create mooc-dl

Hello. I'm probably not the first one to come up with this idea but to me it seems to be a very logical next step. There are two distinct projects that do basically the same but on different MOOC platforms. Support for these platforms sounds like a good code modularization challenge, it would also be a good programming style. coursera-dl looks more advanced in terms of different kind of optimizations like cookie management, caching and so on, so it's edx-downloader to merge with coursera-dl, not vise versa.

I would like to know what you think? Is this reasonable? If no, why? If yes, what steps can be taken to start moving in this direction?

IndexError: list index out of range

$ python edx-dl.py
Username:
Password:

Traceback (most recent call last):
File "edx-dl.py", line 390, in
main()
File "edx-dl.py", line 222, in main
data = soup.find_all('ul')[1]
IndexError: list index out of range

Failing when accessing archived course: BE101x Behavioural Economics in Action

This course is archived.

I'm receiving the following error when I select the course from the course list in edx-dl.py:

Traceback (most recent call last):
File "edx-dl.py", line 377, in
main()
File "edx-dl.py", line 257, in main
w.ul.find_all('a')]) for w in WEEKS]
AttributeError: 'NoneType' object has no attribute 'string'

Invalid path for subtitles

Subtitles are being downloaded to the current user folder and not to the configured destination as videos are.

Implement storing user/passwors via netrc

Like I already do with coursera-dl, it would be super hand, as we are trying to support many sites, to store the user credentials in only one standard place (e.g., in ~/.netrc), as I really have a hard time remembering my own passwords.

Directory names

Hi,

In windows, I cannot download the course
Stat2.1x Introduction to Statistics: Descriptive Statistics
because edx-downloader tries to create a directory name that is invalid (because it includes ':')

I have already submitted this issue some time ago, but it doesn't seem to have been addressed?

IndexError: list index out of range

Used Python 2.7.3, and the latest commit 2755642

Enter Course Number: 9
CS50x Introduction to Computer Science I has 12 weeks so far
1 - Download
Week 0
videos
2 - Download
Week 1
videos
3 - Download
Week 2
videos
4 - Download
Week 3
videos
5 - Download
Week 4
videos
6 - Download
Week 5
videos
7 - Download
Week 6
videos
8 - Download
Week 7
videos
9 - Download
Week 8
videos
10 - Download
Week 9
videos
11 - Download
Week 10
videos
12 - Download
Week 11
videos
13 - Download them all
Enter Your Choice: 13
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_0/week0w/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_0/week0f/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_0/pset0/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_0/shorts0/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_1/week1m/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_1/week1w/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_1/pset1/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_1/section1/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_1/shorts1/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_2/week2m/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_2/week2w/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_2/pset2/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_2/section2/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_2/shorts2/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_3/week3m/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_3/week3w/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_3/pset3/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_3/section3/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_3/shorts3/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_4/week4m/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_4/week4w/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_4/section4/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_4/shorts4/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_5/week5f/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_5/pset4/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_5/section5/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_5/shorts5/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_6/week6m/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_6/week6w/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_6/pset5/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_6/section6/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_6/shorts6/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_7/week7m/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_7/week7w/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_7/pset6/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_7/section7/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_7/shorts7/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_8/week8w/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_8/week8f/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_8/pset7/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_8/section8/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_8/shorts8/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_9/week9m/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_9/week9w/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_9/section9/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_9/shorts9/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_10/week10m/'...
Processing 'https://courses.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_11/week11m/'...
Traceback (most recent call last):
File "edx-dl.py", line 191, in
os.system('youtube-dl -F %s' % video_link[-1])
IndexError: list index out of range

Access to 0 courses on edX

I have been downloading videos using edx-downloader without any problems. It used to show that I had access to 10 courses.

But, suddenly I started getting this message: "You can access 0 courses on edX"

Therefore, I can't download any videos now.

Crash after downloaing some videos.

While trying to download "SPU27x Science Cooking From Haute Cuisine to Soft Matter Science" in 720x1280, the program crashed with the following traceback:
[download] 100% of 59.66MiB in 01:35.89KiB/s ETA 00:00
[download] ed-x subtitles: /home/xl0/work/mooc/edx/edx-downloader/../courses/SPU27x Science Cooking From Haute Cuisine to Soft Matter Science/113-HARSPU27T313-G010900_100.srt
[youtube] Setting language
[youtube] rwECGaU-VUg: Downloading video webpage
[youtube] rwECGaU-VUg: Downloading video info webpage
[youtube] rwECGaU-VUg: Extracting video information
Traceback (most recent call last):
File "./edx-dl.py", line 382, in
main()
File "./edx-dl.py", line 373, in main
subs_filename = (match.group(1) or match.group(2)).decode('utf-8')[:-4]
AttributeError: 'NoneType' object has no attribute 'group'

Firsh time it downloaded 99 videos, so I thought it might be related to some field width, but on a second attempt it wen up to 114, so probably not the case.

'charmap' codec can't encode character u'\u2013' in position 28: character maps to <undefined>

You can access 19 courses
1 - CS-184.1x Foundations of Computer Graphics -> Started
2 - CS.169.2x Software as a Service, Part 2 (rev Fall 2013) -> Started
3 - CS188.1x Artificial Intelligence -> Started
4 - CS169.1x Engineering Software as a Service -> Not yet
5 - CS169.2x Engineering Software as a Service, Part 2 -> Not yet
6 - AE1110x Introduction to Aeronautical Engineering -> Not yet
7 - BIO465x Neuronal Dynamics -> Started
8 - AMRx Autonomous Mobile Robots -> Not yet
9 - CS50x Introduction to Computer Science -> Started
10 - 16.101x Introduction to Aerodynamics -> Started
11 - 16.110x Flight Vehicle Aerodynamics -> Not yet
12 - 2.03x Dynamics -> Started
13 - 6.00.1x Introduction to Computer Science and Programming -> Started
14 - 6.002x Circuits and Electronics -> Started
15 - ELEC301x Discrete Time Signals and Systems -> Not yet
16 - 20220332X Principles of Electric Circuits: Part 1 -> Started
17 - 20220332_2x Principles of Electric Circuits: Part 2 -> Not yet
Traceback (most recent call last):
File "C:\Python27\edx-downloader\edx-dl.py", line 403, in
main()
File "C:\Python27\edx-downloader\edx-dl.py", line 271, in main
print('%d - %s -> %s' % (c, course[0], course[2]))
File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position
28: character maps to

Check for errors when downloading videos

Sometimes i get errors when edx-dl downloads a video.

I think we should check for return code from youtube-dl and maybe restart the download, instead of just forgetting about that and going on...

failure in uptodate Debian with python 2.7

Strangely, it stopped working for me a while ago with this error.

[download] Saving videos into: ./Downloaded/
Traceback (most recent call last):
File "edx-dl.py", line 387, in
main()
File "edx-dl.py", line 348, in main
popen_youtube = Popen(cmd, stdout=PIPE, stderr=PIPE)
File "/usr/lib/python2.7/subprocess.py", line 679, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1259, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

Error downloading videos who are not in the chosen format

Many courses (e.g. 8.01x Classical Mechanics, SPU27x Science & Cooking. BE101x Behavioural Economics) have videos who are in different formats, when youtube-dl finds those videos it breaks, and those videos aren't downloaded.

More OpenX Course Websites

More websites use the platform of edX now, such as XuetangX, a Chinese website.
Should we also support them? How can we integrate them into one downloader?
Any idea?

Not downloading videos and subtitles hosted on S3 Amazon (AWS)

Works fine with youtube vídeos.
But didn't find and didn't download videos and subtitles hosted on amazon.

For now, I can right-click on the videos and download them manually.
But I'm more interested on the subtitles.

Is there a default url for the subs that I can make with the video url at hand?

TIA

edit: fixed some typos in my message.

error when attempting to download

I am having difficulty downloading course videos, which seems to have something to do with youtube-dl. After choosing the videos I want to download, the script spends several minutes processing the URLs to download, but then I get the following error message:

'youtube-dl' is not recognized as an internal or external command, operable program or batch file.

I'm not sure what exactly is the problem because I have correctly installed youtube-dl and have successfully downloaded a video from YouTube before I installed edx-dl.py . What am I doing wrong?

URLOPEN failed.

Edx-dl is not working for me. Its giving the following log. Thanks

c:\Python33\youtube-dl-master>python edx-dl.py ve****@gmail.com ****
Traceback (most recent call last):
File "edx-dl.py", line 99, in
response = urlopen(request)
File "C:\Python33\lib\urllib\request.py", line 156, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 513, in error
return self._call_chain(_args)
File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
result = func(_args)
File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

edx-issue

Video Quality Option

Recommend adding an option to download the best available video quality which is now apparently the default behavior for youtube-dl. I modified my local copy of the code to simply bypass the quality option by commenting out the "-f ..." option on the youtube-dl command line.

This should reduce the number of download errors reported.

Invalid os.path Results for AbsPath in cygwin

When using edx-dl under cygwin on windows, downloading subtitles while using an absolute destination path breaks, with a screwball path at what is currently line 385:

open(os.path.join(os.getcwd(), subs_filename),
'wb+').write(subs_string.encode('utf-8'))

The path it's trying to write to is a concatenation of the cwd and the absolute path target_dir/filename.srt - so something like this:

/c/Users/mitch/home/src/edx-downloader/q:/edu/class/videoname.srt

Basically, abspath is failing to recognize "q:/edu" as the beginning of an absolute path, and is returning the mess above.

If I had more time to debug, I would. There may be some interaction between cygwin/python/windows as well... My memory is that passing a cygwin unix-style path (e.g. /q/edu/class) as the target_dir broke some other part of the script, so I had to use the drive letter version.

Admittedly, this is an upstream bug, and also a bit of a corner case. But for anyone using windows & edx-dl, it's real. I just hacked edx-dl to work for my case, by deleting the call to os.path.join. A real fix would probably include testing subs_filename for leading drive letters.

IndexError: list index out of range

Hi Guys
First I would like to thank you for creating such a wonderful script,
actually you saved my precious time.

I tried to download the "CS50x Introduction to Computer Science I" course content
then I got into a issue, here is stack-trace

You can access 13 courses on edX
1 - CS169.1x Software as a Service -> Started
2 - CS169.1x Software as a Service -> Started
3 - CS169.2x Software as a Service -> Started
4 - CS184.1x Foundations of Computer Graphics -> Started
5 - CS188.1x Artificial Intelligence -> Started
6 - CS188.1x Artificial Intelligence -> Started
7 - CS191x Quantum Mechanics and Quantum Computation -> Started
8 - CS50x Introduction to Computer Science I -> Started
9 - 6.00x Introduction to Computer Science and Programming -> Started
10 - 6.00x Introduction to Computer Science and Programming -> Started
11 - 8.02x Electricity and Magnetism -> Started
12 - UT.2.01x Ideas of the 20th Century -> Not yet
13 - UT.3.01x Age of Globalization -> Not yet
Enter Course Number: 8
CS50x Introduction to Computer Science I has 12 weeks so far
1 - Download Week 0 videos
2 - Download Week 1 videos
3 - Download Week 2 videos
4 - Download Week 3 videos
5 - Download Week 4 videos
6 - Download Week 5 videos
7 - Download Week 6 videos
8 - Download Week 7 videos
9 - Download Week 8 videos
10 - Download Week 9 videos
11 - Download Week 10 videos
12 - Download Week 11 videos
13 - Download them all
Enter Your Choice: 13
Processing 'https://www.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_0/week0w/'...
Processing 'https://www.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_0/week0f/'...
.....
.....
.....
.....
.....
Processing 'https://www.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_10/week10m/'...
Processing 'https://www.edx.org/courses/HarvardX/CS50x/2012/courseware/Week_11/week11m/'...
Traceback (most recent call last):
File "edx-dl.py", line 185, in
os.system('youtube-dl -F %s' % video_link[-1])
IndexError: list index out of range

Thanks

cannot run python edx-dl.py

First, thanks for developing this project. It will be very useful to me. Unfortunately I am unable to run python edx-dl.py . I have correctly installed youtube-dl and BeautifulSoup4, but I'm not sure what to do with edx-dl.py . I get a syntax error message when I try to run it, and it seems to refer to something in line 4 of the file. I have virtually no prior experience with python, so I really have no idea what the trouble might be.

Syntax error

I've just redownloaded the current version to test a fixed issue, but this error appeared.

Download subtitles (y/n)? y
[download] Saving videos into: Downloaded
[youtube] Setting language
[youtube] YTGNNCmWqh0: Downloading webpage
[youtube] YTGNNCmWqh0: Downloading video info webpage
[youtube] YTGNNCmWqh0: Extracting video information
[download] Destination: Downloaded/8.01x Classical Mechanics/01-Walter Lewin 8.01x Intro Video.mp4
[download] 100% of 35.29MiB in 00:59.54KiB/s ETA 00:00
Warning: edX subtitles (error:Not Found)
[youtube] Setting language
[youtube] hHLFFaZiCbk: Downloading webpage
[youtube] hHLFFaZiCbk: Downloading video info webpage
[youtube] hHLFFaZiCbk: Extracting video information
[download] Destination: Downloaded/8.01x Classical Mechanics/02-MITx - Classical Mechanics (Physics 1) - 8.01x About Video.mp4
[download] 100% of 13.21MiB in 00:14.31KiB/s ETA 00:00
Traceback (most recent call last):
File "edx-dl.py", line 373, in
main()
File "edx-dl.py", line 364, in main
subs_filename.append('.srt')
AttributeError: 'unicode' object has no attribute 'append'

I'm a C programmer and I don't know Python, but is there something like a Python compiler or Python syntax checker that can catch this kind of errors? I suppose this is just a syntax error.

video download problem with ER22x

I get the following error when trying to download ER22x materials:

Traceback (most recent call last):
File "edx-dl.py", line 185, in
os.system('youtube-dl -F %s' % video_link[-1])
IndexError: list index out of range

System has python2.7.3

Broken

Maybe a page change?

Traceback (most recent call last):
File "edx-dl.py", line 147, in
data = soup.section.section.div.div.nav
AttributeError: 'NoneType' object has no attribute 'div'

Certificate verify failed due to youtube-dl

On running the script,it throws an error:

tested with; 07ddc8f

[youtube] Setting language
WARNING: unable to set language: urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:550)

[youtube] LvaTokhYnDw: Downloading webpage
ERROR: Unable to download webpage: urlopen error [SSL:CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:550)

capture

edX added a new "News" HTML "article" element

edX added a new "News" HTML "article" element to the dashboard and now the program fails to see which are the courses.

In the file "edx-dl.py " that can be solved by changing the line 54 from

for COURSE in COURSES :

to

for COURSE in COURSES [1:]:

And by the way, many thanks for the program!

HTTP Error 500: INTERNAL SERVER ERROR

Hi,

first and foremost, thanks for this very useful code!

I'm not sure this is a bug with the downloader, it looks to me more of a server bug/problem.
This morning (just after successfully downloading many videos from one of my courses) I tried to download more and got the 500 error. Complete output:

Traceback (most recent call last):
File "/home/davide/bin/edx-downloader/edx-dl.py", line 285, in
main()
File "/home/davide/bin/edx-downloader/edx-dl.py", line 178, in main
response = urlopen(request)
File "/usr/lib64/python3.3/urllib/request.py", line 160, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib64/python3.3/urllib/request.py", line 479, in open
response = meth(req, response)
File "/usr/lib64/python3.3/urllib/request.py", line 591, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python3.3/urllib/request.py", line 517, in error
return self._call_chain(_args)
File "/usr/lib64/python3.3/urllib/request.py", line 451, in _call_chain
result = func(_args)
File "/usr/lib64/python3.3/urllib/request.py", line 599, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: INTERNAL SERVER ERROR

I also noticed that from any browser the login page freezes after I enter email and password, but if I navigate to edx on another tab after a while, I appear to actually BE logged in.

You think it would be possible to exploit this to have the downloader also work? Or do you have ways to notify Edx of their problem?

Thank you again

Davide

error when launching edx-dl.py

Hello,
When i start edx-dl.py, i have the following issue:
File "edx-dl.py", line 4

^
SyntaxError: invalid syntax
Do you know why? what can i do to solve this?
Thanks for your feedback

Error with encoding?

I got an encoding error before downloading starts.
The course link is https://www.edx.org/courses/MITx/6.00x/2013_Spring/ and the error message is as follows:

You can access 1 courses on edX
1 - 6.00x Introduction to Computer Science and Programming -> Started
Enter Course Number: 1
Traceback (most recent call last):
  File "edx-dl.py", line 146, in <module>
    soup = BeautifulSoup(courseware)
  File "/usr/local/lib/python2.7/dist-packages/beautifulsoup4-4.1.3-py2.7.egg/bs4/__init__.py", line 172, in __init__
    self._feed()
  File "/usr/local/lib/python2.7/dist-packages/beautifulsoup4-4.1.3-py2.7.egg/bs4/__init__.py", line 185, in _feed
    self.builder.feed(self.markup)
  File "/usr/local/lib/python2.7/dist-packages/beautifulsoup4-4.1.3-py2.7.egg/bs4/builder/_lxml.py", line 195, in feed
    self.parser.close()
  File "parser.pxi", line 1187, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:88786)
  File "parsertarget.pxi", line 142, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:98085)
  File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:97909)
  File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:9071)
  File "saxparser.pxi", line 259, in lxml.etree._handleSaxData (src/lxml/lxml.etree.c:94081)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 2: invalid continuation byte

Stop committing things and decide the vision of the project

I would like to, in a very similar fashion of what Guido van Rossum once asked of the Python developers, to suggest that we take a moratorium on new commits and, first, decide what we think and agree is the role of the project.

In particular, I would like to see many things addressed:

  • The current documentation sucks. Badly. And I'm not only referring to the English parts of it. I am not a native speaker, even though some people have sporadically told me that my written English is good enough.
  • The current code smells bad. In many ways:
    • It, perhaps, tries too hard to handhold the users and doing good user interfaces is hard. Exceptionally hard. Especially without a graphical toolkit that would have already have solved most of the problems. And while not looking good, Python has already a dependency on Tk, which is satisfied by all Python installations, barring users that explicitly avoid tkinter. The non-interactive branch that I created was supposed to cut the silly, unsafe text UI which comes with the master branch, besides having more modularity (more about this on the next points).
    • On a more subjective note, the code clearly looks like the people writing it are amateur programmers. Well, I should not have said this, because I am also an amateur programmer, but each programming language has its own set of idioms and the current code follows none, differently from what I tried to accomplish on my non-interactive branch.
    • The current master branch, at least the last time I checked, didn't support sites other than edx.org. My branch supports edX-based sites in general and I have, personally, used it with Stanford's site (just for tests), 10gen/Mongodb.com (for real courses, where I completed 4 courses with certificates) and edx.org (just for tests, as I am mostly completing some coursera courses).
    • The current master branch has a lot of technical debt, which is something which I plainly acknowledge in my branch, with clear FIXME's or XXX's. This makes it easy for other people to jump in and see that the code needs improvement in a clear way. Unfortunately, such visibility is hindered by the fact that the code is in a non-default branch and github gives almost no visibility to it. I can't stress how much I think that technical debt is something that I try to avoid, even if I am swamped with it.
    • The current master branch has features that I don't have, but that's mostly because I thought that my branch would have received after @iemejia joined the project. My original intention with the code would be to use the time-tested practice of making a development branch, stop developing on the "stable" branch and, eventually, make the development branch the default branch. I guess that this was not communicated effectively by me. The rationale for this development model is made explicit in this post: http://nvie.com/posts/a-successful-git-branching-model/
    • The code currently lacks real testing. This impacts us. Badly. Especially when the sites that we are scraping change in some unpredictable ways. We need to add hooks to travis-ci and to coveralls, just like I do with coursera-dl.
    • I am not really sure if I believe that having an interactive way of doing things is so much appealing. Just to put things in perspective, in coursera-dl, where I have tried hard to make the community inclusive we have 1333 stars and 425 forks, which is, in some way, a measure of the success of the project---when I joined, the project had way fewer followers. With youtube-dl, there are 3296 stars and 703 forks. This edx-downloader project has 58 stars and 63 forks. Both coursera-dl and youtube-dl don't have an interactive mode. But they are successful projects.
    • Let me rephrase the point above, to avoid misinterpretations: I am not saying that having a text UI is detrimental to the project. On the contrary. But being functional and flexible by far exceeds a toy that doesn't fullfill the necessities of the users (say, supporting more sites, being reliable, tested) or doesn't fullfill the ease with which developers can add/fix features. And, yes, this last point includes adding a proper interactive mode. Again, if we are serious about usability and having an ease-to-use program, we should give a serious thought to use either curses or a graphical interface with tkinter. Otherwise, what we have is a joke. And the user interface may be a very good learning exercise for those that have not yet programmed such things.
    • Coupled with the point above, I think that we should try hard to make the program work like a library/python module. This makes testing easier, coverage analysis easier, static analysis easier, integration with other tools easier and, in fact, many other things easier.
    • After working in a project where there is more than one person involved, I have reached the conclusion that it is very important to have every committer know about every other changes that other people make to the code. In a regular git setting, this could be accomplished via hooks that e-mail people the diffs being made, so that everybody can be up-to-date with the project. Apparently, with github, the way to make other people know of the changes that other people are working on is to send pull requests. I would propose, therefore, that we don't use direct commits to the project, unless we have a pull request. Otherwise, we may get conflicts and people not knowing where the code stands.
  • This is a subjective point, but some programs, when invoked with no parameters, start with an interactive mode. This is, perhaps, appealing to people used to Windows. Other programs, when invoked with no parameters, just spit information on how it should be used. This is, perhaps, the Unix-mindset manifesting itself. The first approach doesn't seem to allow (unless one adopts the use of configuration files or use of batch files/scripts) the specification of standard parameters. This is annoying to some.

Well, I guess that I have more to say, but it is 4am here and I should really go to bed.

/cc: @rbrito

urllib2.HTTPError: HTTP Error 404: Not Found

When I try to login I get the following error

Modern Computer@ModernComputer ~/edx-downloader
$ python edx-dl.py [email protected] password
Traceback (most recent call last):
File "edx-dl.py", line 38, in
response = urllib2.urlopen(request)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(_args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(_args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

double check for video links

Script not working with renew edX course "CS-169.1x Software as a Service" (just started again).
But if change this line (182):

splitter = re.compile(b'data-streams=(?:&#34;|").*1.0[0]*:')

to this:

splitter = re.compile(b'data-youtube-id-1-0=(?:&#34;|")')

It's work perfect !

I wish the script can check first regexp, and if nothing found - check the second one.
Thanks.

Just letting you know about this feature in edX

I have writed edX to make such a button to download subtitles form. And maybe they did it or it is made by the course maintainers. The course is "RiceX: PHYS102x Electricity & Magnetism"

screenshot

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.