chillee / coursera-dl-all Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Traceback (most recent call last):
File "c:\0\Google\projects\coursera-dl-all\dl_all.py", line 299, in
download_all_quizzes(session, quiz_info, i[1])
File "c:\0\Google\projects\coursera-dl-all\dl_all.py", line 189, in download_all_quizzes
download_quiz(session, quiz_obj, clean_filename(category_name))
File "c:\0\Google\projects\coursera-dl-all\dl_all.py", line 179, in download_quiz
session.find_elements_by_css_selector('#spark > form > p > input')[0].click()
IndexError: list index out of range
On a Mac OS X, I tried things two ways and got different errors:
python dl_all.py -u myemail -q -a -p mypassword --headless
resulted in
('https://class.coursera.org/molevol-003/', 'molevol-003')
Logging In....
Traceback (most recent call last):
File "dl_all.py", line 280, in
error = login(session, class_url, args.u, args.p )
File "dl_all.py", line 52, in login
x.send_keys(email)
.....
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0: unexpected end of data
python dl_all.py -u myemail -q -a -p mypassword
resulted in:
('https://class.coursera.org/molevol-003/', 'molevol-003')
Traceback (most recent call last):
File "dl_all.py", line 278, in
session = webdriver.Firefox()
....
ValueError: insecure string pickle
any thoughts?
Using Safari browser with the Coursera-dl package installed on a Mac.
Login is possible and coursera-dl is able to make an HTTPS connection. Unsure why cookies are not found - they have not been blocked for Coursera.org.
When I run the code it always stuck at the log in line. The credential is correct yet it still stuck. I just want to download the quiz of neural net.
Using -v -a -q and still the only thing I see inside are folders with videos/slides/subtitles on Ubuntu 14.04 with Python 2.7
In which operating system was the library tested ?
Running headless gets stuck on login screen. PhantomJS installed.
The lecture videos can be found & are properly scraped, but the courses themselves resolve to a 404 error page. In other words, the deprecation process has already begun for some of these courses even though the videos remain intact.
Might need to add in specific checks for certain reported courses to only scrape videos if the course directory pages are down so the entire program doesn't crash. However, since the deprecation process is ongoing, might as well just handle the selenium web driver exceptions more elegantly so the entire program doesn't crash just due to one course (or even one page of a course). Maybe output all exceptions to a log file, and alert the user through the terminal before the script ends that there were errors logged in output file LOG_FILE_NAME.
Also, some course videos won't get parsed until you're fully enrolled (algs4partII-007). Since your script now supports automatic enrollment, videos should be handled after the quizzes and assignments. Running coursera-dl with the --clear-cache argument also helps when the script is re-run using different Coursera accounts.
Fully transitioning to Selenium requires custom capabilities for Firefox. Right now, the Marionette web driver isn't automatically bundled with Firefox & explicit PATH permissions must be stated. Including a link to this might help new users install your script: https://developer.mozilla.org/en-US/docs/Mozilla/QA/Marionette/WebDriver
Since Firefox's capabilities need to be explicitly sent, using session.quit() is more reliable than session.close().
Also, there's a race condition of some sort when quizzes are captured. Here's a stack traceback I received while parsing algs4partII-007:
Traceback (most recent call last):
File "dl_all.py", line 301, in <module>
download_all_quizzes(session, quiz_info, i[1])
File "dl_all.py", line 190, in download_all_quizzes
download_quiz(session, quiz_obj, category_name)
File "dl_all.py", line 183, in download_quiz
download_all_zips_on_page(session, path)
File "dl_all.py", line 104, in download_all_zips_on_page
url = i.get_attribute('href')
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 111, in get_attribute
resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 456, in _execute
return self._parent.execute(command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 236, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: The element reference is stale. Either the element is no longer attached to the DOM or the page has been refreshed.
Waiting on attributes is dissfactory, so you may need to reapproach how to ascertain if a link is fully loaded. Possible solutions:
https://blog.mozilla.org/webqa/2012/07/12/how-to-webdriverwait/
angular/protractor#610
http://stackoverflow.com/questions/5709204/random-element-is-no-longer-attached-to-the-dom-staleelementreferenceexception
https://media.readthedocs.org/pdf/marionette_client/latest/marionette_client.pdf (useful if Marionette-enabled Firefox is used)
Here are some of the revisions I made to handle two of the aforementioned issues:
#include the following import
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
for i in reader:
class_url, class_slug = get_class_url_info(i)
print(class_url, class_slug)
mkdir_safe(class_slug)
os.chdir(class_slug)
# session = dryscrape.Session()
session=''
if args.headless:
session = webdriver.PhantomJS()
else:
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
firefox_capabilities['binary'] = '/usr/bin/firefox' # binary path could be handled better to support multi-platform portability.
session = webdriver.Firefox(capabilities=firefox_capabilities)
print("Logging In....")
error = login(session, class_url, args.u, args.p )
if (error==-1):
session.close()
continue
print("Logged in!")
# if
if not args.ns:
download_sidebar_pages(session)
if (args.q):
# quiz_info = get_quiz_info(session)
print("Downloading Quizzes....")
quiz_links = get_quiz_types(session)
for i in quiz_links:
print("Downloading "+i[1])
quiz_info = get_quiz_info(session, i[0], i[1])
download_all_quizzes(session, quiz_info, i[1])
# print(class_url)
if (args.a):
mkdir_safe("assignments")
assign_info = get_assign_info(session)
download_all_assignments(session, assign_info)
session.quit()
os.chdir('..')
if (args.v):
os.system('coursera-dl --clear-cache -u '+args.u+' -p '+args.p+' --path='+os.getcwd()+' '+class_slug)
Hi dude! After I downloaded the videos, the script always return such errors for every course I selected. (I tried different courses to check the bug, the common part is urllib.error.ContentTooShortError. I don't know what it is.)
My environment:
Logging In....
Logged in!
[('https://class.coursera.org/pgm-003/wiki/view?page=CourseSchedule', 'CourseSch
edule'), ('https://class.coursera.org/pgm-003/wiki/view?page=CourseInformation',
'CourseInformation'), ('https://class.coursera.org/pgm-003/wiki/view?page=Cours
eStaff', 'OurTeam'), ('https://class.coursera.org/pgm-003/wiki/view?page=CourseL
ogistics', 'CourseLogistics'), ('https://class.coursera.org/pgm-003/wiki/view?pa
ge=OctaveInstallation', 'OctaveInstallation'), ('https://class.coursera.org/pgm-
003/wiki/view?page=LectureSlides', 'LectureSlides'), ('https://class.coursera.or
g/pgm-003/questions', 'QuickQuestions15'), ('https://class.coursera.org/pgm-003/
class/index', 'Home'), ('https://class.coursera.org/pgm-003/assignment/index', '
ProgrammingAssignments'), ('https://class.coursera.org/pgm-003/forum/index', 'Di
scussionForums'), ('https://class.coursera.org/pgm-003/wiki/view?page=FAQList',
'FAQ')]
Traceback (most recent call last):
File "dl_all.py", line 290, in <module>
download_sidebar_pages(session)
File "dl_all.py", line 229, in download_sidebar_pages
download_all_zips_on_page(session, path)
File "dl_all.py", line 123, in download_all_zips_on_page
urllib.request.urlretrieve(url, path+url[url.rfind('/'):])
File "D:\Anaconda3\lib\urllib\request.py", line 228, in urlretrieve
% (read, size), result)
urllib.error.ContentTooShortError: <urlopen error retrieval incomplete: got only
1430256 out of 1432748 bytes>
Environment:
Windows 7 64-bit
Python 3.5
Python dependencies look up to date.
python -m pip install --upgrade coursera-dl
Requirement already up-to-date: coursera-dl in c:\python35\lib\site-packages
Requirement already up-to-date: beautifulsoup4>=4.1.3 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: html5lib>=1.0b2 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: requests>=2.4.3 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: six>=1.5.0 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: urllib3>=1.10 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: pyasn1>=0.1.7 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: keyring>=4.0 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: pywin32-ctypes; sys_platform == "win32" in c:\python35\lib\site-packages (from keyring>=4.0->coursera-dl)
Authentication occurs with some symbols such as !@#$%*
python dl_all.py -u username -p Easypassword1@#$%* -v -a -q --headless
https://class.coursera.org/androidpart1-014/ androidpart1-014
coursera_dl version 0.6.1
Downloading class: androidpart1-014
Starting new HTTPS connection (1): class.coursera.org
Starting new HTTPS connection (1): class.coursera.org
Starting new HTTPS connection (1): www.coursera.org
Logged in on coursera.org.
Found authentication cookies.
Authentication fails with ^ in the password.
python dl_all.py -u username -p Easypassword1^ -v -a -q --headless
https://class.coursera.org/androidpart1-014/ androidpart1-014
coursera_dl version 0.6.1
Downloading class: androidpart1-014
Starting new HTTPS connection (1): class.coursera.org
Starting new HTTPS connection (1): class.coursera.org
Starting new HTTPS connection (1): www.coursera.org
Could not authenticate: Cannot login on coursera.org.
Authentication with & in the password seems to ignore --headless option and tries to start Firefox which automatically updated itself to v47.
python dl_all.py -u username -p Easypassword1& -v -a -q --headless
https://class.coursera.org/androidpart1-014/ androidpart1-014
Traceback (most recent call last):
File "dl_all.py", line 317, in <module>
session = webdriver.Firefox()
File "C:\Python35\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 81, in __init__
self.binary, timeout)
File "C:\Python35\lib\site-packages\selenium\webdriver\firefox\extension_connection.py", line 51, in __init__
self.binary.launch_browser(self.profile, timeout=timeout)
File "C:\Python35\lib\site-packages\selenium\webdriver\firefox\firefox_binary.py", line 68, in launch_browser
self._wait_until_connectable(timeout=timeout)
File "C:\Python35\lib\site-packages\selenium\webdriver\firefox\firefox_binary.py", line 98, in _wait_until_connectable
raise WebDriverException("The browser appears to have exited "
selenium.common.exceptions.WebDriverException: Message: The browser appears to have exited before we could connect. If you specified a log_file in the FirefoxBinary constructor, check it for details.
'-v' is not recognized as an internal or external command,
operable program or batch file.
Thanks for the awesome code. Do we still have to download answer keys manually?
The homework and final exam pages use non-standardized URL's, yet require you to click on a button to continue just like they would if they used the provided front-ends. As such, the homeworks and the final exam aren't parsed.
Thank you for your code. However I got some problems.
When I add -v tag, I can run the script. However, if I only use -q -a , there is error:
('https://class.coursera.org/algs4partII-007/', 'algs4partII-007')
Traceback (most recent call last):
File "dl_all.py", line 282, in
session = webdriver.Firefox()
File "/Users/Doodle/anaconda/lib/python2.7/site-packages/selenium/webdriver/firefox/webdriver.py", line 81, in init
self.binary, timeout)
File "/Users/Doodle/anaconda/lib/python2.7/site-packages/selenium/webdriver/firefox/extension_connection.py", line 51, in init
self.binary.launch_browser(self.profile, timeout=timeout)
File "/Users/Doodle/anaconda/lib/python2.7/site-packages/selenium/webdriver/firefox/firefox_binary.py", line 67, in launch_browser
self._start_from_profile_path(self.profile.path)
File "/Users/Doodle/anaconda/lib/python2.7/site-packages/selenium/webdriver/firefox/firefox_binary.py", line 90, in _start_from_profile_path
env=self._firefox_env)
File "/Users/Doodle/anaconda/lib/python2.7/subprocess.py", line 710, in init
errread, errwrite)
File "/Users/Doodle/anaconda/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
Could you help me?
As I don't have much experience with Python, I must apologize in advance if my terminology doesn't quite match up with my implementation.
In the function download_all_zips_on_page(session, path='assignments'), the check for each hw_string parses through the entire unicode string and instantly queries a success on the first match. This is a problem if the URL contains one of these substrings before the very end.
For instance, if the URL == "www.python.org", the check erroneously confirms that the URL is a file due to the included substring ".py".
This might not cover all edge cases, but one way of parsing the URL as intended without causing inevitable IOErrors is to only check the very end of the URL:
for i in links:
url = i.get_attribute('href')
if url==None:
continue
txt_file.write(url+'\n')
hw_strings = [u'.zip', u'.py', u'.m', u'.pdf']
is_hw = False
for j in hw_strings:
#has additional slash at the end. must be spliced off.
if (url[len(url)-1] == u'/'):
url = url[:(len(url)-1)]
if (url[(len(url)-len(j)):len(url)] == j):
is_hw = True
continue
Hi, Nice program here!
Just one thing, I'm not being able to choose the courses it downloads. Can I specify the courses I want? Or is it random?
Thanks!!
I don't see any files downloaded for Quizes and Assignments for comparch-003
The screen output for assignments download is as below : [ removed username and password in the output below ]
$ python coursera-dl-all/dl_all.py -u yyy -p xxx -a
https://class.coursera.org/comparch-003/ comparch-003
Logging In....
Logged in!
[]
$ ls coursera-downloads/comparch-003/assignments/
$
Env: Ubuntu, Python 2.7.10
venture-001 has sections with forward slashes in their names, such as "Grading / Notation", "Week 1 / Semaine 1", "Final Project / Project final", and so on.
The script seems to work okay for most of the sections, but fails on "Quizzes / Quizz". Here's the trace:
Downloading Quizzes...
Downloading Quizzes / Quizz
Traceback (most recent call last):
File "dl_all.py", line 335, in <module>
quiz_info = get_quiz_info(session, i[0], i[1])
File "dl_all.py", line 175, in get_quiz_info
render(session, os.getcwd()+'/'+category_name)
File "dl_all.py", line 44, in render
f = open(path+'.html', 'wb')
IOError: [Errno 2] No such file or directory: u'/home/username/Documents/coursera-downloads/venture-001/Quizzes / Quizz.html'
Should be an easy fix to replace the / with a dash or something.
The error module is a new addition meant for Python 3.
An equivalent for urllib.error.HTTPError in Python 2.x would be to use urllib2.HTTPError
A solution would be to check the version of Python used to run the script (sys.version) and write conditionals to determine imports and modules.
An example can be found here: http://stackoverflow.com/questions/1875259/importing-a-module-based-on-installed-python-version
You may also want to handle multiple errors at once, since urllib.urlretrieve can also produce an OSError if a connection cannot be established.
For instance,
except (OSError, urllib.error.HTTPError, urllib.error.URLError) as e:
print("Failed to download "+url)
continue
As I'm currently unable to test any commits you've made since last night because I'm visiting family, I was wondering if you could look into the following crash for hetero-004 with the "Home / Announcements" sidebar link and see if your new code properly handles it.
Here's the error code using the previous night's codebase:
Traceback (most recent call last):
File "dl_all.py", line 249, in <module>
download_sidebar_pages(session)
File "dl_all.py", line 190, in download_sidebar_pages
render(session, os.getcwd()+'/'+i[1])
File "dl_all.py", line 33, in render
f = open(path+'.html', 'w')
IOError: [Errno 2] No such file or directory: u'/home/joshuawn/Desktop/coursera-dl-all/coursera-downloads/hetero-004/Home / Announcements.html'
Got an error. It looks, program is trying to create a Directory with a name containing invalid character. Please see the below output.
('https://class.coursera.org/algs4partII-007/', 'algs4partII-007')
Logging In....
Logged in!
[(u'https://class.coursera.org/algs4partII-007/wiki/view?page=schedule', u'Schedule'), (u'https://class.coursera.org/algs4partII-007/wiki/ScheduleGoogleHangouts
', u'GoogleHangouts'), (u'https://class.coursera.org/algs4partII-007/class/index', u'Home'), (u'https://class.coursera.org/algs4partII-007/wiki/view?page=errata
', u'Errata'), (u'https://class.coursera.org/algs4partII-007/wiki/view?page=syllabus', u'Syllabus'), (u'https://class.coursera.org/algs4partII-007/assignment/in
dex', u'ProgrammingAssignments'), (u'https://class.coursera.org/algs4partII-007/forum/index', u'DiscussionForums')]
Downloading Quizzes....
Downloading Surveys
Downloading Job Interview Questions
Downloading Exercises
Programming Assignment 1: WordNet
Help Center
Programming Assignment 2: Seam Carving
Help Center
Programming Assignment 3: Baseball Elimination
Help Center
Programming Assignment 4: Boggle
Help Center
Programming Assignment 5: Burrows-Wheeler
Help Center
[u'Programming Assignment 1: WordNet', u'Programming Assignment 2: Seam Carving', u'Programming Assignment 3: Baseball Elimination', u'Programming Assignment 4:
Boggle', u'Programming Assignment 5: Burrows-Wheeler']
Traceback (most recent call last):
File "dl_all.py", line 304, in <module>
download_all_assignments(session, assign_info)
File "dl_all.py", line 213, in download_all_assignments
download_all_zips_on_page(session, 'assignments/'+i[1])
File "dl_all.py", line 99, in download_all_zips_on_page
os.makedirs(path)
File "C:\Python27\lib\os.py", line 157, in makedirs
mkdir(name, mode)
WindowsError: [Error 267] The directory name is invalid: u'assignments/Programming Assignment 1: WordNet'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.