Giter Club home page Giter Club logo

coursera-dl-all's People

Contributors

chillee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

coursera-dl-all's Issues

Index out of Range error -- Platform: {Windows7, Python 2.7, PhantomJS} Course:{compilers-004}

Traceback (most recent call last):
File "c:\0\Google\projects\coursera-dl-all\dl_all.py", line 299, in
download_all_quizzes(session, quiz_info, i[1])
File "c:\0\Google\projects\coursera-dl-all\dl_all.py", line 189, in download_all_quizzes
download_quiz(session, quiz_obj, clean_filename(category_name))
File "c:\0\Google\projects\coursera-dl-all\dl_all.py", line 179, in download_quiz
session.find_elements_by_css_selector('#spark > form > p > input')[0].click()
IndexError: list index out of range

different errors when attempting to run in 2 different ways

On a Mac OS X, I tried things two ways and got different errors:

python dl_all.py -u myemail -q -a -p mypassword --headless

resulted in
('https://class.coursera.org/molevol-003/', 'molevol-003')
Logging In....
Traceback (most recent call last):
File "dl_all.py", line 280, in
error = login(session, class_url, args.u, args.p )
File "dl_all.py", line 52, in login
x.send_keys(email)
.....
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0: unexpected end of data

python dl_all.py -u myemail -q -a -p mypassword
resulted in:
('https://class.coursera.org/molevol-003/', 'molevol-003')
Traceback (most recent call last):
File "dl_all.py", line 278, in
session = webdriver.Firefox()
....
ValueError: insecure string pickle

any thoughts?

Could not authenticate; Did not find necessary cookies

Using Safari browser with the Coursera-dl package installed on a Mac.

Login is possible and coursera-dl is able to make an HTTPS connection. Unsure why cookies are not found - they have not been blocked for Coursera.org.

Stuck at logging in

When I run the code it always stuck at the log in line. The credential is correct yet it still stuck. I just want to download the quiz of neural net.

Are you able to download Assignments/Quizzes ?

Using -v -a -q and still the only thing I see inside are folders with videos/slides/subtitles on Ubuntu 14.04 with Python 2.7

In which operating system was the library tested ?

Running headless gets stuck on login screen. PhantomJS installed.

algo-004 and algo2-003 improperly resolve (and miscellaneous other issues)

The lecture videos can be found & are properly scraped, but the courses themselves resolve to a 404 error page. In other words, the deprecation process has already begun for some of these courses even though the videos remain intact.

Might need to add in specific checks for certain reported courses to only scrape videos if the course directory pages are down so the entire program doesn't crash. However, since the deprecation process is ongoing, might as well just handle the selenium web driver exceptions more elegantly so the entire program doesn't crash just due to one course (or even one page of a course). Maybe output all exceptions to a log file, and alert the user through the terminal before the script ends that there were errors logged in output file LOG_FILE_NAME.

Also, some course videos won't get parsed until you're fully enrolled (algs4partII-007). Since your script now supports automatic enrollment, videos should be handled after the quizzes and assignments. Running coursera-dl with the --clear-cache argument also helps when the script is re-run using different Coursera accounts.

Fully transitioning to Selenium requires custom capabilities for Firefox. Right now, the Marionette web driver isn't automatically bundled with Firefox & explicit PATH permissions must be stated. Including a link to this might help new users install your script: https://developer.mozilla.org/en-US/docs/Mozilla/QA/Marionette/WebDriver

Since Firefox's capabilities need to be explicitly sent, using session.quit() is more reliable than session.close().

Also, there's a race condition of some sort when quizzes are captured. Here's a stack traceback I received while parsing algs4partII-007:

Traceback (most recent call last):
  File "dl_all.py", line 301, in <module>
    download_all_quizzes(session, quiz_info, i[1])
  File "dl_all.py", line 190, in download_all_quizzes
    download_quiz(session, quiz_obj, category_name)
  File "dl_all.py", line 183, in download_quiz
    download_all_zips_on_page(session, path)
  File "dl_all.py", line 104, in download_all_zips_on_page
    url = i.get_attribute('href')
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 111, in get_attribute
    resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 456, in _execute
    return self._parent.execute(command, params)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 236, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: The element reference is stale. Either the element is no longer attached to the DOM or the page has been refreshed.

Waiting on attributes is dissfactory, so you may need to reapproach how to ascertain if a link is fully loaded. Possible solutions:
https://blog.mozilla.org/webqa/2012/07/12/how-to-webdriverwait/
angular/protractor#610
http://stackoverflow.com/questions/5709204/random-element-is-no-longer-attached-to-the-dom-staleelementreferenceexception
https://media.readthedocs.org/pdf/marionette_client/latest/marionette_client.pdf (useful if Marionette-enabled Firefox is used)

Here are some of the revisions I made to handle two of the aforementioned issues:

#include the following import
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

for i in reader:

    class_url, class_slug = get_class_url_info(i)
    print(class_url, class_slug)
    mkdir_safe(class_slug)
    os.chdir(class_slug)

    # session = dryscrape.Session()
    session=''
    if args.headless:
        session = webdriver.PhantomJS()
    else:
        firefox_capabilities = DesiredCapabilities.FIREFOX
        firefox_capabilities['marionette'] = True
        firefox_capabilities['binary'] = '/usr/bin/firefox'      # binary path could be handled better to support multi-platform portability.
        session = webdriver.Firefox(capabilities=firefox_capabilities)
    print("Logging In....")
    error = login(session, class_url, args.u, args.p )
    if (error==-1):
        session.close()
        continue
    print("Logged in!")
    # if

    if not args.ns:
        download_sidebar_pages(session)

    if (args.q):
        # quiz_info = get_quiz_info(session)
        print("Downloading Quizzes....")
        quiz_links = get_quiz_types(session)
        for i in quiz_links:
            print("Downloading "+i[1])
            quiz_info = get_quiz_info(session, i[0], i[1])
            download_all_quizzes(session, quiz_info, i[1])
    # print(class_url)
    if (args.a):
        mkdir_safe("assignments")
        assign_info = get_assign_info(session)
        download_all_assignments(session, assign_info)

    session.quit()

    os.chdir('..')
    if (args.v):
        os.system('coursera-dl --clear-cache -u '+args.u+' -p '+args.p+' --path='+os.getcwd()+' '+class_slug)

ContentTooShortError

Hi dude! After I downloaded the videos, the script always return such errors for every course I selected. (I tried different courses to check the bug, the common part is urllib.error.ContentTooShortError. I don't know what it is.)

My environment:

  • Anaconda python 3.5
  • Firefox 46
Logging In....
Logged in!
[('https://class.coursera.org/pgm-003/wiki/view?page=CourseSchedule', 'CourseSch
edule'), ('https://class.coursera.org/pgm-003/wiki/view?page=CourseInformation',
 'CourseInformation'), ('https://class.coursera.org/pgm-003/wiki/view?page=Cours
eStaff', 'OurTeam'), ('https://class.coursera.org/pgm-003/wiki/view?page=CourseL
ogistics', 'CourseLogistics'), ('https://class.coursera.org/pgm-003/wiki/view?pa
ge=OctaveInstallation', 'OctaveInstallation'), ('https://class.coursera.org/pgm-
003/wiki/view?page=LectureSlides', 'LectureSlides'), ('https://class.coursera.or
g/pgm-003/questions', 'QuickQuestions15'), ('https://class.coursera.org/pgm-003/
class/index', 'Home'), ('https://class.coursera.org/pgm-003/assignment/index', '
ProgrammingAssignments'), ('https://class.coursera.org/pgm-003/forum/index', 'Di
scussionForums'), ('https://class.coursera.org/pgm-003/wiki/view?page=FAQList',
'FAQ')]
Traceback (most recent call last):
  File "dl_all.py", line 290, in <module>
    download_sidebar_pages(session)
  File "dl_all.py", line 229, in download_sidebar_pages
    download_all_zips_on_page(session, path)
  File "dl_all.py", line 123, in download_all_zips_on_page
    urllib.request.urlretrieve(url, path+url[url.rfind('/'):])
  File "D:\Anaconda3\lib\urllib\request.py", line 228, in urlretrieve
    % (read, size), result)
urllib.error.ContentTooShortError: <urlopen error retrieval incomplete: got only
 1430256 out of 1432748 bytes>

Symbols in password causes authentication errors

Environment:
Windows 7 64-bit
Python 3.5

Python dependencies look up to date.

python -m pip install --upgrade coursera-dl
Requirement already up-to-date: coursera-dl in c:\python35\lib\site-packages
Requirement already up-to-date: beautifulsoup4>=4.1.3 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: html5lib>=1.0b2 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: requests>=2.4.3 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: six>=1.5.0 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: urllib3>=1.10 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: pyasn1>=0.1.7 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: keyring>=4.0 in c:\python35\lib\site-packages (from coursera-dl)
Requirement already up-to-date: pywin32-ctypes; sys_platform == "win32" in c:\python35\lib\site-packages (from keyring>=4.0->coursera-dl)

Authentication occurs with some symbols such as !@#$%*

python dl_all.py -u username -p Easypassword1@#$%* -v -a -q --headless
https://class.coursera.org/androidpart1-014/ androidpart1-014
coursera_dl version 0.6.1
Downloading class: androidpart1-014
Starting new HTTPS connection (1): class.coursera.org
Starting new HTTPS connection (1): class.coursera.org
Starting new HTTPS connection (1): www.coursera.org
Logged in on coursera.org.
Found authentication cookies.

Authentication fails with ^ in the password.

python dl_all.py -u username -p Easypassword1^ -v -a -q --headless
https://class.coursera.org/androidpart1-014/ androidpart1-014
coursera_dl version 0.6.1
Downloading class: androidpart1-014
Starting new HTTPS connection (1): class.coursera.org
Starting new HTTPS connection (1): class.coursera.org
Starting new HTTPS connection (1): www.coursera.org
Could not authenticate: Cannot login on coursera.org.

Authentication with & in the password seems to ignore --headless option and tries to start Firefox which automatically updated itself to v47.

python dl_all.py -u username -p Easypassword1& -v -a -q --headless
https://class.coursera.org/androidpart1-014/ androidpart1-014
Traceback (most recent call last):
  File "dl_all.py", line 317, in <module>
    session = webdriver.Firefox()
  File "C:\Python35\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 81, in __init__
    self.binary, timeout)
  File "C:\Python35\lib\site-packages\selenium\webdriver\firefox\extension_connection.py", line 51, in __init__
    self.binary.launch_browser(self.profile, timeout=timeout)
  File "C:\Python35\lib\site-packages\selenium\webdriver\firefox\firefox_binary.py", line 68, in launch_browser
    self._wait_until_connectable(timeout=timeout)
  File "C:\Python35\lib\site-packages\selenium\webdriver\firefox\firefox_binary.py", line 98, in _wait_until_connectable
    raise WebDriverException("The browser appears to have exited "
selenium.common.exceptions.WebDriverException: Message: The browser appears to have exited before we could connect. If you specified a log_file in the FirefoxBinary constructor, check it for details.

'-v' is not recognized as an internal or external command,
operable program or batch file.

Programming1-002 edge case

The homework and final exam pages use non-standardized URL's, yet require you to click on a button to continue just like they would if they used the provided front-ends. As such, the homeworks and the final exam aren't parsed.

Firefox Needed?

Thank you for your code. However I got some problems.
When I add -v tag, I can run the script. However, if I only use -q -a , there is error:

('https://class.coursera.org/algs4partII-007/', 'algs4partII-007')
Traceback (most recent call last):
File "dl_all.py", line 282, in
session = webdriver.Firefox()
File "/Users/Doodle/anaconda/lib/python2.7/site-packages/selenium/webdriver/firefox/webdriver.py", line 81, in init
self.binary, timeout)
File "/Users/Doodle/anaconda/lib/python2.7/site-packages/selenium/webdriver/firefox/extension_connection.py", line 51, in init
self.binary.launch_browser(self.profile, timeout=timeout)
File "/Users/Doodle/anaconda/lib/python2.7/site-packages/selenium/webdriver/firefox/firefox_binary.py", line 67, in launch_browser
self._start_from_profile_path(self.profile.path)
File "/Users/Doodle/anaconda/lib/python2.7/site-packages/selenium/webdriver/firefox/firefox_binary.py", line 90, in _start_from_profile_path
env=self._firefox_env)
File "/Users/Doodle/anaconda/lib/python2.7/subprocess.py", line 710, in init
errread, errwrite)
File "/Users/Doodle/anaconda/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception

Could you help me?

Check for extension in download_all_zips_on_page is naive

As I don't have much experience with Python, I must apologize in advance if my terminology doesn't quite match up with my implementation.

In the function download_all_zips_on_page(session, path='assignments'), the check for each hw_string parses through the entire unicode string and instantly queries a success on the first match. This is a problem if the URL contains one of these substrings before the very end.

For instance, if the URL == "www.python.org", the check erroneously confirms that the URL is a file due to the included substring ".py".

This might not cover all edge cases, but one way of parsing the URL as intended without causing inevitable IOErrors is to only check the very end of the URL:

    for i in links:
        url = i.get_attribute('href')
        if url==None:
            continue
        txt_file.write(url+'\n')
        hw_strings = [u'.zip', u'.py', u'.m', u'.pdf']
        is_hw = False

        for j in hw_strings:
            #has additional slash at the end. must be spliced off.
            if (url[len(url)-1] == u'/'):
                url = url[:(len(url)-1)]

            if (url[(len(url)-len(j)):len(url)] == j):
                is_hw = True
                continue

Downloaded Courses

Hi, Nice program here!

Just one thing, I'm not being able to choose the courses it downloads. Can I specify the courses I want? Or is it random?

Thanks!!

No Quiz and Assignments for comparch-003

I don't see any files downloaded for Quizes and Assignments for comparch-003

The screen output for assignments download is as below : [ removed username and password in the output below ]

$ python coursera-dl-all/dl_all.py -u yyy -p xxx -a
https://class.coursera.org/comparch-003/ comparch-003
Logging In....
Logged in!
[]

$ ls coursera-downloads/comparch-003/assignments/
$

Further file and directory name sanitization needed

Env: Ubuntu, Python 2.7.10

venture-001 has sections with forward slashes in their names, such as "Grading / Notation", "Week 1 / Semaine 1", "Final Project / Project final", and so on.

The script seems to work okay for most of the sections, but fails on "Quizzes / Quizz". Here's the trace:

Downloading Quizzes...
Downloading Quizzes / Quizz
Traceback (most recent call last):
  File "dl_all.py", line 335, in <module>
    quiz_info = get_quiz_info(session, i[0], i[1])
  File "dl_all.py", line 175, in get_quiz_info
    render(session, os.getcwd()+'/'+category_name)
  File "dl_all.py", line 44, in render
    f = open(path+'.html', 'wb')
IOError: [Errno 2] No such file or directory: u'/home/username/Documents/coursera-downloads/venture-001/Quizzes / Quizz.html'

Should be an easy fix to replace the / with a dash or something.

error module doesn't exist for urllib in Python 2.x

The error module is a new addition meant for Python 3.
An equivalent for urllib.error.HTTPError in Python 2.x would be to use urllib2.HTTPError
A solution would be to check the version of Python used to run the script (sys.version) and write conditionals to determine imports and modules.
An example can be found here: http://stackoverflow.com/questions/1875259/importing-a-module-based-on-installed-python-version

You may also want to handle multiple errors at once, since urllib.urlretrieve can also produce an OSError if a connection cannot be established.

For instance,

except (OSError, urllib.error.HTTPError, urllib.error.URLError) as e:
         print("Failed to download "+url)
         continue

Sidebar title with forward slash causes IOError

As I'm currently unable to test any commits you've made since last night because I'm visiting family, I was wondering if you could look into the following crash for hetero-004 with the "Home / Announcements" sidebar link and see if your new code properly handles it.

Here's the error code using the previous night's codebase:

Traceback (most recent call last):
  File "dl_all.py", line 249, in <module>
    download_sidebar_pages(session)
  File "dl_all.py", line 190, in download_sidebar_pages
    render(session, os.getcwd()+'/'+i[1])
  File "dl_all.py", line 33, in render
    f = open(path+'.html', 'w')
IOError: [Errno 2] No such file or directory: u'/home/joshuawn/Desktop/coursera-dl-all/coursera-downloads/hetero-004/Home / Announcements.html'

WindowsError: [Error 267] The directory name is invalid: u'assignments/Programming Assignment 1: WordNet'

Got an error. It looks, program is trying to create a Directory with a name containing invalid character. Please see the below output.

('https://class.coursera.org/algs4partII-007/', 'algs4partII-007')
Logging In....
Logged in!
[(u'https://class.coursera.org/algs4partII-007/wiki/view?page=schedule', u'Schedule'), (u'https://class.coursera.org/algs4partII-007/wiki/ScheduleGoogleHangouts
', u'GoogleHangouts'), (u'https://class.coursera.org/algs4partII-007/class/index', u'Home'), (u'https://class.coursera.org/algs4partII-007/wiki/view?page=errata
', u'Errata'), (u'https://class.coursera.org/algs4partII-007/wiki/view?page=syllabus', u'Syllabus'), (u'https://class.coursera.org/algs4partII-007/assignment/in
dex', u'ProgrammingAssignments'), (u'https://class.coursera.org/algs4partII-007/forum/index', u'DiscussionForums')]
Downloading Quizzes....
Downloading Surveys
Downloading Job Interview Questions
Downloading Exercises
Programming Assignment 1: WordNet
Help Center
Programming Assignment 2: Seam Carving
Help Center
Programming Assignment 3: Baseball Elimination
Help Center
Programming Assignment 4: Boggle
Help Center
Programming Assignment 5: Burrows-Wheeler
Help Center
[u'Programming Assignment 1: WordNet', u'Programming Assignment 2: Seam Carving', u'Programming Assignment 3: Baseball Elimination', u'Programming Assignment 4:
 Boggle', u'Programming Assignment 5: Burrows-Wheeler']
Traceback (most recent call last):
  File "dl_all.py", line 304, in <module>
    download_all_assignments(session, assign_info)
  File "dl_all.py", line 213, in download_all_assignments
    download_all_zips_on_page(session, 'assignments/'+i[1])
  File "dl_all.py", line 99, in download_all_zips_on_page
    os.makedirs(path)
  File "C:\Python27\lib\os.py", line 157, in makedirs
    mkdir(name, mode)
WindowsError: [Error 267] The directory name is invalid: u'assignments/Programming Assignment 1: WordNet'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.