quickflix-crawler's People
quickflix-crawler's Issues
Calculate the popularity of movies
Implement page rank algorithm
Need to check why cannot find a movie review json in follow urls
These urls are all aggregated by google, however our crawler could not found
- http://www.metacritic.com/movie/the-magnificent-seven
- http://www.newyorker.com/goings-on-about-town/movies/the-magnificent-seven
- http://www.nytimes.com/2016/09/23/movies/magnificent-seven-review-denzel-washington.html?referrer=google_kp&_r=0
- http://www.avclub.com/review/magnificent-seven-gets-uninspired-remake-242722
ADD a learning algorithm to do selective crawling for domains
Found a bunch of movie review urls that could be parsed as the seed
Please help to add more entires following format:
'url1',
'url2',
Implement priority queue for crawling
Prioritises certain domains over others depending on whether it contains reviews / whether it is useful (based on some metric)
Catch exceptions in remote.py
fetch_html(): request timeout and bad URLs etc
parse_review(): error loading string into json
Implement a function to crawl movie's basic information by name
Pic
Description
Release date:
Director:
Actors:
Producers:
Implement recursive discovery of URLs
- Crawler pops URLs from the queue, visits it (or not) and parses the review (if the page has one) and adds new URLs to the queue
- Request rate controlled by some time-delay
- Needs #3 to be done
Create db tables
The crawler needs access to:
- list of URLs that have already been visited
- queue of URLs that have yet to be visited
And of course:
- data tables for movie reviews
Integrate crawing with page rank algorithm
Add HTTPS support
To check if the requests library supports it. Otherwise need to implement ourselves.
Implement multi-threading
duplicate review found in database
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.