Deep learning based Smart Web Crawler
We do web crawling all the time to get the data that we need. However there is time consuming and frustrating process of classifing useful and unuseful data. Our team wanted to reduce the time of doing so by implementing deep learning techniques to find user-customized useful patterns inside web crawled data.
We are at the stage of planning out smart ways to solve this problem. Please contact us if you have any idea.
- Crawl as usual
- Go through HTML source code and get the most useful phrase.
- Crawl as usual
- Using CNN find the most likely region
- Go through HTML source code in the region detected from #2. Use LSTM to get the most useful phrase.
Note that this is not the final version
- run
python DeepCrawler --mode sample
in project directory - Sampler GUI App will show up as below
4. Press 'Start Sampling'
In Development