Giter Club home page Giter Club logo

scrape-finance-data-v2's People

Contributors

ngockhanh5110 avatar vincetran96 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

scrape-finance-data-v2's Issues

Can't scrape anymore

Hi Vince,

It seems Vietstock has applied some new scrapping blocker. The options scrapping mass and industry are unusable. Scrapping by ticker works but it scraps only a few first pages of the financial statements. Here are the log of mass scrapping:

2021-11-02 15:28:11 [scrapy.extensions.telnet] INFO: Telnet Password: 60c4207748982772
2021-11-02 15:28:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2021-11-02 15:28:11 [financeInfo] INFO: Reading start URLs from redis key 'financeInfo:start_urls' (batch size: 16, encoding: utf-8
2021-11-02 15:28:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'rotating_proxies.middlewares.BanDetectionMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-11-02 15:28:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-11-02 15:28:11 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy_redis.pipelines.RedisPipeline']
2021-11-02 15:28:11 [scrapy.core.engine] INFO: Spider opened
2021-11-02 15:28:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-11-02 15:28:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2021-11-02 15:28:11 [celery.app.trace] INFO: Task celery_tasks.finance_task[cd208afb-3cd5-40f7-a938-248c7d98603c] succeeded in 0.11887021500001538s: None
2021-11-02 15:28:12 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:16 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:21 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:26 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:31 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:31 [financeInfo] INFO: corpAZ closed key: 1
2021-11-02 15:28:31 [financeInfo] INFO: corpAZ key financeInfo:corpAZtickers contains: set()
2021-11-02 15:28:31 [financeInfo] INFO: set()
2021-11-02 15:28:31 [financeInfo] INFO: Deleted status file at run/scrapy/financeInfo.scrapy
2021-11-02 15:28:31 [scrapy.core.engine] INFO: Closing spider (CorpAZ is closed; CorpAZ queue is empty; Spider is idling)
2021-11-02 15:28:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 20.031222,
 'finish_reason': 'CorpAZ is closed; CorpAZ queue is empty; Spider is idling',
 'finish_time': datetime.datetime(2021, 11, 2, 15, 28, 31, 938732),
 'log_count/INFO': 21,
 'log_count/WARNING': 1,
 'memusage/max': 65544192,
 'memusage/startup': 65331200,
 'start_time': datetime.datetime(2021, 11, 2, 15, 28, 11, 907510)}
2021-11-02 15:28:31 [scrapy.core.engine] INFO: Spider closed (CorpAZ is closed; CorpAZ queue is empty; Spider is idling)

Miss tickers when mass scraping

There are more than 3,000 tickers in file bizType_ind_tickers.csv. However, there are only 600-1000 tickers dowloaded when I mass scrap (it took 5 hours everytime to scrap those tickers with my internet connection). It missed many bluechips, whose information are certainly available on VietStock such as VIC, GAS.

I have tried several times by both WiFi and Lan connection. The number of tickers and number json files vary from one time to another. One time has 657 tickers, another time has 1025 tickers. None of those times are closed to the total 3129 tickers as in bizType_ind_tickers.csv.

The executions were all stopped by itself and stated clearly that it finished (as terminal shown here at the end).

Following are count summary of total tickers and downloaded tickers for one time I scrap mass:

biztype_id ind_id ticker ticker_download
TOTAL 3129 657
0 1 100 136 110
1 1 200 81 30
2 1 300 171 49
3 1 400 598 49
4 1 500 903 50
5 1 600 192 51
6 1 700 67 11
7 1 800 221 50
8 1 900 84 24
9 1 1000 22 0
10 1 1100 3 0
11 1 1200 66 12
12 1 1300 5 0
13 1 1400 66 7
14 1 1500 4 0
15 1 1600 5 0
16 1 1700 7 0
17 1 1800 43 0
18 1 1900 6 0
19 1 2000 4 0
20 2 1000 105 103
21 3 1000 75 60
22 4 900 1 0
23 4 1000 39 24
24 4 1600 1 0
25 5 1000 32 19
26 6 1000 2 1
27 6 2000 2 1
28 7 1000 7 6
29 8 1200 181 0

Following are terminal record:

Scrape Finance Data - version 2

Do you wish to mass scrape? [y/n] y
Do you wish clear ALL scraped files and kill ALL running Celery workers? [y/n] y
Clearing scraped files and all running workers, please wait...

OK
rm: cannot remove './run/celery/*': No such file or directory
removed './run/scrapy/financeInfo.scrapy' removed './logs/corporateAZExpress_log_verbose.log' removed './logs/corporateAZOverview_log_verbose.log' removed './logs/financeInfo_log_verbose.log'
Do you wish to start mass scraping now? Process will automatically exit when finished. [y] y
Creating Celery workers...
Running Celery tasks for mass scrape...
Scrapy is still running…
Scrapy is still running...
Scrapy is still running...
…
Scrapy is still running...
Scrapy has finished
Killing Celery workers, flushing Redis queues, deleting Celery run files...
OK
removed './run/celery/workercorpAZ.pid'
removed './run/celery/workerfinance.pid'
Exiting...

financeInfo's log file is incomplete

financeInfo's log file does not show downloads stats, etc.:

2021-09-24 20:21:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/status/403': 1,
 'elapsed_time_seconds': 1402.909312,
 'finish_reason': 'CorpAZ is closed; CorpAZ queue is empty; Spider is idling',
 'finish_time': datetime.datetime(2021, 9, 25, 1, 21, 29, 850110),
 'httpcompression/response_bytes': 162246573,
 'httpcompression/response_count': 2135,
 'log_count/INFO': 2910,
 'memusage/max': 61169664,
 'memusage/startup': 55721984,
 'response_received_count': 2136,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/403': 1,
 'scheduler/dequeued/redis': 2135,
 'scheduler/enqueued/redis': 2135,
 'start_time': datetime.datetime(2021, 9, 25, 0, 58, 6, 940798)}

Endless loop of downloading

When I select the option scraping with business ID and industry ID, the execution goes into an endless loop. The downloaded json files are continuously replaced by new ones with exactly the same names.

Haven't checked if the issue happens for mass scraping or other options.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.