vincetran96 / scrape-finance-data-v2 Goto Github PK
View Code? Open in Web Editor NEWA standalone package to scrape financial data from listed Vietnamese companies via Vietstock
License: MIT License
A standalone package to scrape financial data from listed Vietnamese companies via Vietstock
License: MIT License
Hi Vince,
It seems Vietstock has applied some new scrapping blocker. The options scrapping mass and industry are unusable. Scrapping by ticker works but it scraps only a few first pages of the financial statements. Here are the log of mass scrapping:
2021-11-02 15:28:11 [scrapy.extensions.telnet] INFO: Telnet Password: 60c4207748982772
2021-11-02 15:28:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2021-11-02 15:28:11 [financeInfo] INFO: Reading start URLs from redis key 'financeInfo:start_urls' (batch size: 16, encoding: utf-8
2021-11-02 15:28:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'rotating_proxies.middlewares.BanDetectionMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-11-02 15:28:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-11-02 15:28:11 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy_redis.pipelines.RedisPipeline']
2021-11-02 15:28:11 [scrapy.core.engine] INFO: Spider opened
2021-11-02 15:28:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-11-02 15:28:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2021-11-02 15:28:11 [celery.app.trace] INFO: Task celery_tasks.finance_task[cd208afb-3cd5-40f7-a938-248c7d98603c] succeeded in 0.11887021500001538s: None
2021-11-02 15:28:12 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:16 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:21 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:26 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:31 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:31 [financeInfo] INFO: corpAZ closed key: 1
2021-11-02 15:28:31 [financeInfo] INFO: corpAZ key financeInfo:corpAZtickers contains: set()
2021-11-02 15:28:31 [financeInfo] INFO: set()
2021-11-02 15:28:31 [financeInfo] INFO: Deleted status file at run/scrapy/financeInfo.scrapy
2021-11-02 15:28:31 [scrapy.core.engine] INFO: Closing spider (CorpAZ is closed; CorpAZ queue is empty; Spider is idling)
2021-11-02 15:28:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 20.031222,
'finish_reason': 'CorpAZ is closed; CorpAZ queue is empty; Spider is idling',
'finish_time': datetime.datetime(2021, 11, 2, 15, 28, 31, 938732),
'log_count/INFO': 21,
'log_count/WARNING': 1,
'memusage/max': 65544192,
'memusage/startup': 65331200,
'start_time': datetime.datetime(2021, 11, 2, 15, 28, 11, 907510)}
2021-11-02 15:28:31 [scrapy.core.engine] INFO: Spider closed (CorpAZ is closed; CorpAZ queue is empty; Spider is idling)
Why there is too much package in requirements.txt, for example cryptography==2.9.2 package i don't know what is this package used for?
Tasks:
Vietstock has implemented __RequestVerificationToken
to prevent cross-site request forgery attacks. Unless that issue is addressed, this application cannot download financials anymore.
More about __RequestVerificationToken
:
There are more than 3,000 tickers in file bizType_ind_tickers.csv
. However, there are only 600-1000 tickers dowloaded when I mass scrap (it took 5 hours everytime to scrap those tickers with my internet connection). It missed many bluechips, whose information are certainly available on VietStock such as VIC, GAS.
I have tried several times by both WiFi and Lan connection. The number of tickers and number json files vary from one time to another. One time has 657 tickers, another time has 1025 tickers. None of those times are closed to the total 3129 tickers as in bizType_ind_tickers.csv
.
The executions were all stopped by itself and stated clearly that it finished (as terminal shown here at the end).
Following are count summary of total tickers and downloaded tickers for one time I scrap mass:
biztype_id | ind_id | ticker | ticker_download | |
---|---|---|---|---|
TOTAL | 3129 | 657 | ||
0 | 1 | 100 | 136 | 110 |
1 | 1 | 200 | 81 | 30 |
2 | 1 | 300 | 171 | 49 |
3 | 1 | 400 | 598 | 49 |
4 | 1 | 500 | 903 | 50 |
5 | 1 | 600 | 192 | 51 |
6 | 1 | 700 | 67 | 11 |
7 | 1 | 800 | 221 | 50 |
8 | 1 | 900 | 84 | 24 |
9 | 1 | 1000 | 22 | 0 |
10 | 1 | 1100 | 3 | 0 |
11 | 1 | 1200 | 66 | 12 |
12 | 1 | 1300 | 5 | 0 |
13 | 1 | 1400 | 66 | 7 |
14 | 1 | 1500 | 4 | 0 |
15 | 1 | 1600 | 5 | 0 |
16 | 1 | 1700 | 7 | 0 |
17 | 1 | 1800 | 43 | 0 |
18 | 1 | 1900 | 6 | 0 |
19 | 1 | 2000 | 4 | 0 |
20 | 2 | 1000 | 105 | 103 |
21 | 3 | 1000 | 75 | 60 |
22 | 4 | 900 | 1 | 0 |
23 | 4 | 1000 | 39 | 24 |
24 | 4 | 1600 | 1 | 0 |
25 | 5 | 1000 | 32 | 19 |
26 | 6 | 1000 | 2 | 1 |
27 | 6 | 2000 | 2 | 1 |
28 | 7 | 1000 | 7 | 6 |
29 | 8 | 1200 | 181 | 0 |
Following are terminal record:
Scrape Finance Data - version 2
Do you wish to mass scrape? [y/n] y
Do you wish clear ALL scraped files and kill ALL running Celery workers? [y/n] y
Clearing scraped files and all running workers, please wait...
OK
rm: cannot remove './run/celery/*': No such file or directory
removed './run/scrapy/financeInfo.scrapy' removed './logs/corporateAZExpress_log_verbose.log' removed './logs/corporateAZOverview_log_verbose.log' removed './logs/financeInfo_log_verbose.log'
Do you wish to start mass scraping now? Process will automatically exit when finished. [y] y
Creating Celery workers...
Running Celery tasks for mass scrape...
Scrapy is still running…
Scrapy is still running...
Scrapy is still running...
…
Scrapy is still running...
Scrapy has finished
Killing Celery workers, flushing Redis queues, deleting Celery run files...
OK
removed './run/celery/workercorpAZ.pid'
removed './run/celery/workerfinance.pid'
Exiting...
financeInfo's log file does not show downloads stats, etc.:
2021-09-24 20:21:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/status/403': 1,
'elapsed_time_seconds': 1402.909312,
'finish_reason': 'CorpAZ is closed; CorpAZ queue is empty; Spider is idling',
'finish_time': datetime.datetime(2021, 9, 25, 1, 21, 29, 850110),
'httpcompression/response_bytes': 162246573,
'httpcompression/response_count': 2135,
'log_count/INFO': 2910,
'memusage/max': 61169664,
'memusage/startup': 55721984,
'response_received_count': 2136,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/403': 1,
'scheduler/dequeued/redis': 2135,
'scheduler/enqueued/redis': 2135,
'start_time': datetime.datetime(2021, 9, 25, 0, 58, 6, 940798)}
Is there any way so we can download only data from HOSE or another platform? for now it downloads everything and take several days to finish.
When I select the option scraping with business ID and industry ID, the execution goes into an endless loop. The downloaded json files are continuously replaced by new ones with exactly the same names.
Haven't checked if the issue happens for mass scraping or other options.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.