vincetran96 / scrape-finance-data-v2 Goto Github PK

View Code? Open in Web Editor NEW

53.0 53.0 29.0 21.7 MB

A standalone package to scrape financial data from listed Vietnamese companies via Vietstock

License: MIT License

Dockerfile 0.52% Python 94.28% Shell 5.20%

data docker finance python redis scrape

scrape-finance-data-v2's People

Contributors

Stargazers

Watchers

scrape-finance-data-v2's Issues

Can't scrape anymore

Hi Vince,

It seems Vietstock has applied some new scrapping blocker. The options scrapping mass and industry are unusable. Scrapping by ticker works but it scraps only a few first pages of the financial statements. Here are the log of mass scrapping:

2021-11-02 15:28:11 [scrapy.extensions.telnet] INFO: Telnet Password: 60c4207748982772
2021-11-02 15:28:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2021-11-02 15:28:11 [financeInfo] INFO: Reading start URLs from redis key 'financeInfo:start_urls' (batch size: 16, encoding: utf-8
2021-11-02 15:28:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'rotating_proxies.middlewares.BanDetectionMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-11-02 15:28:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-11-02 15:28:11 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy_redis.pipelines.RedisPipeline']
2021-11-02 15:28:11 [scrapy.core.engine] INFO: Spider opened
2021-11-02 15:28:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-11-02 15:28:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2021-11-02 15:28:11 [celery.app.trace] INFO: Task celery_tasks.finance_task[cd208afb-3cd5-40f7-a938-248c7d98603c] succeeded in 0.11887021500001538s: None
2021-11-02 15:28:12 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:16 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:21 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:26 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:31 [financeInfo] INFO: === IDLING... ===
2021-11-02 15:28:31 [financeInfo] INFO: corpAZ closed key: 1
2021-11-02 15:28:31 [financeInfo] INFO: corpAZ key financeInfo:corpAZtickers contains: set()
2021-11-02 15:28:31 [financeInfo] INFO: set()
2021-11-02 15:28:31 [financeInfo] INFO: Deleted status file at run/scrapy/financeInfo.scrapy
2021-11-02 15:28:31 [scrapy.core.engine] INFO: Closing spider (CorpAZ is closed; CorpAZ queue is empty; Spider is idling)
2021-11-02 15:28:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 20.031222,
 'finish_reason': 'CorpAZ is closed; CorpAZ queue is empty; Spider is idling',
 'finish_time': datetime.datetime(2021, 11, 2, 15, 28, 31, 938732),
 'log_count/INFO': 21,
 'log_count/WARNING': 1,
 'memusage/max': 65544192,
 'memusage/startup': 65331200,
 'start_time': datetime.datetime(2021, 11, 2, 15, 28, 11, 907510)}
2021-11-02 15:28:31 [scrapy.core.engine] INFO: Spider closed (CorpAZ is closed; CorpAZ queue is empty; Spider is idling)

Too much package in requirements.txt

Why there is too much package in requirements.txt, for example cryptography==2.9.2 package i don't know what is this package used for?

New API security measure from Vietstock

Tasks:

Change constants to workaround this
Update README

Vietstock has implemented __RequestVerificationToken to prevent cross-site request forgery attacks. Unless that issue is addressed, this application cannot download financials anymore.

More about __RequestVerificationToken:

Miss tickers when mass scraping

There are more than 3,000 tickers in file bizType_ind_tickers.csv. However, there are only 600-1000 tickers dowloaded when I mass scrap (it took 5 hours everytime to scrap those tickers with my internet connection). It missed many bluechips, whose information are certainly available on VietStock such as VIC, GAS.

I have tried several times by both WiFi and Lan connection. The number of tickers and number json files vary from one time to another. One time has 657 tickers, another time has 1025 tickers. None of those times are closed to the total 3129 tickers as in bizType_ind_tickers.csv.

The executions were all stopped by itself and stated clearly that it finished (as terminal shown here at the end).

Following are count summary of total tickers and downloaded tickers for one time I scrap mass:

	biztype_id	ind_id	ticker	ticker_download
TOTAL			3129	657
0	1	100	136	110
1	1	200	81	30
2	1	300	171	49
3	1	400	598	49
4	1	500	903	50
5	1	600	192	51
6	1	700	67	11
7	1	800	221	50
8	1	900	84	24
9	1	1000	22	0
10	1	1100	3	0
11	1	1200	66	12
12	1	1300	5	0
13	1	1400	66	7
14	1	1500	4	0
15	1	1600	5	0
16	1	1700	7	0
17	1	1800	43	0
18	1	1900	6	0
19	1	2000	4	0
20	2	1000	105	103
21	3	1000	75	60
22	4	900	1	0
23	4	1000	39	24
24	4	1600	1	0
25	5	1000	32	19
26	6	1000	2	1
27	6	2000	2	1
28	7	1000	7	6
29	8	1200	181	0

Following are terminal record:

Scrape Finance Data - version 2

Do you wish to mass scrape? [y/n] y
Do you wish clear ALL scraped files and kill ALL running Celery workers? [y/n] y
Clearing scraped files and all running workers, please wait...

OK
rm: cannot remove './run/celery/*': No such file or directory
removed './run/scrapy/financeInfo.scrapy' removed './logs/corporateAZExpress_log_verbose.log' removed './logs/corporateAZOverview_log_verbose.log' removed './logs/financeInfo_log_verbose.log'
Do you wish to start mass scraping now? Process will automatically exit when finished. [y] y
Creating Celery workers...
Running Celery tasks for mass scrape...
Scrapy is still running…
Scrapy is still running...
Scrapy is still running...
…
Scrapy is still running...
Scrapy has finished
Killing Celery workers, flushing Redis queues, deleting Celery run files...
OK
removed './run/celery/workercorpAZ.pid'
removed './run/celery/workerfinance.pid'
Exiting...

financeInfo's log file is incomplete

financeInfo's log file does not show downloads stats, etc.:

2021-09-24 20:21:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/status/403': 1,
 'elapsed_time_seconds': 1402.909312,
 'finish_reason': 'CorpAZ is closed; CorpAZ queue is empty; Spider is idling',
 'finish_time': datetime.datetime(2021, 9, 25, 1, 21, 29, 850110),
 'httpcompression/response_bytes': 162246573,
 'httpcompression/response_count': 2135,
 'log_count/INFO': 2910,
 'memusage/max': 61169664,
 'memusage/startup': 55721984,
 'response_received_count': 2136,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/403': 1,
 'scheduler/dequeued/redis': 2135,
 'scheduler/enqueued/redis': 2135,
 'start_time': datetime.datetime(2021, 9, 25, 0, 58, 6, 940798)}

Can we choose platform like HOSE only?

Is there any way so we can download only data from HOSE or another platform? for now it downloads everything and take several days to finish.

Endless loop of downloading

When I select the option scraping with business ID and industry ID, the execution goes into an endless loop. The downloaded json files are continuously replaced by new ones with exactly the same names.

Haven't checked if the issue happens for mass scraping or other options.

vincetran96 / scrape-finance-data-v2 Goto Github PK

scrape-finance-data-v2's People

Contributors

Stargazers

Watchers

Forkers

scrape-finance-data-v2's Issues

Can't scrape anymore

Too much package in requirements.txt

New API security measure from Vietstock

Miss tickers when mass scraping

financeInfo's log file is incomplete

Can we choose platform like HOSE only?

Endless loop of downloading

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent