alvarobartt / investpy Goto Github PK
View Code? Open in Web Editor NEWFinancial Data Extraction from Investing.com with Python
Home Page: https://investpy.readthedocs.io/
License: MIT License
Financial Data Extraction from Investing.com with Python
Home Page: https://investpy.readthedocs.io/
License: MIT License
Hi Alvaro, thanks again for a great tool!
Was wondering if you could include other asset classes in the queries, namely bonds and perhaps even commodities if you have the time.
Ola Alvaro! Greetings from Brazil!
First of all, congratulations for investpy, I really appreciate it and hope to contribute as soon as I feel skilled enough, since I'm pretty new in programming and Python.
See if you could check on this:
I'm having trouble to get brazilian funds data. No retrieval for name or ISIN number. I called equities = investpy.get_equities_list() but could not find any funds.
As equities/stocks, funds and etfs retrieval functions have already been developed, the addition of currency retrieval functions is proposed, since currencies are one of the main financial products data offered by Investing.com.
So, for example, as already developed for equities, the development of the same functions adapted to currency data retrieval is proposed in order to improve investpy usage and coverage of the financial markets data.
This is supposed to lead to the development of the following functions:
import investpy
investpy.get_currency_crosses()
investpy.get_currency_crosses_list()
investpy.get_currency_crosses_dict()
investpy.get_available_currencies()
investpy.get_currency_cross_recent_data()
investpy.get_currency_cross_historical_data()
investpy.get_currency_crosses_overview()
Since in the issue #48 country names were standarized, when retrieving a listing of them their standard names were returned instead of the names used by Investing.com which leads to an error whenever the country is checked and it matches, but then the generated URL is non existing.
For example, if we are aiming to retrieve an overview on the main ETFs as it follows:
investpy.get_etfs_overview(country='united states')
that function will check whether the introduced country name has ETFs or not, and in this case it will match, but since the name of USA
has been standarized to United States
, whenever generating the URL to retrieve the overview from, it will generate: https://www.investing.com/etfs/united-states-etfs which will throw an 404 error, even the introduced country named was stated as "valid". So on, wherever any country name was standarized, also when retrieving data it needs to be de-standarized.
In addition to the existing functionalities, could it be possible to include the main indices retrieval (Ex. Dow 30).
Thank you for your help!
So to extend the usage of the package, the retrieval of all the equities indexed in Investing.com is proposed, as currently investpy just supports spanish equities, which limits its usage since there are a lot of markets available.
📄 So on, some tasks can be defined in order to get to develop this enchancement and so to integrate it with the package:
T01 - Research on which id fields are sent on the GET request to Investing.com
T02 - Retrieve a listing with all the available countries which have equities listed in Investing and define a function to wrap it.
T03 - Extend investpy.equities.retrieve_equities
functionallity and test its usage.
T04 - Check if there is any missing data in equities.csv
due to connection errors since the scrapping process took too long and made a huge amout of consecutive requests to Investing.com. (Tip: define a function to wrap that check because it can happen again if equities.csv
is re-generated.)
T05 - Once equities.csv
is generated, add country param to equity retrieval functions since it is a new filtering needed to avoid duplicated equity names in different countries. (Note: this tasks is pretty similar to the one done for ETFs when country filtering was added)
T06 - Generate the docs and update investpy usage.
Additional tasks will be added to the previous listing if needed in order to complete this issue.
Some stocks share the same name, but they never will share the same symbol since it is an unique identifier of each of them. So on, instead of using its name as input parameter for stock data retrieval functions, the replacement for its symbol is proposed.
As it can be seen in the example presented below, in Finland there are 2 stocks named the same way (and the same for the rest of the countries since they all have duplicated names), in this case, the only difference which can be known by the user is the symbol.
{'country': 'finland', 'name': 'Kesko', 'full_name': 'Kesko Oyj', 'tag': 'kesko', 'isin': 'FI0009000202', 'id': '575', 'class': 'Helsinki Retail', 'currency': 'EUR', 'symbol': 'KESKOB'},
{'country': 'finland', 'name': 'Kesko', 'full_name': 'Kesko Oyj', 'tag': 'kesko-a', 'isin': 'FI0009007900', 'id': '26416', 'class': 'Helsinki Retail', 'currency': 'EUR', 'symbol': 'KESKOA'},
So on, currently just for stocks, the stock data should be searched via symbol instead of via name. Note that anyways, the country param is mandatory.
which function can i use to find VIX index from CBOE? Also which country does the VIX belongs to? Should I use USA or america or united states?
As just the equity/stock features from investpy are properly documented and explained in the docs, in https://investpy.readthedocs.io/equities.html; the addition of the funds documentation is proposed.
Funds documentation will include information and usage, so to make investpy ease to use for newcomers, as the usage samples found in README.md file are not enough.
As the current date format for the retrieved information is: dd/mm/yyyy
, which stands for day-month-year.
But as that format is the one used in Spain, more date formats should be supported such as mm/dd/yy
, m/d/yy
, etc. also the separator can either be -
, /
or no separator can be used.
This argument will be optional, since default date format will remain as established currently (until 0.8.7 release) as dd/mm/yyyy
.
Currently Equity, Fund and ETF historical data retrieval functions return a JSON object whenever as_json
param is True, that looks like:
obj = {
'name': 'bbva',
'historical': [ ... ]
}
So on, instead of returning the introduced value by param for equity
, fund
or etf
, the original value as indexed by investpy should be returned.
I believe get_historical_data() has a bug.
I did a test comparing the same equity (Exxon Mobil) using get_recent_data() and get_historical_data(), both for the same time frames, and got different values.
Then I tested another equity (DOW) with get_recent_data() and was getting the exact same values as for the get_recent_data("Exxon Mobil").
When running investpy.bonds.get_bonds("hong kong")
an empty dataframe with the columns country, name, full_name
is returned. This also happens for the following countries:
['czech republic', 'hong kong', 'new zealand', 'south africa', 'south korea', 'sri lanka']
I noticed that all of them have a space in the name, is this a problem with the package?
as the information is directly retrieved from a CSV file a filtering system is proposed in order to retrieve just the selected columns on get_equities(), get_funds() and get_etfs() functions in init
e.g. if you want to retrieve just the spanish stocks: get_equities(country='spain')
As Investing offers a listing of world funds, the addition of its retrieval to investpy is proposed. So on, the retrieval process will be pretty similar as already done for ETFs. The development of this function in order to fill the funds file will imply to add a country
column on the CSV file and, so on, the addition of a country
parameter to the fund retrieval functions.
Anyways, this should be added in the 0.9 release 🚀 along its documentation, tests and usage samples.
add an optional parameter for equity, fund and etf retrieval functions in order to allow the user to store the retrieved content directly on a database instead of storing it into a pandas.DataFrame.
as the most used databases are mongodb and mysql, the tests are going to be developed on pymongo and pymysql, respectively.
Even though the information ued by investpy is just retrieved from Investing.com, the webpage uses different country names or notation for the same countries such as: usa
or united states
, for example. So on, the need to define a standard where all the countries are named the same way is proposed.
This improvement involves determining which country names are going to be used and, so on, renaming them and their usage.
It has been spotted that when installing investpy, the data contained as resources/ is not being placed inside the Python package directory and it is being placed in the root directory.
Described issue is presented in the following image:
So a way to include that data inside the Python package directory (.../site-packages/investpy/...) should be developed. Anyways the package works properly, but this should be fixed since it is not the pythonic way of doing it.
In order to reduce param lenght on data retrieval functions, asc
and desc
values for order param should be allowed, so this is just a minor fix in order to allow multiple parameter values so to reduce error rates.
When calling both functions: investpy.get_equities_list(country)
and investpy.get_funds_list(country)
which is expected to return a listing containing all the names from the available equities/funds as indexed in Investing.com which will later be used for data retrieval functions, and the country
param can either be None or a country name, to retrieve all the available equities/funds from all the world or from the specified country, respectively.
Instead of returning a list
those functions return a pandas.DataFrame
as presented below, which needs to be fixed.
add tests for etfs since in the last release of world etfs it was removed and in order to improve code coverage up to 90% as it was in the previous release.
debug
argument on functions to test
, as the purpose of that argument is to speed up tests made via pytest.debug
argument for printing debug messages in the case that the user wants to check what is happening inside the function api call to Investing.To implement the debug
behaviour to show or hide print output as the param is respectively True
or False
, will be done as described on stackoverflow by @brigand (credits to him):
As realised in a previous issue (#42), some information is missing in Investing.com listings, which leads to problems when it comes to data retrieval since existing financial products appear as missing or not found when they are not.
This is an relevant and hard issue to solve since just the main financial products seem to be listed in investpy, which implies a research on how does Investing.com indexes data.
This can be easily checked when applying filters to every financial product retrieval, e.g. some results are not shown when retrieving data from all the available global indices, which is shown if not all the filters are applied. As realised in #42, the index with symbol SX86P does not appear listed in Investing.com when all the filters are applied, but it does appear whenever just the primarySectors filter is applied.
Note that a previous issue was reported (#31), where some funds were not listed in investpy but they did appear in Investing.com search engine. That issue was labeled as wont fix
, but now it needs to be fixed since a lot of information appears to be missing due to Investing.com information indexing errors.
Add a function to retrieve stock dividends, since they are one of the main keys when it comes to creating stock portfolios. As investpy_portfolio has been created as an investpy module, the addition of dividends is requested.
Dividends can be found in the webpage of every stock, such as the presented below, which basically is the url of the stock plus -dividends. The table of dividends can be easily scraped via sending the same request as Investing.com does in order to get the pieces of the table to create a dividends pandas.DataFrame
.
Investing.com Stock Dividends Reference: https://www.investing.com/equities/bp-prudhoe-bay-royalty-trust-dividends
Since the historical data values retrieved from Investing.com are specified in different currencies based on the country from which the equity is, the addition of a column to the pandas.DataFrame
containing the currency is proposed.
As the currency is already stored on the .csv
files, its retrieval is not needed, just its addition to the pandas.DataFrame
and, so on, to the investpy.Data
model.
import investpy
df = investpy.get_recent_data(equity='bbva',
country='spain')
A sample resulting pandas.DataFrame
from the previous piece of code will look like:
Open High Low Close Volume Currency
Date
2019-08-13 4.263 4.395 4.230 4.353 27250000 EUR
2019-08-14 4.322 4.325 4.215 4.244 36890000 EUR
2019-08-15 4.281 4.298 4.187 4.234 21340000 EUR
2019-08-16 4.234 4.375 4.208 4.365 46080000 EUR
2019-08-19 4.396 4.425 4.269 4.269 18950000 EUR
In order to make the crawling process more efficient and scalable, for the pacakge future implementation as a PyPi package.
👍 The code needs to be clear, easily INTEGRABLE and IMPLEMENTABLE! 👍
Notes: Create test branch in order to use it in every computer in every Python version available (2.7, 3.6 and 3.7).
In addition to the existing functionalities, could it be possible to include the raw materials retrieval (Ex. Brent oil).
Thank you for your help!
Since fund information values can sometimes be None
, Investing.com sets them as "N/A"
which is a str
that means Not Applicable. This means that the value is missing or does not exist, so if we cast a str
as an int
value, there will be an error as happens in the piece of code shown below:
Where the highlighted text is the error code raised, in order to fix this, one-line if-else statements are proposed, so if the value is not an int
, there will be no type cast.
Investing provides daily, weekly and monthly timeframe data. I believe that currently the daily timeframe is hardcoded. For example, in the get_recent_data function in the init.py file parameters are set as follows:
params = { "curr_id": id_, "smlID": str(randint(1000000, 99999999)), "header": header, "interval_sec": "Daily", "sort_col": "date", "sort_ord": "DESC", "action": "historical_data" }
We could possibly make the interval_sec an optional parameter, where Daily is the default.
.travis.yml needs to be modified and updated in order to test coverage.py on all test files, so the code coverage is fully updated on error functions
more details can be found at: https://stackoverflow.com/questions/56337918/run-coverage-on-tests-directory-via-travis-ci
The time comparison between equity and etf historical data retrieval showed that when it came to etf data retrieval the elapsed time was much smaller than the one elpsed on equity retrieval.
This was due to etf retrieval process was sending a request to Investing inner-API,
https://es.investing.com/instruments/HistoricalDataAjax, using the etf symbol
already stored in the etfs.csv
file, which combined with the etf id
the parameters of the request were sent.
The params of the request look like:
params = {
"curr_id": id_,
"smlID": str(randint(1000000, 99999999)),
"header": header,
"interval_sec": "Daily",
"sort_col": "date",
"sort_ord": "DESC",
"action": "historical_data"
}
Where the header
value is a str
formed by "Datos históricos " + symbol
.
This lead to the conclusion that storing the symbol
of every equity and fund was required in order to send the request to HistoricalDataAjax instead of scraping each time the equity or fund webpage in order to retrieve the symbol
and the sending the request to HistoricalDataAjax.
Data retrieval time comparison is presented in the following image:
In the previous image, it can be seen that the time elapsed when it comes to equity historical data retrieval is twice the time elapsed in etf historical data retrieval. So on, the fix will improve equity historical data retrieval by lasting half of the time (x2 speed improvement).
In order to fit the needs of the Final Degree Projects named: "Recommender system of banking products" and "Robo-Advisor Application" from the University of Salamanca (USAL).
So in the next few days I will be developing an ETF historical data scraper in order to retrieve data that is going to be used by some students in their projects as the main Data Extraction Tool.
As explained in trendet package on commit alvarobartt/trendet@ef4e5fa, the requirements were not properly listed on setup.py file.
The error is due to a missing comma (,) to separate each package from the requirements; so on, the installation of investpy 0.8.9 release is not going to pre-install all the dependecies, which will lead to errors on its usage.
The following change is proposed in setup.py:
...
install_requires=['Unidecode>=1.1.1',
'pandas>=0.25.1',
'lxml>=4.4.1',
'setuptools>=41.2.0',
'requests>=2.22.0'],
...
Since the retrieved company profile in spanish from Bolsa de Madrid as described in https://investpy.readthedocs.io/main_api.html#investpy.get_equity_company_profile has scaped characters remaining from the HTML retrieval, they need to be removed in order to keep just the plain text from it.
The described issue can be observed here:
Where the highlighted parts from the company profile description in spanish should be removed with a re before returning it to the user.
Since having to retrieve the pandas.DataFrame
from the financial product to get data from is a technical aspect that some investpy users may not know how to proceed, the addition of search functions by fields is proposed.
So on, an example of this issue is described whenever a user wants to search the equity 'BBVA' but the user just knows or just has the ISIN code of that equity, which current solution should be like:
import investpy
df = investpy.get_equities()
df.head()
country | name | full_name | tag | isin | id | currency | symbol | |
---|---|---|---|---|---|---|---|---|
0 | argentina | Tenaris | Tenaris | tenaris?cid=13302 | LU0156801721 | 13302 | ARS | TS |
1 | argentina | PETROBRAS ON | Petroleo Brasileiro - Petrobras | petrobras-on?cid=13303 | BRPETRACNOR9 | 13303 | ARS | APBR |
2 | argentina | GP Fin Galicia | Grupo Financiero Galicia B | gp-fin-galicia | ARP495251018 | 13304 | ARS | GGAL |
3 | argentina | Ternium Argentina | Ternium Argentina Sociedad Anónima | siderar | ARSIDE010029 | 13305 | ARS | TXAR |
4 | argentina | Pampa Energía | Pampa Energía S.A. | pampa-energia | ARP432631215 | 13306 | ARS | PAMP |
isin = 'ES0113211835'
name = df.loc[(df['isin'] == isin).idxmax(), 'name']
country = df.loc[(df['isin'] == isin).idxmax(), 'country']
data = investpy.get_recent_data(equity=name,
country=country)
So what the previous block of code is doing consists on retrieving both the name
and the country
which match the known ISIN code in order to use that data to retrieve the pandas.DataFrame
of historical data from it.
Additionally, the proposal is to develop a function to get all the information from the known equity field in order to use it for data retrieval. Available fields to look for are: name
, full_name
or isin
.
Hi Alvaro,
I am able to get historical data on one stock per each request. But I can not get historical data for multiple stocks in a single request.
Here below the code I used:
tickers = "aapl, goog"
df = investpy.get_stock_historical_data(stock=tickers,
country='united states',
from_date='01/01/2010',
to_date='01/01/2019')
Here below the error:
Traceback (most recent call last):
File "C:\Users\tommaso\miniconda3\envs\3.7\lib\site-packages\IPython\core\interactiveshell.py", line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 4, in
to_date='01/01/2019')
File "C:\Users\tommaso\miniconda3\envs\3.7\lib\site-packages\investpy_init_.py", line 522, in get_stock_historical_data
raise RuntimeError("ERR#0018: stock " + stock + " not found, check if it is correct.")
RuntimeError: ERR#0018: stock aapl, goog not found, check if it is correct.
Regards
t
Since FOR loops in Python take too long, instead of using loops to check and access the name of an equity, fund or etf when it comes to data retrieval, access the dataframe values via:
df.loc[df['name'] == introduced_name]
Further details can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html for pandas 0.25.1 release.
Improve the usage of the logging interface whenever debug
parameter is True so to let the user know which is happening inside the package, and so to be able to detect any kind of problem in it.
So to clarify investpy usage some arguments need to be renamed as they are not clear enough, such as start
and end
❌ should be renamed to from
and until
✔️, respectively as they specify dates, so the proper way to refer them is via from_date and until_date, to set a time window for data retrieval.
Hi Alvaro,
Would be possible for you to incorporate an additional functionality to the existing code? I'm interested in scraping data from the Stoxx 600 and Dow Jones sectors and subsectors since these are the ones I use in my trading and backtests.
Tickers of the Stoxx 600 subsectors are as follows:
['SXEP','SX4P','SXPP','SXOP','SXNP','SXAP','SX3P','SXQP','SXDP','SXRP','SXMP','SXTP','SKXP','SX6P','SX7P','SXIP','SX86P','SXFP','SX8P']
Tickers of some of the Dow Jones sectors/supersectors are:
['DJUSEN','DJUSCH','DJUSBS','DJUSCN','DJUSIG','DJUSAP','DJUSFB','DJUSNG','DJUSHC','DJUSRT','DJUSME','DJUSCG','DJUSTL','DJUSUT','DJUSBK','DJUSIR','DJUSRE','DJUSFI','DJUSTC']
Thanks in advance,
Gerardo
Since a PR #41 has been merged into the master branch, before launching a new release, the commited changes need to be properly checked in order to determine if the current version follows investpy standards and, so on, if it fits the current needs or requirements.
Also both tests/
and docs/
need to be properly checked and updated.
As already done for equities as described in the issue #28, the same proposal is made for both funds and etfs, since the currency in which their historical values are displayed is much relevant when it comes to financial data analysis of historical data values.
So on, the currency
of each fund and etf needs to be retrieved and added to the information retrieval functions, which are investpy.funds.retrieve_info(tag)
and investpy.etfs.retrieve_info(tag)
; these functions create a dictionary containing the retrieved information values.
Also, the currency
value of each fund and etf needs to be included in the resulting pandas.DataFrame
or json
object when retrieving historical data from any fund or etf.
running the function get_index_historical_data reports a missing csv file containing the database of the indices, as follows: index_countries.csv
Hi Alvaro,
With the latest update, the function get_index_historical_data() works perfectly with sectors and subsector located almost everywhere but for some reason has problems retrieving data from the following "countries":
Any clue of what the reason can be? See function and error below, also an screenshot.
index = investpy.get_index_historical_data(index='STOXX Europe 600 Utilities',country='euro zone',from_date='01/01/1934',to_date='01/10/2019')
RuntimeError: ERR#0034: country euro zone not found, check if it is correct.
Cheers,
Gerardo
replace code documentation docstrings with Google docstrings as described in https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html
this will help the further generation of sphnix documentation (expected in 0.9 release)
When retrieving equities, funds and etfs pandas.DataFrame
, note that it can be filtered by country
; and the index is not properly updated since it keeps the same indexin as in the original pandas.DataFrame
. So on, this should be updated as the index nneds to be reseted; a example of the described issue is presented below:
For example: when retrieving a pandas.DataFrame
of the spanish equities it keeps the same indexing as it originally had instead of reseting it (which should be done via .reset_index() pandas function)
df = investpy.get_equities(country='spain')
df.head(5)
index | country | name | full_name | tag | isin | id | currency | symbol |
---|---|---|---|---|---|---|---|---|
9289 | spain | ACS | Actividades de Construcción y Servicios S.A. | acs-cons-y-serv | ES0167050915 | 442 | EUR | ACS |
9290 | spain | Abengoa | Abengoa S.A. | abengoa | ES0105200416 | 443 | EUR | ABG |
9291 | spain | Atresmedia | Atresmedia Corp. de Medios de Com. S.A. | atresmedia | ES0109427734 | 444 | EUR | A3M |
9292 | spain | Acerinox | Acerinox S.A. | acerinox | ES0132105018 | 445 | EUR | ACX |
9293 | spain | BBVA | Banco Bilbao Vizcaya Argentaria S.A. | bbva | ES0113211835 | 446 | EUR | BBVA |
❗️ The index should be starting from 0 to len(df) instead of starting from 9289 in this case, since that was the original index from the initial equities pandas.DataFrame
before the country filtering.
Since the scraping process that involves retrieving the list containing all the ETFs from every country and their overview take too much time; the function investpy.get_etfs_overview()
needs to be improved so that just the overview of the main ETFs from every country is displayed.
The intention of this function is just to get an overview on the main ETFs so the information from all ETFs is not useless (since as described in #50 Investing.com has an indexing error which does not displays more than 1.000 results per page, so the overview of all ETFs is just from the first 1.000 ETFs displayed in Investing.com). Anyways, note that this is just an additional functionality included in investpy.
The following yields a dataframe with only 137 days of data:
df = investpy.get_etf_historical_data(etf='Vanguard FTSE Emerging Markets UCITS USD Inc', from_date='01/01/2000', to_date='12/08/2019', as_json=False, order='ascending', debug=False)
However, there should be multiple years of data (2013-present):
https://www.investing.com/etfs/vanguard-ftse---emerging-markets-historical-data?cid=962110
Add argument to data retrieval functions in order to improve travis-ci tests as the file does not have to be generated and just some lines will not be executed when it comes to file writting.
So on, debug_mode
argument needs to be added to equities.retrieve_equities(), funds.retrieve_funds() and etfs.retrieve_etfs().
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.