dinghino / stocks-historical-data Goto Github PK

View Code? Open in Web Editor NEW

23.0 1.0 2.0 7.18 MB

Fetcher & Parser for stocks historical data

License: GNU General Public License v3.0

Python 100.00%

stocks-historical-data's Introduction

🐒 stonks-o-fetcher

A Simple modular tool to fetch and parse data related to the stock market.

Getting started

For the moment the only source is this repository, so to get the program you have to clone it locally.

Requirements

Python >3.6

The program is tested only on a linux environment (WSL 1 and debian) but should technically work on windows too I think.

Installation

After cloning and entering the root of the project

pip install .

If you are not on python 3

python3 -m pip install .

This will make the program available on your system with the command

stonks-cli

First steps

On first startup you'll have to setup your settings, especially the output path.

There is some validation for fields, so if something is missing you'll see it.

You can use the default ~ to point the path to your home folder, so you can set the path to something like ~/stonks/ or whatever you like.

If you define a filename in your path, meaning that it ends with either .csv or .txt that's the filename it will use to output the data, otherwise the filename will be generated automatically from the settings.

⚠️ File checking

There is not check on existing files yet, and that's on purpose, so if you specify a custom file name it will be overwritten at every execution.

It is recommended to not specify a filename and let the program do its thing. It is also strongly suggested to actually change te path to something familiar.

Controls

The important things are esplained in the program itself, and are mostly out of my control due dependecies, but:

Menu navigation: arrow keys and vim bindings
Confirm a value: Enter
Multiselect when available Space - also enter will add the currently highlighted entry
Exit from an input field with no defaults: type an empty space then enter
With default values you can press enter to confirm it.

⚠️ Saving your settings

Exiting the application with ESC will NOT save your settings. you have to use the main menu option to do so.

CLI and automation

As of version 0.6.0 the only way to use this program is through the interactive cli menus, but i'm planning on adding the option to launch it with arguments to automate the execution of the process, specify all the required paramenters through arguments and handle different settings files to easily automate the execution through multiple settings.

Contributing

A proper documentation will come later, but here's the gist if you want to contribute on new features.

Components

The project is meant to be easily expandable and flexible. There are two main type of components to consider:

Source components
- Fetchers
- Parsers
Writer components

The name are pretty self explanatory I think.

The whole system is already setup to be almost completely automated Each source handler is included in its own module (folder). The module, through the __init__.py has to export some values:

Fetcher - your fetcher class, inheriting from FetcherBase
Parser - your parser class, inheriting from ParserBase
source - string. Unique value identifying the source handled, can be everyhing
friendly_name - string. The text that appears on the CLI
description - string a brief description of the source. appears in the cli.

Writers are similar, but instead of Fetcher and Parser and source they must have:

Writer - your writer class, inheriting from WriterBase
output_type - string unique identifier for the class.

The rest of the attributes remain the same.

ℹ️ You can look at the existing modules inside stonks/components to better understand

There's a manager component that is already set up to import all the valid modules from the components/handlers and components/writers folders, so when your module is ready it should work. Loading is done in the cli module, so that the app is actually empty by itself.

For a module to be valid it has to have the required classes and at least the source/output_type

Custom formatting

If you take a look at the existing components description you'll notice some strange formatting.

The CLI has a custom formatter - because i like colored crayons - to ease highlighing important words. Instead of the standard string.format that replaces the values, here we wrap the words into {} to specify formatting.

#  {word:color}
#  {word:style}
#  {word:color|style}
text = 'This {word:blue} is blue!'
# > this word is blue! - with `word` in blue.

Formatting is done through termcolor, so valid values are the ones in their documentation.

As before, check existing modules to better understand.

⚠️ String content For the moment there are a few issues with the default implementation of string.format that catches various character, specifically the . and : that is used as our delimiter, for now Inserting these character in a block to format will cause problems.

As a rule of thumb, if you write your description and when testing the cli the page doesn't load, it means that there's probably something wrong with the text there.

Testing

Testing is done with pytest and coverage.

You can start a full run with

coverage run -m pytest -v && coverare report -m

Or use whatever integration you like - I'm using vscode and its integrations.

There is an utils file with a bunch of function and a decorator class, used mainly as container for the functions. Most of the tests require at least one decorator if they are not testing for failures.

Pull Requests

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. I'm trying to follow git flow specs to some degree, so eventually the PR toward develop please.

⚠️ Please make sure to update tests as appropriate.

License

MIT

stocks-historical-data's People

Contributors

Stargazers

Watchers

Forkers

kc-ck trendingtechnology

stocks-historical-data's Issues

Add delimiter for csv output

Some software are picky on how they want the data. for now the delimiter used is |, but some wants a comma ,'

Add the option to specify a custom delimiter

Add setting
Add cli control
Add the ability to pass the delimiter to the writer class

Add source(s) for historical quotes

This relates to #9 but it's more generic. Also #61 needs to be done first or concurrently, probably. Also #65.

The general idea would be to have some source set up to get historical quotes, meaning open, close, high, low prices and - if available on the target source - daily volume.

Some sources that come to mind are:

NASDAQ - again see #9 and the related api pricing problems)
Yahoo - api required, free tier with 500 requests per month so not so many but they could be enough, especially if we can pull more symbols with one request
interactive brokers - still have to explore but they have REST API available, but didn't actually understood if they are free, behind paywall, require an active IB account or not and so on

I think, even if the nasdaq module is already been setup with an initial mock for some time, the best bet is to start with YAHOO data as it has a free tier and would allow to properly test and setup the REST integration properly.

specify output path/template in stonks run

Give the ability the option to specify the path to output the data and/or the template when running the run command

-o | --output for the output path
-p | --pattern for the filename pattern generator

add nasdaq historical data for quotes

Data is available from https://www.nasdaq.com/market-activity/stocks/gme/historical officially, BUT:

the filename itself don't look consistent to me with the data available
date ranges cannot be predetermined (as far as I can see)

See this comment for a working update

custom output filename template

some things are already set for this to work but the gist is to have the user be able to specify a format using python formatting your {variable} so that it can know in advance what the file name would be.

options for this could/should be:

start date - with optional formatting
end date - with optional formatting
source
tickers

The template should be set per writer, when possibile, but for now a global template would be ok.
The template should be validated when added (i.e. for missing parenthesis or unknown variables requested

setup.py

properly setup setup.py to be able to install everything and eventually publish it

Issue when parsing SEC data

  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "main", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "./main.py", line 7, in <module>
    cli()
  File "./cli/cli.py", line 13, in start
    entry.run(settings)
  File "./cli/entry.py", line 18, in run
    utils.run_menu(get_menu(), settings)
  File "./cli/utils.py", line 21, in run_menu
    menu_exit = handle_choice(menu_items, choice, settings)
  File "./cli/utils.py", line 10, in handle_choice
    return menu_items[choice][1](settings)
  File "./cli/entry.py", line 35, in handle_run_scraper
    scraper.run()
  File "./scraper/stocks.py", line 52, in run
    self.parser.parse(response)
  File "./scraper/components/parsers/base_parser.py", line 63, in parse
    for row in reader:
  File "/usr/lib/python3.6/codecs.py", line 1041, in iterdecode
    output = decoder.decode(input)
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 35: invalid start byte

Error comes when trying to parse one of the sec files with a long list of tickers and a wide date range.
Doesn't seem related specifically to the settings of the run but it might as well be

Settings

Current Options
Start   2019-05-01
End     2020-01-01
Type    Aggregate File
Path    ./data/output/
Tickers A, AAP, ABBV, ABMD, ABT, ACN, ADBE, AEE, AES, AFL, AKAM, ALB, ALGN, ALK, ALL, ALLE, ALXN, AMCR, AMD, AMZN, AOS, APD, ARE, ATVI, GOOG, GOOGL, LNT, MMM, MO
Sources SEC FTDs

The issue seems to be in the csv.reader built in the SecFtdParser class and returned to process. when the loop start on the iterator it crashes.

Doesn't crash on the first extraction though, when we try to extract the header

Short volume files layout

http://regsho.finra.org/DailyShortSaleVolumeFileLayout.pdf

Kind Regards,
Kenny G

Output aggregation of more sources

The idea would be to find a way to output more sources on one single individual file (either by ticker or all together as it is now).

The basic idea is that when we'll have more and more sources an user would want all the data retrieved by maybe 2, 3 or more sources in one single file to handle to processing (excel, db, whatever).

This could be done creating a new class of components to preprocess all the parsed data before passing it to the writer, so that it could bulk merge the various columns together properly, leaving the source and writers modules ignorant of the new feature.

We would have to setup a small refactor on the App component to change the current main loop and may incur in issues when the columns have the same name, so that is something to think about

Customized settings for modules

The idea is to expand the settings object and give each component/module its own set of settings, while keeping some as global.
Reason being that for example if we implement a MYSQL writer there is not need to have a path to output the data but there is need to configure the mysql connection.

Same goes with sources: some require keys ( see #65 ) that can either be stored in a separate file or in the settings.

This would require some refactor of the setting class, maybe creating some base class for basic functionalities and a way to pass around the settings data to each component

Reorganize components and remove validation on constants

In order to actually allow custom components to be added we need to remove the
validation performed on the scraper.settings.constants.

Those can be kept for internal reference, for now at least, but it would be
better to rethink on how that works, maybe even reorganizing the code.

The validation itself is technically not needed anymore, I think, since every
component passes through the manager to be handled, so the names can actually be
anything now.

It would also be a good idea to actually define naming and import schemes to follow.

Also, to improve code navigation and import we could change the packages structure
to something like

├── components/
│   │   # should import, like it does now, `fetchers` and `writers`,
│   │   # at least for now and include `constants`
│   ├── __init__.py
│   ├── component_base.py
│   ├── base_fetcher.py
│   ├── base_parser.py
│   ├── base_parser.py
│   │
│   │   # one folder for each source to handle
│   ├── <source_name>
│   │   │   # imports classes and constants, especially the source_name
│   │   ├── __init__.py
│   │   ├── <source_name>_fetcher.py
│   │   ├── <source_name>_parser.py
│   │   │   # local constants and a SOURCE_NAME for matching
│   │   └── constants.py
│   │
│   │   # one folder for each output type/writer
│   ├── <output_type>/
│   │   ├── __init__.py
│   │   ├──<writer_name>.py
│   │   │   # local constants and OUTPUT_TYPE for matching
│   │   └──constants.py

This could allow anybody to create a new set of handlers for a source/output type,
like for example

├── <my_source>/
│   ├── __init__.py
│   ├── <my_source>_fetcher.py
│   ├── <my_source>_parser.py
│   └── constants.py

# classes creation
from scraper.components import Fetcher  # base class

class MyFetcher(Fetcher):
  pass

from scraper.components import manager
from my_source import MyParser, MyFetcher, SOURCE_NAME

# Add them to the manager so that both the app and the cli can use them
manager.register_handler(SOURCE_NAME, MyFetcher, MyParser)

since the interface will be the same regardless and matching is now done through the manager, working this way it's simpler to set up new set of components and organize them.

Unit testing & Major Refactor

As per title. Needed because yes.

I'm planning on using pytest but we'll see.

to simulate responses for requests we could use responses, that should do the trick.

Improve SEC Data fetching

Since the SEC FTDs data is split in first and second half of each month, it could be nice to have the fetcher try and retrieve only what's needed.

Describe the solution you'd like
Check the day of the start date, if above half month skip the first file. Do the same for the end date but skipping second file if below half month.

Ability to continue outputs

The idea would be to have the app load an existing output file, use

a mix of file name and header row to detect the source and optionally the output type
the filename and/or the date column - if present - to find the newest date in the file
the ticker/symbol column or the filename to detect the relevant data

Then setup the settings automatically with the proper values and, using the last date as start_date the current date as end_date, fetch the new data, then output on the same file adding instead of replacing data.

This, as a first thought, could require

A new Writer to handle appending to the existing file instead of overwriting
New options on all the parsers to actually have a way to detect if they are the correct one
Eventually a new FilenameGenerator, or some addition to the existing one, to remove the dates from the file (as they would clash with manual identification of the file)

As an extra may require to disregard the different parsing method for the various output types since they may clash with the parsing (i.e. custom file names without symbols in the name and inside would be impossible to use)

Ability to save settings to a different path

This comes in tandem with #67 since I've been requested to be able to launch with arguments, specifying a settings file and the ability to saveload the settings to/from a custom path.

Give the user the ability to save the settings on a different path
Give the user the ability to load the settings from a file
- arguments
- from cli, typing the path
- from cli, through a navigator
Give the user the ability to set a default settings path.

The app itself should already be capable of most of the work since the instance can be created providing a path to use for the settings file, already has a default one (albeit static for now) and both the to_file and from_file methods accept a path to use.

Also we already have proper handling for missing or wrongly formatted settings file.

refactor CLI to allow saving the settings to another path
refactor CLI to change a default settings path
refactor Settings to be able to have a safe fallback and handle a default path
(optional?) store the settings path on a simple file in the project folder so that the app knows where to look for it.

Internal error handling and logging

Currently there's some kind of validation of settings in place but it's not how it should be ( #70 ). Also there is no proper logging, with levels and everything, to debug what's happening.

The idea as it comes to mind is to have something like the manager (or even inside the manager!) to register some error messages with some additional info to be used either as level and as some kind of traceback.

Errors could have a basic shape of { level: <logging.level>, message: <error-message>} so that they can be outputted with the logging module or better yet loguru which looks awesome and easy to use and wouldn't require much fuss.

Improve manager & component handling

The role of the manager it to handle and group together components that work with the same source, meaning Fetchers and Parsers, in order to better find and select them, as well as allowing expandability of the whole project with custom sources.

As of now the whole manager is just a singleton in the form of the module itself. it works just fine for now but i don't like it very much. Need to find a better way to:

Handle the manager itself, maybe including it into App as simple object, so we can do app.manager.register(...) when setting up an App object
Improve sources discrimination for components. I usually use to set a static property in the class with the name of what it's used for, but not sure at the moment. This would be something like FinraFetcher.__source = SURCES.FINRA_SHORTS so that when registering the component you just add it and the manager places it in the right place.
This may then cause the issue of forgetting to register a component and having future issues

Also, as a sidenote, the Writer components still have a static selection method. They should also be put in some sort of writer manager.
The idea here would be to be able to write different formats and file types like - to say some - directly into a database, local or remote, on a spreadsheet without the need to import a csv and so on.
Same principle applies though: they need a discriminator/unique identifier to handle them.

This also could be used to automatically generate menus for the CLI app, without the need to setup everything manually. #24 for that.

Update readme for develop branch

A bunch of things changed, for contributors, developers wanting to implement this as a library and even for end users.
Update the readme before develop->master

launch with arguments

implement argparse or something similar to launch directly bypassing the cli. Useful for automated systems and to just relaunch quickly with options

Implement core command structure with click
create commands to launch with arguments
add option to launch the CLI app
Should allow specifying a custom path for a settings file, to allow for separate configurations for eventual automations

secrets for api keys and handling

Relates to #64, #61 and #9.

Most REST API sources are behind authentication and/or pay wall, so they will be available to users only if they have the proper key to use them.

We would need:

some way to store the keys. These could be either in a .env file or even in the settings file since it's local and not committed and they would be stored in clear anyway
settings refactor to handle the keys
a mapping on the various source handlers for the key, meaning having a new attribute to map the handlers to a key name
- manager handling refactor to store them and have the required_key or something available throughout the app
A way to easily handle the keys through the app's cli
dynamically enable/disable the handlers if a key exists or not

Update

Stuff on the develop branch changed since I opened this issue.

After some thoughts the api keys will be stored in at least one separate file if not one for each key and kept internal to the application. This will allow to switch settings file and/or give it away to other people while retaining the keys.

The file format for the keys could be either a simple key=value file or a json

The handling itself will be probably done through the Settings object so that they're available through the whole app and can be easily modified with the interactive cli

Fetchers should have a new method to tell the app if they need an api key or not to work.

Since each source have an unique string identifier accessible through the static method is_for we could use that value to store it into settings.keys[<source>_API_KEY] -> str

The keys dict should be dynamically created after manager initialization (so when all components are ready to be used) and set to default None; when the file with the data is loaded settings will loop through all the items in the file and set the key.
Having the api key key set to None will allow the cli to know that there should at least be one, so we can give feedback on missing keys.

Internal process

initialize the manager with the components
Initialize the app/settings (load settings file/create empty settings)
Either Settings - or delegated to App - should loop all sources in the manager and access the Fetcher classes, looking for the new requires_api_key() -> bool method/property
If requires_api_key is true add the Fetcher::is_for value to settings.keys dict
Settings should load the file with the keys and dump the content in the dict

When a fetcher requires a key to be put into the request it should then go to settings.keys[self.is_for()] to retrieve it, throwing a new dedicated Exception if missing.

Cli handling

At least until we can set specific settings for each component, the cli should have a new main menu item to manage api keys.
here we should list - maybe in place of the current settings - all the required/available keys by name and give the user the options to set/update or remove the key.

The list should quickly show if a key is present or missing too so it's easy to know.

Test issue

test issue for the worflow

dynamic descriptions on options in cli

modular components are functioning
cli correctly parses the registered modules

the description obviously is not updated. I was thinking about using the preview command of the menu creator which - while it adds the not needed preview box, might just do the trick perfectly.

Working on implementing the preview function to read the description from the module and add custom formatting to allow highlighting and other things, which may be nice.

What about a readme?

I hate when projects don't tell exactly what they do and/or how they do it.

This needs to be resolved asap.

REST API

Some sources - most of them actually - works through REST API, some even through websockets. The next sources I'd like to see are for quotes from nasdaq and/or yahoo and maybe interactive brokers API.

Currently the only way to try and work with those is to manually create the request by building the URL with the data, but since we're using requests it could be nice to actually do it properly.

I see two options:

refactor the current WriterBase to handle all kinds of requests, adding overridable methods to get request header and body
add another layer of inheritance, adding a RestFetcher(FetcherBase) that adds those methods as abstract methods, forcing subclasses to implement them.

Bot have their merits, so if nobody can point me to a direction I'll see what's what when the time comes.

No validation on whole settings

we are currently missing a method to actually validate that the settings are there.

on the cli this is currently done through cli.utils.validate_settings but this should be one directly by the Settings object (ideally in App before starting the actual loop) and the errors should be available somewhere to give feedback.

This is also needed to complete #67 properly, providing the option to change the dates

Improve CLI development

As far as i can tell, user-side the cli works just fine. It can be improved and automated on the developing side. see #23 for some thoughts.

List of available sources

Here is an updated (as much as possible) list of available data that can be scraped through the app.

A note about NASDAQ DATA

I've already started working on it, before finding out about the pay wall.
They have A BUNCH of data, like all the things but due to the paywall for now
ALL THINGS RELATED TO NASDAQ are on halt

Already available

List of already available sources implemented. feel free to open issues on bugs/changes you wanna see implemented

Historical short volume
- FINRA
  - direct file access (csv)
  - no authentication needed
  - reported daily
Historical fail to deliver
- SEC
  - direct file access (zip file with csv in .txt format)
  - no authentication needed
  - reported twice a month, reports contain daily data

Working on

List of approved suggestions that are work in progress

Historical quotes
- NASDAQ
  - REST API
  - requires authentication
  - behind pay wall if not requested through browser
- YAHOO
  - REST API
  - requires authentication with optional paywall
  - has free tier with 500 requests/month

Suggested

List of noted suggestions

Rejected

List of suggestions rejected with some reason

Documentation

It may be time to start writing up some proper documentation, at least for the major parts of the package and some example on how to implement new stuff.