10-K-scraper

This script is to download 10-k filing textual data (.htm) through Sec Edgar API, and to scrape specific sections, then save them into .txt file. You are welcomed to do modifications on this scripts.

Platform & Dependency:

Python 3.6 and standard libraries
BeautifulSoup

Introduction:

TenKDownloader(CIK, start_time, end_time) will return a TenKDownloader object. CIK can be one string or a list of string. Check https://www.sec.gov/Archives/edgar/cik-lookup-data.txt to see the CIKs for companies that you are looking for. Sometimes this argument can be symbol. start_time and end_time are in format %Y%m%d.

Attributes:

Method:
- download(path='./data') will download coresponding 10-k filing in ./data/<CIK>/date.htm. Implementation of this function is to use BeautifulSoup to scrape the web page and retrieve the .htm file;
Data
- all_url is a Python dictionary (key: CIK; value: list of tuple (date, filing url)).
TenKScraper(section, next_section) will return a TenKScraper object. section is something like 'item 1', and next_section is where you stop. For example, if you want to scrape section 'item 2', you can create TenKScraper('item 2', 'item 3').

Attributes:

Method:
- scrape(htm_file, txt_file) will scrape and write textual data into txt_file, and will also return the text as a string. Implementation of this function is based on the work of http://community.mis.temple.edu/zuyinzheng/pythonworkshop/, using regular expression to recognize bond tag. You can customize the pattern which is p1-p13 in my code. Note that output path must exist, but the txt file is not necessary to be existed.

Example

from TenK import TenKDownloader, TenKScraper

company_CIK = ['6281', '6769']
downloader = TenKDownloader(company_name, '20150101','20181101')
downloader.download()

scraper = TenKScraper('Item 1A', 'Item 1B')  # scrape text start from Item 1A, and stop by Item 1B
scraper2 = TenKScraper('Item 7', 'Item 8')
scraper.scrape('./data/6281/20171122.htm', './data/txt/test.txt') # make sure ./data/txt exists
scraper2.scrape('./data/6769/20180223.htm', './data/txt/test2.txt')

theling / 10-k-scraper Goto Github PK

10-k-scraper's Introduction

10-K-scraper

Platform & Dependency:

Introduction:

Attributes:

Method:

Data

Attributes:

Method:

Example

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent