Giter Club home page Giter Club logo

10-k-scraper's Introduction

10-K-scraper

This script is to download 10-k filing textual data (.htm) through Sec Edgar API, and to scrape specific sections, then save them into .txt file. You are welcomed to do modifications on this scripts.

Platform & Dependency:

  • Python 3.6 and standard libraries
  • BeautifulSoup

Introduction:

  • TenKDownloader(CIK, start_time, end_time) will return a TenKDownloader object. CIK can be one string or a list of string. Check https://www.sec.gov/Archives/edgar/cik-lookup-data.txt to see the CIKs for companies that you are looking for. Sometimes this argument can be symbol. start_time and end_time are in format %Y%m%d.

    Attributes:

    Method:

    • download(path='./data') will download coresponding 10-k filing in ./data/<CIK>/date.htm. Implementation of this function is to use BeautifulSoup to scrape the web page and retrieve the .htm file;

    Data

    • all_url is a Python dictionary (key: CIK; value: list of tuple (date, filing url)).
  • TenKScraper(section, next_section) will return a TenKScraper object. section is something like 'item 1', and next_section is where you stop. For example, if you want to scrape section 'item 2', you can create TenKScraper('item 2', 'item 3').

    Attributes:

    Method:

    • scrape(htm_file, txt_file) will scrape and write textual data into txt_file, and will also return the text as a string. Implementation of this function is based on the work of http://community.mis.temple.edu/zuyinzheng/pythonworkshop/, using regular expression to recognize bond tag. You can customize the pattern which is p1-p13 in my code. Note that output path must exist, but the txt file is not necessary to be existed.

Example

from TenK import TenKDownloader, TenKScraper

company_CIK = ['6281', '6769']
downloader = TenKDownloader(company_name, '20150101','20181101')
downloader.download()

scraper = TenKScraper('Item 1A', 'Item 1B')  # scrape text start from Item 1A, and stop by Item 1B
scraper2 = TenKScraper('Item 7', 'Item 8')
scraper.scrape('./data/6281/20171122.htm', './data/txt/test.txt') # make sure ./data/txt exists
scraper2.scrape('./data/6769/20180223.htm', './data/txt/test2.txt')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.