mwaldstein / edgarwebr Goto Github PK

View Code? Open in Web Editor NEW

77.0 6.0 16.0 17.61 MB

R package for interacting with the SEC's EDGAR filing search and retrieval system

Home Page: https://mwaldstein.github.io/edgarWebR/

License: Other

Makefile 2.12% R 97.84% Dockerfile 0.04%

edgar xbrl cran fund-search rstats r

edgarwebr's People

Contributors

Stargazers

Watchers

Forkers

feihelan spiderhands mobile-systems mythicalprogrammer phanisarman braman09 nyw-92 saehom ericofosuhene mconsidine hoooni balthasars waynelockon michaelchirico fankywong

edgarwebr's Issues

Bug: parse_text_filing does not split when empty lines have spaces in them

Bug report from Mohan:

Filing: https://www.sec.gov/Archives/edgar/data/104938/0000950131-94-000440.txt

Fix:
chomp empty lines in parse_text_filing:

filing_doc <- gsub("\\n +\\n", "\n\n", filing_doc)

parsing the Complete submission of the K-8 filing

Hello and thank you for the great package
I wonder if it is possible to parse the Complete submission of the K-8 filing
Regards
snvv

Parse all HTML with `option = "HUGE"`

Because of #11, if default HTML parsing fails, it attempts to repeat using the HUGE option passed to read_html.

This option should be used all the time for consistency, but most of the existing parsing tests started failing if the option was used universally, leading to the conditional so it is only used as needed. We need to better understand the parsing difference and do a bunch of testing to understand the impact of making the option universal...

Documentation: Docuement how 'Fast Search' works and is implemented in edgarWebR

Edgar website allows to search by either the name (regular search) or by CIK/Ticker symbol (Fast Search).
https://www.sec.gov/edgar/searchedgar/companysearch.html

I would love to have company_search() extended with a CIK=TRUE/FALSE parameter, and something like the following logic:

ifelse(CIK,
      ifelse(file_number,
             paste0("&company=&filenum=", x),
             paste0("&company=", URLencode(x, reserved = TRUE))),
      ifelse(file_number,
             paste0("&CIK=&filenum=", x),
             paste0("&CIK=", URLencode(x, reserved = TRUE)))
       )

I was not able to get it testable, so I don't know:

if indeed the api would accept the CIK-fie;d
how CIK and filenum have to interact, which combinations are compatible

Again, would love to have this implemented!

Incomplete return of parse_filing

Hello
I have find many url that does not return part.name and item.name

Below is a minimal example to reproduce the problem

library(edgarWebR)
filing_doc = "https://www.sec.gov/Archives/edgar/data/1560385/000155837019010666/lmca-20190930x10q1a7cb4.htm"
doc <- parse_filing(filing_doc, include.raw = TRUE)
doc$part.name
doc$item.name

Thank you
snvv

company_filings and filing_details now giving "xml2::url_absolute error"

Hi, I love your package. This worked on Feb 18th, but doesn't now. Not sure if this is an xml2 problem, it looks like some updates for that package came out subsequently.

filing_list <-
company_filings(
as.character('AAPL'),
ownership = FALSE,
type = '10-K',
before = "2020207",
count = 40,
page = 1)

Error in xml2::url_absolute(res[[ref]], xml2::xml_url(doc)) :
Base URL must be length 1

Any thoughts appreciated.

Parse Filings fails if the section is included by reference

Example:
parse_filing(paste0('https://www.sec.gov/Archives/edgar/data/1048911/000119312515252494/d48165d10k.htm'))

In this case, Item 1A. Risk Factors is included by reference. It is on pages 81-86.
Section that is labelled "Risk Factors" only identifies where to look for the actual text.

SEC regulations will almost surely require more reference filings (contact me for details if explanation is needed) so the issue will be more acute in the near future.

balthasars' workaround appears to work

balthasar's workaround appears to work great! Thanks for the help. I'm not too familiar with APIs, so it took a while to figure out what was going wrong. This might be helpful to someone that isn't too family with API keys:

   install.packages("usethis")
   require(usethis)
   usethis::edit_r_environ()

   # A .Renvrion window will open. Add the following to the .Renviron and don't forget to click save:
   EDGARWEBR_USER_AGENT = "XXXX"
   #Run the rest of balthasars' code to access EDGAR

Vignette info

Hi and thanks for the package! Looks very useful. Just a quick editor nit -- the vignette on CRAN still says this:

How to Download

edgarWebR is not yet of CRAN, as the API hasn’t stabilized yet. In the meantime, you can get a copy from github by using devtools:

Feature: Detect text filings and fallback appropriately

It isn't always clear just from the metadata if a filing is HTML or text only.

parse_filing already checks for html-wrapped text, should extend to check for exclusively text files.

Extract text with combined items

Fails to extract item lines for combined items like:
Items 1 and 2. | BUSINESS AND PROPERTIES.

Example url: https://www.sec.gov/Archives/edgar/data/1031093/000107997317000033/svbl_10k-103116.htm

Unable to find documentation to set User Agent

When retrieving filings, I receive an error for being an Undeclared Automated Tool. For example, when using

latest_filings()

I receive error

No encoding supplied: defaulting to UTF-8.
Error in check_result(res) : 
  EDGAR request blocked from Undeclared Automated Tool.
Please visit https://www.sec.gov/developer for best practices.
See https://mwaldstein.github.io/edgarWebR/index.html#ethical-use--fair-access for your responsibilities
Consider also setting the environment variable 'EDGARWEBR_USER_AGENT

I found in the README the following information:

Because of abusive use of this library, the SEC is likely to block its use “as is” without setting a custom ‘User Agent’ identifier. Details for setting a custom agent are below.

However, no details were given below. Could anyone help me on setting the user agent?

Could I get the DEF 14A link using company_filings function?

I'm using edgarWebR well. Thank you very much~!
It's really beneficial for me!!

But I want to know the way I can get the DEF 14A original URL.

I tried to get URLs using company_filings function below, But I can get only master index URL.

` db_def <- company_filings(db_cik$Ticker[i], type = "DEF 14A", count = 1)

accession_number act file_number filing_date accepted_date href type
1 34 001-07463 2022-12-13 2022-12-13 https://www.sec.gov/Archives/edgar/data/52988/000119312522303804/0001193125-22-303804-index.htm DEF 14A
2 34 001-07463 2021-12-10 2021-12-10 https://www.sec.gov/Archives/edgar/data/52988/000119312521354013/0001193125-21-354013-index.htm DEF 14A`

I find all the sources including GitHub, StackOverflow, and lots of tech blogs, but I can't find the way that I want.
T-T

Use httptest in vignettes

Currently tests are run against a local cache using httptest - vignettes however hit the sec server, adding a bit of overhead in package testing which isn't helpful. New version of httptest has some features to help use a http cache for vignettes.

Reference doc: https://enpiar.com/r/httptest/articles/vignettes.html

company_filings function sporadically locates company filings

Hello,

I'm running into an issue with the company_filings function in the EdgarWebR package. Specifically, the browse_edgar subfunction spits out an error sporadically when trying to find a company's filings:

Error in browse_edgar(x, ownership = ownership, type = type, before = before, :
Could not find company: XXXXXXX

...where XXXXXX stands for a company's CIK.

Sometimes the function will work and return the desired results, but most times it fails with the error message above. I'm currently running the function in a loop with several CIKs and each time a different CIK causes the function to error out.

I'm running R 3.6.2 in RStudio 1.2.1335.

Any help you can provide is greatly appreciated.

Thank you,

-Mike

some words are not parsed correctly due to tag <BR>

in the function , parse_filings, some words are still connected to each other such as "Weightedaverageexerciseprice" after parsing. This means it cannot recognize this as a proper word.
In the original html document, it is expressed as Weighted
average
exercise
price, this indicates that parse_filings function cannot recognize the
properly.

pls fix this and many thx!

Unable to use full_text

I get this error every time I try to run the full_text function:

Error in curl::curl_fetch_memory(url, handle = handle) :
Could not resolve host: searchwww.sec.gov

Does anyone have a similar issue?

Thanks!

How to deal with company_filings when number of filings is greater than 100 for a given date?

I am trying to retrieve SC 13G filings for a given company. My issue is that there is more than 100 filings on certain dates. Hence when I am fetching the data from Edgar I am missing some filings. Is there a way to deal with that?

Make parse_filing function support html-wrapped text filings

Hi, Micah
i detected another issue, in parse_filing function, i understand it will split the content mainly based on parent nodes such as

, however, it cannot parse the children nodes such as , then the item and part cannot be recognized correctly. So the ideal solution would be parse all nodes (including children nodes) to make the parse function as loose as possible otherwise we could miss quite some information.

Here is the example:
https://www.sec.gov/Archives/edgar/data/1424844/000092290708000774/form10k_122308.htm

thanks in advance!

Regards
Derek

Bug: <PAGE> markers in text filings don't always have a page number

Currently when cleaning text filings, tags should get stripped out. The current code expects the page markers to have a page number, eg.

<PAGE> 10

The regex needs to be altered to look for and remove the marker when there is no page number.

Unable to reach SEC full text search endpoint

Getting an error while trying to use full_text:

full_text(type="D",count=100)

Error in full_text(type = "D", count = 100) :
Unable to reach the SEC full text search endpoint (https://searchwww.sec.gov/EDGARFSClient/jsp/EDGAR_MainAccess.jsp)

Looks like support for the legacy full text ended October 1st.

accession number missing from company_filings

I noticed the company search page was down Friday night 7/3/2020. Appears and update has been done to the RSS feeds.

aapl <- company_filings(
  x = "AAPL"
)

Returning the attached screenshot for me.

Unable to reach the SEC endpoint

I have a code that extracts the url to exhibit 21, by using the url to the 10-K filing. However, I am now experiencing problems with the parse_submission function.

I receive this error message:
Error in charToText(x) :
Unable to reach the SEC endpoint (https://www.sec.gov/Archives/edgar/data/835011/000117184312000904/0001171843-12-000904.txt)

I am using the function in a loop. The problem occurs on different links if I rerun the code. If I run the code manually, it works. If I run the code enough times, I get output in the end, but never for all 3370 observations in my data set.

Any help would be useful, thanks.

Excessive depth in document

Hi,

When I used parse_filing for the below URLs, here are the errors:

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
Excessive depth in document: 256 use XML_PARSE_HUGE option [1]

Here are a few sample URLs:
https://www.sec.gov/Archives/edgar/data/1065648/000106564809000009/form_10k.htm
https://www.sec.gov/Archives/edgar/data/1010247/000101024709000005/form10k.htm
https://www.sec.gov/Archives/edgar/data/861459/000086145909000013/form10-q.htm

Again, thanks very much for contributing this package! It's fantastic.

Best regards

Warning on vector inputs

edgarWebR functions are not vectorized which causes unexpected and unclear errors.

To fix this, functions should warn on multiple inputs.