manusimidt / py-xbrl Goto Github PK

View Code? Open in Web Editor NEW

95.0 12.0 34.0 388 KB

Python-based parser for parsing XBRL and iXBRL files

Home Page: https://py-xbrl.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 98.03% HTML 1.97%

xbrl xbrl-parser edgar ixbrl python

py-xbrl's Introduction

About Me:

📚 I am currently doing my Masters in Artificial Intelligence at JKU in Linz, Austria.
🤖 There I have chosen the elective track "Mechatronics - Robotics and Autonomous Systems" because I am fascinated by robotics and drones.
🎓 Before my master's degree, I did a bachelor's degree in business information technology.

Connect:

🌍 https://manusimidt.dev
✉ [email protected]

Recent Projects:

📊 PY-XBRL (Open Source Python XBRL/iXBRL Parser)
🤖 Delta Robot

Socials:

Languages

Frameworks & Technologies

GitHub Stats

py-xbrl's People

Contributors

Stargazers

Watchers

py-xbrl's Issues

Doesn't seem to work with local XSD files

I have the following files:

$ ls data/TSLA/10-k/20201231/
tsla-10k_20201231_htm.xml tsla-20201231_cal.xml     tsla-20201231_lab.xml
tsla-20201231.xsd         tsla-20201231_def.xml     tsla-20201231_pre.xml

But when I try to load one, it fails:

from xbrl_parser.instance import parse_xbrl, parse_xbrl_url, XbrlInstance
from xbrl_parser.cache import HttpCache
import logging
logging.basicConfig(level=logging.INFO)
cache: HttpCache = HttpCache('./cache/')

# parse from path
instance_path = './data/TSLA/10-k/20201231/tsla-10k_20201231_htm.xml'
inst1 = parse_xbrl(instance_path, None, './data/TSLA/10-k/20201231')

Traceback (most recent call last):
  File "./test.py", line 10, in <module>
    inst1 = parse_xbrl(instance_path, cache, './data/TSLA/10-k/20201231')
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/xbrl_parser/instance.py", line 281, in parse_xbrl
    taxonomy: TaxonomySchema = parse_taxonomy(cache, schema_url)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/xbrl_parser/taxonomy.py", line 202, in parse_taxonomy
    schema_path: str = cache.cache_file(schema_url)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/xbrl_parser/cache.py", line 75, in cache_file
    query_response = requests.get(file_url)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/requests/sessions.py", line 528, in request
    prep = self.prepare_request(req)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/requests/sessions.py", line 456, in prepare_request
    p.prepare(
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/requests/models.py", line 316, in prepare
    self.prepare_url(url, params)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/requests/models.py", line 390, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL './data/TSLA/10-k/20201231/tsla-20201231.xsd': No schema supplied. Perhaps you meant http://./data/TSLA/10-k/20201231/tsla-20201231.xsd?

If I leave out instance_url I get this error instead:

Traceback (most recent call last):
  File "./test.py", line 10, in <module>
    inst1 = parse_xbrl(instance_path, cache)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/xbrl_parser/instance.py", line 276, in parse_xbrl
    schema_url = resolve_uri(instance_url, schema_uri)
  File "/Users/jamie/.pyenv/versions/3.8.7/lib/python3.8/site-packages/xbrl_parser/helper/uri_resolver.py", line 23, in resolve_uri
    if '.' in dir_uri.split('/')[-1]:
AttributeError: 'NoneType' object has no attribute 'split'

It seems that if you use the HTTP cacher, it tries to load everything from there and raises a fatal error if files aren't found. My understanding was that it'd be optional.

Label Linkbase instance should have get_label() function

Would be handy if the label linkbase would return an array of possible labels given the ID of a concept.

Parsing of Presentation Linkbase for SEC submissions

I am also having problems getting information from the the presentation linkbase. In my case I am getting the information from: microsoft 10k-2020 instance document and the object instance.taxonomy.pre_linkbases does not contain the same information as the linkbase document. It is missing all the locators and definitionArcs. I have spent a few hours looking into the code but I can't find where the error is.

Originally posted by @Pablompg in #20 (comment)

Map common namespaces to schema urls

Some submissions have now schema url defined for common taxonomies.
Maybe add a dictionary that maps from namespace to schema_url for some common taxonomies.
This map can then be used, it the company has not defined the schema_url in the taxonomy extension.

parse_ixbrl should add encoding argument

If the encoding of instance is utf8 but the locale encoding is cp950, then it crahsed. Just adding encoding option to specified the encoding of the instance file for parse_ixbrl function as following:

def parse_ixbrl(instance_path: str, cache: HttpCache, instance_url: str or None = None
               ,encoding='utf8'
               ) -> XbrlInstance:

    instance_file = open(instance_path, "r", encoding=encoding)

Currency extraction

Any chance we could extract the reportedCurrency e.g: "USD", it's in the high level xbrli header.

<xbrli:measure>iso4217:USD</xbrli:measure>

in https://www.sec.gov/ix?doc=/Archives/edgar/data/0000320193/000032019321000056/aapl-20210327.htm

It's also in xml:

<xbrli:unit id="USD">
<xbrli:measure>iso4217:USD</xbrli:measure>
</xbrli:unit>

https://www.sec.gov/Archives/edgar/data/1403570/000149315221006582/qtmm-20190630.xml

Submissions from CompSci

When trying to parse this submission: https://www.sec.gov/Archives/edgar/data/747540/000121390021011934/sprs-20201130.xml the library failed with the error:

Traceback (most recent call last):
File "/home/pablo/.pyenv/versions/3.7.10/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/pablo/.pyenv/versions/3.7.10/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/pablo/Desktop/repos/data/etls/providers/raw_sec1_extract_fundamentals/src/raw_sec1_extract_fundamentals/main.py", line 23, in
download_files(config, os.getenv("DOWNLOAD_FILES", "download.files"))
File "/home/pablo/.local/share/virtualenvs/raw_sec1_extract_fundamentals-70AQCo8K/lib/python3.7/site-packages/data_components/config/logging.py", line 229, in wrapped_method
return method(*args, **kwargs)
File "/home/pablo/.local/share/virtualenvs/raw_sec1_extract_fundamentals-70AQCo8K/lib/python3.7/site-packages/data_components/config/logging.py", line 278, in wrapped_method
ans = method(*args, **kwargs)
File "/home/pablo/Desktop/repos/data/etls/providers/raw_sec1_extract_fundamentals/src/raw_sec1_extract_fundamentals/main.py", line 15, in download_files
SecDownload(config, config_key).run()
File "/home/pablo/Desktop/repos/data/etls/providers/raw_sec1_extract_fundamentals/src/raw_sec1_extract_fundamentals/sec1_task.py", line 64, in run
xbrlParser.parse_instance(url)
File "/home/pablo/.local/share/virtualenvs/raw_sec1_extract_fundamentals-70AQCo8K/lib/python3.7/site-packages/xbrl/instance.py", line 604, in parse_instance
return parse_xbrl_url(url, self.cache)
File "/home/pablo/.local/share/virtualenvs/raw_sec1_extract_fundamentals-70AQCo8K/lib/python3.7/site-packages/xbrl/instance.py", line 256, in parse_xbrl_url
return parse_xbrl(instance_path, cache, instance_url)
File "/home/pablo/.local/share/virtualenvs/raw_sec1_extract_fundamentals-70AQCo8K/lib/python3.7/site-packages/xbrl/instance.py", line 288, in parse_xbrl
context_dir = _parse_context_elements(root.findall('xbrli:context', NAME_SPACES), root.attrib['ns_map'], taxonomy, cache)
File "/home/pablo/.local/share/virtualenvs/raw_sec1_extract_fundamentals-70AQCo8K/lib/python3.7/site-packages/xbrl/instance.py", line 540, in _parse_context_elements
member_concept: Concept = member_tax.concepts[member_tax.name_id_map[member_concept_name]]
AttributeError: 'NoneType' object has no attribute 'concepts'

I will have a deeper look at the error but it seems that this submission does not contain an id for each fact. Will have to analyse it and see if it can be parsed or not.

prefix 'ix' not found in prefix map

This is the full console log

`Traceback (most recent call last):

File "/home/samar/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3418, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 5, in
XbrlInstance = parse_ixbrl(ixbrl_path,cache)

File "/home/samar/anaconda3/lib/python3.8/site-packages/xbrl_parser/instance.py", line 389, in parse_ixbrl
xbrl_resources: ET.Element = root.find('.//ix:resources', ns_map)

File "/home/samar/anaconda3/lib/python3.8/xml/etree/ElementTree.py", line 649, in find
return self._root.find(path, namespaces)

File "/home/samar/anaconda3/lib/python3.8/xml/etree/ElementPath.py", line 389, in find
return next(iterfind(elem, path, namespaces), None)

File "/home/samar/anaconda3/lib/python3.8/xml/etree/ElementPath.py", line 368, in iterfind
selector.append(ops[token[0]](next, token))

File "/home/samar/anaconda3/lib/python3.8/xml/etree/ElementPath.py", line 184, in prepare_descendant
token = next()

File "/home/samar/anaconda3/lib/python3.8/xml/etree/ElementPath.py", line 86, in xpath_tokenizer
raise SyntaxError("prefix %r not found in prefix map" % prefix) from None

File "", line unknown
SyntaxError: prefix 'ix' not found in prefix map
`

The taxonomy was imported sucessfully.

Double ixbrl fillings

Filling has two ixbrl entries, but only the secondary ixbrl carries data:
index:
https://www.sec.gov/Archives/edgar/data/0000944745/000156459021013168/0001564590-21-013168-index.htm

If I try to parse the main one:
https://www.sec.gov/Archives/edgar/data/944745/000156459021013168/civb-20201231.htm

It says:

    inst = XbrlParser(cache).parse_instance(url)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 653, in parse_instance
    return parse_ixbrl_url(url, self.cache)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 363, in parse_ixbrl_url
    return parse_ixbrl(instance_path, cache, instance_url)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 404, in parse_ixbrl
    if xbrl_resources is None: raise InstanceParseException('Could not find xbrl resources in file')
xbrl.InstanceParseException: Could not find xbrl resources in file

As pointed out the SEC extracted xml already merged the 2 files together, but unfortunately it's not in the SEC zip file.

Shouldn't the lib find the secondary ixbrl, when I provide the main ixbrl?
This is the only reference to secondary file in the main one:

<p style="margin-bottom:8pt;margin-top:0pt;margin-left:0pt;;text-indent:0pt;;font-size:9.5pt;font-family:Times New Roman;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;"><a href="civb-ex131_7.htm">
<span style="text-decoration:none;">Statement regarding earnings per share</span>
</a>

Another one:
https://www.sec.gov/Archives/edgar/data/0000021076/000002107621000016/0000021076-21-000016-index.htm

The taxonomy with namespace http://fasb.org/us-gaap/2020-01-31 could not be found

Bug description

The taxonomy with namespace http://fasb.org/us-gaap/2020-01-31 could not be found. Please check if it is imported in the schema file.

Steps to reproduce the behavior

from xbrl.cache import HttpCache
from xbrl.instance import XbrlParser
cache = HttpCache('./cache')
cache.set_headers({'From': '[email protected]', 'User-Agent': 'Tool/Version (Website)'})
xbrlParser = XbrlParser(cache)
url = 'https://www.sec.gov/Archives/edgar/data/1822027/000121390021030040/tekk-20201231.xml'
xbrlParser.parse_instance(url)

Error Trace

Traceback (most recent call last):
  File "/home/pablo/.pyenv/versions/3.7.10/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/pablo/.pyenv/versions/3.7.10/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/pablo/Desktop/repos/data/etls/providers/raw_sec1_extract_fundamentals/src/raw_sec1_extract_fundamentals/__main__.py", line 23, in <module>
    download_files(config, os.getenv("DOWNLOAD_FILES", "download.files"))
  File "/home/pablo/.local/share/virtualenvs/raw_sec1_extract_fundamentals-70AQCo8K/lib/python3.7/site-packages/data_components/config/logging.py", line 229, in wrapped_method
    return method(*args, **kwargs)
  File "/home/pablo/.local/share/virtualenvs/raw_sec1_extract_fundamentals-70AQCo8K/lib/python3.7/site-packages/data_components/config/logging.py", line 278, in wrapped_method
    ans = method(*args, **kwargs)
  File "/home/pablo/Desktop/repos/data/etls/providers/raw_sec1_extract_fundamentals/src/raw_sec1_extract_fundamentals/__main__.py", line 15, in download_files
    SecDownload(config, config_key).run()
  File "/home/pablo/Desktop/repos/data/etls/providers/raw_sec1_extract_fundamentals/src/raw_sec1_extract_fundamentals/sec1_task.py", line 50, in run
    xbrlParser.parse_instance(url)
  File "/home/pablo/.local/share/virtualenvs/raw_sec1_extract_fundamentals-70AQCo8K/lib/python3.7/site-packages/xbrl/instance.py", line 589, in parse_instance
    return parse_xbrl_url(url, self.cache)
  File "/home/pablo/.local/share/virtualenvs/raw_sec1_extract_fundamentals-70AQCo8K/lib/python3.7/site-packages/xbrl/instance.py", line 256, in parse_xbrl_url
    return parse_xbrl(instance_path, cache, instance_url)
  File "/home/pablo/.local/share/virtualenvs/raw_sec1_extract_fundamentals-70AQCo8K/lib/python3.7/site-packages/xbrl/instance.py", line 288, in parse_xbrl
    context_dir = _parse_context_elements(root.findall('xbrli:context', NAME_SPACES), root.attrib['ns_map'], taxonomy)
  File "/home/pablo/.local/share/virtualenvs/raw_sec1_extract_fundamentals-70AQCo8K/lib/python3.7/site-packages/xbrl/instance.py", line 532, in _parse_context_elements
    if dimension_tax is None: raise TaxonomyNotFound(ns_map[dimension_prefix])
xbrl.TaxonomyNotFound: The taxonomy with namespace http://fasb.org/us-gaap/2020-01-31 could not be found. Please check if it is imported in the schema file

Fails to parse xml

Trying to parse:
https://www.sec.gov/Archives/edgar/data/104169/000010416914000019/wmt-20140131_cal.xml

gives:
Exception has occurred: AttributeError
'NoneType' object has no attribute 'attrib'

for
inst = XbrlParser(cache).parse_instance(url)

trace:
inst = XbrlParser(cache).parse_instance(url)
File "C:\python36\lib\site-packages\xbrl\instance.py", line 652, in parse_instance
return parse_xbrl_url(url, self.cache)
File "C:\python36\lib\site-packages\xbrl\instance.py", line 277, in parse_xbrl_url
return parse_xbrl(instance_path, cache, instance_url)
File "C:\python36\lib\site-packages\xbrl\instance.py", line 293, in parse_xbrl
schema_uri: str = schema_ref.attrib[XLINK_NS + 'href']

parsing uk submission "KeyError: 'bus'"

Hello,
i have encountered this little problem parsing uk submissions
inst = parse_ixbrl(file_path, cache) File "/Users/lafiraed/Documents/finance-pipelines/compagniesHouse/uk_company/xbrl/instance.py", line 407, in parse_ixbrl context_dir = _parse_context_elements(xbrl_resources.findall('xbrli:context', NAME_SPACES), ns_map, taxonomy, cache) File "/Users/lafiraed/Documents/finance-pipelines/compagniesHouse/uk_company/xbrl/instance.py", line 549, in _parse_context_elements dimension_tax = taxonomy.get_taxonomy(ns_map[dimension_prefix]) KeyError: 'bus'
here is the submission file https://drive.google.com/file/d/1Mncf4rW9Dl8nghIjbP28nkcZEiqxQzBV/view?usp=sharing

Parsing of Presentation Linkbase for UK submissions

Hello,
is possible to map the extract facts to these subreport :
cashFlow,
balanceSheet,
profitAndLoss
Best

Rounding error

Parsing
https://www.sec.gov/Archives/edgar/data/0000320193/000032019321000056/aapl-20210327.htm

for
fact.concept.name == 'UnrecognizedTaxBenefits'

results in
fact.value == 16899999999.999998

source seems to have it correctly:

"us-gaap:UnrecognizedTaxBenefits" scale=3D"9" id=3D"id3VybDovL2RvY3MudjEvZG=
9jOmRhZDhkZWU5YWJlYTQ1NDM4YTBlMDI0ZmZiODE1ZDFhL3NlYzpkYWQ4ZGVlOWFiZWE0NTQzO=
GEwZTAyNGZmYjgxNWQxYV80OS9mcmFnOjAwOWNkNTU0YjAyNzQ4MjI5NmU2MjliY2MyNDkwMDQ3=
L3RleHRyZWdpb246MDA5Y2Q1NTRiMDI3NDgyMjk2ZTYyOWJjYzI0OTAwNDdfMTE3_2dc3aa0b-3=
02e-4999-9819-1ab85e1929c2">16.9</ix:nonfraction> Billion

Add support for iXBRL fact-format ixt-sec:numwordsen

As defined in https://www.sec.gov/info/edgar/edgarfm-vol2-v50.pdf

<ix:nonFraction 
  unitRef="U_xbrlipure" 
  id="F_000636" 
  name="us-gaap:StockholdersEquityNoteStockSplitConversionRatio1" 
  contextRef="C_0001318605_20200810_20200810" 
  decimals="0" 
  format="ixt-sec:numwordsen">
    five
</ix:nonFraction>

2 name space errors

For these two:
https://www.sec.gov/Archives/edgar/data/0001686850/000121390021042028/f10q0621_motusgihold.htm
https://www.sec.gov/Archives/edgar/data/0001553643/000121390021017556/f10k2020_relmadatherapeutic.htm

I get:

Traceback (most recent call last):
    inst = XbrlParser(cache).parse_instance(url)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 653, in parse_instance
    return parse_ixbrl_url(url, self.cache)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 363, in parse_ixbrl_url
    return parse_ixbrl(instance_path, cache, instance_url)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 406, in parse_ixbrl
    context_dir = _parse_context_elements(xbrl_resources.findall('xbrli:context', NAME_SPACES), ns_map, taxonomy, cache)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 574, in _parse_context_elements
    member_tax = _load_common_taxonomy(cache, ns_map[member_prefix], taxonomy)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 630, in _load_common_taxonomy
    if tax is None: raise TaxonomyNotFound(namespace)
xbrl.TaxonomyNotFound: The taxonomy with namespace http://xbrl.sec.gov/stpr/2021 could not be found. Please check if it is imported in the schema file

Maybe it's missing www?

Is it a filling or a lib issue ?

Cheers

How to generate a class relationship diagram

Actually, it's not a bug report.

I just wonder what tool do you use to generate this image:

Parser crashes for certain ixbrl submissions (could not parse date string)

Describe the bug
A clear and concise description of what the bug is:

Error when parsing date in ixbrl submission

To Reproduce
Steps to reproduce the behavior:

parse 'https://www.sec.gov/Archives/edgar/data/1585521/000158552121000048/zm-20210131.htm'

cache: HttpCache = HttpCache('./../cache')
instance_url = 'https://www.sec.gov/Archives/edgar/data/1585521/000158552121000048/zm-20210131.htm'
inst: XbrlInstance = parse_ixbrl_url(instance_url, cache)
print(inst)

Expected behavior
A clear and concise description of what you expected to happen:

Screenshots
If applicable, add screenshots to help explain your problem:

Is there a way to show the complete tags?

Awesome library. I have a question btw if you don't mind?

Is there a way to show the complete tags for each facts?

ex.

us-gaap:AssetsCurrent
dei:SomeTagIntheDocument
ticker:SomeTagInTheDocument

I'm looking to separate the standard and non-standar tag in the SEC filings

Make code and naming more standard

I believe the way of using the library is not standard when downloading it from pypi.

Just a few non-standard issues about the library:

When importing the library from pypi it is called py-xbrl. However the github project is called xbrl_parser. To download the library you have to run pip install py-xbrl but in the imports you have to write from xbrl_parser.cache import HttpCache. The standard is to have the same naming on both the library and the project.
When using the library you should not be able to import methods. The standard way of doing it is to import a class and then call the methods within the class. I think the standard option would be to have XbrlInstance url parameter in the class constructor of XbrlInstance . Then the constructor would call the different methods to parse the xbrl/ixbrl instance that the url references. I understand this would require some refactoring of the code as you would need to call the xbrl_parse methods with a reference to the XbrlInstance object instead of the urls.

This is how they do it in another library which is more standard:

What are your thoughts about it?

Is there a way to retrieve a fact's ID?

If you look at this tag:

<us-gaap:Assets contextRef="FI2015Q4" decimals="-3" id="Fact-7214827CB0865D3EDB8BC10FF27FAF5E" unitRef="usd">377284000</us-gaap:Assets>

I would like to access the id attribute in order to link together elements with footnoteArc. I don't see anything exposed in AbstractFact or NumericFact, but maybe I'm missing something.

Missing fact from ixbrl

For
https://www.sec.gov/ix?doc=/Archives/edgar/data/0001365135/000155837021005716/wu-20210331x10q.htm

The lib doesn't extract DocumentPeriodEndDate at all, it's there on the web view.
First time I have seen this problem. Even if it's nested I still got back something. I don't think I'm filtering it out.

It only founds these

 2021-01-01 to 2021-03-31 0 dimension | DocumentFiscalYearFocus: 2021 2021
 2021-01-01 to 2021-03-31 0 dimension | DocumentFiscalPeriodFocus: Q1 Q1
 2021-01-01 to 2021-03-31 0 dimension | DocumentFiscalYearFocus: 2021 2021
 2021-01-01 to 2021-03-31 0 dimension | DocumentFiscalPeriodFocus: Q1 Q1

In source:

<span class=3D"html-attribute-name">name</span>=3D"<spa=
n class=3D"html-attribute-value">dei:DocumentPeriodEndDate</span>" <span cl=
ass=3D"html-attribute-name">id</span>=3D"<span class=3D"html-attribute-valu=
e">Narr_VydiiUz0MUOCCrLq-p-Mpw</span>"&gt;</span><span class=3D"html-tag">&=
lt;b <span class=3D"html-attribute-name">style</span>=3D"<span class=3D"htm=
l-attribute-value">font-weight:bold;</span>"&gt;</span>March 31, 2021<span =
class=3D"html-tag">&lt;/b&gt;</span>

Does it work for you?

New 2022 taxonomies

Hi, I've got the batch of missing taxonomies for 2022: (where it says .html just use the .xml version)

Fill: 2022-06-25 Q1 TRNS usr: https://www.sec.gov/Archives/edgar/data/0000099302/000120677422001900/transcat4090061-10q_htm.xml
  error The 4omy with namespace http://fasb.org/srt/2022 could not be found. Please check if it is imported in the schema file
Fill: 2022-06-30 Q2 KNDI usr: https://www.sec.gov/ix?doc=/Archives/edgar/data/0001316517/000121390022045201/f10q0622_kanditech.htm
  error The 4omy with namespace http://fasb.org/us-gaap/2022 could not be found. Please check if it is imported in the schema file
Fill: 2022-06-30 Q2 NUVR usr: https://www.sec.gov/ix?doc=/Archives/edgar/data/0000071557/000151316222000106/nuvr-20220630.htm
  error The taxonomy with namespace http://fasb.org/srt/2022 could not be found. Please check if it is imported in the schema file
Fill: 2022-05-31 Q1 NXTP usr: https://www.sec.gov/ix?doc=/Archives/edgar/data/0001372183/000121390022039436/f10q0522_nextplay.htm
  error The taxonomy with namespace http://fasb.org/us-gaap/2022 could not be found. Please check if it is imported in the schema file
Fill: 2022-04-30 Q3 RFL usr: https://www.sec.gov/ix?doc=/Archives/edgar/data/0001713863/000121390022032831/f10q0422_rafaelholdings.htm
  error The taxonomy with namespace http://xbrl.sec.gov/dei/2021q4 could not be found. Please check if it is imported in the schema file
Fill: 2022-06-30 Q1 GDST usr: https://www.sec.gov/ix?doc=/Archives/edgar/data/0001858007/000121390022047219/f10q0622_goldenstone.htm
  error The taxonomy with namespace http://xbrl.sec.gov/dei/2022 could not be found. Please check if it is imported in the schema file

Bug: instance.json('my-file.json')

The XbrlInstance.json(self, file_path: str = None, override_fact_ids: bool = True) function does not respect the file_path argument. Specifically, if the file_path is specified it writes the json object to a hardcoded filepath, data.json instead of the file_path specified.

if file_path:
    with open('data.json', 'w') as f:
        return json.dump(json_dict, f)
else:
    return json.dumps(json_dict)

KeyError: 'Unit_sqft'

doing https://www.sec.gov/Archives/edgar/data/0000740664/000114420419013512/rfil-20190131.xml

gives:

Traceback (most recent call last):
  File "small_test.py", line 11, in <module>
    inst = XbrlParser(cache).parse_instance(url)
  File "C:\tmp\py-xbrl_orig\xbrl\instance.py", line 642, in parse_instance
    return parse_xbrl_url(url, self.cache)
  File "C:\tmp\py-xbrl_orig\xbrl\instance.py", line 277, in parse_xbrl_url
    return parse_xbrl(instance_path, cache, instance_url)
  File "C:\tmp\py-xbrl_orig\xbrl\instance.py", line 339, in parse_xbrl
    unit: AbstractUnit = unit_dir[fact_elem.attrib['unitRef']]
KeyError: 'Unit_sqft '

Slow parsing on some fillings

MSFT fillings parse very slowly, e.g. parsing only one of them takes 11secs @ 100% CPU.

ixbrl in html seems like a valid xml, cannot we just cut it out, parse it, and never use regexp?
There are 2120074 regexp calls, looks like every tag is searched this way.
Downloading the same file and parsing it with bs4 only takes 4secs: (3s if lxml mode used)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

python3 -m cProfile -s tottime xbrl_small_test.py > prof.txt

from xbrl.cache import HttpCache
from xbrl.instance import XbrlInstance, XbrlParser

dir = 'cache'
cache = HttpCache(dir)
# !Replace the dummy header with your information! SEC EDGAR require you to disclose information about your bot! (https://www.sec.gov/privacy.htm#security)
cache.set_headers({'From': '[email protected]', 'User-Agent': 'revenue extactor v1.0'})
cache.set_connection_params(delay=1000/9.9, retries=5, backoff_factor=0.8, logs=True)

url = 'https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm'
# same as zip:  https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/0001564590-21-002316-xbrl.zip

inst = XbrlParser(cache).parse_instance(url)

Profiling result

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  2120074    5.464    0.000    5.464    0.000 {method 'findall' of '_sre.SRE_Pattern' objects}  <-- slowest part 5.5seconds
  1060027    1.244    0.000    8.874    0.000 uri_helper.py:58(compare_uri)
  2120054    0.861    0.000    7.029    0.000 re.py:214(findall)
531164/2886    0.810    0.000    9.684    0.003 taxonomy.py:170(get_taxonomy)
  2120160    0.703    0.000    0.728    0.000 re.py:286(_compile)
  2160290    0.622    0.000    0.622    0.000 {method 'split' of 'str' objects}
       31    0.193    0.006    0.193    0.006 {method '_parse_whole' of 'xml.etree.ElementTree.XMLParser' objects}
        1    0.139    0.139    0.323    0.323 xml_parser.py:9(parse_file)
      316    0.136    0.000    0.136    0.000 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
     25/1    0.127    0.005    2.553    2.553 taxonomy.py:219(parse_taxonomy)

The call stack to get to the bottleneck:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   10.646   10.646 xbrl_small_test.py:2(<module>)  <-- entry
        1    0.000    0.000   10.318   10.318 instance.py:644(parse_instance)
        1    0.024    0.024   10.318   10.318 instance.py:351(parse_ixbrl_url)
        1    0.016    0.016   10.293   10.293 instance.py:366(parse_ixbrl)
531164/2886    0.799    0.000    9.478    0.003 taxonomy.py:170(get_taxonomy)
  1060027    1.215    0.000    8.679    0.000 uri_helper.py:58(compare_uri)
  2120054    0.847    0.000    6.893    0.000 re.py:214(findall)
  2120074    5.345    0.000    5.345    0.000 {method 'findall' of '_sre.SRE_Pattern' objects}  <-- slow part

DEI Taxonomy not found

Yes the DEI Taxonomy is not imported in the Taxonomy Extension by this particular filer.
But it is a standard taxonomy that should be covered by

https://github.com/manusimidt/xbrl_parser/blob/9a5e0232c568226f9b9251908ea482803d39c29f/xbrl_parser/taxonomy.py#L147-L156

Traceback (most recent call last):
  File "E:/Programming/python/xbrl_parser/workdir/test_parser.py", line 11, in <module>
    inst: XbrlInstance = parse_xbrl_url(instance_url, cache)
  File "E:\Programming\python\xbrl_parser\xbrl_parser\instance.py", line 256, in parse_xbrl_url
    return parse_xbrl(instance_path, cache, instance_url)
  File "E:\Programming\python\xbrl_parser\xbrl_parser\instance.py", line 282, in parse_xbrl
    context_dir = _parse_context_elements(root.findall('xbrli:context', NAME_SPACES), root.attrib['ns_map'], taxonomy)
  File "E:\Programming\python\xbrl_parser\xbrl_parser\instance.py", line 519, in _parse_context_elements
    if dimension_tax is None: raise TaxonomyNotFound(ns_map[dimension_prefix])
xbrl_parser.TaxonomyNotFound: The taxonomy with namespace http://xbrl.sec.gov/dei/2011-01-31 could not be found. Please check if it is imported in the schema file

undefined entity exception

When parsing:
https://www.sec.gov/Archives/edgar/data/0001764013/000121390019023173/f10q0919_healthsciencesacq.htm

I get exception:
undefined entity: line 15, column 172

via
inst = XbrlParser(cache).parse_instance(url)

py-xbrl==2.0.7

Make delay in HTTP Cache more intelligent

Currently the HTTPCache waits the specified time, regardless of whether another request follows.
https://github.com/manusimidt/xbrl_parser/blob/5a1b3882e8b65f7815bbc330c9c010169c195951/xbrl_parser/cache.py#L77

It should have at least an internal clock that counts the number of millisecs between two requests.

Taxonomy xsd:annotation/xsd:appinfo/link:roleType links aren't populated

Describe the bug
A clear and concise description of what the bug is:

First off, great module! Thanks a ton for putting this out there. I was getting xbrl brain-damage until I came across this repo.

Following logical expression doesn't evaluate as expected:
https://github.com/manusimidt/xbrl_parser/blob/e72c683166d41de1a5eaca87e52971aa5dda7df7/xbrl_parser/taxonomy.py#L179
This is because bool(elr_definition) returns false even if Element.find doesn't return None. I assume because len(elr_definition) evaluates to 0 (or maybe __bool__).

To Reproduce
Steps to reproduce the behavior:

Just observe that parsed instance will have an empty taxonomy.link_roles. Tested with:
http://www.xbrlsite.com/US-GAAP/BasicExample/2010-09-30/abc-20101231.xml

Expected behavior
A clear and concise description of what you expected to happen:

taxonomy.link_roles should be populated. Can change
not elr_definition or not elr_definition.text
to
elr_definition == None or not elr_definition.text

Screenshots
If applicable, add screenshots to help explain your problem:

"Explicit Member"s missing

Given filling:
https://www.sec.gov/ix?doc=/Archives/edgar/data/0001751143/000121390021027034/f10q0321_atlastechnical.htm
or xml:
https://www.sec.gov/Archives/edgar/data/1751143/000121390021027034/f10q0321_atlastechnical_htm.xml

I would like to extract number of shares, now there are ClassA and ClassB ones under the same tag name:
dei:EntityCommonStockSharesOutstanding

But they are have unique names under Explicit Member in web view:

us-gaap:CommonClassAMember
us-gaap:CommonClassBMember

which the lib doesn't see.

In xml I only see the common tag, so they might inherit the unique ones:
<dei:EntityCommonStockSharesOutstanding contextRef="c2" decimals="INF" unitRef="shares">4284023</dei:EntityCommonStockSharesOutstanding>

This might be another nested case.

Failed namespace-uri parsing

Found a few rare cases, parsing these zipped fillings gave "namespace couldn't be found" errors.
It worked great on 2000 other symbols.

See SEC response to this at the end. They seem to suggest that the lib doesn't follow 301 perm.redirection, but I don't think that's the case because code says:
_session.get(url, headers=headers, allow_redirects=True)
Or the response is in a special html header. (location)

symbol LIVX
   https://www.sec.gov/Archives/edgar/data/0001491419/000121390021036869/0001213900-21-036869-xbrl.zip
   LIVX The taxonomy with namespace https://protect2.fireeye.com/v1/url?k=ae6c5932-f1f761c4-ae6cbd84-8681010e5614-13f54919d528bfc9&q=1&e=be10b465-8bbe-4d86-b910-e99fa6de80f7&u=http%3A%2F%2Ffasb.org%2Fus-gaap%2F2021-01-31 could not be found.
 
symbol REX
   https://www.sec.gov/Archives/edgar/data/0000744187/000093041321001146/0000930413-21-001146-xbrl.zip
   REX The taxonomy with namespace https://protect2.fireeye.com/v1/url?k=dee7a9e9-817c911f-dee74d5f-8681010e5614-259af4ed9b5caf8c&q=1&e=be10b465-8bbe-4d86-b910-e99fa6de80f7&u=http%3A%2F%2Ffasb.org%2Fus-gaap%2F2021-01-31 could not be found.
 
symbol MILE
   https://www.sec.gov/Archives/edgar/data/0001819035/000121390021027739/0001213900-21-027739-xbrl.zip
   MILE The taxonomy with namespace http://xbrl.sec.gov/stpr/2018-01-31 could not be found.
 
symbol MOTS
   https://www.sec.gov/Archives/edgar/data/0001686850/000121390021026081/0001213900-21-026081-xbrl.zip
   MOTS The taxonomy with namespace http://xbrl.sec.gov/stpr/2018-01-31 could not be found.

SEC's respons:

We assume you are aware that the namespace-uri is not the URL, but rather, that the namesace-uri designates a location (URL). So, we suspect that your software is trying to retrieve the taxonomy file http://xbrl.sec.gov/stpr/2018/stpr-2018-01-31.xsd. Please note that URL will return a 301 response header:

HTTP/1.1 301 Moved Permanently
Date: Mon, 19 Jul 2021 19:52:38 GMT
Server: AkamaiGHost
Location: https://xbrl.sec.gov/stpr/2018/stpr-2018-01-31.xsd
Connection: Keep-Alive
Content-Length: 0

Not all software products will interpret this response correctly (although we haven't seen this particular problem in a couple of years).

xml parsing errors

Lib throws exception on parsing some (new) ixbrl fillings. (list below)
Not sure how SEC tolerates these and what they store in their xml.

lxml==4.6.3
py-xbrl==2.0.2

inst = xbrlParser.parse_instance(url)

Traceback (most recent call last):
  File "parse_sec.py", line 391, in <module>
    resultDict = parse_xml(url, price)
  File "parse_sec.py", line 356, in parse_xml
    flatDict = _get_raw_data(url)
  File "parse_sec.py", line 61, in _get_raw_data
    inst = xbrlParser.parse_instance(url)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 626, in parse_instance
    return parse_ixbrl_url(url, self.cache)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 363, in parse_ixbrl_url
    return parse_ixbrl(instance_path, cache, instance_url)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 383, in parse_ixbrl
    root: ET = parse_file(instance_path)
  File "C:\python36\lib\site-packages\xbrl\helper\xml_parser.py", line 19, in parse_file
    for event, elem in ET.iterparse(path, events):
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1221, in iterator
    yield from pullparser.read_events()
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1296, in read_events
    raise event
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1268, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: mismatched tag: line 15, column 172

e.g. all of these:

"no" vs 0 represantation

In
https://www.sec.gov/ix?doc=/Archives/edgar/data/0001733257/000156459021027625/fnch-10q_20210331.htm

for InventoryNet the lib extracts "no", but the website says Fact: 0

So I think there is a special case for interpreting "no" as 0 in the following xml context:

source:
<ix:nonFraction unitRef="U_iso4217USD" id="F_000402" name="us-gaap:InventoryNet" contextRef="C_0001733257_srtCounterpartyNameAxis_fnchOpenBiomeMember_us-gaapTypeOfArrangementAxis_fnchQualitySystemAndSupplyAgreementMember_20210331" decimals="INF" format="ixt-sec:numwordsen" scale="6"><ix:nonFraction unitRef="U_iso4217USD" id="F_000403" name="us-gaap:InventoryNet" contextRef="C_0001733257_srtCounterpartyNameAxis_fnchOpenBiomeMember_us-gaapTypeOfArrangementAxis_fnchQualitySystemAndSupplyAgreementMember_20201231" decimals="INF" format="ixt-sec:numwordsen" scale="6">no</ix:nonFraction>

Support nested Facts

In
https://www.sec.gov/ix?doc=/Archives/edgar/data/0001439725/000156459020056735/bdsx-10q_20200930.htm

for
DocumentPeriodEndDate

returns
'September 30,'

ctx

but web view shows:
September 30, 2020

Works on others.

(not critical, only using it for sanity checking)

Parsing <TEXT> fails

Parsing of
https://www.sec.gov/Archives/edgar/data/0001634379/000156459020053234/mtcr-10q_20200930.htm

causes exception in XbrlParser(cache).parse_instance(url)
Saying: not well-formed (invalid token): line 7, column 2 Thus most likely also other fillings from the same company.

SEC's response:

Please look at the contents of the link. You will see that like every other one of the millions of HTML documents on the EDGAR site, the first six lines are document metadata in SGML, that a browser ignores. They look like this:

<DOCUMENT>
<TYPE>10-Q
<SEQUENCE>1
<FILENAME>mtcr-10q_20200930.htm
<DESCRIPTION>10-Q
<TEXT>
 Programs can start parsing after the <TEXT> line and also ignore the last two lines
 </TEXT>
</DOCUMENT>

trace

  File "C:\python36\lib\site-packages\xbrl\instance.py", line 383, in parse_ixbrl
    root: ET = parse_file(instance_path)
  File "C:\python36\lib\site-packages\xbrl\helper\xml_parser.py", line 19, in parse_file
    for event, elem in ET.iterparse(path, events):
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1221, in iterator
    yield from pullparser.read_events()
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1296, in read_events
    raise event
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1268, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 7, column 2

Add support for the ixt-sec transformations.

As defined in:
https://www.sec.gov/info/edgar/specifications/edgarfm-vol2-v51_d.pdf

Sign missing

In
https://www.sec.gov/ix?doc=/Archives/edgar/data/0000320193/000032019321000010/aapl-20201226.htm

for
IncreaseDecreaseInAccountsReceivable

The html displays: (10,945), but the lib returns positive number, weirdly Sign=Positive is stated in xbrl, but the minus sign is needed for change_in_working_capital calculation.

Not well-formed (invalid token) error for ixblr.

Guy, hi

Great project! Learnt a lot about xbrl from discussions and blog posts.

Remote xblr example worked for me, remote ixblr didn't (i think it's cause url provided is regular htm, not ixblr), but even with ixblr document it does not work cause it can't create cache file.

I think issue is ?= in path name.
I tested on windows and termux (linux on android)
In both cases there are enough permissions, and xblr example works.

Details:
1, command executed

# inline xblr

import logging
from xbrl.cache import HttpCache
from xbrl.instance import XbrlInstance, XbrlParser

logging.basicConfig(level=logging.INFO)
cache: HttpCache = HttpCache('./cache')

cache.set_headers({'From': '[email protected]', 'User-Agent': 'Tool/Version (Website)'})
xbrlParser = XbrlParser(cache)

ixbrl_url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231.htm'
inst: XbrlInstance = xbrlParser.parse_instance(ixbrl_url)

2, windows error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\site-packages\xbrl\instance.py", line 653, in parse_instance
    return parse_ixbrl_url(url, self.cache)
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\site-packages\xbrl\instance.py", line 362, in parse_ixbrl_url
    instance_path: str = cache.cache_file(instance_url)
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\site-packages\xbrl\cache.py", line 83, in cache_file
    os.makedirs(file_dir_path)
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  [Previous line repeated 2 more times]
  File "C:\Users\Jimmu\AppData\Local\Programs\Python\Python38\lib\os.py", line 223, in makedirs
    mkdir(name, mode)
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: './cache/www.sec.gov/ix?doc='

path was created till ix?doc= (basically cache/www.sec.gov)

3, termux error

PermissionError                           Traceback (most recent call last)
Input In [29], in <module>
     11 xbrlParser = XbrlParser(cache)
     13 ixbrl_url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231.htm'
---> 14 inst: XbrlInstance = xbrlParser.parse_instance(ixbrl_url)
File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/instance.py:653, in XbrlParser.parse_instance(self, url)
    651 if url.split('.')[-1] == 'xml' or url.split('.')[-1] == 'xbrl':
    652     return parse_xbrl_url(url, self.cache)
--> 653 return parse_ixbrl_url(url, self.cache)
File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/instance.py:362, in parse_ixbrl_url(instance_url, cache)
    351 def parse_ixbrl_url(instance_url: str, cache: HttpCache) -> XbrlInstance:
    352     """
    353     Parses a inline XBRL (iXBRL) instance file.
    354     :param cache: HttpCache instance
   (...)
    360     :return:
361     """
--> 362     instance_path: str = cache.cache_file(instance_url)
    363     return parse_ixbrl(instance_path, cache, instance_url)
File /data/data/com.termux/files/usr/lib/python3.10/site-packages/xbrl/cache.py:94, in HttpCache.cache_file(self, file_url)
     90     else:
     91         raise Exception(
     92             "Could not download file from {}. Error code: {}".format(file_url, query_response.status_code))
---> 94 with open(file_path, "wb+") as file:
     95     file.write(query_response.content)
     96     file.close()
PermissionError: [Errno 1] Operation not permitted: './cache/www.sec.gov/ix?doc=/Archives/edgar/data/1671933/000156459021006726/ttd-10k_20201231.htm'

path was created till ttd (basically file creation failed)

Example fails

How to use parse_instance? Please update example.

xbrlParser = XbrlParser(cache)
xbrl_url = 'https://www.sec.gov/Archives/edgar/data/789019/000156459017014900/msft-20170630.xml'
inst: XbrlInstance = XbrlParser.parse_instance(xbrl_url)

File "C:\tmp\SECgov\parse_sec.py", line 18, in
inst: XbrlInstance = XbrlParser.parse_instance(xbrl_url)
TypeError: parse_instance() missing 1 required positional argument: 'url'

xbrlParser = XbrlParser(cache)
xbrl_url = 'https://www.sec.gov/Archives/edgar/data/789019/000156459017014900/msft-20170630.xml'
inst: XbrlInstance = XbrlParser.parse_instance(url=xbrl_url)

File "C:\tmp\SECgov\parse_sec.py", line 18, in
inst: XbrlInstance = XbrlParser.parse_instance(url=xbrl_url)
TypeError: parse_instance() missing 1 required positional argument: 'self'

unresolved schemas

Got some more unresolved schemas.

As I understand these are not real URIs, so what't the official way to resolve them?
There must be a way to look these up instead of hard coding them. Let me ask SEC.

 parsing cache/www.sec.gov/Archives/edgar/data/0000779544/000093041317004111/arkr-20170930.xml  error The taxonomy with namespace http://xbrl.sec.gov/stpr/2011-01-31 could not be found. Please check if it is imported in the schema file
 parsing cache/www.sec.gov/Archives/edgar/data/0001213660/000121390021043138/f10q0621_bimiinter_htm.xml  error The taxonomy with namespace http://xbrl.sec.gov/currency/2021 could not be found. Please check if it is imported in the schema file
 parsing cache/www.sec.gov/Archives/edgar/data/0000721693/000121390021043080/f10q0621_chinarecycling_htm.xml  error The taxonomy with namespace http://xbrl.sec.gov/currency/2021 could not be found. Please check if it is imported in the schema file
 parsing cache/www.sec.gov/Archives/edgar/data/0001066923/000121390021021762/ftft-20201231.xml  error The taxonomy with namespace http://xbrl.sec.gov/currency/2020-01-31 could not be found. Please check if it is imported in the schema file
 parsing cache/www.sec.gov/Archives/edgar/data/0000754811/000118518516005657/grow-20160930.xml  error The taxonomy with namespace http://xbrl.sec.gov/country/2016-01-31 could not be found. Please check if it is imported in the schema file
 parsing cache/www.sec.gov/Archives/edgar/data/0001316517/000121390021040913/f10q0621_kanditech_htm.xml  error The taxonomy with namespace http://xbrl.sec.gov/naics/2021 could not be found. Please check if it is imported in the schema file
 parsing cache/www.sec.gov/Archives/edgar/data/0001464790/000121390021039630/f10q0621_brileyfin_htm.xml  error The taxonomy with namespace http://xbrl.sec.gov/currency/2021 could not be found. Please check if it is imported in the schema file
 parsing cache/www.sec.gov/Archives/edgar/data/0001422892/000121390021050437/f10k2021_sinoglobalship_htm.xml  error The taxonomy with namespace http://xbrl.sec.gov/currency/2021 could not be found. Please check if it is imported in the schema file

Any way to get textual labels?

Is there any way to retrieve the textual label corresponding to a fact when it is in a table as follows?

In the above table, under Net Sales, Products, Services and total net sales each correspond with tag : us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax.

I was able to parse out the member labels to retrieve tags like us-gaap:ServiceMember.

What I would really like to do is to find the relavent table labels of Products, Services, and Total set sales. As far as I can tell these are only in the html portion of xbrl docs.

Is there already a way to access these labels, if not how might I go about it?

Found a possible parsing error:

https://www.sec.gov/ix?doc=/Archives/edgar/data/0000320193/000032019320000096/aapl-20200926.htm

CommonStockParOrStatedValuePerShare should have 2 instances, but the lib only gives back 09/28/2019, even when not filtering for len(fact.context.segments)

Common Stock, Par or Stated Value Per Share
As of 09/26/2020
0.00001

Common Stock, Par or Stated Value Per Share
As of 09/28/2019
0.00001

Same problem for

CommonStockSharesAuthorized (missing As of 09/26/2020)
EffectiveIncomeTaxRateReconciliationAtFederalStatutoryIncomeTaxRate (missing 12 months ending 09/28/2019)

Originally posted by @mrx23dot in #36 (comment)

Add support for Datetime in context duration.

parsing cache/www.sec.gov/Archives/edgar/data/0000752642/000149315218003093/umh-20171231.xml
parsing cache/www.sec.gov/Archives/edgar/data/0000888491/000114420418026912/ohi-20180331.xml
give:
error unconverted data remains: T00:00:00

maybe a str.split('T')[0] could help.

  File "C:\tmp\py-xbrl_orig\xbrl\instance.py", line 642, in parse_instance
    return parse_xbrl_url(url, self.cache)
  File "C:\tmp\py-xbrl_orig\xbrl\instance.py", line 277, in parse_xbrl_url
    return parse_xbrl(instance_path, cache, instance_url)
  File "C:\tmp\py-xbrl_orig\xbrl\instance.py", line 309, in parse_xbrl
    context_dir = _parse_context_elements(root.findall('xbrli:context', NAME_SPACES), root.attrib['ns_map'], taxonomy, cache)
  File "C:\tmp\py-xbrl_orig\xbrl\instance.py", line 541, in _parse_context_elements
    datetime.strptime(start_date.text.strip(), '%Y-%m-%d').date(),
  File "C:\Python37\lib\_strptime.py", line 577, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "C:\Python37\lib\_strptime.py", line 362, in _strptime
    data_string[found.end():])
  ValueError: unconverted data remains: T00:00:00

Date parsing fails

https://www.sec.gov/Archives/edgar/data/0001580905/000119312519281533/d802337d10q.htm

ixbrl gives:

Traceback (most recent call last):
  File "xbrl_small_test.py", line 14, in <module>
    inst = XbrlParser(cache).parse_instance(url)
  File "py-xbrl_my\xbrl\instance.py", line 653, in parse_instance
    return parse_ixbrl_url(url, self.cache)
  File "py-xbrl_my\xbrl\instance.py", line 363, in parse_ixbrl_url
    return parse_ixbrl(instance_path, cache, instance_url)
  File "py-xbrl_my\xbrl\instance.py", line 426, in parse_ixbrl
    fact_value: str or float = _extract_ixbrl_value(fact_elem)
  File "py-xbrl_my\xbrl\instance.py", line 503, in _extract_ixbrl_value
    parsed_date = strptime(fact_elem.text, '%B %d')
  File "C:\python36\lib\_strptime.py", line 559, in _strptime_time
    tt = _strptime(data_string, format)[0]
  File "C:\python36\lib\_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data 'December-31' does not match format '%B %d'

index
https://www.sec.gov/Archives/edgar/data/0001580905/000119312519281533/0001193125-19-281533-index.htm

Be nicer to submissions that do not follow the XBRL standard 100%

Implement some functionality that allows also for parsing XBRL reports that are violating the XBRL standart. Maybe just issue a warning and continue with parsing instead of crashing completely.

(from discussion:)
Hey,

the concepts are defined in the different taxonomy schemas imported by the instance document.

For example:
The first submission you provided failed at the concept:
"in-ca:WhetherApprovalTakenFromBoardForMaterialContractsorArrangementsorTransactionsWithRelatedParty"
which is prefixed by xmlns "in-ca". This xml namespace refers to the taxonomy with namespace "http://www.icai.org/xbrl/taxonomy/2016-03-31/in-ca".
This is linked to the schema file located at https://www.mca.gov.in/XBRL/2016/07/26/Taxonomy/CnI/IN-CA/in-ca-2016-03-31.xsd.
There you can check that the above mentioned concept is really not defined.

=> Thus the creator of this filing incorrectly used this non-existing concept which is why py-xbrl crashes.

The problematic line is the following:

py-xbrl/xbrl/instance.py

Line 336 in 7be61f7

concept: Concept = tax.concepts[tax.name_id_map[concept_name]]

Here I just expect the tax.name_id_map to have the given concept (which it also should according to the XBRL standard).

There where several discussions bevore about "How to treat incorrect XBRL". Because many users of py-xbrl just wan't to get data out of the reports and do not care if the report could be parsed 100%.

I plan to implement a functionality which would allow you to parse submissions that are incorrect (and maybe just issue a warning).
But I am not able to work on py-xbrl until Mid July (due to university stuff).

So in the mean time i would suggest to just but a "try-catch" block around the line where it's failing.
Like the following (untested):

# get the concept object from the taxonomy
tax = taxonomy.get_taxonomy(taxonomy_ns)
if tax is None: tax = _load_common_taxonomy(cache, taxonomy_ns, taxonomy)

try:
    concept: Concept = tax.concepts[tax.name_id_map[concept_name]]
    context: AbstractContext = context_dir[fact_elem.attrib['contextRef'].strip()]
except ValueError:
    print(f"All facts with concept {concept_name} will be ignored, due to invalid concept definition")
    continue

Originally posted by @manusimidt in #83 (reply in thread)

Date parsing fails

Parsing the following 2 URLs give date parsing exceptions.

Are they violating the standard or the lib should be able to handle them?
(although it would be risky guessing the date)

CODI time data 'Dec 31' does not match format '%B %d'
url = 'https://www.sec.gov/Archives/edgar/data/0001345126/000134512621000014/codi-20210331.htm'

MFA time data 'Dec 31' does not match format '%B %d'
url = 'https://www.sec.gov/Archives/edgar/data/0001055160/000105516021000007/mfa-20210331.htm'

Traceback (most recent call last):
  File "C:\tmp\small_test.py", line 12, in <module>
    inst = XbrlParser(cache).parse_instance(url)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 626, in parse_instance
    return parse_ixbrl_url(url, self.cache)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 363, in parse_ixbrl_url
    return parse_ixbrl(instance_path, cache, instance_url)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 424, in parse_ixbrl
    fact_value: str or float = _extract_ixbrl_value(fact_elem)
  File "C:\python36\lib\site-packages\xbrl\instance.py", line 495, in _extract_ixbrl_value
    parsed_date = strptime(fact_elem.text, '%B %d')
  File "C:\python36\lib\_strptime.py", line 559, in _strptime_time
    tt = _strptime(data_string, format)[0]
  File "C:\python36\lib\_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data 'Dec 31' does not match format '%B %d'

Check if given cache path makes sense

Describe the bug
A clear and concise description of what the bug is:

Not really a bug, but if the user forgets to add the / at the end of the cache path, the HttpCache will store the files in i.e: ./cachewww.sec.gov.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen:

Automatically append the / at the end of the cache path

Screenshots
If applicable, add screenshots to help explain your problem:

cache: HttpCache = HttpCache('./cache')
# should have the same effect as:
cache: HttpCache = HttpCache('./cache/')

Parser interprets .xbrl files as iXBRL

Thx, that works if the file ends on .xml. The XBRL ends on .xbrl by default in The Netherlands, so the class automatically switches to the ixbrl processing variant. Unfortunately I get the same error, but by renaming it works.

Originally posted by @tedjansen in #11 (comment)

zip download fails

I'm trying to speed up the download, and read that the lib supports zip download, but looks like it's not complete yet.
The provided zip file is a valid one.
py-xbrl==2.0.4

from xbrl.cache import HttpCache
from xbrl.instance import XbrlParser, XbrlInstance

cache: HttpCache = HttpCache('./cache')
cache.set_headers({'From': '[email protected]', 'User-Agent': 'zipper 1'})
xbrlParser = XbrlParser(cache)

url = 'http://www.sec.gov/Archives/edgar/data/0000320193/000032019321000056/aapl-20210327.htm' # ok
url = 'https://www.sec.gov/Archives/edgar/data/0000320193/000032019321000056/0000320193-21-000056-xbrl.zip' #nok
inst = XbrlParser(cache).parse_instance(url) 

for fact in inst.facts:
  print(fact.concept.name)

gives

Traceback (most recent call last):
  File "C:\Users\Downloads\4\zip_test.py", line 10, in <module>
    inst = XbrlParser(cache).parse_instance(url) # here to be able free up
  File "C:\Python37\lib\site-packages\xbrl\instance.py", line 626, in parse_instance
    return parse_ixbrl_url(url, self.cache)
  File "C:\Python37\lib\site-packages\xbrl\instance.py", line 363, in parse_ixbrl_url
    return parse_ixbrl(instance_path, cache, instance_url)
  File "C:\Python37\lib\site-packages\xbrl\instance.py", line 383, in parse_ixbrl
    root: ET = parse_file(instance_path)
  File "C:\Python37\lib\site-packages\xbrl\helper\xml_parser.py", line 19, in parse_file
    for event, elem in ET.iterparse(path, events):
  File "C:\Python37\lib\xml\etree\ElementTree.py", line 1222, in iterator
    yield from pullparser.read_events()
  File "C:\Python37\lib\xml\etree\ElementTree.py", line 1297, in read_events
    raise event
  File "C:\Python37\lib\xml\etree\ElementTree.py", line 1269, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2

manusimidt / py-xbrl Goto Github PK

py-xbrl's Introduction

About Me:

Connect:

Recent Projects:

Socials:

Languages

Frameworks & Technologies

GitHub Stats

py-xbrl's People

Contributors

Stargazers

Watchers

Forkers

py-xbrl's Issues

Recommend Projects

Recommend Topics

Recommend Org