willtrnr / pyxlsb Goto Github PK

View Code? Open in Web Editor NEW

87.0 87.0 22.0 887 KB

Excel 2007+ Binary Workbook (xlsb) reader for Python

License: GNU Lesser General Public License v3.0

Python 100.00%

excel python xlsb

pyxlsb's Introduction

pyxlsb's People

Contributors

Stargazers

Watchers

pyxlsb's Issues

convert to a pandas df

I'm trying to write a function to convert the sheet into a dataframe. A way exists?

Create pip release of v1.1.0

Latest version v1.0.4 contains a pandoc environment dependency to run setup.py. This dependency seems removed in the HEAD of master and the setup.py there has version 1.1.0.

Would you mind creating a pip release? I'm running into an issue where a remote blackbox environment runs setup.py and breaks on this dependency.

Unable to open .xlsb workbook (zipfile.BadZipFile: File is not a zip file)

Did pretty much the same thing as in "Usage" but the same error pops up on every single file.
I tried even random files that are not .xlsb and the exact same error appears
pyxlsb==1.0.6

Traceback (most recent call last): File "...", line 50, in <module> main() File "...", line 45, in main with open_workbook(source_path) as wb: File "C:\Users\...\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyxlsb\__init__.py", line 10, in open_workbook zf = ZipFile(name, 'r') File "C:\Users\...\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py", line 1108, in __init__ self._RealGetContents() File "C:\Users\...\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py", line 1175, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file

How can I read the binary workbook after opening it

Build from source fails, if pandoc not present

Setup.py lines 8-12 put in try/except and set a default README text.

Add unit tests

Regressions are very likely with this, have to look into unit testing Python.

Problem to get first sheet

Hi, there is a hardcoded problem to get first sheet, as all dev can think the first way to get the first sheet is by the index 0, but there is a code to avoid this.

if idx < 0 or idx > len(self.sheets): raise IndexError('sheet index out of range')

lines 51,52 in file workbook.py, I found this when I tried to get the first sheet with the get_sheet function from the Workbook object.

Traceback (most recent call last): File "/Users/****/virtualenvs/rollout/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-6-ff99215dc569>", line 1, in <module> workbook.get_sheet(0) File "/Users/****/virtualenvs/rollout/lib/python2.7/site-packages/pyxlsb/workbook.py", line 52, in get_sheet raise IndexError('sheet index out of range') IndexError: sheet index out of range
but when I tried to get the first line as I expected then, that is using the natural index (1) for first, then I found that the file did not retrieved me the first sheet, but it retrieved me the second.

I think just allowing the index 0 in the idx variable fix the problem, will try to make a pull request with it.

Are you still maintaining this repo? let me now.

Thanks

Expose the Number Format for the cells

Allow me to see the Number Format for each cell, so that I can properly format the cell's contents, convert the cell to the proper type, and/or pass this along when converting the sheet to xlsx or other formats.

use numba to speed up conversion?

I converting an XLSB file with size about 150MB. It takes more than 20 minutes to complete. It's too long for me. How do I speed up? I tried with numba but it did not work, probably due to the mixture of texts and numbers in my file? Is it known that pyxlsb works with numba during the reading of Excel rows?

What I am after is a fast way to read a XLSB file to a Pandas dataframe.

Here is my current code.

from numba import jit
from pyxlsb import open_workbook as open_xlsb
import pandas as pd

#@jit(nopython=True, parallel=True)        
def xlsb2array(xlsb, sheetnum = 2):
    csvArr = []
    with open_xlsb(xlsb) as wb:
        # Read the sheet to array 
        with wb.get_sheet(sheetnum) as sheet:
            for row in sheet.rows(sparse = True):
                vals = [item.v for item in row]
                csvArr.append(vals)
    return csvArr
df = pd.DataFrame(xlsb2array(myxlsb))

Fails to read correct sheet when there is a sheet inside excel containing chart.

If xlsb contains a chart sheet, then reading subsequent sheets by name does not work correctly.

sheet_name works correctly with the real sheet name while providing index is just reading first few rows of the sheet!

import trio
import httpx
from bs4 import BeautifulSoup
import pandas as pd
from functools import partial


async def main(url):
    async with httpx.AsyncClient(timeout=None) as client:
        r = await client.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        tfile = soup.select_one('.file-link:-soup-contains(Table)').a['href']
        async with client.stream('GET', tfile) as r:
            fname = r.headers.get('content-disposition').split('=')[-1]
            async with await trio.open_file(fname, 'wb') as f:
                async for chunk in r.aiter_bytes():
                    await f.write(chunk)

        df = await trio.to_thread.run_sync(partial(pd.read_excel, fname, sheet_name='Master Data', engine="pyxlsb"))
        print(df)

if __name__ == "__main__":
    trio.run(main, 'https://rigcount.bakerhughes.com/na-rig-count')

Reading the sheet using sheet_name='Master Data' works fine but sheet_name = 1 is just reading like 1.5k of rows!

Wrong parse formula doubles values

When loading a xlsb file with formulas, incorrect data is returned when using read_double function.
Can you fix it?
for example we have
0,645383342 value in excel but in code we have next hex b'h1\x12\xec\xc6\xa7\xe4?' and this value is 0.6454805956614509

'utf-16-le' codec can't decode bytes in position 0-1: illegal encoding

for one xlsx file, can't be read
raise the exception as in the title

'utf-16-le' codec can't decode bytes in position 0-1: illegal encoding

search the info in internet, seems similar issue as in xlrb:

https://github.com/python-excel/xlrd/issues/126

or if there is the way to override the encoding ?

Encoding Help

Hi...
I'm having a problem with encoding and I'm not not too familiar with Python, so I'm hoping you have a second to help me out.

I'm using your xlsb to csv example verbatim on a workbook that has lots of foreign (Hungarian) characters. When it hits the first special character, I get this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 1: ordinal not in range(128)

I'm expecting that the c.v in this line: writer.writerow([c.v for c in row]) needs to be encode before writing it to the file, but I don't know how to implement the encoding. I tried some stuff like c.v.encode('utf-8') but, of course, it didn't work. Could you point me in the right direction?

Thanks!
-Alec

String columns are read as None

Hi, i am attempting to read a file (also as attachment in this ticket). The integer column (titled Site) is read correctly, but the fields containing string values are all read as String.

I was not able to find a cause for this in the source code. This file is delivered to me on a regular basis, and we have no control over the format the file is provided in. So requesting to export it as xlsx or csv is not really an option.

As github does not support uploading xlsb files, i compressed in a zip. So the zip is not the direct content of the file.

file.zip

Reproducing:

wb = open_workbook('Netwerk modernisatie week 49.xlsb')
sheet = wb.get_sheet(1)
print(next(sheet.rows()))
[Cell(r=0, c=0, v=None), Cell(r=0, c=1, v=None), Cell(r=0, c=2, v=None), Cell(r=0, c=3, v=None), Cell(r=0, c=4, v=None)]

The first row of the file contains strings. The second row also contains some integer values, which are read correctly.

What am i doing wrong?

Add version with the package version

In pandas-dev/pandas#29836, pandas is adding support for reading xlsb files, using pyxlsb as an engine. One difficulty we're having is that we can't check what version of pyxlsb is imported.

PEP 396 recommends a top-level __version__: https://www.python.org/dev/peps/pep-0396/. Could you add that to the package?

Retain date (format) values when reading from xlsb

Retain date (format) values when reading from xlsb instead of having numeric values:

Original cell 2-Jan-17 ~ After 42737

chinese garbled

hi,
when I use pyxlsb to read something include chinese, it's always return garbled code of chinese. And I can't mend it with coding like gbk or utf8 or gb18030. How can I do ?

Distinguish between worksheets and chart sheets

It would be useful to have a way of checking which sheets are worksheets and which are chart sheets. Currently (1.0.8), both kinds of sheets appear in sheet_names and the only way to identify a chart sheet is by trying to get_sheet, which raises an "error: unpack requires a buffer of 1 bytes." I tried to check the XLSB standard to see how difficult this would be, but I confess that I quit when I saw there were 1110 pages.

Thanks for providing this module!

Some negative floats are read incorrectly

Hi,

first, thanks a lot for the library!
I found a (quite serious) bug: If I try to load a file containing the numbers "-20222.7" or "-11459.53", they are incorrectly read as "10717195.54" and "10725958.71" respectively. I made a short demo excel:

bug.zip
Content of "bug.xlsb" is:

TEXT
-20'222.80
-20'222.70
-4'718.63
-11'459.53
-4'044.54

If I execute the following code:

import pyxlsb
wb = pyxlsb.open_workbook("bug.xlsb")
sheet = wb.get_sheet(1)
rows = sheet.rows()
for row in rows: print (row)

Output:

python bug.py
[Cell(r=0, c=0, v='TEXT')]
[Cell(r=1, c=0, v=-20222.8)]
[Cell(r=2, c=0, v=10717195.54)]
[Cell(r=3, c=0, v=-4718.63)]
[Cell(r=4, c=0, v=10725958.71)]
[Cell(r=5, c=0, v=-4044.54)]

I'm using 64-bit python 3 on Windows 10 ("Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)] on win32")

I have looked quickly in the source code, but I have no idea what causes it. It seems to bee very well reproducible, however.

Greets,
Samuel

can the worksheet be converted into csv without reading row by row?

Hi, I work with xlsb with a large number of rows so I wonder if there is a way to avoid read data in rows by rows as it's taking a long time.

Pulling Named Range or ListObject (e.g. Table) Values Into DataFrame or Nested List

Goal

Read the values from an Excel named range into a nested list like [[1,2,3],[10,20,30]] or Pandas dataframe
Read the values from an Excel table into a nested list or Pandas dataframe

Description
It would be extremely useful to add a method to pull the contents of an entire name range or table (e.g. listobject in VBA) directly into a nested listed (a matrix such as [[1,2,3], [10,20,30]] or a Panda's dataframe.

I see reading data from XLSB files as one of the primary use cases of this library, and introducing what I am proposing would make using this package much easier. What I am describing can be done easily using XLWings, but this package is specific to Windows and requires Excel to be installed. This pretty much kills XLWings as an option for server-side applications or for deployments on Linux.

Suggested API
Assume wsht and wbk have been initialized to a worksheet and workbook respectively.

DataFrameQ: Optional[bool] = True in each of the functions forces a dataframe to be return. By default, the functions should return nested lists.

I see the need for the following three methods:

From a Worksheet Named Range - Should work with both the name of the ranges or a standard address
Call: wsht.GetRangeValues(rangeName: str, DataFrameQ: Optional[bool] = False)
Returned Value : [[val_1_1, val_1_2, ...], [val_2_1, val_2_2, ...], ...]
From a Workbook Named Range - Should work with both the name of the ranges or a standard address
Call: wbk.GetRangeValues(rangeName: str, DataFrameQ: Optional[bool] = False)
Returned Value : [[val_1_1, val_1_2, ...], [val_2_1, val_2_2, ...], ...]
From a Table - Need only work with table names
Call: wsht.GetTableValues((tableName: str, DataFrameQ: Optional[bool] = False)
Returned Value : [[val_1_1, val_1_2, ...], [val_2_1, val_2_2, ...], ...]

Sincerely,

Pablo

open_workbook throws KeyError(None) when workbook contains sheet with rId being None

When a workbook contains sheet with rId being None, open_workbook throws KeyError(None).

Related code:

pyxlsb/pyxlsb/workbook.py

Lines 43 to 47 in 7dca940

 for item in reader: 

 if item[0] == biff12.SHEET: 

 self._sheets.append((item[1].name, rels[item[1].rId])) 

 elif item[0] == biff12.SHEETS_END: 

 break

And some debug output:

for item in reader:
  print(item)

(387, 'workbook')
(399, 'sheets')
(412, sheet(sheetId=28, rId=None, name=None))
(412, sheet(sheetId=2, rId='rId1', name='*'))
(412, sheet(sheetId=13, rId='rId2', name='*'))
(412, sheet(sheetId=15, rId='rId3', name='*'))
(412, sheet(sheetId=16, rId='rId4', name='*'))
(412, sheet(sheetId=23, rId='rId5', name='*'))
(412, sheet(sheetId=21, rId='rId6', name='*'))
(412, sheet(sheetId=22, rId='rId7', name='*'))
(412, sheet(sheetId=18, rId='rId8', name='*'))
(412, sheet(sheetId=25, rId='rId9', name='*'))
(412, sheet(sheetId=7, rId='rId10', name='*'))
(412, sheet(sheetId=24, rId='rId11', name='*'))
(412, sheet(sheetId=27, rId='rId12', name='*'))
(412, sheet(sheetId=11, rId='rId13', name='*'))
(412, sheet(sheetId=5, rId='rId14', name='*'))
(400, '/sheets')

(I've masked all sheet names with values as '*'.)

For now, I don't know how to recreate such a *.xlsb file from scratch, and can't share the file I have due to confidentiality reasons. Hopefully the debug output is enough for anyone interested to identify the problem.

How to read only visible sheets

I have a use case wher ei wanted to read the content only if the sheet is visible i shouldn't read the hidden values.
Is there any parameter for this to toggle?

Python reports unclosed files

While running our project with -W default per the recommendation of Python 3.9, we saw a few warnings about unclosed files which we traced to pyxlsb. Here's a simple demo using 2010_InsPubActs.xlsb which is an XLSB that I found online but I think any XLSB would work.

$ python 
Python 3.8.8 (v3.8.8:024d8058b0, Feb 19 2021, 08:48:17)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyxlsb
>>> pyxlsb.__version__
'1.0.8'
>>> workbook = pyxlsb.open_workbook('2010_InsPubActs.xlsb')
>>> [workbook.get_sheet(sheet_name) for sheet_name in workbook.sheets]
[<pyxlsb.worksheet.Worksheet object at 0x7fd53802be80>]
>>> workbook.close()
>>> ^D
sys:1: ResourceWarning: unclosed file <_io.BufferedRandom name=6>
ResourceWarning: Enable tracemalloc to get the object allocation traceback

I noticed that there's one or two TemporaryFile instances being passed to the call to Worksheet on line 84; maybe they're not getting closed?

Thanks for pyxlsb!

how to differentiate float, date, time and datetime cells?

Thanks for this great library which fill up the gap in the world of python + excel.

My question is: how to find out date and time fields among floats? with xls and xlsx readers from Python, I was able to tell which cell has date, time and float values.

Thanks!

wb = pyxlsb.open_workbook(xlsb_file)
sheet = wb.get_sheet(sheet_name)
print(sheet.dimension)
for row in sheet.rows():
    do something

OUTPUT

dimension(r=0, c=1, h=3, w=2)
File "...packages\pyxlsb\worksheet.py", line 71, in rows
    row[item[1].c] = Cell._make([row_num, item[1].c, item[1].v])
IndexError: list assignment index out of range

I'm using 64-bit python 3 on Windows 10.
pyxlsb v1.0.2
Do you have the same issue?
Thank you

Improve performance with large worksheets

This will serve as an umbrella issue for performance improvement.

Currently there is a bit of copying which could potentially be avoided with BIFF record reading and there's also the possibility of using a C extension (or Cython).

	for item in reader:
	if item[0] == biff12.SHEET:
	self._sheets.append((item[1].name, rels[item[1].rId]))
	elif item[0] == biff12.SHEETS_END:
	break

willtrnr / pyxlsb Goto Github PK

pyxlsb's Introduction

pyxlsb's People

Contributors

Stargazers

Watchers

Forkers

pyxlsb's Issues

Recommend Projects

Recommend Topics

Recommend Org