Giter Club home page Giter Club logo

pyxlsb's Introduction

OS:ArchLinux WM:i3 IDE:NeoVim

pyxlsb's People

Contributors

ftaebi avatar sourcery-ai[bot] avatar twoertwein avatar willtrnr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyxlsb's Issues

Create pip release of v1.1.0

Latest version v1.0.4 contains a pandoc environment dependency to run setup.py. This dependency seems removed in the HEAD of master and the setup.py there has version 1.1.0.

Would you mind creating a pip release? I'm running into an issue where a remote blackbox environment runs setup.py and breaks on this dependency.

Unable to open .xlsb workbook (zipfile.BadZipFile: File is not a zip file)

Did pretty much the same thing as in "Usage" but the same error pops up on every single file.
I tried even random files that are not .xlsb and the exact same error appears
pyxlsb==1.0.6

Traceback (most recent call last): File "...", line 50, in <module> main() File "...", line 45, in main with open_workbook(source_path) as wb: File "C:\Users\...\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyxlsb\__init__.py", line 10, in open_workbook zf = ZipFile(name, 'r') File "C:\Users\...\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py", line 1108, in __init__ self._RealGetContents() File "C:\Users\...\AppData\Local\Programs\Python\Python36-32\lib\zipfile.py", line 1175, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file

Add unit tests

Regressions are very likely with this, have to look into unit testing Python.

Problem to get first sheet

Hi, there is a hardcoded problem to get first sheet, as all dev can think the first way to get the first sheet is by the index 0, but there is a code to avoid this.

if idx < 0 or idx > len(self.sheets): raise IndexError('sheet index out of range')

lines 51,52 in file workbook.py, I found this when I tried to get the first sheet with the get_sheet function from the Workbook object.

Traceback (most recent call last): File "/Users/****/virtualenvs/rollout/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-6-ff99215dc569>", line 1, in <module> workbook.get_sheet(0) File "/Users/****/virtualenvs/rollout/lib/python2.7/site-packages/pyxlsb/workbook.py", line 52, in get_sheet raise IndexError('sheet index out of range') IndexError: sheet index out of range
but when I tried to get the first line as I expected then, that is using the natural index (1) for first, then I found that the file did not retrieved me the first sheet, but it retrieved me the second.

I think just allowing the index 0 in the idx variable fix the problem, will try to make a pull request with it.

Are you still maintaining this repo? let me now.

Thanks

Expose the Number Format for the cells

Allow me to see the Number Format for each cell, so that I can properly format the cell's contents, convert the cell to the proper type, and/or pass this along when converting the sheet to xlsx or other formats.

use numba to speed up conversion?

I converting an XLSB file with size about 150MB. It takes more than 20 minutes to complete. It's too long for me. How do I speed up? I tried with numba but it did not work, probably due to the mixture of texts and numbers in my file? Is it known that pyxlsb works with numba during the reading of Excel rows?

What I am after is a fast way to read a XLSB file to a Pandas dataframe.

Here is my current code.

from numba import jit
from pyxlsb import open_workbook as open_xlsb
import pandas as pd

#@jit(nopython=True, parallel=True)        
def xlsb2array(xlsb, sheetnum = 2):
    csvArr = []
    with open_xlsb(xlsb) as wb:
        # Read the sheet to array 
        with wb.get_sheet(sheetnum) as sheet:
            for row in sheet.rows(sparse = True):
                vals = [item.v for item in row]
                csvArr.append(vals)
    return csvArr
df = pd.DataFrame(xlsb2array(myxlsb))

sheet_name works correctly with the real sheet name while providing index is just reading first few rows of the sheet!

import trio
import httpx
from bs4 import BeautifulSoup
import pandas as pd
from functools import partial


async def main(url):
    async with httpx.AsyncClient(timeout=None) as client:
        r = await client.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        tfile = soup.select_one('.file-link:-soup-contains(Table)').a['href']
        async with client.stream('GET', tfile) as r:
            fname = r.headers.get('content-disposition').split('=')[-1]
            async with await trio.open_file(fname, 'wb') as f:
                async for chunk in r.aiter_bytes():
                    await f.write(chunk)

        df = await trio.to_thread.run_sync(partial(pd.read_excel, fname, sheet_name='Master Data', engine="pyxlsb"))
        print(df)

if __name__ == "__main__":
    trio.run(main, 'https://rigcount.bakerhughes.com/na-rig-count')

Reading the sheet using sheet_name='Master Data' works fine but sheet_name = 1 is just reading like 1.5k of rows!

Wrong parse formula doubles values

When loading a xlsb file with formulas, incorrect data is returned when using read_double function.
Can you fix it?
for example we have
0,645383342 value in excel but in code we have next hex b'h1\x12\xec\xc6\xa7\xe4?' and this value is 0.6454805956614509

Encoding Help

Hi...
I'm having a problem with encoding and I'm not not too familiar with Python, so I'm hoping you have a second to help me out.

I'm using your xlsb to csv example verbatim on a workbook that has lots of foreign (Hungarian) characters. When it hits the first special character, I get this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 1: ordinal not in range(128)

I'm expecting that the c.v in this line: writer.writerow([c.v for c in row]) needs to be encode before writing it to the file, but I don't know how to implement the encoding. I tried some stuff like c.v.encode('utf-8') but, of course, it didn't work. Could you point me in the right direction?

Thanks!
-Alec

String columns are read as None

Hi, i am attempting to read a file (also as attachment in this ticket). The integer column (titled Site) is read correctly, but the fields containing string values are all read as String.

I was not able to find a cause for this in the source code. This file is delivered to me on a regular basis, and we have no control over the format the file is provided in. So requesting to export it as xlsx or csv is not really an option.

As github does not support uploading xlsb files, i compressed in a zip. So the zip is not the direct content of the file.

file.zip

Reproducing:

wb = open_workbook('Netwerk modernisatie week 49.xlsb')
sheet = wb.get_sheet(1)
print(next(sheet.rows()))
[Cell(r=0, c=0, v=None), Cell(r=0, c=1, v=None), Cell(r=0, c=2, v=None), Cell(r=0, c=3, v=None), Cell(r=0, c=4, v=None)]

The first row of the file contains strings. The second row also contains some integer values, which are read correctly.

What am i doing wrong?

chinese garbled

hi,
when I use pyxlsb to read something include chinese, it's always return garbled code of chinese. And I can't mend it with coding like gbk or utf8 or gb18030. How can I do ?

Distinguish between worksheets and chart sheets

It would be useful to have a way of checking which sheets are worksheets and which are chart sheets. Currently (1.0.8), both kinds of sheets appear in sheet_names and the only way to identify a chart sheet is by trying to get_sheet, which raises an "error: unpack requires a buffer of 1 bytes." I tried to check the XLSB standard to see how difficult this would be, but I confess that I quit when I saw there were 1110 pages.

Thanks for providing this module!

Some negative floats are read incorrectly

Hi,

first, thanks a lot for the library!
I found a (quite serious) bug: If I try to load a file containing the numbers "-20222.7" or "-11459.53", they are incorrectly read as "10717195.54" and "10725958.71" respectively. I made a short demo excel:

bug.zip
Content of "bug.xlsb" is:

TEXT
-20'222.80
-20'222.70
-4'718.63
-11'459.53
-4'044.54

If I execute the following code:

import pyxlsb
wb = pyxlsb.open_workbook("bug.xlsb")
sheet = wb.get_sheet(1)
rows = sheet.rows()
for row in rows: print (row)

Output:

python bug.py
[Cell(r=0, c=0, v='TEXT')]
[Cell(r=1, c=0, v=-20222.8)]
[Cell(r=2, c=0, v=10717195.54)]
[Cell(r=3, c=0, v=-4718.63)]
[Cell(r=4, c=0, v=10725958.71)]
[Cell(r=5, c=0, v=-4044.54)]

I'm using 64-bit python 3 on Windows 10 ("Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)] on win32")

I have looked quickly in the source code, but I have no idea what causes it. It seems to bee very well reproducible, however.

Greets,
Samuel

Pulling Named Range or ListObject (e.g. Table) Values Into DataFrame or Nested List

Goal

  1. Read the values from an Excel named range into a nested list like [[1,2,3],[10,20,30]] or Pandas dataframe
  2. Read the values from an Excel table into a nested list or Pandas dataframe

Description
It would be extremely useful to add a method to pull the contents of an entire name range or table (e.g. listobject in VBA) directly into a nested listed (a matrix such as [[1,2,3], [10,20,30]] or a Panda's dataframe.

I see reading data from XLSB files as one of the primary use cases of this library, and introducing what I am proposing would make using this package much easier. What I am describing can be done easily using XLWings, but this package is specific to Windows and requires Excel to be installed. This pretty much kills XLWings as an option for server-side applications or for deployments on Linux.

Suggested API
Assume wsht and wbk have been initialized to a worksheet and workbook respectively.

DataFrameQ: Optional[bool] = True in each of the functions forces a dataframe to be return. By default, the functions should return nested lists.

I see the need for the following three methods:

  1. From a Worksheet Named Range - Should work with both the name of the ranges or a standard address
    Call: wsht.GetRangeValues(rangeName: str, DataFrameQ: Optional[bool] = False)
    Returned Value : [[val_1_1, val_1_2, ...], [val_2_1, val_2_2, ...], ...]

  2. From a Workbook Named Range - Should work with both the name of the ranges or a standard address
    Call: wbk.GetRangeValues(rangeName: str, DataFrameQ: Optional[bool] = False)
    Returned Value : [[val_1_1, val_1_2, ...], [val_2_1, val_2_2, ...], ...]

  3. From a Table - Need only work with table names
    Call: wsht.GetTableValues((tableName: str, DataFrameQ: Optional[bool] = False)
    Returned Value : [[val_1_1, val_1_2, ...], [val_2_1, val_2_2, ...], ...]

Sincerely,

Pablo

open_workbook throws KeyError(None) when workbook contains sheet with rId being None

When a workbook contains sheet with rId being None, open_workbook throws KeyError(None).

Related code:

for item in reader:
if item[0] == biff12.SHEET:
self._sheets.append((item[1].name, rels[item[1].rId]))
elif item[0] == biff12.SHEETS_END:
break

And some debug output:

for item in reader:
  print(item)

(387, 'workbook')
(399, 'sheets')
(412, sheet(sheetId=28, rId=None, name=None))
(412, sheet(sheetId=2, rId='rId1', name='*'))
(412, sheet(sheetId=13, rId='rId2', name='*'))
(412, sheet(sheetId=15, rId='rId3', name='*'))
(412, sheet(sheetId=16, rId='rId4', name='*'))
(412, sheet(sheetId=23, rId='rId5', name='*'))
(412, sheet(sheetId=21, rId='rId6', name='*'))
(412, sheet(sheetId=22, rId='rId7', name='*'))
(412, sheet(sheetId=18, rId='rId8', name='*'))
(412, sheet(sheetId=25, rId='rId9', name='*'))
(412, sheet(sheetId=7, rId='rId10', name='*'))
(412, sheet(sheetId=24, rId='rId11', name='*'))
(412, sheet(sheetId=27, rId='rId12', name='*'))
(412, sheet(sheetId=11, rId='rId13', name='*'))
(412, sheet(sheetId=5, rId='rId14', name='*'))
(400, '/sheets')

(I've masked all sheet names with values as '*'.)

For now, I don't know how to recreate such a *.xlsb file from scratch, and can't share the file I have due to confidentiality reasons. Hopefully the debug output is enough for anyone interested to identify the problem.

How to read only visible sheets

I have a use case wher ei wanted to read the content only if the sheet is visible i shouldn't read the hidden values.
Is there any parameter for this to toggle?

Python reports unclosed files

While running our project with -W default per the recommendation of Python 3.9, we saw a few warnings about unclosed files which we traced to pyxlsb. Here's a simple demo using 2010_InsPubActs.xlsb which is an XLSB that I found online but I think any XLSB would work.

$ python 
Python 3.8.8 (v3.8.8:024d8058b0, Feb 19 2021, 08:48:17)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyxlsb
>>> pyxlsb.__version__
'1.0.8'
>>> workbook = pyxlsb.open_workbook('2010_InsPubActs.xlsb')
>>> [workbook.get_sheet(sheet_name) for sheet_name in workbook.sheets]
[<pyxlsb.worksheet.Worksheet object at 0x7fd53802be80>]
>>> workbook.close()
>>> ^D
sys:1: ResourceWarning: unclosed file <_io.BufferedRandom name=6>
ResourceWarning: Enable tracemalloc to get the object allocation traceback

I noticed that there's one or two TemporaryFile instances being passed to the call to Worksheet on line 84; maybe they're not getting closed?

Thanks for pyxlsb!

how to differentiate float, date, time and datetime cells?

Thanks for this great library which fill up the gap in the world of python + excel.

My question is: how to find out date and time fields among floats? with xls and xlsx readers from Python, I was able to tell which cell has date, time and float values.

Thanks!

Unable to get data when the first column is empty

Hi,
thanks for the library.
It works well when the first column is not empty.

INPUT

wb = pyxlsb.open_workbook(xlsb_file)
sheet = wb.get_sheet(sheet_name)
print(sheet.dimension)
for row in sheet.rows():
    do something

OUTPUT

dimension(r=0, c=1, h=3, w=2)
File "...packages\pyxlsb\worksheet.py", line 71, in rows
    row[item[1].c] = Cell._make([row_num, item[1].c, item[1].v])
IndexError: list assignment index out of range

I'm using 64-bit python 3 on Windows 10.
pyxlsb v1.0.2
Do you have the same issue?
Thank you

Improve performance with large worksheets

This will serve as an umbrella issue for performance improvement.

Currently there is a bit of copying which could potentially be avoided with BIFF record reading and there's also the possibility of using a C extension (or Cython).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.