tafia / calamine Goto Github PK

View Code? Open in Web Editor NEW

1.7K 1.7K 158.0 5.78 MB

A pure Rust Excel/OpenDocument SpreadSheets file reader: rust on metal sheets

License: MIT License

Rust 100.00%

deserializer excel opendocument-spreadsheet parser rust serde vba

calamine's People

Contributors

Stargazers

Watchers

Forkers

wall-clock pfernie wadavada grissiom ramn pgkos hsivonen kenqyu erickt happy-ferret sunguo alexnpavel kraai laumann p-tuka nabero zetok gzsombor kklibo mattsse pluripotency dskkato novanova1 shamiao coden-labs fstfwd sflanaga isgasho fg51 phiresky cookedsashimi jamesmcm rustyforks syed-ghufran-hassan bonusrk hiraokatakuya onlyone0001 wamich birktj songlinshu armersong progmasys nopey kkirsanov dodomorandi robyoung eqicreate jfcha evilpie daesnorey jayyucn lovingmelody striezel-stash rylwin icodein huntertparks rafagd tradesage sergeilem linecode markcatley rvanheusden andy128k willglynn pombredanne osmos-io k-fujino-kohei msdrigg dhbradshaw visig9 gy0801151351 yutiansut cxh19940504 lynsei api-ops killthemule smuel-adm vbirds mjvl hbina jqnatividad cyroalves saberboy signis-bin cnpryer datalearns dusdanig prettywood letuanhai luausdalmolin econaxis kov xlwings stadelmanma maxverevkin linux-kern tngo04 mayhemheroes denismcolin roblabla

calamine's Issues

ODS partial richtext string issue

Richtext in ODS cells works as long as the richtext part of the string is at the beginning. If it starts inside the string, the cell value is cut where it starts.

See the following pull request for a related testcase #122.

---- partial_richtext_ods stdout ----
thread 'partial_richtext_ods' panicked at 'assertion failed: (left == right)
left: String("a"),
right: String("abc"): Mismatch at position (0, 0)', tests/test.rs:272:5

Sorry, I had to create another ODS file as it seems editing those in the repository broke other tests.

Time is recognized as Float.

time.xlsx

time.xlsx is composed of one cell "00:15:00".
This cell recognized as Float(0.0104166666666667).
I expect as String("00:15:00").
Is this the intended behavior?
The test code is below.

#[test]
fn time() {
    setup();

    let path = format!("{}/tests/time.xlsx", env!("CARGO_MANIFEST_DIR"));
    let mut excel: Xlsx<_> = open_workbook(&path).unwrap();

    let range = excel.worksheet_range("Sheet1").unwrap().unwrap();
    range_eq!(range, [[String("00:15:00".to_string())]]);
}

Range::rows: Do not crash when range is empty

Because of chunks, we cannot actually iterate over empty ranges

Migrate to failure

Also type all string errors into enum variants.

Bug in sheet name parsing

See failing test: https://github.com/ramn/calamine/tree/failing_test_for_sheet_name_parsing

creating nested ranges.

how can i make a range within a range . I have a complicated xlsx file that contains a lot of header and meta data. surrounding that actual data. I was wondering how I can create a nested range to focus on the actual data.

currently I am simply using the worksheet_range function. which creates a range for the whole sheet.

Provide a lazy iterator over the rows

Today the entire range is first saved in memory.
Depending on the need we could provide for both xlsx and xlsm a Row or even a Cell lazy iterator ... and collect them into a Range if needed.

Find a new name

As per reddit thread the name is really not good.

@BrunoQC do you have any idea for the new name?
I'll settle tomorrow (the sooner the better but I need some time to decide).

Some research on the excel verb:
http://www.merriam-webster.com/thesaurus/excel

Some other candidates

xlrs (eXceLRuSt)
excellent / excel-lent (weird meaning in french as it is not necessarily slow)
exceller (there is a wikipedia page about a "exceller" horse, I like the image https://en.wikipedia.org/wiki/Exceller)
xlsr: variant of 1. but starting with "xls" like excel files
excel-reader

I tend to like 1. the most

Proposed by reddit users:

Excellerator (killercup)

bug when some rows has empty cells at the end

I have empty cells on some rows and it causes problems since office considers every row being the same length.

I wanted to fix it but I'm not sure how you would prefer it. Office could have a notion of rows in Range.inner. I'm not sure if we could add empty DataTypes in Range.inner since we would need to know the longest row's length and we would need to make 2 pass on the XML.

I can work on it if you lack the time.

I'll link a test case in which I also test for empty lines. I can split them in 2 test cases and I can also add them to issues.xlsx if you prefer.

<sheetData>
    <row r="1" spans="1:4" s="4" customFormat="1">
        <c r="A1" s="1" t="s">
            <v>0</v>
        </c>
        <c r="B1" s="2" t="s">
            <v>5</v>
        </c>
        <c r="C1" s="3" t="s">
            <v>10</v>
        </c>
        <c r="D1" s="4" t="s">
            <v>15</v>
        </c>
    </row>
    <row r="2" spans="1:4" s="4" customFormat="1">
        <c r="A2" s="1" t="s">
            <v>1</v>
        </c>
        <c r="B2" s="2" t="s">
            <v>6</v>
        </c>
        <c r="C2" s="3" t="s">
            <v>11</v>
        </c>
        <c r="D2" s="4" t="s">
            <v>16</v>
        </c>
    </row>
    <row r="3" spans="1:4" s="4" customFormat="1">
        <c r="A3" s="1" t="s">
            <v>2</v>
        </c>
        <c r="B3" s="2" t="s">
            <v>7</v>
        </c>
        <c r="C3" s="3" t="s">
            <v>12</v>
        </c>
    </row>
    <row r="4" spans="1:4" s="4" customFormat="1">
        <c r="A4" s="1" t="s">
            <v>3</v>
        </c>
        <c r="B4" s="2" t="s">
            <v>8</v>
        </c>
        <c r="C4" s="3" t="s">
            <v>13</v>
        </c>
        <c r="D4" s="4" t="s">
            <v>17</v>
        </c>
    </row>
    <row r="5" spans="1:4" s="4" customFormat="1">
        <c r="A5" s="1" t="s">
            <v>4</v>
        </c>
        <c r="B5" s="2" t="s">
            <v>9</v>
        </c>
        <c r="C5" s="3" t="s">
            <v>14</v>
        </c>
    </row>
</sheetData>

Feature/API request: Open sheets from memory buffers

At the moment, the open API seems to be the only way to open a new workbook. However, it would be good to have also a way to open sheets directly from a memory buffer, without accessing the hard disk.

Imagine that an user has uploaded a smallish XLSX file to your web service, and the backend is written with Rust. You don't want to first save the file to disk (an operation that might have a worst case latency of tens of milliseconds), and then load it there to open it up – instead, you want to directly open the file from a &[u8].

Performance/speed

Hi! I really love the work that you did on this tool, thank you!

I've been looking for a high-performance xlsx parser that works well on large files

I tried to parse a 270,000-row, 13-column, 21MB file and convert it to a csv, using both calamine (the sample code you provided in examples/) and the fastest xlsx library I know of, https://github.com/dilshod/xlsx2csv in python.

xlsx2csv took 1m42s while Calamine converted this in 3m39s.

I was wondering if speed and performance is a goal of this library, and if you have any ideas for speeding things up?

Extracting dates from Excel

In my excel sheet, I've got a column with dates. Unfortunately, calamine::DateType has no variant for dates and when I handle a date as a string I get for 29.08.2012 -> 41181.

It would be nice to have built-in date conversion

read_shared_strings is broken where there's RichTextRun cells

I have a big (14M) excel file with about 20 sheets. I only need to read one sheet and I can validate everything but on the last line some of the strings doesn't match. I have the same row count.

I'll check my Go implementation if I have the same problem and if I did something special.

calamine::xls::parse_dimensions() panics for specific workbook on debug build

fn main() {
    use calamine::{open_workbook, Xls};
    let mut path = std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR"));
    path.push("car_test.xls");
    let _excel: Xls<_> = open_workbook(&path).unwrap();
}

Run this code on debug mode with car_test.xls from car_test.zip panics here:

thread 'main' panicked at 'attempt to subtract with overflow', C:\--snip--\calamine\src\xls.rs:450:32
stack backtrace:
  11: calamine::xls::parse_dimensions
             at C:\--snip--\calamine\src\xls.rs:450

Runing this code on release mode doesn't panic, because release builds doesn't check overflow. But I'm pretty sure potential bad things will come if u32 overflow happens.

One minute, I'll provide a pull request very soon.

Add rustfmt checks

Expose typed data for dates and times

Currently they're just strings it would be nice if we could add support for exposing the type of these cells.

Random Read

I have a a xlsx file that has a lot of meta data. and i am trying to read specific cells using the get_value function but i seem to be having a little trouble. Is there an example that uses the range.get_value function.

Better Vba management

Using the api is not nice atm because of the Cow.
Maybe doing all the work at once (expecting the vba code to be pretty low anyway) should probably be a better solution.
And then directly call get_references and get_module_names and get_module_(raw) from Excel.

Make office run on stable rust

The only blocking point I think is the impl trait

get_value returns Empty for non-empty cells

If I open this XLS and try to read cell 6,2 (C7) of the sheet "Boys", it returns Some(Empty) when in Excel it contains the number 9. It doesn't seem to be a formula either.

Code to read it:

let mut workbook: Xls<_> = open_workbook("names.xls").unwrap();
let range = workbook.worksheet_range("Boys").unwrap().unwrap();
println!("{:?}", range.get_value((6,2)));

Prints Some(Empty).

Add examples

Can be straightforward but helps a lot when starting using a lib.

Ideas:
Conversion

Excel to csv
Excel to Json

Bulk check for excel errors

error in cells
errors in broken references

Usability of finding a header row and skipping rows

I have to read excel sheets with headers where the header row is not the first row.
I currently try to solve this by first using range.rows().enumerate().find(...) to get a row that contains the header names I expect and then using the index of that row to get a subrange of my original range.

I introduced a bug by using the index I got from that process as the new start and therefore started at e.g. row 3 instead of 26. I fixed that by adding the found header index to the current range start.
I guess this is kind of hard to use :/

Serde serialization and deserialization?

Hello!

Have you considered adding support for serde support for serialization and deserialization, at least for the DataType type? It seems like it would be pretty straightforward. Are there any downsides to doing this that I'm not thinking of?

lifetime issue when itering over the `sheet_names()`

Here is the code:

    let mut xl = Excel::open(&sce).unwrap();
    for stn in xl.sheet_names().unwrap() {
        let range = xl.worksheet_range(stn).unwrap();
        write_range(&mut dest, range).unwrap();
    }

I want to iter over the sheets with sheet names. But when I compile it, I got lifetime issue:

error[E0499]: cannot borrow `xl` as mutable more than once at a time
  --> src/main.rs:28:21
   |
27 |     for stn in xl.sheet_names().unwrap() {
   |                -- first mutable borrow occurs here
28 |         let range = xl.worksheet_range(stn).unwrap();
   |                     ^^ second mutable borrow occurs here
29 |         write_range(&mut dest, range).unwrap();
30 |     }
   |     - first borrow ends here

error: aborting due to previous error

Is there any way to fix it easily?

getting the name of the author

Hello I was wondering if there was a way to find the name of the author of a sheet in the same way the file command:
file.xls: Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1, Code page: 1252, Author: AUTHOR_NAME, Last Saved By: AUTHOR_NAME, Last Printed: Wed Apr 10 07:55:25 2019, Create Time/Date: Wed Apr 10 07:57:54 2019, Last Saved Time/Date: Wed Apr 10 07:57:54 2019, Security: 0

I don't find anything in the documentation :/

Make clippy happy

linked to #8

Reader with data types casting on all columns or specific columns in range

Hi
First of all, thank you so much for making this library! I have been trying to get an xlsb read into pandas dataframe in python and your library has been a godsend.

My issue is that my .xlsb file has some strings that look like numbers: 00000000231, 00000000249, 00000000746. However the reader reads those values as Integers 231, 249, 746.

Is there anyway to force the entire range to be read as a string vs each data type? Sorry I am new to Rust programming so I couldn't find anything on the readthedocs site under the following links:

pub fn read_xlsb(path: &str)->Result<Vec<Vec<String>>, Error>{
    let mut excel: Xlsb<_> = open_workbook(path).expect("Incorrect path");
    if let Some(Ok(r)) = excel.worksheet_range("Default"){
        let width = r.get_size().1-1;
        let mut buffered = Vec::new();
        let rows = r.rows().take(10);
        for row in rows{
            let mut string = Vec::new();
            for value in row{
                match *value{
                    DataType::Empty=>string.push(String::from(" ")),
                    DataType::String(ref s)=>string.push(String::from(format!("{}",s))),
                    DataType::Int(ref i)=>string.push(String::from(format!("{}", i))),
                    DataType::Float(ref f)=>string.push(String::from(format!("'{}',", f))),
                    DataType::Bool(ref b)=>string.push(String::from(format!("'{}',",b))),
                    DataType::Error(ref e)=>string.push(String::from(format!("'{:?}',", e)))
                }
            }
            buffered.push(string);
        }
        Ok(buffered)
    }
    else{
        return Err(Error::new(ErrorKind::NotFound, "File NOT FOUND"));
    }
    
}

Does calamine support the XML 2003 format?

It's a non-zipped XML.

Error types don't implement std::eror::Error

Please add the respective implementations.

Documentation enhancement: How do I query a range of cells?

The examples I found all use the whole of sheet1 as the range. I have an Excel file that has a header row, followed by data rows grouped into sets of 4 columns each.

For external reasons I can't use export to CSV, or change the spreadsheet in any way.

I'd like to be able to find an example in the documentation describing how to convert the string b"A2:D91" to a Range that can be used within the Serde Deserialization example, instead of getting the range via workbook.worksheet_range("Sheet1")

Loading sheets with different locale depends on sheet name

As an intro into Rust, I set a task for myself to convert some Excel files into JSON. My Excel files only have one sheet, they're basically csvs with numbers and strings,

I run into the issue that I need to enter the sheet name to open it i.e. workbook.worksheet_range("Sheet1"). When I tried running the code above I got an error "Sheet \'Sheet1\' does not exist"'. I opened the Excel to confirm it had the sheet, and realized the name was "Tabelle1" as it's created with the German locale. I realized that the actual user defined sheet name has to be used for workbook.worksheet_range() if I understand the code in read_sheets_names() correctly.
I was hoping to find out more and dug through the calamine code to figure out if there is another option. I didn't find any other way to reference the sheets. This feels like a good point for an enhancement request. The good thing is that workbook.worksheet_range("Tabelle1") opens the file correctly and I get the data.

After unzipping the excel file, I believe a solution might be something like:
The opened workbook.xml has a sheetId property and the actual sheet xml name still stays sheet1.xml in my case. I'm not sure if there are differences between the format revisions. I only tested on Excel2010. So I'm guessing that the path or the sheetId can be used for another reference. For instance when fetching sheets by name in read_sheets_names() we add the sheetId of the sheet to the resulting sheets object. Then a new function could take advantage of this and get the sheet by passing in it's number, i.e. "1".

That's my suggestion, I'd submit a pull request, as I said this is my first attempt at using rust, so I'll leave it at the suggestion level.
Thanks for the work on the wrapper, it's way over my head, but feels good to read the code and try and understand it 😃. Happy new year!

xlsm: rows not splitted correctly

| 1 | a |
| 2 | b |
| 3 | c |

output

0,0: Int(1)
0,1: String("a")
1,0: String("b")
1,1: Int(3)
row: [Int(1), String("a"), Int(2)]
row: [String("b"), Int(3), String("c")]

code

extern crate office;
use office::Excel;

fn main() {
    let mut excel = Excel::open("test.xlsm").unwrap();
    let r = excel.worksheet_range("Feuil1").unwrap();
    println!("0,0: {:?}", r.get_value(0, 0));
    println!("0,1: {:?}", r.get_value(0, 1));
    println!("1,0: {:?}", r.get_value(1, 0));
    println!("1,1: {:?}", r.get_value(1, 1));

    for row in r.rows() {
        println!("row: {:?}", row);
    }   
}

can't parse a float if the excel's format is numeric without precision

I have a '0.5' value. In Excel I see '1'. The cells' format seems to be set to numeric with 0 precision digits.

Is it possible write excel files with this library?

Or some maybe somework to be done to do it?

Deserialize cells with numbers as strings

I am writing some software that deserializes a excel file as a Vec<HashMap<String, Value>> where Value is my own internal value type. I do this using the RangeDeserializer, with the headers setting set to true. However I am running into some problems when my header contains numbers, is deserialized as such, and I get an error about expecting string and getting a number. I guess a simple fix would to use my internal value type which can be deserialized as both a floating point and a string and then manually converting it to a string, however if the header is parsed as a boolean I would have problem there as well. I don't know if this is considered a bug or not, but is there any workaround other than using Range::rows() and manually implementing the deserializing.

xls: Manage Continue records

Continue records are not managed yet
https://msdn.microsoft.com/en-us/library/dd949081(v=office.12).aspx

Use a streaming iterator when parsing xlsx

Today the entire file is saved in memory. We can probably have a BufReader<ZipFile> wrapper like what is done in xlsb to be a bit more efficient.

error "not v node" when the cell is a formula

formula: CONCAT("a", "b")

Clean the project and publish to crates.io

Out of my mind:

license
better readme (a status about xlsb and xls management out of vba) (#20)
documentation (will then probably rely on doc.rs
more tests
travis and appveyor badges
clippy (#18)
rustfmt (#21)
use stable toolchain per default (#15 #19)

Read xls (CFB) cells

This is probably the only missing part before covering all excel files.

XLSX with passwords

It seems that no Rust crate at the moment supports encrypted spreadsheets but is there any plans? It seems that password support is by no means a trivial feature, since there are many different schemes to encrypt an XLSX file. I don't even know if some are more common than others: http://poi.apache.org/encryption.html

I think that the Apache POI library is the gold standard standard for supporting stuff: https://github.com/apache/poi

As one step towards a better direction, it would be nice if it could recognize and report the error correctly, as it does with the XLS format.

Don't clone on returning the sheet names

The sheet_names method clones the vector of the sheet names. I get that this isn't likely to be on the hot path, but cloning and allocating should usually left to the user when possible. (Nothing wrong with convenience methods though). As an alternative, it would be a good practice to provide an iterator over the sheet names: sheet_names_iter, with Item=&str. This avoids both allocation and cloning, leaving the maximum flexibility in the hands of the user.

Note in README that `xls` is not supported

ref https://github.com/tafia/office/blob/755fee75cd09ad26257c80530a16c92d704a0e9e/examples/excel_to_csv.rs#L20

Feature request: original order preservation for Ods::sheet_names()

Great library, it's saved me countless hours of work. I'm using it to read ods spreadsheets, and I could use a list of sheet names in the original order that they appeared in the file.

Bug(s) in indexing worksheets

#113 contains code aimed at debugging indexing bugs.

let range = workbook.worksheet_range(name_worksheet.as_str());
if range.is_empty() {
    continue
}
let (row_max, col_max) = (row_max_usize as u32, col_max_usize as u32);
trace!("{}x{} matrix", row_max, col_max);
for index_col in 0..col_max {
    for index_row in 0..row_max {
        // panics
        let cell = range.get_value((index_row, index_col));
    }
}

OpenOffce Support

I hope you support .ods :)

Support SPSS files

This seems like a file format that fits very well with the possible applications for Calamine.

http://justsolve.archiveteam.org/wiki/SAV

`get_value` returns incorrect value when max. index is overflown

Consider Calamine 0.14.0, and an XLSX spreadsheet with one worksheet covered by range.
Given range.end() == (m_max, n_max) where m_max >= 0 and n_max >= 0,
let a = range.get_value((m, n)); where n == n_max + 1,
let b = range.get_value((m, n)); where n == 0,
then a == b.

This is incorrect. None should be returned when the index of any dimension is overflown wrt. to the maximum index.

xlsm: no sheet

There seems to be no sheet if I only have one line in a xlsm file.

| 1 | a |

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(Msg("Sheet \'Feuil1\' does not exist"), (None, None))', ../src/libcore/result.rs:799

extern crate office;
use office::Excel;

fn main() {
    let mut excel = Excel::open("test2.xlsm").unwrap();
    let r = excel.worksheet_range("Feuil1").unwrap();
    println!("0,0: {:?}", r.get_value(0, 0));
    println!("0,1: {:?}", r.get_value(0, 1));

    for row in r.rows() {
        println!("row: {:?}", row);
    }
}

Improve readme

Adds several missing parts

Examples
- ~~Example for reading cells~~
- ~~Example for reading vba, and checking references~~
Better description of the library
Benchmarks?
Current limitations
- ~~cannot read xlsb cells yet~~
- cannot read xls cells yet
- inconsistent vba string parsing (missing a proper MBSC encoding)
- plus many missing functions
- no writing, only read