tafia / calamine Goto Github PK
View Code? Open in Web Editor NEWA pure Rust Excel/OpenDocument SpreadSheets file reader: rust on metal sheets
License: MIT License
A pure Rust Excel/OpenDocument SpreadSheets file reader: rust on metal sheets
License: MIT License
Richtext in ODS cells works as long as the richtext part of the string is at the beginning. If it starts inside the string, the cell value is cut where it starts.
See the following pull request for a related testcase #122.
---- partial_richtext_ods stdout ----
thread 'partial_richtext_ods' panicked at 'assertion failed: (left == right)
left: String("a")
,
right: String("abc")
: Mismatch at position (0, 0)', tests/test.rs:272:5
Sorry, I had to create another ODS file as it seems editing those in the repository broke other tests.
time.xlsx is composed of one cell "00:15:00".
This cell recognized as Float(0.0104166666666667).
I expect as String("00:15:00").
Is this the intended behavior?
The test code is below.
#[test]
fn time() {
setup();
let path = format!("{}/tests/time.xlsx", env!("CARGO_MANIFEST_DIR"));
let mut excel: Xlsx<_> = open_workbook(&path).unwrap();
let range = excel.worksheet_range("Sheet1").unwrap().unwrap();
range_eq!(range, [[String("00:15:00".to_string())]]);
}
Because of chunks
, we cannot actually iterate over empty ranges
Also type all string errors into enum variants.
how can i make a range within a range . I have a complicated xlsx file that contains a lot of header and meta data. surrounding that actual data. I was wondering how I can create a nested range to focus on the actual data.
currently I am simply using the worksheet_range function. which creates a range for the whole sheet.
Today the entire range is first saved in memory.
Depending on the need we could provide for both xlsx and xlsm a Row
or even a Cell
lazy iterator ... and collect them into a Range if needed.
As per reddit thread the name is really not good.
@BrunoQC do you have any idea for the new name?
I'll settle tomorrow (the sooner the better but I need some time to decide).
Some research on the excel verb:
http://www.merriam-webster.com/thesaurus/excel
Some other candidates
I tend to like 1. the most
Proposed by reddit users:
I have empty cells on some rows and it causes problems since office considers every row being the same length.
I wanted to fix it but I'm not sure how you would prefer it. Office could have a notion of rows in Range.inner. I'm not sure if we could add empty DataTypes in Range.inner since we would need to know the longest row's length and we would need to make 2 pass on the XML.
I can work on it if you lack the time.
I'll link a test case in which I also test for empty lines. I can split them in 2 test cases and I can also add them to issues.xlsx if you prefer.
<sheetData>
<row r="1" spans="1:4" s="4" customFormat="1">
<c r="A1" s="1" t="s">
<v>0</v>
</c>
<c r="B1" s="2" t="s">
<v>5</v>
</c>
<c r="C1" s="3" t="s">
<v>10</v>
</c>
<c r="D1" s="4" t="s">
<v>15</v>
</c>
</row>
<row r="2" spans="1:4" s="4" customFormat="1">
<c r="A2" s="1" t="s">
<v>1</v>
</c>
<c r="B2" s="2" t="s">
<v>6</v>
</c>
<c r="C2" s="3" t="s">
<v>11</v>
</c>
<c r="D2" s="4" t="s">
<v>16</v>
</c>
</row>
<row r="3" spans="1:4" s="4" customFormat="1">
<c r="A3" s="1" t="s">
<v>2</v>
</c>
<c r="B3" s="2" t="s">
<v>7</v>
</c>
<c r="C3" s="3" t="s">
<v>12</v>
</c>
</row>
<row r="4" spans="1:4" s="4" customFormat="1">
<c r="A4" s="1" t="s">
<v>3</v>
</c>
<c r="B4" s="2" t="s">
<v>8</v>
</c>
<c r="C4" s="3" t="s">
<v>13</v>
</c>
<c r="D4" s="4" t="s">
<v>17</v>
</c>
</row>
<row r="5" spans="1:4" s="4" customFormat="1">
<c r="A5" s="1" t="s">
<v>4</v>
</c>
<c r="B5" s="2" t="s">
<v>9</v>
</c>
<c r="C5" s="3" t="s">
<v>14</v>
</c>
</row>
</sheetData>
At the moment, the open
API seems to be the only way to open a new workbook. However, it would be good to have also a way to open sheets directly from a memory buffer, without accessing the hard disk.
Imagine that an user has uploaded a smallish XLSX file to your web service, and the backend is written with Rust. You don't want to first save the file to disk (an operation that might have a worst case latency of tens of milliseconds), and then load it there to open it up โ instead, you want to directly open the file from a &[u8]
.
Hi! I really love the work that you did on this tool, thank you!
I've been looking for a high-performance xlsx parser that works well on large files
I tried to parse a 270,000-row, 13-column, 21MB file and convert it to a csv, using both calamine (the sample code you provided in examples/) and the fastest xlsx library I know of, https://github.com/dilshod/xlsx2csv in python.
xlsx2csv took 1m42s while Calamine converted this in 3m39s.
I was wondering if speed and performance is a goal of this library, and if you have any ideas for speeding things up?
In my excel sheet, I've got a column with dates. Unfortunately, calamine::DateType
has no variant for dates and when I handle a date as a string I get for 29.08.2012
-> 41181
.
It would be nice to have built-in date conversion
I have a big (14M) excel file with about 20 sheets. I only need to read one sheet and I can validate everything but on the last line some of the strings doesn't match. I have the same row count.
I'll check my Go implementation if I have the same problem and if I did something special.
fn main() {
use calamine::{open_workbook, Xls};
let mut path = std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR"));
path.push("car_test.xls");
let _excel: Xls<_> = open_workbook(&path).unwrap();
}
Run this code on debug mode with car_test.xls
from car_test.zip panics here:
thread 'main' panicked at 'attempt to subtract with overflow', C:\--snip--\calamine\src\xls.rs:450:32
stack backtrace:
11: calamine::xls::parse_dimensions
at C:\--snip--\calamine\src\xls.rs:450
Runing this code on release mode doesn't panic, because release builds doesn't check overflow. But I'm pretty sure potential bad things will come if u32
overflow happens.
One minute, I'll provide a pull request very soon.
Currently they're just strings it would be nice if we could add support for exposing the type of these cells.
I have a a xlsx file that has a lot of meta data. and i am trying to read specific cells using the get_value function but i seem to be having a little trouble. Is there an example that uses the range.get_value function.
Using the api is not nice atm because of the Cow
.
Maybe doing all the work at once (expecting the vba code to be pretty low anyway) should probably be a better solution.
And then directly call get_references
and get_module_names
and get_module_(raw)
from Excel
.
The only blocking point I think is the impl trait
If I open this XLS and try to read cell 6,2 (C7) of the sheet "Boys", it returns Some(Empty)
when in Excel it contains the number 9. It doesn't seem to be a formula either.
Code to read it:
let mut workbook: Xls<_> = open_workbook("names.xls").unwrap();
let range = workbook.worksheet_range("Boys").unwrap().unwrap();
println!("{:?}", range.get_value((6,2)));
Prints Some(Empty)
.
Can be straightforward but helps a lot when starting using a lib.
Ideas:
Conversion
Bulk check for excel errors
I have to read excel sheets with headers where the header row is not the first row.
I currently try to solve this by first using range.rows().enumerate().find(...)
to get a row that contains the header names I expect and then using the index of that row to get a subrange of my original range.
I introduced a bug by using the index I got from that process as the new start and therefore started at e.g. row 3 instead of 26. I fixed that by adding the found header index to the current range start.
I guess this is kind of hard to use :/
Hello!
Have you considered adding support for serde support for serialization and deserialization, at least for the DataType
type? It seems like it would be pretty straightforward. Are there any downsides to doing this that I'm not thinking of?
Here is the code:
let mut xl = Excel::open(&sce).unwrap();
for stn in xl.sheet_names().unwrap() {
let range = xl.worksheet_range(stn).unwrap();
write_range(&mut dest, range).unwrap();
}
I want to iter over the sheets with sheet names. But when I compile it, I got lifetime issue:
error[E0499]: cannot borrow `xl` as mutable more than once at a time
--> src/main.rs:28:21
|
27 | for stn in xl.sheet_names().unwrap() {
| -- first mutable borrow occurs here
28 | let range = xl.worksheet_range(stn).unwrap();
| ^^ second mutable borrow occurs here
29 | write_range(&mut dest, range).unwrap();
30 | }
| - first borrow ends here
error: aborting due to previous error
Is there any way to fix it easily?
Hello I was wondering if there was a way to find the name of the author of a sheet in the same way the file command:
file.xls: Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1, Code page: 1252, Author: AUTHOR_NAME, Last Saved By: AUTHOR_NAME, Last Printed: Wed Apr 10 07:55:25 2019, Create Time/Date: Wed Apr 10 07:57:54 2019, Last Saved Time/Date: Wed Apr 10 07:57:54 2019, Security: 0
I don't find anything in the documentation :/
linked to #8
Hi
First of all, thank you so much for making this library! I have been trying to get an xlsb read into pandas dataframe in python and your library has been a godsend.
My issue is that my .xlsb file has some strings that look like numbers: 00000000231, 00000000249, 00000000746. However the reader reads those values as Integers 231, 249, 746.
Is there anyway to force the entire range to be read as a string vs each data type? Sorry I am new to Rust programming so I couldn't find anything on the readthedocs site under the following links:
pub fn read_xlsb(path: &str)->Result<Vec<Vec<String>>, Error>{
let mut excel: Xlsb<_> = open_workbook(path).expect("Incorrect path");
if let Some(Ok(r)) = excel.worksheet_range("Default"){
let width = r.get_size().1-1;
let mut buffered = Vec::new();
let rows = r.rows().take(10);
for row in rows{
let mut string = Vec::new();
for value in row{
match *value{
DataType::Empty=>string.push(String::from(" ")),
DataType::String(ref s)=>string.push(String::from(format!("{}",s))),
DataType::Int(ref i)=>string.push(String::from(format!("{}", i))),
DataType::Float(ref f)=>string.push(String::from(format!("'{}',", f))),
DataType::Bool(ref b)=>string.push(String::from(format!("'{}',",b))),
DataType::Error(ref e)=>string.push(String::from(format!("'{:?}',", e)))
}
}
buffered.push(string);
}
Ok(buffered)
}
else{
return Err(Error::new(ErrorKind::NotFound, "File NOT FOUND"));
}
}
It's a non-zipped XML.
Please add the respective implementations.
The examples I found all use the whole of sheet1 as the range. I have an Excel file that has a header row, followed by data rows grouped into sets of 4 columns each.
For external reasons I can't use export to CSV, or change the spreadsheet in any way.
I'd like to be able to find an example in the documentation describing how to convert the string b"A2:D91" to a Range that can be used within the Serde Deserialization example, instead of getting the range via workbook.worksheet_range("Sheet1")
As an intro into Rust, I set a task for myself to convert some Excel files into JSON. My Excel files only have one sheet, they're basically csvs with numbers and strings,
I run into the issue that I need to enter the sheet name to open it i.e. workbook.worksheet_range("Sheet1")
. When I tried running the code above I got an error "Sheet \'Sheet1\' does not exist"'
. I opened the Excel to confirm it had the sheet, and realized the name was "Tabelle1"
as it's created with the German locale. I realized that the actual user defined sheet name has to be used for workbook.worksheet_range()
if I understand the code in read_sheets_names()
correctly.
I was hoping to find out more and dug through the calamine code to figure out if there is another option. I didn't find any other way to reference the sheets. This feels like a good point for an enhancement request. The good thing is that workbook.worksheet_range("Tabelle1")
opens the file correctly and I get the data.
After unzipping the excel file, I believe a solution might be something like:
The opened workbook.xml
has a sheetId
property and the actual sheet xml name still stays sheet1.xml
in my case. I'm not sure if there are differences between the format revisions. I only tested on Excel2010. So I'm guessing that the path or the sheetId
can be used for another reference. For instance when fetching sheets by name in read_sheets_names()
we add the sheetId
of the sheet to the resulting sheets
object. Then a new function could take advantage of this and get the sheet by passing in it's number, i.e. "1".
That's my suggestion, I'd submit a pull request, as I said this is my first attempt at using rust, so I'll leave it at the suggestion level.
Thanks for the work on the wrapper, it's way over my head, but feels good to read the code and try and understand it ๐. Happy new year!
| 1 | a |
| 2 | b |
| 3 | c |
output
0,0: Int(1)
0,1: String("a")
1,0: String("b")
1,1: Int(3)
row: [Int(1), String("a"), Int(2)]
row: [String("b"), Int(3), String("c")]
code
extern crate office;
use office::Excel;
fn main() {
let mut excel = Excel::open("test.xlsm").unwrap();
let r = excel.worksheet_range("Feuil1").unwrap();
println!("0,0: {:?}", r.get_value(0, 0));
println!("0,1: {:?}", r.get_value(0, 1));
println!("1,0: {:?}", r.get_value(1, 0));
println!("1,1: {:?}", r.get_value(1, 1));
for row in r.rows() {
println!("row: {:?}", row);
}
}
I have a '0.5' value. In Excel I see '1'. The cells' format seems to be set to numeric with 0 precision digits.
Or some maybe somework to be done to do it?
I am writing some software that deserializes a excel file as a Vec<HashMap<String, Value>>
where Value is my own internal value type. I do this using the RangeDeserializer
, with the headers setting set to true
. However I am running into some problems when my header contains numbers, is deserialized as such, and I get an error about expecting string and getting a number. I guess a simple fix would to use my internal value type which can be deserialized as both a floating point and a string and then manually converting it to a string, however if the header is parsed as a boolean I would have problem there as well. I don't know if this is considered a bug or not, but is there any workaround other than using Range::rows()
and manually implementing the deserializing.
Continue records are not managed yet
https://msdn.microsoft.com/en-us/library/dd949081(v=office.12).aspx
Today the entire file is saved in memory. We can probably have a BufReader<ZipFile>
wrapper like what is done in xlsb to be a bit more efficient.
formula: CONCAT("a", "b")
This is probably the only missing part before covering all excel files.
It seems that no Rust crate at the moment supports encrypted spreadsheets but is there any plans? It seems that password support is by no means a trivial feature, since there are many different schemes to encrypt an XLSX file. I don't even know if some are more common than others: http://poi.apache.org/encryption.html
I think that the Apache POI library is the gold standard standard for supporting stuff: https://github.com/apache/poi
As one step towards a better direction, it would be nice if it could recognize and report the error correctly, as it does with the XLS format.
The sheet_names
method clones the vector of the sheet names. I get that this isn't likely to be on the hot path, but cloning and allocating should usually left to the user when possible. (Nothing wrong with convenience methods though). As an alternative, it would be a good practice to provide an iterator over the sheet names: sheet_names_iter
, with Item=&str
. This avoids both allocation and cloning, leaving the maximum flexibility in the hands of the user.
Great library, it's saved me countless hours of work. I'm using it to read ods spreadsheets, and I could use a list of sheet names in the original order that they appeared in the file.
#113 contains code aimed at debugging indexing bugs.
let range = workbook.worksheet_range(name_worksheet.as_str());
if range.is_empty() {
continue
}
let (row_max, col_max) = (row_max_usize as u32, col_max_usize as u32);
trace!("{}x{} matrix", row_max, col_max);
for index_col in 0..col_max {
for index_row in 0..row_max {
// panics
let cell = range.get_value((index_row, index_col));
}
}
I hope you support .ods
:)
This seems like a file format that fits very well with the possible applications for Calamine.
Consider Calamine 0.14.0, and an XLSX spreadsheet with one worksheet covered by range
.
Given range.end() == (m_max, n_max)
where m_max >= 0
and n_max >= 0
,
let a = range.get_value((m, n));
where n == n_max + 1
,
let b = range.get_value((m, n));
where n == 0
,
then a == b
.
This is incorrect. None
should be returned when the index of any dimension is overflown wrt. to the maximum index.
There seems to be no sheet if I only have one line in a xlsm file.
| 1 | a |
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(Msg("Sheet \'Feuil1\' does not exist"), (None, None))', ../src/libcore/result.rs:799
extern crate office;
use office::Excel;
fn main() {
let mut excel = Excel::open("test2.xlsm").unwrap();
let r = excel.worksheet_range("Feuil1").unwrap();
println!("0,0: {:?}", r.get_value(0, 0));
println!("0,1: {:?}", r.get_value(0, 1));
for row in r.rows() {
println!("row: {:?}", row);
}
}
Adds several missing parts
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.