jsonkenl / xlsxir Goto Github PK

Xlsx parser for the Elixir language.

License: MIT License

Elixir 100.00%

elixir elixir-lang excel xlsx parser libreoffice

xlsxir's Introduction

Xlsxir

Xlsxir is an Elixir library that parses .xlsx files using Simple API for XML (SAX) parsing via the Erlsom Erlang library, extracts the data to an Erlang Term Storage (ETS) process and provides various functions for accessing the data. Please submit any issues found and they will be addressed ASAP.

Installation

You can add Xlsxir as a dependency to your Elixir project via the Hex package manager by adding the following to your mix.exs file:

def deps do
  [ {:xlsxir, "~> 1.6.4"} ]
end

Or, you can directly reference the GitHub repo:

def deps do
  [ {:xlsxir, github: "jsonkenl/xlsxir"} ]
end

Then start an OTP application:

defp application do
  [applications: [:xlsxir]]
end

Basic Usage

Xlsxir.extract/3 is deprecated, please use Xlsxir.multi_extract/1-5 going forward.

Xlsxir parses a .xlsx file located at a given path and extracts the data to an ETS process via the Xlsxir.multi_extract/1-5, Xlsxir.peek/3-4 and Xlsxir.stream_list/2-3 functions:

Xlsxir.multi_extract(path, index \\ nil, timer \\ false, excel \\ nil, options \\ [])
Xlsxir.peek(path, index, rows, options \\ [])
Xlsxir.stream_list(path, index, options \\ [])

The peek/3-4 functions return only the given number of rows from the worksheet at a given index. The multi_extract/1-5 functions allow multiple worksheets to be parsed by creating a separate ETS process for each worksheet and returning a unique table identifier for each. This option will parse all worksheets by default (when index == nil), returning a list of tuple results.

Argument descriptions:

path the path of the file to be parsed in string format
index is the position of the worksheet you wish to parse (zero-based index)
timer is a boolean flag that controls an extraction timer that returns time elapsed when set to true. Defalut value is false.
rows is an integer representing the number of rows to be extracted from the given worksheet.
options - see function documentation for option detail.

Upon successful completion, the extraction process returns:

for multi_extract/3:
- [{:ok, table_1_id}, ...] with timer set to false
- {:ok, table_id} when given a specific worksheet index
- [{:ok, table_1_id, time_elapsed}, ...] with timer set to true
- {:ok, table_id, time_elapsed} when given a specific worksheet index
for peek/3: {:ok, table_id}

Unsucessful parsing of a specific worksheet returns {:error, reason}.

The extracted worksheet data can be accessed using any of the following functions:

Xlsxir.get_list(table_id)
Xlsxir.get_map(table_id)
Xlsxir.get_mda(table_id)
Xlsxir.get_cell(table_id, cell_ref)
Xlsxir.get_row(table_id, row_num)
Xlsxir.get_col(table_id, col_ltr)
Xlsxir.get_info(table_id, num_type)

Xlsxir.get_list/1 returns entire worksheet in a list of row lists (i.e. [[row 1 values], ...])
Xlsxir.get_map/1 returns entire worksheet in a map of cell names and values (i.e. %{"A1" => value, ...})
Xlsxir.get_mda/1 returns entire worksheet in an indexed map which can be accessed like a multi-dimensional array (i.e. some_var[0][0] for cell "A1")
Xlsxir.get_cell/2 returns value of specified cell (i.e. "A1" returns value contained in cell A1)
Xlsxir.get_row/2 returns values of specified row (i.e. 1 returns the first row of data)
Xlsxir.get_col/2 returns values of specified column (i.e. "A" returns the first column of data)
Xlsxir.get_info/1 and Xlsxir.get_multi_info/2 return count data for num_type specified (i.e. :rows, :cols, :cells, :all)

Once the table data is no longer needed, run the following function to delete the ETS process and free memory:

Xlsxir.close(table_id)

Refer to Xlsxir documentation for more detailed examples.

Considerations

Cell references are formatted as a string (i.e. "A1"). Strings will be returned as type string, resulting values for functions from within the worksheet are returned as type string, integer or float depending on the type of the resulting value, data formatted as a number in the worksheet will be returned as type integer or float, date formatted values will be returned in Erlang :calendar.date() type format (i.e. {year, month, day}), and datetime values will be returned as an Elixir naive datetime. Xlsxir does not currently support dates prior to 1/1/1900.

Contributing

Contributions are encouraged. Feel free to fork the repo, add your code along with appropriate tests and documentation (ensuring all existing tests continue to pass) and submit a pull request.

Bug Reporting

Please report any bugs or request future enhancements via the Issues page.

Acknowledgements

I'd like to thank the following people who were a big help in the development of this library:

Paulo Almeida (@pma) helped with testing and has provided several great ideas for development.
Benjamin Tan's (@benjamintanweihao) article on SAX parsing with Elrsom was invaluable.
Daniel Berkompas' (@danielberkompas) article Multidimensional Arrays in Elixir inspired Xlsxir.get_mda/0.
Alex Kovalevych's (@alexkovalevych) contributions greatly improved the parsing performance of Xlsxir.

xlsxir's People

Contributors

Stargazers

Watchers

xlsxir's Issues

Enum.filter_map/3 is deprecated

warning: Enum.filter_map/3 is deprecated. Use Enum.filter/2 + Enum.map/2 or for comprehensions instead
  lib/xlsxir/unzip.ex:73

inlineStr type not parsed

cell value of inlineStr is not parsed... adding:

['inlineStr', _, s] -> List.to_string(s)

to the format_cell_value function makes it work, but I'm not sure if that's the correct solution.

Trouble parsing a date field

I have a xlsx file, which has a field with the content "01.05.2018". xlsxir parses this into {2014, 4, 30}. While changing the date from May 1st into Apr 30th could be explained as some timezone adjustment (note that this is Berlin, which is both german - i.e. uses DD.MM.YYYY - and UTC+1/UTC+2) the year change cannot be explained that way.

Any ideas how I could fix this? Is this a bug with xlsxir? I can easily provide the XLSX file if that helps.

MatchError when Parsing XLSX

Hi!

Thanks for the great library. We're using this in production and noticed that we're having an error when trying to parse XLSX files whose columns may be missing some style information (for example a column with no values). The error seems to go away when we directly delete that that column.

This is the error we get:

(MatchError) no match of right hand side value: []

(xlsxir) lib/xlsxir/parse_style.ex:69: Xlsxir.ParseStyle.sax_event_handler/2
(erlsom) xlsxir/deps/erlsom/src/erlsom_sax_utf8.erl:1408: :erlsom_sax_utf8.wrapCallback/2
(erlsom) xlsxir/deps/erlsom/src/erlsom_sax_utf8.erl:936: :erlsom_sax_utf8.parseContentLT/2
(erlsom) xlsxir/deps/erlsom/src/erlsom_sax_utf8.erl:196: :erlsom_sax_utf8.parse/2
(xlsxir) lib/xlsxir/sax_parser.ex:61: Xlsxir.SaxParser.parse/3
(xlsxir) lib/xlsxir/xlsx_file.ex:225: Xlsxir.XlsxFile.parse_styles_to_ets/1
(xlsxir) lib/xlsxir/xlsx_file.ex:65: Xlsxir.XlsxFile.initialize/2
(xlsxir) lib/xlsxir.ex:88: Xlsxir.extract/4

I've attached the spreadsheet, missing_styles.xlsx, where this occurs.

As a heads up, I have a fix for this, but I thought it would better to log an issue and open a PR which fixes this.

Sheet name

Is there a way to get the sheet name when extracting data?

Conditional formatting of cells breaks parsing

Thanks for the repo. New to elixir but found your code interesting to try and learn from.

Found that the Worksheet parser currently breaks if cells contain any conditional formatting elements. In the process of debugging more but need to get to sleep for today.

Symptoms:
ParseWorksheet fails during the :characters handler as the state is true and not a map with a value that contains the contents of the conditional formatting rule (ex. A1<>'Staffing (2)'!A1) which was parsed from the sheet @

<extLst>
    <ext uri="{78C0D931-6437-407d-A8EE-F0AAD7539E65}" xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main">
      <x14:conditionalFormattings>
        <x14:conditionalFormatting xmlns:xm="http://schemas.microsoft.com/office/excel/2006/main">
          <x14:cfRule id="{A977E7D9-018E-4592-92B6-8A1AF77B8E98}" priority="1" type="expression">
            <xm:f>A1&lt;&gt;'Staffing (2)'!A1</xm:f>
            <x14:dxf>
              <fill>
                <patternFill>
                  <bgColor rgb="FF99FF99"/>
                </patternFill>
              </fill>
            </x14:dxf>
          </x14:cfRule>
          <xm:sqref>A1:XFD1048576</xm:sqref>
        </x14:conditionalFormatting>
      </x14:conditionalFormattings>
    </ext>
  </extLst>

Expected:

System parses correctly

Steps To Reproduce:

Add conditional formatting to any cell (did this with the test worksheet present in the repo but need to create it's own test tomorrow)

Stacktrace

** (BadMapError) expected a map, got: true
     stacktrace:
       (xlsxir) lib/xlsxir/parse_worksheet.ex:54: Xlsxir.ParseWorksheet.sax_event_handler/2
       /Users/justinn/Development/elixir/xlsxir/deps/erlsom/src/erlsom_sax_utf8.erl:1408: :erlsom_sax_utf8.wrapCallback/2
       /Users/justinn/Development/elixir/xlsxir/deps/erlsom/src/erlsom_sax_utf8.erl:984: :erlsom_sax_utf8.parseTextNoIgnore/3
       /Users/justinn/Development/elixir/xlsxir/deps/erlsom/src/erlsom_sax_utf8.erl:963: :erlsom_sax_utf8.parseContent/2
       /Users/justinn/Development/elixir/xlsxir/deps/erlsom/src/erlsom_sax_utf8.erl:196: :erlsom_sax_utf8.parse/2
       (xlsxir) lib/xlsxir/sax_parser.ex:59: Xlsxir.SaxParser.parse/2
       (xlsxir) lib/xlsxir.ex:46: Xlsxir.extract/3

Xlsxir reads the wrong sheet name from workbook created with Elixlsx

I'm not sure if the bug is with Elixlsx or with Xlsxir, but the file works when opened in Excel.

bug.xlsx

iex(19)> %Elixlsx.Workbook{sheets: [%Elixlsx.Sheet{name: "Sheet1", rows: [["A"]]}, %Elixlsx.Sheet{name: "Sheet2", rows: [["A"], ["B"]]}]} |> Elixlsx.write_to("bug.xlsx")
{:ok, 'bug.xlsx'}
iex(20)> [{:ok, sheet1}, {:ok, sheet2}] = Xlsxir.multi_extract("bug.xlsx")
[
  ok: #Reference<0.1684016233.731512836.50328>,
  ok: #Reference<0.1684016233.731512836.50330>
]
iex(21)> Xlsxir.get_info(sheet1)
[rows: 1, cols: 1, cells: 1, name: nil]
iex(22)> Xlsxir.get_info(sheet2)
[rows: 2, cols: 1, cells: 2, name: "Sheet1"]

Order of rows does not match the source file

At least in the .xlsx I used, the line number is represented as a string in the ets table, so Enum.sort will not work as expected: https://github.com/kennellroxco/xlsxir/blob/master/lib/xlsxir.ex#L125.

I handled this by not using get_lines and accessing the ets table directly. I need the row number to display to the user on any validation error, but I don't necessarily need to process the rows in the order of the source file.

I'm also thinking that the use case of iterating the rows could be implemented considerably more efficiently if it was a special case, avoiding a lot of intermediate representations and transformations.

Improve parsing worksheets

I think we can improve parsing worksheets by not parsing them entirely in the memory, but instead parse line by line by request (using Stream).

Here is the idea:

Parse sharedString.xml the same way we do right now.
Create custom Stream for each worksheet.
When the next row is requested - read next line in the worksheet#{i}.xml and add it to the sax parser. Return parsed line.

Of course, if the entire worksheet is a single line (which i think not happens often), we can't do that.

not adding xlsxir to application function

Hello, I'm having the doubt if isn't necessary to add xlsxir to application function into mix.exs?
As in application is supposed to be added all the stuff you need in production..

Extract cell range from ets table i.e "A3:F10", sheet names and named ranges

Not sure if this can be achieved without first grabbing the entire ets table and filtering it.

Would be nice to just pass a string like "A10:G1000" and only get back a list/map/mda of those cols*rows out of ets.

Also workbook.xml contains some useful information when parsing workbooks such as sheet names, and definedNames which correspond to named ranges in excel.

Let me know if you are interested in adding this functionality to your library, I will try to help!

Bug: get_list/1 skips leading empty cells

Table: "B1" = 2, "D1" = 4

Xlsxir.get_list(table_id) incorrectly returns [[2, nil, 4]]

Expected to return [[nil, 2, nil, 4]]

Read currency format

Hi there!,

I have some problems reading columns that have been formatted as currency because values returned look like erl dates.

For example in this file currency.xlsx columns have Canadian dollar format but what I get for 0.00 CAD is {0, 0, 0}, and 45,052.00 CAD is {2023, 5, 6} and I dont know the reason. Is it possible to get the plain text value? I see in the documentation that available formats are strings, numbers and dates so I don't know how other formats are parsed.

(ArithmeticError) bad argument in arithmetic expression

@AlexKovalevych Since upgrade to Elixir 1.5.0 and move to Phoenix 1.3 I'm getting this error when calling Xlsxir.multi_extract/2. Having a bit of trouble figuring out why, any clue to why this might be happening?

** (exit) an exception was raised:
** (ArithmeticError) bad argument in arithmetic expression
(xlsxir) lib/xlsxir/state_manager.ex:20: anonymous fn/2 in Xlsxir.TableId.assign_id/1
(elixir) lib/agent/server.ex:23: Agent.Server.handle_call/3
(stdlib) gen_server.erl:636: :gen_server.try_handle_call/4
(stdlib) gen_server.erl:665: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

Parse fails when more than 2 attributes for a cell

ETS process vs data structure

(I'm new to Excel / Erlang / Elixir parsing so please bear with me if this doesn't make too much sense.)

In order to avoid having to sandwich my code between Xlsxir.multi_extract and Xlsxir.close everywhere, I'm planning to make a helper function that extracts a file, converts everything into some data structure (probably a 2 dimensional array), then closes the ETS process. Then nothing else will have to care about the ETS process anymore.

2 questions:

Would this function be a nice to have in xlsxir?
Why would you not use such a function? Is it because the typical use case involves more random access than batch processing?

Stream loading

Hi,

Thank you for this module, this is working fine!

However, when I want to load very big files (more than 200K lines), on my staging server I have a crash due to low memory. Would you consider adding functions to parse the xlsx file with streams so that one can load data as he needs it?

In my use case, the final user can upload xlsx files to import contacts, so a stream version of the extract would be great to improve my performances and avoid crashes for very big files :)

A way to unzip to memory instead of fs?

Currently if i process multiple excel files in parallel they will be extracted to the same directory at the disk which makes it impossible to parse files. Probably we should consider extract to memory or allow to set extract directory.

Validation of xlsx file based on content_type (or someother attribute) than file_path

Usecase

When using Phoenix and integrating parsing of xlsx files, there could be a possibility of passing Plug.Upload file_path for parsing but it bails out due to xlsx file_extension.

Even though the file is a valid xlsx, it fails 😢

Bypassing the file_path check, correctly parses and returns valid data

I don't have much of backgroud in xlsx parsing stuff but curious if we can shift to more robust than file_extension as I can change the extension of any file and make it pass through validation check and a valid file gets blocked due to it.

Let me know if I can be of any help in fixing it. I will need some guidance though 😉

Parsed TIMEVALUE cell showing one second too early

Hi,

Using xlsir on sample attached xlsx file, timevalue are parsed but showing one second too early :

iex(1)> Mix.install([
...(1)>   {:xlsxir, github: "jsonkenl/xlsxir"}
...(1)> ])
* Getting xlsxir (https://github.com/jsonkenl/xlsxir.git)
remote: Enumerating objects: 2261, done.        
[...]
iex(2)> {:ok, table_id} = Xlsxir.multi_extract("/home/jfburdet/Downloads/time_issue.xlsx", 0)
{:ok, #Reference<0.4013215777.4275175433.253817>}
iex(3)> table_id |> Xlsxir.get_map()
%{"A1" => "This should read 9h30mn00sec", "A2" => {9, 29, 59}}

Opening the file in excel shows correct time (9h30mn00s) ...

Did I miss something or is it an actual bug ?

time_issue.xlsx

Collectable protocol is deprecated for non-empty lists

(Elixir 1.8)

I have difficulty reading elixir errors like this. I think the warning is referring to the xlsxir code? If not i can make a report somewhere else. happy to help work on this if you accept pull requests..

warning: the Collectable protocol is deprecated for non-empty lists. The behaviour of things like Enum.into/2 or "for" comprehensions with an :into option is incorrect when collecting into non-empty lists. If you're collecting into a non-empty keyword list, consider using Keyword.merge/2 instead. If you're collecting into a non-empty list, consider concatenating the two lists with the ++ operator.
  (elixir) lib/collectable.ex:83: Collectable.List.into/1
  (elixir) lib/enum.ex:1227: Enum.into_protocol/2
  (xlsxir) lib/xlsxir/parse_worksheet.ex:110: Xlsxir.ParseWorksheet.sax_event_handler/4
  (erlsom) c:/Users/XXXX/Projects/XXXXXX/deps/erlsom/src/erlsom_sax_utf8.erl:1408: :erlsom_sax_utf8.wrapCallback/2
  (erlsom) c:/Users/XXXX/Projects/XXXXXX/deps/erlsom/src/erlsom_sax_utf8.erl:907: :erlsom_sax_utf8.parseContentLT/2

Plans for Excel Export?

I need a tool to export into Excel file format. Do you have any plans to extend the project by an export feature?

Bad match on sheet r:id value "R1"

While trying to parse a workbook I received the following error:

** (MatchError) no match of right hand side value: "R1"
    (xlsxir 1.6.4) lib/xlsxir/parse_workbook.ex:27: anonymous fn/2 in Xlsxir.ParseWorkbook.sax_event_handler/2
    (elixir 1.11.4) lib/enum.ex:2193: Enum."-reduce/3-lists^foldl/2-0-"/3
    (xlsxir 1.6.4) lib/xlsxir/parse_workbook.ex:17: Xlsxir.ParseWorkbook.sax_event_handler/2
    (erlsom 1.5.0) /Users/bp/carebridge/deus_ex_machina/deps/erlsom/src/erlsom_sax_utf8.erl:1408: :erlsom_sax_utf8.wrapCallback/2
    (erlsom 1.5.0) /Users/bp/carebridge/deus_ex_machina/deps/erlsom/src/erlsom_sax_utf8.erl:926: :erlsom_sax_utf8.parseContentLT/2
    (erlsom 1.5.0) /Users/bp/carebridge/deus_ex_machina/deps/erlsom/src/erlsom_sax_utf8.erl:196: :erlsom_sax_utf8.parse/2
    (xlsxir 1.6.4) lib/xlsxir/sax_parser.ex:61: Xlsxir.SaxParser.parse/3
    (xlsxir 1.6.4) lib/xlsxir/xlsx_file.ex:236: Xlsxir.XlsxFile.parse_workbook_to_ets/1
    (xlsxir 1.6.4) lib/xlsxir.ex:233: Xlsxir.multi_extract/5

Digging through the actual xml I noticed the sheets attributer:id="R1"

<?xml version="1.0" encoding="utf-8"?>
<workbook xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
	xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
	<sheets>
		<sheet name="LATEST_XWALK" sheetId="1" r:id="R1" />
	</sheets>
</workbook>

Then it appears the parser is expecting the r:id to always start with "rId"
Changing this function to accommodate "R1" did the trick

  def sax_event_handler({:startElement, _, 'sheet', _, xml_attrs}, state) do
    sheet =
      Enum.reduce(xml_attrs, %{name: nil, sheet_id: nil, rid: nil}, fn attr, sheet ->
        case attr do
          {:attribute, 'name', _, _, name} ->
            %{sheet | name: name |> to_string}

          {:attribute, 'sheetId', _, _, sheet_id} ->
            {sheet_id, _} = sheet_id |> to_string |> Integer.parse()
            %{sheet | sheet_id: sheet_id}

          {:attribute, 'id', _, _, rid} ->
            case rid |> to_string do
              "rId" <> rid ->
                {rid, _} = Integer.parse(rid)
                %{sheet | rid: rid}

              "R" <> rid ->
                {rid, _} = Integer.parse(rid)
                %{sheet | rid: rid}
              end

          _ ->
            sheet
        end
      end)

    %__MODULE__{state | sheets: [sheet | state.sheets]}
  end

I'm happy to clean up the above and submit a PR but I'm not 100% sure the r:id="R1" is valid? I received this file from a client, when creating a new file I see the expected r:id="rId1".

Parsing time failed

** (FunctionClauseError) no function clause matching in Xlsxir.ConvertTime.from_float/1
    lib/xlsxir/convert_time.ex:31: Xlsxir.ConvertTime.from_float(42705.0002662037)
    lib/xlsxir/parse_worksheet.ex:59: Xlsxir.ParseWorksheet.sax_event_handler/2
    src/erlsom_sax_utf8.erl:1408: :erlsom_sax_utf8.wrapCallback/2
    src/erlsom_sax_utf8.erl:907: :erlsom_sax_utf8.parseContentLT/2
    src/erlsom_sax_utf8.erl:196: :erlsom_sax_utf8.parse/2
    lib/xlsxir/sax_parser.ex:59: Xlsxir.SaxParser.parse/2
    lib/xlsxir.ex:137: Xlsxir.do_multi_extract/3
    lib/xlsxir.ex:109: anonymous fn/4 in Xlsxir.multi_extract/3

Export to XLSX

Hi there,

I'm looking around for an Elixir library that can export data to Excel format. You have this listed in your development goals:

Export functionality to .xlsx file type with formatting options

I'm just wondering how actively that's being worked? Do you know of any libraries in the Elixir ecosystem at all that can do this right now?

Need to update docs and tests before publishing v 0.2 to Hex

Xlsxir.Format.format_cell_value results in matcherror

Commits since March 12th 2019 not included if using hex-version

After some hours debugging the error fixed in Fix MatchError when no styles exist for cell , I became aware that the code in the Github repo is different than the one received if using {:xlsxir, "~> 1.6"}in mix.exs. When looking at Xlsxir at hex.pm, it seems like the latest commit included is the one from March 12th 2019.

Is this intended, or is it a mistake, e.g. by having forgotten to increment the version number?

Parse function doesn't handle errors in Excel worksheet

Need to add parse functionality for any errors contained within an Excel workbook.

Concurrent access fails.

I am using Xlsxir from a phoenix application which means there will be the potential for multiple users to try and process their spreadsheets simultaneously.

But due to the global unzip dir (I think) and ets tables, concurrent access fails. This is evident when I run my tests. If they run async, I get failures. If I serialise them - they all pass.

The solution I came up with was to put a GenServer facade in front of Xlsxir. Calls to the facade are serialised via the GenServer's mailbox so xlsx processing only happens on one file at a time.

This is not ideal but it solves the initial problem. It may lead to a performance bottleneck in the (for me, far) future.

I was wondering whether it would be useful for this technique to be added to the documentation?

Here's an example of the GenServer:

defmodule Excel do
  use GenServer

  def start_link do
    GenServer.start_link(__MODULE__, :ok, name: :excel)
  end

  ## Public interface

  def from_xlsx(path) do
    GenServer.call(:excel, {:from_xlsx, path})
  end

  ## Server callbacks
  def handle_call({:from_xlsx, path}, _from, state) do
    {:reply, do_work(Xlsxir.multi_extract(path)), state}
  end 

  defp do_work({:error, _} = err), do: err
  defp do_work(tables) do
    res = process_tables(tables)
    tables |> Enum.each(fn {:ok, id} -> Xlsxir.close(id) end)

    res
  end

  ...

# Usage:
  # start the GenServer
  Excel.start_link

  # and in your app, use it:
  res = Excel.from_xlsx("path/to/file.xlsx")

That's my first GenServer - and I know it's a little odd because it doesn't actually care about state at all. I'm more than happy to hear feedback on whether the above is useful, not, or whether it is a "start" for something that may be useful.

Note: with the change to use the above pattern, my tests run successfully when running async.

Fail to read the formula result

Test file: import-sheet-1.xlsx

iex(36)> path = "./import-sheet-1.xlsx"
iex(37)> [{:ok, s1}|_] = Xlsxir.multi_extract(path)
[
  ok: #Reference<0.1526854273.1649541121.217868>,
  ok: #Reference<0.1526854273.1649541121.217870>
]
iex(38)> Xlsxir.get_list(s1)
[
  [nil, nil, nil],
  [nil, nil,
   "IFERROR(INDEX(INDIRECT(INDIRECT(ADDRESS(3,COLUMN()))&\"[uniqueKey]\"),MATCH(INDIRECT(ADDRESS(ROW(),COLUMN(),1,TRUE,\"object-input\")),INDIRECT(INDIRECT(ADDRESS(3,COLUMN()))&\"[name]\"),0)),IF(ISBLANK(INDIRECT(ADDRESS(ROW(),COLUMN(),1,TRUE,\"object-input\"))), \"\", INDIRECT(ADDRESS(ROW(),COLUMN(),1,TRUE,\"object-input\"))))"]
]

It's expected to return the formula result which should be nil or empty string, but it returns the formula string instead.

When the 1.6.5 will come out?

Hey @jsonkenl thanks for this amazing library.

Do you have any deadline to launch the 1.6.5 release, including the: #112 #111 and #108 ?

Documentation for Xlsxir.multi_extract/4

@AlexKovalevych could you please update the documentation for the excel parameter in Xlsxir.multi_extract/4 in the xlsxir.ex file when you get a chance? Thanks!

Unable to catch/rescue exception from Xlsxir.stream_list

I have an excel file with some bad data that crashes Xlsxir.stream_list in an uncatchable way.
Example: run mix test on https://github.com/benschenker/brokelixir

It fails with:

10:20:17.303 [error] Process #PID<0.195.0> raised an exception
** (RuntimeError) Unknown cell attribute
    (xlsxir) lib/xlsxir/parse_worksheet.ex:73: anonymous fn/2 in Xlsxir.ParseWorksheet.sax_event_handler/4
    (elixir) lib/enum.ex:1314: Enum."-map/2-lists^map/1-0-"/2
    (xlsxir) lib/xlsxir/parse_worksheet.ex:68: Xlsxir.ParseWorksheet.sax_event_handler/4
    (erlsom)

even though the call is wrapped in a try/rescue block, I have also tried catching :exit to no avail.

I'd love to put up a pr to fix this but after poking around a bit it feels like I may need a bit of guidance. Any assistance is appreciated.

Phonetic guides are added to column values

Excel has a feature called phonetic guides for Japanese characters. When kanji are entered in a column, Excel adds phonetic guides as katakana. In Excel, these guides can be hidden or shown.

When opening an xlsx file like that with LibreOffice, Apple Numbers or Google Sheets, you can only see the actual column content (without the guides). However, when parsing the data with xlsxir, the phonetic guides are appended to the column content. For example, instead of 國學院大学, a column will contain 國學院大学ダイガク.

While googling for a solution, I found the same issue description in other tools: SDL, ACS, MicroStrategy.

According to this post,

For storing each unique text from a cell, Excel uses something called a "shared string table" and the content of each cell is the index of the text from that table. When we implemented the filter we erroneously thought that every "shared string" item contains only the text of the cell and some formatting belonging to that text. However, after this post, we found out that the phonetic translations are also found there.

Load custom formats

Sample from xl/styles.xml:

<numFmts count="4"><numFmt numFmtId="164" formatCode="GENERAL"/><numFmt numFmtId="165" formatCode="#,##0"/><numFmt numFmtId="166" formatCode="#,##0.00"/><numFmt numFmtId="167" formatCode="YYYY/MM/DD"/></numFmts><fonts count="4">

Current implementation raises here: https://github.com/kennellroxco/xlsxir/blob/master/lib/xlsxir/parse.ex#L73

** (RuntimeError) Unknown style type: 164.
    (xlsxir) lib/xlsxir/parse.ex:73: anonymous fn/1 in Xlsxir.Parse.num_style/1
    (elixir) lib/enum.ex:1088: Enum."-map/2-lists^map/1-0-"/2
    (xlsxir) lib/xlsxir.ex:44: Xlsxir.extract/3

Compilation warnings on Elixir 1.11

Compiling the library on Elixir 1.11 throws the following warning:

Compiling 14 files (.ex)
warning: Supervisor.Spec.worker/2 is deprecated. Use the new child specifications outlined in the Supervisor module instead
  lib/xlsxir.ex:10: Xlsxir.start/2

Performance issue with large files

Xlsxir.Parse.worksheet/2 seems to be the bottleneck.

Parse special characters

Hi, I don't read special characters like ampersand; i.e. "http://domain.com?param1=aaa&param2=two

Complex shared strings throw out indexes

Problem

I have a workbook that has some colored text. In my app it parses successfully but the field values are incorrect. It is correct in numbers, libre office and using Ruby's "Simple Xlsx Reader" gem.

It comes down to the parsing of the shared strings. In the shared strings I have:

  <si>
    <r>
      <t>Red writing</t>
    </r>
    <r>
      <rPr>
        <sz val="10"/>
        <rFont val="Arial"/>
        <family val="2"/>
      </rPr>
      <t>: Indicates no perament walker, but these Walkers are helping out this week</t>
    </r>
    <phoneticPr fontId="0" type="noConversion"/>
  </si>

In Xlsxir this is turned into two strings "Red writing" followed by ": Indicates no perament walker, but these Walkers are helping out this week". In other parses it is a single string "Red writing: Indicates no perament walker, but these Walkers are helping out this week".

This throws the indexes out by one and causes the wrong values to be used in any cells that use shared strings that follow the above value.

Current Status

I am looking to reduce the problem spreadsheet to the smallest failing example. I will look into getting a test case written and will try and find a fix. I might need some help on sax parser stuff.

Error in ParseWorksheet.format_cell_value/1

I'm getting an error in ParseWorksheet.format_cell_value/1 (https://github.com/kennellroxco/xlsxir/blob/master/lib/xlsxir/parse_worksheet.ex#L99) where list has the value ['str', 'd', s].

The problematic cell seems to be one with a text formula:

<c r="F2" s="12" t="str">
        <f>CONCATENATE(...edited...)</f>
        <v>Mensalidade de Maio de 2016</v>
      </c>

As a quick fix, I added an extra ['str', _, s] -> List.to_string(s) to the case statement. I'm wondering if you might have a better insight on the issue and a more elegant solution.

I can get more relevant info from this file if you can point me in the right direction.

Read cached, evaluated formula values?

Currently xlsxir reads out the formula string directly. Is there a way to read the evaluated value that MS Excel cached in the file?

I know openpyxl has a data_only flag that does that. Is it in your plan to support this feature?

Thanks!

Parsed Row/Column values leaks into subsequent rows and corrupt parsed data

Observation but might not be the actual cause

When using merged cells with partially colored text, Row values wrongly leaked into subsequent rows and data from parsing is corrupted.

Sample Text (is Bold, has Green in green color and Blue in Blue color)

Detailed Info, Green are mandatory fields, Blue are optional ones

Here Green are mandatory fields, Blue are optional ones are wrongly parsed as next row values and corrupts values for next row and further

Sample Xlsx with parsing problem

sample_with_color_and_merged_cells.xlsx

IEX Log and parsed data

iex(1)> file_path = "/home/pikender/sample_with_color_and_merged_cells.xlsx"
"/home/pikender/sample_with_color_and_merged_cells.xlsx"
iex(2)> Xlsxir.extract(file_path, 0, true)                                                             
{:ok, [0, 0, 0, 123158]}
iex(3)> Xlsxir.get_list
[["Profile Info", nil, nil, nil], ["Full Name", "First Name", "Last Name", nil],
 [nil, nil, nil, nil],
 ["Detailed Info, ", nil, nil, nil, nil, nil, nil, nil, nil],
 ["Green", " are mandatory fields, ", "Blue", " are optional ones", "Name",
  "Phone Number", "Country", "City", "State"], ["Address 1"]]
iex(50)> Xlsxir.close   
:ok

Expected

Even, if doesn't match below parsing but altleast should prevent corruption of other rows data and parse properly

[
  ["Profile Info", nil, nil, nil],
  ["Full Name", "First Name", "Last Name", nil],
  [nil, nil, nil, nil],
  ["Detailed Info, Green are mandatory fields, Blue are optional ones", nil, nil, nil, nil, nil, nil, nil, nil],
  ["Name", "Phone Number", "Country", "City", "State", "Address 1", "Address 2", "Zip Code", "Landmark"],
  ["Pikender"]
]

Actual

[
  ["Profile Info", nil, nil, nil],
  ["Full Name", "First Name", "Last Name", nil],
  [nil, nil, nil, nil],
  ["Detailed Info, ", nil, nil, nil, nil, nil, nil, nil, nil],
  ["Green", " are mandatory fields, ", "Blue", " are optional ones", "Name", "Phone Number", "Country", "City", "State"],
  ["Address 1"]
]

Let me know if any more info needed

Improve performance

Parsing a XLSX with about 40k rows and 7 columns takes 77.38 s.

For comparison, using Apache POI with JInterface for interop takes about 7.7 s.

Creating this as a reminder to profile the code and see if it is possible to improve the performance.

Edit: The current implementation loads the entire worksheet XML into memory and builds at least two intermediate lists to produce the formatted output. We can improve this by using xmerl_sax_parser and in one pass produce the output list. An alternative that would add an external dependency and some C code but could be faster is https://github.com/processone/fast_xml/

Edit 2: The zip file is opened, loaded and decompressed several times (1. validate, 2. extract shared_strings, 3. extract styles, n. for each worksheet). To improve performance we should load it and decompress only once. We can start a GenServer or Agent to keep the state for a single .xlsx to be processed and stop it after finishing. If all binaries are only accessed by this process, I think the garbage collector will just free all the memory, including the large binaries, once the process dies.

docx format support

Any plans to add docx parsing support?

Ignore empty cells

Hi. I need to parse a sheet that contains empty cells, but I can't ignore them, I need to obtain some empty char or nil option. Is there some way to have this behavior? I saw that you added the functionality of ignoring empty cells in the last version.

argument error (stdlib) :ets.new

Hi I'm trying to read a xlsx but always get the same error
** (ArgumentError) argument error
(stdlib) :ets.new(:styles, [:set, :protected, :named_table])
lib/xlsxir/state_manager.ex:199: Xlsxir.Style.new/0
lib/xlsxir/sax_parser.ex:52: Xlsxir.SaxParser.parse/3
(elixir) lib/enum.ex:645: Enum."-each/2-lists^foreach/1-0-"/2
(elixir) lib/enum.ex:645: Enum.each/2
lib/xlsxir.ex:179: Xlsxir.do_multi_extract/3
with extract and peek get the same error can you suggest a solution?

version 1.6.3 not found

** (Mix) No matching version for xlsxir ~> 1.6.3 (from: mix.exs) in registry

The latest version is: 1.6.2

Error when running extract a 2nd time

Running Xlsxir.extract/3 a second time raises. ets seems to complain that the table already exists.

iex(1)> path = "large_file.xlsx"
"large_file.xlsx"
iex(2)> Xlsxir.extract(path, 0)    
:ok
iex(3)> sheet = Xlsxir.extract(path, 0)
** (ArgumentError) argument error
    (stdlib) :ets.new(:worksheet, [:set, :protected, :named_table])
    (xlsxir) lib/xlsxir/state_manager.ex:11: Xlsxir.Worksheet.new/0
    (xlsxir) lib/xlsxir/sax_parser.ex:48: Xlsxir.SaxParser.parse/2
    (xlsxir) lib/xlsxir.ex:46: Xlsxir.extract/3

Specify date type instead of default erlang date tuple

Thanks for great tools to handle excel format in elixir.

Is there any configuration to make the parser to generate elixir data type instead erlang date tuple?
The reason is when persist the data as map to database, it will report error protocol Jason.Encoder not implemented for {2011, 12, 30} of type Tuple.

So I need to convert the erlang date tuple myself.
Is there some configuration for xlsxir to do this ?or I need to handle the rows returned by Xlsxir.get_list(tid)

Thanks