Giter Club home page Giter Club logo

Comments (40)

hadley avatar hadley commented on May 18, 2024

Shouldn't be too hard to translate (e.g.) B5:X17 to right number of rows and cols to skip/read

from readxl.

hadley avatar hadley commented on May 18, 2024

@zentree would you prefer a specification like this?

from readxl.

jkeirstead avatar jkeirstead commented on May 18, 2024

I'd second this, both the functionality of being able to pull out a rectangular selection and the ability to specify that selection with Excel-like references. I'm currently using XLConnect's readWorksheet(..., startCol=, endCol=, startRow=, endRow=) syntax but it's a bit long-winded.

from readxl.

smbache avatar smbache commented on May 18, 2024

Specifying the number of rows should not be a must, as you typically don't know the number of rows,
so "A1:Z32" is typically not something you know within R, but I guess as long e.g. "B:Z" would work it's fine.

from readxl.

gshotwell avatar gshotwell commented on May 18, 2024

I would go with the "A1:Z31" syntax. Typically an excel user will be referencing the data cells anyway in the sheet so they'll be thinking in those terms. You might also want to have it skip the "$" notation in excel, so that "A$1:$B$32" returns the same cell range as A1:B32, just because the user might be copying the reference from an excel sheet. .

from readxl.

zentree avatar zentree commented on May 18, 2024

Rectangular specification would be a nice way to deal with number of rows too. From a user point of view I have to deal with two cases:

  1. I'm processing a large number of files—large enough so I can't check every one of them—which would have a default format. These are often machine generated.
  2. I'm reading a small number of files created by hand, where I need to extract a variable rectangular array.

I see a notation like A1:Z32 dealing mostly with case 2.

from readxl.

DavoOZ avatar DavoOZ commented on May 18, 2024

Excellent new package; so is readr.
I can see readxl replacing my old workhorse XLConnect very soon (no more hassles with java updates). Also much easier for new users.

I'd use two extra options a lot:

  1. header=TRUE/FALSE
  2. startrow=n

from readxl.

jennybc avatar jennybc commented on May 18, 2024

I'm doing exactly this over in googlesheets, which consumes (and writes) data from (to) Google Spreadsheets.

For the targeted data consumption discussed here, the Sheets API forces you to use a cell-by-cell approach (the incredibly slow "cell feed"). In this case, my function lets user specify the data rectangle via min row, max row, min col, max col. There are convenience wrappers to get one or more rows, one or more columns, or a region specified like B3:G17 or R3C2:R17C7 (the other standard positioning notation). Then that output can be processed by one of two functions for reshaping and/or transformation, where there is an argument for header = TRUE/FALSE.(Hmmm... I should probably write even more wrappers to package that sequence of actions.)

https://github.com/jennybc/googlesheets/blob/master/R/consume-data.R

https://github.com/jennybc/googlesheets/blob/master/README.md#convenience-wrappers-and-post-processing-the-data

I'm really interested in this thread because it would be great to keep the two interfaces as similar as possible. I even volunteer to help!

from readxl.

hadley avatar hadley commented on May 18, 2024

@jennybc I'd love to incorporate your cell specification code in readxl.

I think bundling all the options into one argument plus helper functions would be ideal. Something like this:

read_excel(..., range = "D12:F15")
read_excel(..., range = rc("R1C12:R6C15"))
read_excel(..., range = excel_range(c(1, 6), c(1, 15))
read_excel(..., range = excel_range(c(2, NA), c(1, NA))

from readxl.

jennybc avatar jennybc commented on May 18, 2024

OK I will isolate and upgrade cell specification over in googlesheets and get back to you. The helper functions are nice idea.

  • Would range = + rc() and excel_range() be the only way to restrict cell consumption or … would you allow cell specification via, e.g., row = 3 or col = "B:D"? Or perhaps range = rows(3) and range = cols("B:D") or range = cols(B:D)? FWIW I am not prepared to deal with any requests for non-contiguous rows or cells, so my inclination is to disallow that.
  • I accept a single cell anywhere I accept a range, i.e. range = "D14" is just as valid as range = "D14:G20". I assume that's OK.
  • Ultimately I need my cell limits as a named list, to pass as a query. So that's where all roads must lead re: cell specification in googlesheets. You probably have more control/responsibility here in readxl.

from readxl.

wdkrnls avatar wdkrnls commented on May 18, 2024

The "B7:Z18" syntax sounds great! Just please make sure it can handle more than 26 columns.

from readxl.

jennybc avatar jennybc commented on May 18, 2024

@wdkrnls What is the max number of columns in Excel? In Google sheets, it's only 300. It is helpful to know this upper bound for the function that translates columns in "ABDC" positioning notation to actual column numbers.

from readxl.

smbache avatar smbache commented on May 18, 2024

According to this source excel 2010 is limited to 16384 columns.

https://support.office.com/en-nz/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3

from readxl.

wdkrnls avatar wdkrnls commented on May 18, 2024

My excel sheets regularly have hundreds of columns by the time I get them. I don't have exact statistics, but my data come from arrays of micro reactors hooked up to online chromatography and have between 500-1000 columns covering concentrations of myriads of detectable chemicals.

from readxl.

jennybc avatar jennybc commented on May 18, 2024

@hadley readxl currently has no dependencies other than Rcpp. Would you want that to be true of the range specifying stuff as well, i.e. it's part of the package philosophy?

from readxl.

hadley avatar hadley commented on May 18, 2024

I'd prefer to keep dependencies minimal, but it seems like this code needs to be shared between readxl and spreadr

from readxl.

jennybc avatar jennybc commented on May 18, 2024

I'll try to do it that way then. And look wistfully at stringr and %>%.

from readxl.

jennybc avatar jennybc commented on May 18, 2024

@hadley

In a branch, I isolated the the cell specification functions and refactored them to have no dependencies:

https://github.com/jennybc/googlesheets/blob/cell-range-specification/R/cell-specification.R

It's basically 3 pairs of functions:

  • letter_to_num() <--> num_to_letter() for converting column IDs, e.g., ABD <--> 732
  • A1_to_RC() <--> RC_to_A1() for converting between positioning notation, e.g, AB10 <--> R10C28
  • convert_range_to_limit_list() <--> convert_limit_list_to_range() for, e.g., A1:C4 <--> list(min-row= 1,min-col= 1,max-row= 4,max-col= 3)

I wanted to check back in before going further. Hopefully this get things rolling, to settle on user-facing bits. I don't export any of the above.

Tests are here:

https://github.com/jennybc/googlesheets/blob/cell-range-specification/tests/testthat/test-cell-specification.R

from readxl.

wdkrnls avatar wdkrnls commented on May 18, 2024

Would I be able to use excel_range to select all columns to the right?

e.g. excel_range(c(15, NA), c(NA, NA))

This feature would be very useful for reading and cleaning the output of one of the data sources I have.

from readxl.

jennybc avatar jennybc commented on May 18, 2024

@wdkrnls Yes that's my interpretation of the proposal and what the NAs would mean. However, I would interpret your example as requesting to read from row 15 on, not column. I'd assume this:

excel_range(c(min-row, max-row), c(min-col, max-col))

where NA for a max means no limit and NA for a min is equivalent to specifying the value as 1.

from readxl.

wdkrnls avatar wdkrnls commented on May 18, 2024

Fair enough, I was imagining specifying the corners of the box as c(column, row) pairs analogous to the excel syntax, but your interpretation sounds great to me.

from readxl.

jennybc avatar jennybc commented on May 18, 2024

@hadley Let me specify where I need your guidance.

Ways to specify cells. From most Excel-ish and least programmatically useful to least Excel-ish and most programmatically useful:

  • A1:E4
  • R1C1: R4C5
  • named vector or list, e.g. c(min_row = 1, max_row = 4, min_col = 1, max_col = 5)

Internally, my current manipulations always move things down that list above. But the way I read your proposed helper functions:

read_excel(..., range = "D12:F15")
read_excel(..., range = rc("R1C12:R6C15"))
read_excel(..., range = excel_range(c(1, 6), c(1, 15))
read_excel(..., range = excel_range(c(2, NA), c(1, NA))

… they seem to imply transformations up the list, i.e. towards the A1:E4 style of cell specification. Which feels weird because internally we'll just reverse that and go all the way to named vector or list.

Should I just get over it and proceed anyway? I realize there is no concern about speed or anything.

Is it overkill to have an S3 class for a cell range that holds the specifications in all the formats, so it's always easy to print or use the one best for the task at hand?

Do you approve of row or column-based helpers:

read_excel(..., range = rows(1:5))
read_excel(..., range = rows(2:*))
read_excel(..., range = columns(3:9))
read_excel(..., range = columns("A:G"))

from readxl.

hadley avatar hadley commented on May 18, 2024

Altogether I think it would look something like this:

excel_range <- function(rows, col = cols) {
  stopifnot(is.numeric(rows), length(rows) == 2)
  stopifnot(is.numeric(cols), length(cols) == 2)

  structure(list(rows = rows, cols = cols), class = "excel_range")
}

rc <- function(x) {
  ...
  excel_range(rows, cols)
}

as.excel_range <- function(x) UseMethod("as.excel_range")
as.excel_range.excel_range <- function(x) x
as.excel_range.character <- function(x) rc(x)

and then read_excel would call as.excel_range() on its input. Does that make sense?

Row and column based helpers would be fine (although I'd worry a little bit about giving them such short names - generally better to preserve short names for widely applicable tools)

from readxl.

jennybc avatar jennybc commented on May 18, 2024

Perfect. I will push it forward and report back.

from readxl.

jennybc avatar jennybc commented on May 18, 2024

I wrote the parts to process cell specification by the user.

This is still in a branch for me, but there I am using these functions myself. FWIW all is well re: tests and travis.

All the cell specification stuff that could be in common is in this file and there are no package dependencies:

https://github.com/jennybc/googlesheets/blob/cell-range-specification/R/cell-specification.R

Tests are here:

https://github.com/jennybc/googlesheets/blob/cell-range-specification/tests/testthat/test-cell-specification.R

I renamed the class from excel_range to cell_limits, to make less Excel-specific and more descriptive (?). I also found it easier to ditch the rc() helper and just detect whether the range is in A1 or R1C1 notation.

Haven't added any row or column helpers yet.

This should put you in a position to modify read_excel() to accept calls like this now:

read_excel(..., range = "D12:F15")
read_excel(..., range = "R1C12:R6C15")
read_excel(..., range = cell_limits(c(1, 6), c(1, 15))
read_excel(..., range = cell_limits(c(2, NA), c(1, NA))

@wdkrnls BTW your original interpretation of the format of the limits is correct. It will be row min + max in one vector and col min + max in the other. sorry I misremembered this … I stand by my original response

from readxl.

jennybc avatar jennybc commented on May 18, 2024

I should probably implement @gshotwell's suggestion to ignore $s. Yes?

from readxl.

hadley avatar hadley commented on May 18, 2024

Looks good. I agree you should just ignore $.

Next step is to figure out where to put this? Maybe we should have a small cell tools package?

from readxl.

jennybc avatar jennybc commented on May 18, 2024

I can put this in a little package. Call it … sheetcells, cellranges, ???

I'll ignore $ and experiment w/ row and column helpers. If it's to be a package for general cell helpers, I'll also pull in a function to take an anchor cell + some input (1d or 2d) and return the range of cells that would be affected by an edit. Which brings us full circle, i.e. back to @smbache's original post in this issue!

from readxl.

hadley avatar hadley commented on May 18, 2024

Cell ranges sounds good to me

from readxl.

wdkrnls avatar wdkrnls commented on May 18, 2024

Really excited to try this out!

from readxl.

jennybc avatar jennybc commented on May 18, 2024

I've put this stuff in a package, cellranger. I'll open an issue there, @hadley, with a few questions and comments.

from readxl.

smbache avatar smbache commented on May 18, 2024

You forgot to remove the last 'e' ;) it's convention!

from readxl.

jennybc avatar jennybc commented on May 18, 2024

The helper package is on CRAN now:

http://cran.r-project.org/web/packages/cellranger/index.html

from readxl.

kcandrews avatar kcandrews commented on May 18, 2024

Thanks so much for this! I'm really looking forward to never typing xlcFreeMemory() again.

from readxl.

eibanez avatar eibanez commented on May 18, 2024

Could this be extended to support the following?

read_excel(..., range = "Sheet1!D12:F15")

from readxl.

kcandrews avatar kcandrews commented on May 18, 2024

What's the advantage of that versus the already existing sheet= argument to read_excel?

from readxl.

eibanez avatar eibanez commented on May 18, 2024

It mirrors the representation that Excel uses for ranges, e.g., for named ranges (#79). Not a big deal, but wanted to throw it out there.

from readxl.

jennybc avatar jennybc commented on May 18, 2024

It's reasonable to think about expanding the notion of a cell_limits object to include the hosting (work)sheet or tab, in addition to the row and column limits. Both Excel and Google Sheets share that same high-level structure and we already must specify worksheet for all reads and writes.

I'm not exactly eager to rework the handling of worksheets … but it's worth considering.

from readxl.

bbolker avatar bbolker commented on May 18, 2024

Bump: is there a way to access this functionality yet? Sorry if it's been implemented and I missed it, but I claim it's not obvious ...

My use case is just to be able to select a limited number of rows. The spreadsheet has header rows interspersed with data (sigh): if I follow the advice to read the whole file (possibly skipping initial rows) and discard the stuff I don't need, then I have the hassle of converting columns back from character to numeric ...

from readxl.

jennybc avatar jennybc commented on May 18, 2024

@bbolker Not yet, but this package is now a main focus of mine. Once I'm done with basic triage of the issues/pull requests, this feature will be a high priority. Closing the longest-running issue will be a real pleasure.

from readxl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.