Comments (40)
Shouldn't be too hard to translate (e.g.) B5:X17
to right number of rows and cols to skip/read
from readxl.
@zentree would you prefer a specification like this?
from readxl.
I'd second this, both the functionality of being able to pull out a rectangular selection and the ability to specify that selection with Excel-like references. I'm currently using XLConnect's readWorksheet(..., startCol=, endCol=, startRow=, endRow=)
syntax but it's a bit long-winded.
from readxl.
Specifying the number of rows should not be a must, as you typically don't know the number of rows,
so "A1:Z32" is typically not something you know within R, but I guess as long e.g. "B:Z" would work it's fine.
from readxl.
I would go with the "A1:Z31" syntax. Typically an excel user will be referencing the data cells anyway in the sheet so they'll be thinking in those terms. You might also want to have it skip the "$" notation in excel, so that "A$1:$B$32" returns the same cell range as A1:B32, just because the user might be copying the reference from an excel sheet. .
from readxl.
Rectangular specification would be a nice way to deal with number of rows too. From a user point of view I have to deal with two cases:
- I'm processing a large number of files—large enough so I can't check every one of them—which would have a default format. These are often machine generated.
- I'm reading a small number of files created by hand, where I need to extract a variable rectangular array.
I see a notation like A1:Z32 dealing mostly with case 2.
from readxl.
Excellent new package; so is readr.
I can see readxl replacing my old workhorse XLConnect very soon (no more hassles with java updates). Also much easier for new users.
I'd use two extra options a lot:
- header=TRUE/FALSE
- startrow=n
from readxl.
I'm doing exactly this over in googlesheets, which consumes (and writes) data from (to) Google Spreadsheets.
For the targeted data consumption discussed here, the Sheets API forces you to use a cell-by-cell approach (the incredibly slow "cell feed"). In this case, my function lets user specify the data rectangle via min row, max row, min col, max col. There are convenience wrappers to get one or more rows, one or more columns, or a region specified like B3:G17 or R3C2:R17C7 (the other standard positioning notation). Then that output can be processed by one of two functions for reshaping and/or transformation, where there is an argument for header = TRUE/FALSE.(Hmmm... I should probably write even more wrappers to package that sequence of actions.)
https://github.com/jennybc/googlesheets/blob/master/R/consume-data.R
I'm really interested in this thread because it would be great to keep the two interfaces as similar as possible. I even volunteer to help!
from readxl.
@jennybc I'd love to incorporate your cell specification code in readxl.
I think bundling all the options into one argument plus helper functions would be ideal. Something like this:
read_excel(..., range = "D12:F15")
read_excel(..., range = rc("R1C12:R6C15"))
read_excel(..., range = excel_range(c(1, 6), c(1, 15))
read_excel(..., range = excel_range(c(2, NA), c(1, NA))
from readxl.
OK I will isolate and upgrade cell specification over in googlesheets and get back to you. The helper functions are nice idea.
- Would
range =
+rc()
andexcel_range()
be the only way to restrict cell consumption or … would you allow cell specification via, e.g.,row = 3
orcol = "B:D"
? Or perhapsrange = rows(3)
andrange = cols("B:D")
orrange = cols(B:D)
? FWIW I am not prepared to deal with any requests for non-contiguous rows or cells, so my inclination is to disallow that. - I accept a single cell anywhere I accept a range, i.e.
range = "D14"
is just as valid asrange = "D14:G20"
. I assume that's OK. - Ultimately I need my cell limits as a named list, to pass as a query. So that's where all roads must lead re: cell specification in googlesheets. You probably have more control/responsibility here in readxl.
from readxl.
The "B7:Z18" syntax sounds great! Just please make sure it can handle more than 26 columns.
from readxl.
@wdkrnls What is the max number of columns in Excel? In Google sheets, it's only 300. It is helpful to know this upper bound for the function that translates columns in "ABDC" positioning notation to actual column numbers.
from readxl.
According to this source excel 2010 is limited to 16384 columns.
from readxl.
My excel sheets regularly have hundreds of columns by the time I get them. I don't have exact statistics, but my data come from arrays of micro reactors hooked up to online chromatography and have between 500-1000 columns covering concentrations of myriads of detectable chemicals.
from readxl.
@hadley readxl currently has no dependencies other than Rcpp. Would you want that to be true of the range specifying stuff as well, i.e. it's part of the package philosophy?
from readxl.
I'd prefer to keep dependencies minimal, but it seems like this code needs to be shared between readxl and spreadr
from readxl.
I'll try to do it that way then. And look wistfully at stringr and %>%
.
from readxl.
In a branch, I isolated the the cell specification functions and refactored them to have no dependencies:
https://github.com/jennybc/googlesheets/blob/cell-range-specification/R/cell-specification.R
It's basically 3 pairs of functions:
letter_to_num()
<-->num_to_letter()
for converting column IDs, e.g.,ABD
<--> 732A1_to_RC()
<-->RC_to_A1()
for converting between positioning notation, e.g,AB10
<-->R10C28
convert_range_to_limit_list()
<-->convert_limit_list_to_range()
for, e.g.,A1:C4
<-->list(
min-row= 1,
min-col= 1,
max-row= 4,
max-col= 3)
I wanted to check back in before going further. Hopefully this get things rolling, to settle on user-facing bits. I don't export any of the above.
Tests are here:
from readxl.
Would I be able to use excel_range to select all columns to the right?
e.g. excel_range(c(15, NA), c(NA, NA))
This feature would be very useful for reading and cleaning the output of one of the data sources I have.
from readxl.
@wdkrnls Yes that's my interpretation of the proposal and what the NAs would mean. However, I would interpret your example as requesting to read from row 15 on, not column. I'd assume this:
excel_range(c(min-row, max-row), c(min-col, max-col))
where NA for a max means no limit and NA for a min is equivalent to specifying the value as 1.
from readxl.
Fair enough, I was imagining specifying the corners of the box as c(column, row) pairs analogous to the excel syntax, but your interpretation sounds great to me.
from readxl.
@hadley Let me specify where I need your guidance.
Ways to specify cells. From most Excel-ish and least programmatically useful to least Excel-ish and most programmatically useful:
- A1:E4
- R1C1: R4C5
- named vector or list, e.g. c(min_row = 1, max_row = 4, min_col = 1, max_col = 5)
Internally, my current manipulations always move things down that list above. But the way I read your proposed helper functions:
read_excel(..., range = "D12:F15")
read_excel(..., range = rc("R1C12:R6C15"))
read_excel(..., range = excel_range(c(1, 6), c(1, 15))
read_excel(..., range = excel_range(c(2, NA), c(1, NA))
… they seem to imply transformations up the list, i.e. towards the A1:E4
style of cell specification. Which feels weird because internally we'll just reverse that and go all the way to named vector or list.
Should I just get over it and proceed anyway? I realize there is no concern about speed or anything.
Is it overkill to have an S3 class for a cell range that holds the specifications in all the formats, so it's always easy to print or use the one best for the task at hand?
Do you approve of row or column-based helpers:
read_excel(..., range = rows(1:5))
read_excel(..., range = rows(2:*))
read_excel(..., range = columns(3:9))
read_excel(..., range = columns("A:G"))
from readxl.
Altogether I think it would look something like this:
excel_range <- function(rows, col = cols) {
stopifnot(is.numeric(rows), length(rows) == 2)
stopifnot(is.numeric(cols), length(cols) == 2)
structure(list(rows = rows, cols = cols), class = "excel_range")
}
rc <- function(x) {
...
excel_range(rows, cols)
}
as.excel_range <- function(x) UseMethod("as.excel_range")
as.excel_range.excel_range <- function(x) x
as.excel_range.character <- function(x) rc(x)
and then read_excel
would call as.excel_range()
on its input. Does that make sense?
Row and column based helpers would be fine (although I'd worry a little bit about giving them such short names - generally better to preserve short names for widely applicable tools)
from readxl.
Perfect. I will push it forward and report back.
from readxl.
I wrote the parts to process cell specification by the user.
This is still in a branch for me, but there I am using these functions myself. FWIW all is well re: tests and travis.
All the cell specification stuff that could be in common is in this file and there are no package dependencies:
https://github.com/jennybc/googlesheets/blob/cell-range-specification/R/cell-specification.R
Tests are here:
I renamed the class from excel_range
to cell_limits
, to make less Excel-specific and more descriptive (?). I also found it easier to ditch the rc()
helper and just detect whether the range is in A1
or R1C1
notation.
Haven't added any row or column helpers yet.
This should put you in a position to modify read_excel()
to accept calls like this now:
read_excel(..., range = "D12:F15")
read_excel(..., range = "R1C12:R6C15")
read_excel(..., range = cell_limits(c(1, 6), c(1, 15))
read_excel(..., range = cell_limits(c(2, NA), c(1, NA))
@wdkrnls BTW your original interpretation of the format of the limits is correct. It will be row min + max in one vector and col min + max in the other. sorry I misremembered this … I stand by my original response
from readxl.
I should probably implement @gshotwell's suggestion to ignore $
s. Yes?
from readxl.
Looks good. I agree you should just ignore $
.
Next step is to figure out where to put this? Maybe we should have a small cell tools package?
from readxl.
I can put this in a little package. Call it … sheetcells
, cellranges
, ???
I'll ignore $
and experiment w/ row and column helpers. If it's to be a package for general cell helpers, I'll also pull in a function to take an anchor cell + some input (1d or 2d) and return the range of cells that would be affected by an edit. Which brings us full circle, i.e. back to @smbache's original post in this issue!
from readxl.
Cell ranges sounds good to me
from readxl.
Really excited to try this out!
from readxl.
I've put this stuff in a package, cellranger
. I'll open an issue there, @hadley, with a few questions and comments.
from readxl.
You forgot to remove the last 'e' ;) it's convention!
from readxl.
The helper package is on CRAN now:
http://cran.r-project.org/web/packages/cellranger/index.html
from readxl.
Thanks so much for this! I'm really looking forward to never typing xlcFreeMemory()
again.
from readxl.
Could this be extended to support the following?
read_excel(..., range = "Sheet1!D12:F15")
from readxl.
What's the advantage of that versus the already existing sheet=
argument to read_excel
?
from readxl.
It mirrors the representation that Excel uses for ranges, e.g., for named ranges (#79). Not a big deal, but wanted to throw it out there.
from readxl.
It's reasonable to think about expanding the notion of a cell_limits
object to include the hosting (work)sheet or tab, in addition to the row and column limits. Both Excel and Google Sheets share that same high-level structure and we already must specify worksheet for all reads and writes.
I'm not exactly eager to rework the handling of worksheets … but it's worth considering.
from readxl.
Bump: is there a way to access this functionality yet? Sorry if it's been implemented and I missed it, but I claim it's not obvious ...
My use case is just to be able to select a limited number of rows. The spreadsheet has header rows interspersed with data (sigh): if I follow the advice to read the whole file (possibly skipping initial rows) and discard the stuff I don't need, then I have the hassle of converting columns back from character to numeric ...
from readxl.
@bbolker Not yet, but this package is now a main focus of mine. Once I'm done with basic triage of the issues/pull requests, this feature will be a high priority. Closing the longest-running issue will be a real pleasure.
from readxl.
Related Issues (20)
- Compilation failure in Alpine Linux in Apptainer HOT 3
- subsetting columns when reading an Excel file HOT 4
- read_excel() converts date to number HOT 3
- Build error on 10.6: error: 'gid_t' has not been declared HOT 2
- zip path is too long HOT 5
- Release readxl 1.4.2 HOT 2
- Upkeep for readxl
- Update errors to rlang 1.0.0
- read_excel unable to open file if the file is .xls but it was saved as .xml HOT 1
- Problems reading files with Windows path containing special characters
- FR: Check Excel sheets and consistency with readr
- Access is denied warning on valid paths that get read successfully
- installing readxl package to R 4.1.3 running in anaconda navigator HOT 1
- Can read_excel() allow reading of open Excel workbooks/sheets? HOT 3
- Scientific counting display issues
- Release readxl 1.4.3 HOT 1
- Force col_types for vector of known columns, default guess for other columns HOT 1
- readxl::read_excel fails with `Error: vector`
- col_types = 'text_as_shown' ? HOT 1
- Multible unaddressed libxls vulnerabilities HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from readxl.