Comments (10)
That sounds good. I will take a look at doing it that way.
from spark-excel.
Another option would be to add a virtual rowNumber
column to the exported DataFrame and implement PrunedFilteredScan
in addition to PrunedScan
in ExcelRelation
.
Then you could do something like
val preview = existingExcelDf.where(col("rowNumber") >= 5 && col("rowNumber") <= 1005)
What I'm not sure about though is if you to have to handle all filters (meaning Spark will not do any post-filtering) when you implement PrunedFilteredScan
.
from spark-excel.
Ok, actually reading the documentation link I posted would have answered my question 😜
Spark will still do post-filtering, so it's OK to return false positives.
What do you think of implementing PrunedFilteredScan
then and using that to do the start and end row logic?
from spark-excel.
Is anyone working on it? My project would benefit from such a feature a lot.
from spark-excel.
Just iterating over all issues again. Having this implemented should also fix #59.
@davidw76 @jaceklaskowski are you still interested in this?
from spark-excel.
Yes, for us it's more important to be able to ingest the first n rows, but for others I guess skipping n rows is more useful. So it would be good to be able to specify a range.
from spark-excel.
What I'm wondering is if we should handle the "skipping n rows because the header row is in line n+1" separately (see #65)?
Then the proposed solution with the artificial row_number
column would still work and would allow additional features like
- returning the first 100 and the last 100 rows (the end of the file might be interesting, too)
- getting a column that uniquely identifies each row for free
OTOH, if all the majority of users need is a start and an end row then it would be overkill...
from spark-excel.
I just released 0.11.0
. Please have a look at the corresponding CHANGELOG
and the README
about the new dataAddress
option which supersedes sheetName
, startColumn
, endColumn
and skipFirstRows
and should hopefully cover this use-case as well.
from spark-excel.
@nightscape - I do not see skipFirstRows
as one of the ExcelOptions. Do we have the feature of skipping initial rows in the excel before we start reading the header columns? Please let me know and I am in urgent need.
from spark-excel.
Check out dataAddress
from spark-excel.
Related Issues (20)
- New Case on Large Number Being Captured As Scientific Notation
- [BUG] last Columns with first line value empty not being read from .xlsx HOT 3
- support spark 3.5 HOT 3
- Incorrect Data Frame creation HOT 1
- [BUG] ClassNotFoundException for 'excel.DefaultSource' while using API V2 HOT 13
- Mentioned jar for scala 2.12 does not exist HOT 2
- [BUG] <infer schema should not include the auto generated columns>
- [BUG] Spark Excel is Incompatible with AWS EMR v6.13 and higher HOT 2
- [BUG] ClassCastException: scala.Some cannot be cast to [Lorg.apache.spark.sql.catalyst.InternalRow HOT 6
- [BUG] Incorrect date formatting if I indicate sheet Spark Read Excel HOT 1
- [BUG] Excel File with Macros Detected as "Potentially" Malicious. Unable to read Excel as a result. HOT 1
- [BUG] When Read Excel Files, Several Errors Using Java HOT 2
- Error Handling for Corrupt Files in Chunk Processing HOT 1
- [BUG] No thrown exception if schema is provieded, but there is no workbook/sheet (PDF with XLSX Extension)
- [FEATURE] Optimize JAR size HOT 2
- [BUG] Cannot read files into dataframe in Databricks 13.3 LTS Runtime 3.3.0 Spark HOT 3
- Extract sheet names using pyspark HOT 3
- [BUG] Wrong place to put maxRowsInMemory
- Loading Excel with PERMISSIVE on EMR fails while it works locally (on Windows) HOT 3
- Unable to download the jar of any version over com.crealytics:spark-excel_2.12:0.13.7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-excel.