Feature: Read multiple worksheets (of same structure) at once #938

christianknoepfle · 2025-02-27T12:13:23Z

Am I using the newest version of the library?

I have made sure that I'm using the latest version of the library.

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

As of now I need to define the exact sheetName or provide a sheet index number to read data (via dataAddress).

Expected Behavior

Idea is to provide a regex as dataAddress, e.g. "sheet_[0-9]+" and then all sheets that match the regex are read and returned as single Dataframe. All such sheets must match the same dataschema / structure. To mark the sheet name as regex we could surrounded the regex e.g. by "[...]" since [ is not a valid sheet name character).

As an alternative one could implement a function to query all the sheet names, find the matching ones, then call the dataframereader for each sheet individually.

@nightscape , would that be a feature that fit spark-excel? If so, any thoughts about the implementation?

Best regards

Christian

Steps To Reproduce

No response

Environment

Anything else?

No response

The text was updated successfully, but these errors were encountered:

nightscape · 2025-02-27T12:58:05Z

Hi Christian, there were multiple requests regarding reading multiple sheets from an Excel file.
I think the Regex idea is nice. Is / allowed in sheet names? That is the notation for regexes in many languages.
Regarding the implementation:
In the best case we add the "round-trip", i.e. we allow writing to multiple sheets as well.
That way we can reuse the IntegrationTest which basically tests that you can read what you can write.

christianknoepfle · 2025-02-27T13:13:49Z

/ is a good idea, it is not an allowed char, so that fits.

christianknoepfle · 2025-02-27T13:27:24Z

from implementation perspective I am wondering if this is the place to start:

spark-excel/src/main/scala/dev/mauch/spark/excel/v2/DataLocator.scala

Line 82 in c272104

override def readFrom(workbook: Workbook): Iterator[Vector[Cell]] = {

"Just" loop over all matching sheets and return a single iterator as beofre. The only caveat I see is that on the 2nd sheet and later we need to skip the header line.

Your thoughts on this? Or would you see a better approach?

nightscape · 2025-02-28T14:51:27Z

That looks like the correct entrypoint 👍
Before adding additional functionality, I would do a refactoring for using a Seq[Sheet] where a Sheet is currently used and just providing a single Seq(sheet) in the beginning. This should be doable while refactoring along green (i.e. letting the tests run frequently, maybe with mill -w ...test).

Then it would be a good time to think about testing:
In the best case, we introduce an additional "symmetric" feature for limiting the number of rows per sheet and continuing writing in the next sheet. For the sheet names one could e.g. use RgxGen to generate Strings adhering to a regex.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Read multiple worksheets (of same structure) at once #938

Feature: Read multiple worksheets (of same structure) at once #938

christianknoepfle commented Feb 27, 2025

nightscape commented Feb 27, 2025

christianknoepfle commented Feb 27, 2025

christianknoepfle commented Feb 27, 2025 •

edited

Loading

nightscape commented Feb 28, 2025

Feature: Read multiple worksheets (of same structure) at once #938

Feature: Read multiple worksheets (of same structure) at once #938

Comments

christianknoepfle commented Feb 27, 2025

Am I using the newest version of the library?

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

nightscape commented Feb 27, 2025

christianknoepfle commented Feb 27, 2025

christianknoepfle commented Feb 27, 2025 • edited Loading

nightscape commented Feb 28, 2025

christianknoepfle commented Feb 27, 2025 •

edited

Loading