Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Read multiple worksheets (of same structure) at once #938

Open
2 tasks done
christianknoepfle opened this issue Feb 27, 2025 · 4 comments
Open
2 tasks done

Comments

@christianknoepfle
Copy link
Contributor

Am I using the newest version of the library?

  • I have made sure that I'm using the latest version of the library.

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

As of now I need to define the exact sheetName or provide a sheet index number to read data (via dataAddress).

Expected Behavior

Idea is to provide a regex as dataAddress, e.g. "sheet_[0-9]+" and then all sheets that match the regex are read and returned as single Dataframe. All such sheets must match the same dataschema / structure. To mark the sheet name as regex we could surrounded the regex e.g. by "[...]" since [ is not a valid sheet name character).

As an alternative one could implement a function to query all the sheet names, find the matching ones, then call the dataframereader for each sheet individually.

@nightscape , would that be a feature that fit spark-excel? If so, any thoughts about the implementation?

Best regards

Christian

Steps To Reproduce

No response

Environment

Anything else?

No response

@nightscape
Copy link
Owner

Hi Christian, there were multiple requests regarding reading multiple sheets from an Excel file.
I think the Regex idea is nice. Is / allowed in sheet names? That is the notation for regexes in many languages.
Regarding the implementation:
In the best case we add the "round-trip", i.e. we allow writing to multiple sheets as well.
That way we can reuse the IntegrationTest which basically tests that you can read what you can write.

@christianknoepfle
Copy link
Contributor Author

/ is a good idea, it is not an allowed char, so that fits.

@christianknoepfle
Copy link
Contributor Author

christianknoepfle commented Feb 27, 2025

from implementation perspective I am wondering if this is the place to start:

override def readFrom(workbook: Workbook): Iterator[Vector[Cell]] = {

"Just" loop over all matching sheets and return a single iterator as beofre. The only caveat I see is that on the 2nd sheet and later we need to skip the header line.

Your thoughts on this? Or would you see a better approach?

@nightscape
Copy link
Owner

That looks like the correct entrypoint 👍
Before adding additional functionality, I would do a refactoring for using a Seq[Sheet] where a Sheet is currently used and just providing a single Seq(sheet) in the beginning. This should be doable while refactoring along green (i.e. letting the tests run frequently, maybe with mill -w ...test).

Then it would be a good time to think about testing:
In the best case, we introduce an additional "symmetric" feature for limiting the number of rows per sheet and continuing writing in the next sheet. For the sheet names one could e.g. use RgxGen to generate Strings adhering to a regex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants