Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spark Excel not reading whole columns and is only reading specific data address ranges #930

Open
2 tasks done
bitbythecron opened this issue Feb 6, 2025 · 6 comments

Comments

@bitbythecron
Copy link

bitbythecron commented Feb 6, 2025

Am I using the newest version of the library?

  • I have made sure that I'm using the latest version of the library.

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Java app here using the Spark Excel library to read an Excel file into a Dataset<Row>. When I use the following configurations:

String filePath = "file:///Users/myuser/example-data.xlsx";
Dataset<Row> dataset = spark.read()
        .format("com.crealytics.spark.excel")
        .option("header", "true")
        .option("inferSchema", "true")
        .option("dataAddress", "'ExampleData'!A2:D7")
        .load(filePath);

This works beautifully and my Dataset<Row> is instantiated without any issues whatsoever. But the minute I go to just tell it to read any rows between A through D, it reads an empty Dataset<Row>:

// dataset will be empty
.option("dataAddress", "'ExampleData'!A:D")

This also happens if I set the sheetName and dataAddress separately:

// dataset will be empty
.option("sheetName", "ExampleData")
.option("dataAddress", "A:D")

And it also happens when, instead of providing the sheetName, I provide a sheetIndex:

// dataset will be empty; and I have experimented by setting it to 0 as well
// in case it is a 0-based index
.option("sheetIndex", 1)
.option("dataAddress", "A:D")

My question: is this expected behavior of the Spark Excel library, or is it a bug I have discovered, or am I not using the Options API correctly here?

Expected Behavior

Explained above, I would have expected all three option configurations to work, but only the first one does.

Steps To Reproduce

Code is provided above. I am pulling in the following Gradle libraries:

    implementation("org.apache.spark:spark-core_2.12:3.5.3")
    implementation("org.apache.spark:spark-sql_2.12:3.5.3")
    implementation("com.crealytics:spark-excel_2.12:3.5.1_0.20.4")
    implementation("com.databricks:spark-xml_2.12:0.18.0")

I am using a Java application (not Scala).

Environment

- Spark version: `2.12:3.5.3`
- Spark-Excel version: `2.12:3.5.1_0.20.4`
- OS: Mac Sequoia 15.3
- Cluster environment

Anything else?

No response

@nightscape
Copy link
Owner

nightscape commented Feb 7, 2025

Hi @bitbythecron, as you are familiar with Java:
Can you try what happens with the same file if you read it by Apache POI, manually create an AreaReference with your dataAddress and run these commands to get the min and max line and col numbers?

@bitbythecron
Copy link
Author

Hi, I will try this suggestion. Please give me until tomorrow and I will post the results.

@bitbythecron
Copy link
Author

Hi @nightscape apologies it took a few more days than expected to get this to you.

Please see this gist. You would need to modify the filePath and dataAddress to replicate it on your end, but I can tell you: this seems to be a limitation with Apache POI, not spark-excel.

When I use this line:

String dataAddress = "'Price Sheet'!A2:D7";

The program reads the file and correctly reads all the data between A2 and D7.

But when I change that to just read all data in the columns:

String dataAddress = "'Price Sheet'!A:D";

It doesn't throw an error, but it comes back empty.

With Apache POI being as mature as it is, I'm leaning towards this being expected behavior, albeit its disappointing.

Can you think of any way around this limitation? In production, my app will be given Excel files without knowing the exact addresses/areas/bounds of the data to read.

This feels like a pretty huge limitation of POI to be honest! I mean, how many times are you going to know (ahead of time) the exact rows in the Excel file you are processing?!

Thanks for any input/advice here!

@nightscape
Copy link
Owner

Have you tried using a very high end address? I'm not a 100% sure, but it might be that this just works.

@bitbythecron
Copy link
Author

bitbythecron commented Feb 11, 2025

Well I guess that would work for most use cases, but I'm more apt to lean towards something like this:

  1. Use POI directly to read the row count from the specific columns I'm interested in
  2. Convert this row count into an actual data address range. For instance if the row count is 10 and I know I want to read columns A through D and ignore the header, that would translate to "A2:D11"
  3. Then use spark-excel with the reader options that will work

I hate to read the same doc twice but I think that's what I may be stuck with if you can't think of anything else. No worries either way, and thanks for the quick responses so far!

@nightscape
Copy link
Owner

If you want, you could create a PR and add a method to read the data address ranger from the WorkbookReader, as is already done for getting the sheetNames.
That wouldn't help performance-wise, but you wouldn't have to use POI directly.
It might also be that the data address is stored as metadata, so hopefully POI doesn't have to read the entire sheet to find out what's the maximum address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants