-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WASM-based parquet #2103
WASM-based parquet #2103
Conversation
modules/parquet/package.json
Outdated
@@ -42,6 +42,7 @@ | |||
"@loaders.gl/compression": "3.2.0-alpha.1", | |||
"@loaders.gl/loader-utils": "3.2.0-alpha.1", | |||
"@loaders.gl/schema": "3.2.0-alpha.1", | |||
"apache-arrow": "^7.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss... arrow 7 is a breaking change. I have a PR that updates to arrow 7 but have been holding it off for loaders.gl 4.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or make apache-arrow a peer dependency? The only functions used are arrow.tableFromIPC
(for reading) and arrow.tableToIPC
(for writing). Arrow 7 isn't necessary; I just did yarn add apache-arrow
and it chose the latest version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that those functions have changed name compared to pre v7 releases. v7 was a major refactor that massively improves tree-shaking of arrow but there are a bunch of breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I dropped the dev dependencies version down to ^4
and had to reimplement both tableToIPC
and tableFromIPC
using the lower level RecordBatchFileWriter
and RecordBatchStreamReader
.
loaders.gl/modules/parquet/src/lib/encode-parquet-wasm.ts
Lines 26 to 28 in 26317f7
export function tableToIPC(table: Table): Uint8Array { | |
return RecordBatchFileWriter.writeAll(table).toUint8Array(true); | |
} |
loaders.gl/modules/parquet/src/lib/parse-parquet-wasm.ts
Lines 21 to 28 in 26317f7
function tableFromIPC(input: ArrayBuffer): Table { | |
const reader = RecordBatchStreamReader.from(input); | |
const recordBatches: RecordBatch[] = []; | |
for (const recordBatch of reader) { | |
recordBatches.push(recordBatch); | |
} | |
return new Table(recordBatches); | |
} |
modules/parquet/test/data/files.js
Outdated
{supportedJs: false, supportedWasm: true, title: 'data_index_bloom_encoding_stats', path: 'good/data_index_bloom_encoding_stats.parquet'}, | ||
{supportedJs: false, supportedWasm: false, title: 'delta_binary_packed', path: 'good/delta_binary_packed.parquet'}, | ||
{supportedJs: false, supportedWasm: false, title: 'delta_byte_array', path: 'good/delta_byte_array.parquet'}, | ||
{supportedJs: false, supportedWasm: false, title: 'delta_encoding_optional_column', path: 'good/delta_encoding_optional_column.parquet'}, | ||
{supportedJs: false, supportedWasm: false, title: 'delta_encoding_required_column', path: 'good/delta_encoding_required_column.parquet'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are new files from https://github.com/apache/parquet-testing that didn't previously exist here.
* @param type Whether to serialize the Table as a file or a stream. | ||
*/ | ||
export function tableToIPC(table: Table): Uint8Array { | ||
return RecordBatchFileWriter.writeAll(table).toUint8Array(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is writeAll()
still present in v7 though? Those kind of static methods were a big part of what was removed in v7...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case yes. The above is the definition of tableToIPC
in v7:
https://github.com/apache/arrow/blob/e90472e35b40f58b17d408438bb8de1641bfe6ef/js/src/ipc/serialization.ts#L24-L51
I think the main thing that needs to be fixed in this PR is how to bundle the wasm in a way that works out of the box for node and browser. I tried to use a dynamic
Previously, it was in an import/require ESM/CJS mess. Node was complaining about the use of I'm not too experienced with these bundling issues so help would be greatly appreciated 🙂 |
@@ -34,6 +34,7 @@ export function makeStream<ArrayBuffer>( | |||
controller.close(); | |||
} else { | |||
// TODO - ignores controller.desiredSize | |||
// @ts-ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should change this into a @ts-expect-error
. I got an error about this originally but maybe it's now fixed on master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very excited for this PR! Added a few comments, although I must confess I have no experience bundling WASM modules so I may be totally off-base
import {WASM_SUPPORTED_FILES} from './data/files'; | ||
|
||
const PARQUET_DIR = '@loaders.gl/parquet/test/data/apache'; | ||
const wasmUrl = 'http://localhost:5000/node_modules/parquet-wasm/arrow1/esm_bg.wasm'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole flow of referencing an external wasm file and using the localhost:5000 override seems close to the flow we have for webworkers. Would it make sense to unify the two? I.e. create a small web worker script that includes the WASM and implements the parse interface? I know this would mean losing the non-worker loader, but how much use is this if need to dynamically load a WASM blob from an external URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, I think both WASM and web workers need to be instantiated asynchronously, right? I don't know much about packaging web workers. In the geopackage loader (which relies on the WASM sql.js
) we have a similar process where the WASM is loaded from an external source that the user can configure:
loaders.gl/modules/geopackage/src/lib/parse-geopackage.ts
Lines 120 to 131 in 7acfc31
async function loadDatabase(arrayBuffer: ArrayBuffer, sqlJsCDN: string | null): Promise<Database> { | |
// In Node, `locateFile` must not be passed | |
let SQL: SqlJsStatic; | |
if (sqlJsCDN) { | |
SQL = await initSqlJs({ | |
locateFile: (file) => `${sqlJsCDN}${file}` | |
}); | |
} else { | |
SQL = await initSqlJs(); | |
} | |
return new SQL.Database(new Uint8Array(arrayBuffer)); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more thinking in terms of packaging. A number of loaders have a build-worker
target in package.json
which then automatically creates a worker file on the npm and provides a way to use a local version of the worker in unit tests. The idea was that you could create a similar build-worker
step for parquet-wasm which would abstract away all the WASM loading into that worker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixpalmer In principle that is a good idea. Are you good at wasm bundling / loading? I have looked into this briefly in the past but overall this seemed like a messy corner of javascript and I didn't find any approach that I got excited about. If someone can demonstrate a convincing solution that seems reasonably simple, generic, portable and future proof I'd be happy to consider it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I have no experience and share the view that it seems to be a messy part of JavaScript
Thanks. Should be finally green now. |
No description provided.