[Data]: Categorizer fails with non uniform distributions #50792

gero90 · 2025-02-21T14:47:46Z

What happened + What you expected to happen

When using ray.data.preprocessors.Categorizer on a non uniformly distributed column, we get a failure on fit():
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order.

If the column is distributed uniformly, the Categorizer works fine. If we use a sample of the dataset to fit, the Categorizer works fine.

Versions / Dependencies

python==3.12.8

ray==2.41.0
pandas==2.2.3
numpy==1.26.4
pyarrow==18.1.0

Reproduction script

from ray.data.preprocessors import Categorizer
import pandas as pd
import numpy as np
import ray

# create pandas dataframe, 1 column with 12 unique strings, distributed logaritmically
strings = [f"string_{i}" for i in range(12)]
probabilities = [2**i for i in range(12)]
data = np.random.choice(strings, size=10_000, p=probabilities/np.sum(probabilities))
df = pd.DataFrame({"my_cat_col": data})
df.head()
#   my_cat_col
# 0  string_11
# 1  string_10
# 2  string_10
# 3   string_8
# 4  string_11

df["my_cat_col"].value_counts(normalize=True)
# my_cat_col
# string_11    0.5015
# string_10    0.2521
# string_9     0.1247
# ...

# convert to ray dataset, categorize column
ray.init()
ds = ray.data.from_pandas(df)
cat = Categorizer(columns=["my_cat_col"])
cat.fit(ds)
# ERROR: pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order

# This works: same fit(), with smaller size
cat.fit(ds.random_sample(0.1))
# Categorizer(columns=['my_cat_col'], dtypes={})

# This works: uniform distribution
data = np.random.choice(strings, size=10_000)
df = pd.DataFrame({"my_cat_col": data})
df["my_cat_col"].value_counts()
ds = ray.data.from_pandas(df)
cat = Categorizer(columns=["my_cat_col"])
cat.fit(ds)
# Categorizer(columns=['my_cat_col'], dtypes={})

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

gero90 · 2025-02-21T14:51:10Z

Traceback:

---------------------------------------------------------------------------
SystemException                           Traceback (most recent call last)
SystemException: 

The above exception was the direct cause of the following exception:

RayTaskError(ArrowTypeError)              Traceback (most recent call last)
File Untitled-1:3
      1 ds = ray.data.from_pandas(df)
      2 cat = Categorizer(columns=["my_cat_col"])
----> 3 cat.fit(ds)

File /.venv/lib/python3.12/site-packages/ray/data/preprocessor.py:117, in Preprocessor.fit(self, ds)
    107 if fit_status in (
    108     Preprocessor.FitStatus.FITTED,
    109     Preprocessor.FitStatus.PARTIALLY_FITTED,
    110 ):
    111     warnings.warn(
    112         "`fit` has already been called on the preprocessor (or at least one "
    113         "contained preprocessors if this is a chain). "
    114         "All previously fitted state will be overwritten!"
    115     )
--> 117 fitted_ds = self._fit(ds)
    118 self._fitted = True
    119 return fitted_ds

File /.venv/lib/python3.12/site-packages/ray/data/preprocessors/encoder.py:533, in Categorizer._fit(self, dataset)
    529 columns_to_get = [
    530     column for column in self.columns if column not in set(self.dtypes)
    531 ]
    532 if columns_to_get:
--> 533     unique_indices = _get_unique_value_indices(
    534         dataset, columns_to_get, drop_na_values=True, key_format="{0}"
    535     )
    536     unique_indices = {
    537         column: pd.CategoricalDtype(values_indices.keys())
    538         for column, values_indices in unique_indices.items()
    539     }
    540 else:

File /.venv/lib/python3.12/site-packages/ray/data/preprocessors/encoder.py:608, in _get_unique_value_indices(dataset, columns, drop_na_values, key_format, max_categories, encode_lists)
    606 value_counts = dataset.map_batches(get_pd_value_counts, batch_format="pandas")
    607 final_counters = {col: Counter() for col in columns}
--> 608 for batch in value_counts.iter_batches(batch_size=None):
    609     for col, counters in batch.items():
    610         for counter in counters:

File /.venv/lib/python3.12/site-packages/ray/data/iterator.py:154, in DataIterator.iter_batches.<locals>._create_iterator()
    145 time_start = time.perf_counter()
    146 # Iterate through the dataset from the start each time
    147 # _iterator_gen is called.
    148 # This allows multiple iterations of the dataset without
    149 # needing to explicitly call `iter_batches()` multiple times.
    150 (
    151     ref_bundles_iterator,
    152     stats,
    153     blocks_owned_by_consumer,
--> 154 ) = self._to_ref_bundle_iterator()
    156 iterator = iter(
    157     iter_batches(
    158         ref_bundles_iterator,
   (...)
    169     )
    170 )
    172 dataset_tag = self._get_dataset_tag()

File /.venv/lib/python3.12/site-packages/ray/data/_internal/iterator/iterator_impl.py:28, in DataIteratorImpl._to_ref_bundle_iterator(self)
     24 def _to_ref_bundle_iterator(
     25     self,
     26 ) -> Tuple[Iterator[RefBundle], Optional[DatasetStats], bool]:
     27     ds = self._base_dataset
---> 28     ref_bundles_iterator, stats, executor = ds._plan.execute_to_iterator()
     29     ds._current_executor = executor
     30     return ref_bundles_iterator, stats, False

File /.venv/lib/python3.12/site-packages/ray/data/exceptions.py:89, in omit_traceback_stdout.<locals>.handle_trace(*args, **kwargs)
     87     raise e.with_traceback(None)
     88 else:
---> 89     raise e.with_traceback(None) from SystemException()

RayTaskError(ArrowTypeError): ray::MapBatches(get_pd_value_counts)() (pid=93539, ip=172.25.53.54)
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 398, in __call__
    yield output_buffer.next()
          ^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/ray/data/_internal/output_buffer.py", line 73, in next
    block_to_yield = self._buffer.build()
                     ^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/ray/data/_internal/delegating_block_builder.py", line 68, in build
    return self._builder.build()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/ray/data/_internal/table_block.py", line 133, in build
    return self._concat_tables(tables)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/ray/data/_internal/arrow_block.py", line 149, in _concat_tables
    return transform_pyarrow.concat(tables)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 333, in concat
    table = pyarrow.concat_tables(blocks, promote_options="default")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 6256, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<string_10: int64, string_11: int64, string_2: int64, string_3: int64, string_4: int64, string_5: int64, string_6: int64, string_7: int64, string_8: int64, string_9: int64> output fields: struct<string_10: int64, string_11: int64, string_2: int64, string_3: int64, string_4: int64, string_5: int64, string_6: int64, string_7: int64, string_8: int64, string_9: int64, string_1: int64, string_0: int64>

gero90 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 21, 2025

jcotant1 added the data Ray Data-related issues label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data]: Categorizer fails with non uniform distributions #50792

[Data]: Categorizer fails with non uniform distributions #50792

gero90 commented Feb 21, 2025 •

edited

Loading

gero90 commented Feb 21, 2025

[Data]: Categorizer fails with non uniform distributions #50792

[Data]: Categorizer fails with non uniform distributions #50792

Comments

gero90 commented Feb 21, 2025 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

gero90 commented Feb 21, 2025

gero90 commented Feb 21, 2025 •

edited

Loading