Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data]: Categorizer fails with non uniform distributions #50792

Open
gero90 opened this issue Feb 21, 2025 · 1 comment
Open

[Data]: Categorizer fails with non uniform distributions #50792

gero90 opened this issue Feb 21, 2025 · 1 comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@gero90
Copy link

gero90 commented Feb 21, 2025

What happened + What you expected to happen

When using ray.data.preprocessors.Categorizer on a non uniformly distributed column, we get a failure on fit():
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order.

If the column is distributed uniformly, the Categorizer works fine. If we use a sample of the dataset to fit, the Categorizer works fine.

Versions / Dependencies

python==3.12.8

ray==2.41.0
pandas==2.2.3
numpy==1.26.4
pyarrow==18.1.0

Reproduction script

from ray.data.preprocessors import Categorizer
import pandas as pd
import numpy as np
import ray

# create pandas dataframe, 1 column with 12 unique strings, distributed logaritmically
strings = [f"string_{i}" for i in range(12)]
probabilities = [2**i for i in range(12)]
data = np.random.choice(strings, size=10_000, p=probabilities/np.sum(probabilities))
df = pd.DataFrame({"my_cat_col": data})
df.head()
#   my_cat_col
# 0  string_11
# 1  string_10
# 2  string_10
# 3   string_8
# 4  string_11

df["my_cat_col"].value_counts(normalize=True)
# my_cat_col
# string_11    0.5015
# string_10    0.2521
# string_9     0.1247
# ...

# convert to ray dataset, categorize column
ray.init()
ds = ray.data.from_pandas(df)
cat = Categorizer(columns=["my_cat_col"])
cat.fit(ds)
# ERROR: pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order

# This works: same fit(), with smaller size
cat.fit(ds.random_sample(0.1))
# Categorizer(columns=['my_cat_col'], dtypes={})

# This works: uniform distribution
data = np.random.choice(strings, size=10_000)
df = pd.DataFrame({"my_cat_col": data})
df["my_cat_col"].value_counts()
ds = ray.data.from_pandas(df)
cat = Categorizer(columns=["my_cat_col"])
cat.fit(ds)
# Categorizer(columns=['my_cat_col'], dtypes={})

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@gero90 gero90 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 21, 2025
@gero90
Copy link
Author

gero90 commented Feb 21, 2025

Traceback:

---------------------------------------------------------------------------
SystemException                           Traceback (most recent call last)
SystemException: 

The above exception was the direct cause of the following exception:

RayTaskError(ArrowTypeError)              Traceback (most recent call last)
File Untitled-1:3
      1 ds = ray.data.from_pandas(df)
      2 cat = Categorizer(columns=["my_cat_col"])
----> 3 cat.fit(ds)

File /.venv/lib/python3.12/site-packages/ray/data/preprocessor.py:117, in Preprocessor.fit(self, ds)
    107 if fit_status in (
    108     Preprocessor.FitStatus.FITTED,
    109     Preprocessor.FitStatus.PARTIALLY_FITTED,
    110 ):
    111     warnings.warn(
    112         "`fit` has already been called on the preprocessor (or at least one "
    113         "contained preprocessors if this is a chain). "
    114         "All previously fitted state will be overwritten!"
    115     )
--> 117 fitted_ds = self._fit(ds)
    118 self._fitted = True
    119 return fitted_ds

File /.venv/lib/python3.12/site-packages/ray/data/preprocessors/encoder.py:533, in Categorizer._fit(self, dataset)
    529 columns_to_get = [
    530     column for column in self.columns if column not in set(self.dtypes)
    531 ]
    532 if columns_to_get:
--> 533     unique_indices = _get_unique_value_indices(
    534         dataset, columns_to_get, drop_na_values=True, key_format="{0}"
    535     )
    536     unique_indices = {
    537         column: pd.CategoricalDtype(values_indices.keys())
    538         for column, values_indices in unique_indices.items()
    539     }
    540 else:

File /.venv/lib/python3.12/site-packages/ray/data/preprocessors/encoder.py:608, in _get_unique_value_indices(dataset, columns, drop_na_values, key_format, max_categories, encode_lists)
    606 value_counts = dataset.map_batches(get_pd_value_counts, batch_format="pandas")
    607 final_counters = {col: Counter() for col in columns}
--> 608 for batch in value_counts.iter_batches(batch_size=None):
    609     for col, counters in batch.items():
    610         for counter in counters:

File /.venv/lib/python3.12/site-packages/ray/data/iterator.py:154, in DataIterator.iter_batches.<locals>._create_iterator()
    145 time_start = time.perf_counter()
    146 # Iterate through the dataset from the start each time
    147 # _iterator_gen is called.
    148 # This allows multiple iterations of the dataset without
    149 # needing to explicitly call `iter_batches()` multiple times.
    150 (
    151     ref_bundles_iterator,
    152     stats,
    153     blocks_owned_by_consumer,
--> 154 ) = self._to_ref_bundle_iterator()
    156 iterator = iter(
    157     iter_batches(
    158         ref_bundles_iterator,
   (...)
    169     )
    170 )
    172 dataset_tag = self._get_dataset_tag()

File /.venv/lib/python3.12/site-packages/ray/data/_internal/iterator/iterator_impl.py:28, in DataIteratorImpl._to_ref_bundle_iterator(self)
     24 def _to_ref_bundle_iterator(
     25     self,
     26 ) -> Tuple[Iterator[RefBundle], Optional[DatasetStats], bool]:
     27     ds = self._base_dataset
---> 28     ref_bundles_iterator, stats, executor = ds._plan.execute_to_iterator()
     29     ds._current_executor = executor
     30     return ref_bundles_iterator, stats, False

File /.venv/lib/python3.12/site-packages/ray/data/exceptions.py:89, in omit_traceback_stdout.<locals>.handle_trace(*args, **kwargs)
     87     raise e.with_traceback(None)
     88 else:
---> 89     raise e.with_traceback(None) from SystemException()

RayTaskError(ArrowTypeError): ray::MapBatches(get_pd_value_counts)() (pid=93539, ip=172.25.53.54)
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 398, in __call__
    yield output_buffer.next()
          ^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/ray/data/_internal/output_buffer.py", line 73, in next
    block_to_yield = self._buffer.build()
                     ^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/ray/data/_internal/delegating_block_builder.py", line 68, in build
    return self._builder.build()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/ray/data/_internal/table_block.py", line 133, in build
    return self._concat_tables(tables)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/ray/data/_internal/arrow_block.py", line 149, in _concat_tables
    return transform_pyarrow.concat(tables)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.12/site-packages/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 333, in concat
    table = pyarrow.concat_tables(blocks, promote_options="default")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 6256, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<string_10: int64, string_11: int64, string_2: int64, string_3: int64, string_4: int64, string_5: int64, string_6: int64, string_7: int64, string_8: int64, string_9: int64> output fields: struct<string_10: int64, string_11: int64, string_2: int64, string_3: int64, string_4: int64, string_5: int64, string_6: int64, string_7: int64, string_8: int64, string_9: int64, string_1: int64, string_0: int64>

@jcotant1 jcotant1 added the data Ray Data-related issues label Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants