Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add binary string operations (length and concatenation) #3646

Merged
merged 38 commits into from
Jan 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
8aa6c7c
Added binary string length operation
f4t4nt Jan 7, 2025
2ca5a9a
Added binary string concat operation
f4t4nt Jan 7, 2025
3e0dae6
fix: update binary concat doctest output formatting
f4t4nt Jan 7, 2025
fa09d9b
style: fix docstring r-prefix and use Self in Rust code
f4t4nt Jan 7, 2025
b2a962a
fix: update binary concat doctest output to match actual formatting
f4t4nt Jan 7, 2025
cf73ba2
style: apply ruff formatting
f4t4nt Jan 7, 2025
1431261
style: apply ruff formatting to test file
f4t4nt Jan 7, 2025
5539f56
feat(binary): Support broadcasting in binary concat
f4t4nt Jan 7, 2025
f409ed5
(feat) minor binary string styling and test issues
f4t4nt Jan 7, 2025
1d8a9ce
feat(binary): Implement binary_substr
f4t4nt Jan 8, 2025
26958f0
test(binary): Add limit testing for binary_substr to identify edge cases
f4t4nt Jan 8, 2025
d6b65b9
feat(binary): Implement binary_substr with edge case handling
f4t4nt Jan 8, 2025
91a30fd
refactor(binary): Split binary tests into separate files and update e…
f4t4nt Jan 8, 2025
7142501
fix: Update docstring formatting and fix type annotation in binary_su…
f4t4nt Jan 8, 2025
3bfb388
fix(binary): Polish binary_substr implementation
f4t4nt Jan 8, 2025
75fc296
fix(binary): Further polish binary_substr implementation
f4t4nt Jan 8, 2025
3000144
test(binary): Add special character tests for binary_substr
f4t4nt Jan 8, 2025
a878746
fix: respect concatenation order in binary operations
f4t4nt Jan 8, 2025
1c26a63
fix: use simple_python_wrapper macro for binary substr
f4t4nt Jan 8, 2025
03160ea
test: add error tests for binary operations
f4t4nt Jan 8, 2025
67e44d0
test: make error message regex more flexible
f4t4nt Jan 8, 2025
5b1ebc2
refactor: remove unreachable error handling in binary ops
f4t4nt Jan 8, 2025
67d2666
chore: final polishes for git checks
f4t4nt Jan 9, 2025
d75130b
refactor: rename binary substr to slice for consistency
f4t4nt Jan 13, 2025
7b8fc92
fix: update error message patterns in binary array tests and simplify…
f4t4nt Jan 13, 2025
4a695b8
feat(binary): support null values in binary concat
f4t4nt Jan 13, 2025
82660c8
feat(binary): implement fixed size binary operations
f4t4nt Jan 14, 2025
e5a291c
fix(binary): use Self type and simplify iteration in fixed size binar…
f4t4nt Jan 14, 2025
9915afe
refactor: remove large tests and fix docstring
f4t4nt Jan 14, 2025
12928bc
refactor: use direct DataType comparisons instead of string comparisons
f4t4nt Jan 14, 2025
75793f7
refactor: generalized slice iter error messages
f4t4nt Jan 14, 2025
c887532
refactor: use cast() instead of into_binary() for type conversion
f4t4nt Jan 15, 2025
5dfefca
refactor: move binary_slice from series ops to slice function
f4t4nt Jan 15, 2025
1cf0ea6
refactor(binary): standardize iterators and fix substr behavior with …
f4t4nt Jan 16, 2025
4e4b23f
refactor(binary): simplify slice by always using binary implementation
f4t4nt Jan 16, 2025
ca7aead
test: remove utf8 tests as they will be fixed in a subsequent PR
f4t4nt Jan 16, 2025
22e7784
fix: binary slice edge cases and tests and remove redundant validity …
f4t4nt Jan 16, 2025
a973300
fix: restore utf8 test files (test_length.py and test_substr.py)
f4t4nt Jan 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions daft/daft/__init__.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -1200,6 +1200,13 @@ def utf8_normalize(
expr: PyExpr, remove_punct: bool, lowercase: bool, nfd_unicode: bool, white_space: bool
) -> PyExpr: ...

# ---
# expr.binary namespace
# ---
def binary_length(expr: PyExpr) -> PyExpr: ...
def binary_concat(left: PyExpr, right: PyExpr) -> PyExpr: ...
def binary_slice(expr: PyExpr, start: PyExpr, length: PyExpr | None = None) -> PyExpr: ...

class PyCatalog:
@staticmethod
def new() -> PyCatalog: ...
Expand Down
106 changes: 106 additions & 0 deletions daft/expressions/expressions.py
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,15 @@ def json(self) -> ExpressionJsonNamespace:
"""Access methods that work on columns of json."""
return ExpressionJsonNamespace.from_expression(self)

@property
def binary(self) -> ExpressionBinaryNamespace:
"""Access binary string operations for this expression.

Returns:
ExpressionBinaryNamespace: A namespace containing binary string operations
"""
return ExpressionBinaryNamespace.from_expression(self)

@staticmethod
def _from_pyexpr(pyexpr: _PyExpr) -> Expression:
expr = Expression.__new__(Expression)
Expand Down Expand Up @@ -3554,3 +3563,100 @@ class ExpressionEmbeddingNamespace(ExpressionNamespace):
def cosine_distance(self, other: Expression) -> Expression:
"""Compute the cosine distance between two embeddings."""
return Expression._from_pyexpr(native.cosine_distance(self._expr, other._expr))


class ExpressionBinaryNamespace(ExpressionNamespace):
def length(self) -> Expression:
"""Retrieves the length for a binary string column.

Example:
>>> import daft
>>> df = daft.from_pydict({"x": [b"foo", b"bar", b"baz"]})
>>> df = df.select(df["x"].binary.length())
>>> df.show()
╭────────╮
│ x │
│ --- │
│ UInt64 │
╞════════╡
│ 3 │
├╌╌╌╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌╌╌╌┤
│ 3 │
╰────────╯
<BLANKLINE>
(Showing first 3 of 3 rows)

Returns:
Expression: an UInt64 expression with the length of each binary string in bytes
"""
return Expression._from_pyexpr(native.binary_length(self._expr))

def concat(self, other: Expression) -> Expression:
r"""Concatenates two binary strings.

Example:
>>> import daft
>>> df = daft.from_pydict(
... {"a": [b"Hello", b"\\xff\\xfe", b"", b"World"], "b": [b" World", b"\\x00", b"empty", b"!"]}
... )
>>> df = df.select(df["a"].binary.concat(df["b"]))
>>> df.show()
╭────────────────────╮
│ a │
│ --- │
│ Binary │
╞════════════════════╡
│ b"Hello World" │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b"\\xff\\xfe\\x00" │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b"empty" │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b"World!" │
╰────────────────────╯
<BLANKLINE>
(Showing first 4 of 4 rows)

Args:
other: The binary string to concatenate with, can be either an Expression or a bytes literal

Returns:
Expression: A binary expression containing the concatenated strings
"""
other_expr = Expression._to_expression(other)
return Expression._from_pyexpr(native.binary_concat(self._expr, other_expr._expr))

def slice(self, start: Expression | int, length: Expression | int | None = None) -> Expression:
r"""Returns a slice of each binary string.

Example:
>>> import daft
>>> df = daft.from_pydict({"x": [b"Hello World", b"\xff\xfe\x00", b"empty"]})
>>> df = df.select(df["x"].binary.slice(1, 3))
>>> df.show()
╭─────────────╮
│ x │
│ --- │
│ Binary │
╞═════════════╡
│ b"ell" │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b"\xfe\x00" │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b"mpt" │
╰─────────────╯
<BLANKLINE>
(Showing first 3 of 3 rows)

Args:
start: The starting position (0-based) of the slice.
length: The length of the slice. If None, returns all characters from start to the end.

Returns:
A new expression representing the slice.
"""
start_expr = Expression._to_expression(start)
length_expr = Expression._to_expression(length)
return Expression._from_pyexpr(native.binary_slice(self._expr, start_expr._expr, length_expr._expr))
16 changes: 16 additions & 0 deletions docs/source/api_docs/expressions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,22 @@ The following methods are available under the ``expr.str`` attribute.
Expression.str.tokenize_decode
Expression.str.count_matches

.. _api-binary-expression-operations:

Binary
######

The following methods are available under the ``expr.binary`` attribute.

.. autosummary::
:nosignatures:
:toctree: doc_gen/expression_methods
:template: autosummary/accessor_method.rst

Expression.binary.concat
Expression.binary.length
Expression.binary.slice

.. _api-float-expression-operations:

Floats
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ check-hidden = true
ignore-words-list = "crate,arithmetics,ser"
# Feel free to un-skip examples, and experimental, you will just need to
# work through many typos (--write-changes and --interactive will help)
skip = "tests/series/*,target,.git,.venv,venv,data,*.csv,*.csv.*,*.html,*.json,*.jsonl,*.pdf,*.txt,*.ipynb,*.tiktoken,*.sql"
skip = "tests/series/*,target,.git,.venv,venv,data,*.csv,*.csv.*,*.html,*.json,*.jsonl,*.pdf,*.txt,*.ipynb,*.tiktoken,*.sql,tests/table/utf8/*,tests/table/binary/*"

[tool.maturin]
# "python" tells pyo3 we want to build an extension module (skips linking against libpython.so)
Expand Down
Loading
Loading