Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] [SQL] Enable SQL query to run on callers scoped variables #2864

Merged
merged 2 commits into from
Sep 23, 2024

Conversation

amitschang
Copy link
Contributor

This builds a catalog based on the python globals and locals visible to the caller at the point where the sql query function is called, in the case where a catalog is not supplied. Otherwise, the catalog is final and must contain necessary tables.

resolves #2740

@github-actions github-actions bot added the enhancement New feature or request label Sep 20, 2024
@amitschang
Copy link
Contributor Author

A couple notes/thoughts:

  • This looks for variables specifically in the caller scope (of daft.sql), another possibility would be to walk the stack all the way back - but I wondered if this might cause confusion in cases where a module wraps the sql function but has a conflicting name in its scope.
  • Another way to get the python tables would be in the rust planner and walk the stack looking for specific named dataframes (since by this time it knows the table) using py03, but this would be more complex and I'm unsure that is necessary.
  • Not sure how you feel about the type: ignores. I think the only reasonable error is attribute access here and we'd see it when interpreter doesn't supply frames, so the current code is much less verbose ignoring the type issues (related to Optional[Frame]) and letting the exception handle things.

Thanks!

Copy link

codspeed-hq bot commented Sep 20, 2024

CodSpeed Performance Report

Merging #2864 will not alter performance

Comparing amitschang:gh-2740-sql-ctx-global (8b22e76) with main (b0f31e3)

Summary

✅ 17 untouched benchmarks

daft/sql/sql.py Outdated
# some interpreters might not implement currentframe; all reasonable
# errors above should be AttributeError
raise DaftCoreException("Cannot get caller environment, please provide a catalog") from exc
catalog = SQLCatalog({k: v for k, v in caller_vars.items() if isinstance(v, DataFrame)})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably do this in a followup PR, but it'd be nice to also support anything that can be zero copy converted into a DataFrame, such as:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I've never heard of the dataframe interchange protocol before! This seems interesting, but I do agree its a separate change since I think it would require some addition in SQLCatalog to get the logical plan from the zero copy table. Perhaps an issue for that would be good

Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor change, but otherwise looks good.

Comment on lines +57 to +75
try:
# Caller is back from func, analytics, annotation
caller_frame = inspect.currentframe().f_back.f_back.f_back # type: ignore
caller_vars = {**caller_frame.f_globals, **caller_frame.f_locals} # type: ignore
except AttributeError as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of making it use either the scoped variables, or the global context, we should be able to mix and match. I'd prefer having a register_globals parameter: sql(query, catalog, register_globals=False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense. I will say that I think register_globals we might want to default to True. Because for example in an interactive or jupyter session one might get the table and then write some processing functionality within functions and it would be likely expected/desired that they are visible there. This might reduce friction for a big use case, and for library methods that need more control the option is there.

Thoughts on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sorry for the confusion, Having register_globals default to True is what I intended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I might have misunderstood that as a signature. Do you mean by default enable supplying a catalog and adding caller-visible variables to it? register_globals=False would mean don't activate this new code, essentially?

Copy link
Contributor

@universalmind303 universalmind303 Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to clarify, i meant sql(query, catalog, register_globals=False) as what the user would invoke if they did not want to register globals, the signature should be

def sql(sql: str, catalog: Optional[SQLCatalog] = None, register_globals: bool = True) -> DataFrame:

and the usage should be

  • sql(query) -> valid
  • sql(query, catalog) -> valid
  • sql(query, catalog, register_globals=False) -> valid
  • sql(query, register_globals=False) -> type error

once again, sorry for the confusion.

register_globals=False would mean don't activate this new code, essentially?

yes that's correct. Which means this should always be invalid

daft.sql(query, register_globals=False)


Not required, but we could go the extra step and @overload the method to prevent incorrect usage at the type level

I think this would give us the desired results

from typing import overload, Optional

@overload
def sql(sql: str) -> DataFrame: ...

@overload
def sql(sql: str, catalog: SQLCatalog, register_globals: bool = True) -> DataFrame: ...

@overload
def sql(sql: str, catalog: SQLCatalog = None, register_globals: bool = True) -> DataFrame: ...

def sql(sql: str, catalog: Optional[SQLCatalog] = None, register_globals: bool = True) -> DataFrame:
    # Implementation here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the clarification - now I see it better. I had thought of it as if the user supplies a catalog they could populate it as needed and wouldn't need the scoped vars - but this way may be more flexible in case a catalog is passed around - though at this point a rule would have to be imposed on precedence (presumably the catalog is higher than the vars).

For this, I'll have to add some features to SQLCatalog to add tables and get registered names. I'll look into that as soon as I can, but might be a second 😄

@universalmind303 universalmind303 changed the title [FEAT] [SQL] Eanable SQL query to run on callers scoped variables [FEAT] [SQL] Enable SQL query to run on callers scoped variables Sep 20, 2024
@amitschang amitschang force-pushed the gh-2740-sql-ctx-global branch from ec2c6eb to 8b22e76 Compare September 23, 2024 15:21
@amitschang
Copy link
Contributor Author

@universalmind303, I've made updates which I think capture the desired behavior. Since I don't think it is desired to have a side-effect of modifying the supplied catalog, this makes a new catalog copying from given, and also I've made catalog take precedence in name conflicts.

Also, I'm not sure about the overloads, I think the two added capture the desired cases since the default values are not consumed in checking - but I wondered if this could generate false negatives in the case a None value variable is used for catalog.

@universalmind303 universalmind303 self-assigned this Sep 23, 2024
@universalmind303 universalmind303 merged commit fd42281 into Eventual-Inc:main Sep 23, 2024
36 checks passed
@universalmind303
Copy link
Contributor

great work @amitschang. Let's see how this works out as is. We can adjust the precedence or typings later on if others run into issues.

@amitschang
Copy link
Contributor Author

@universalmind303, super - and thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SQL] Allow for running of SQL on the Python global() namespace
2 participants