-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] [SQL] Enable SQL query to run on callers scoped variables #2864
[FEAT] [SQL] Enable SQL query to run on callers scoped variables #2864
Conversation
A couple notes/thoughts:
Thanks! |
CodSpeed Performance ReportMerging #2864 will not alter performanceComparing Summary
|
daft/sql/sql.py
Outdated
# some interpreters might not implement currentframe; all reasonable | ||
# errors above should be AttributeError | ||
raise DaftCoreException("Cannot get caller environment, please provide a catalog") from exc | ||
catalog = SQLCatalog({k: v for k, v in caller_vars.items() if isinstance(v, DataFrame)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably do this in a followup PR, but it'd be nice to also support anything that can be zero copy converted into a DataFrame
, such as:
- anything implementing dataframe interchange protocol
- pyarrow tables
- pyarrow recordbatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I've never heard of the dataframe interchange protocol before! This seems interesting, but I do agree its a separate change since I think it would require some addition in SQLCatalog to get the logical plan from the zero copy table. Perhaps an issue for that would be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor change, but otherwise looks good.
try: | ||
# Caller is back from func, analytics, annotation | ||
caller_frame = inspect.currentframe().f_back.f_back.f_back # type: ignore | ||
caller_vars = {**caller_frame.f_globals, **caller_frame.f_locals} # type: ignore | ||
except AttributeError as exc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of making it use either the scoped variables, or the global context, we should be able to mix and match. I'd prefer having a register_globals
parameter: sql(query, catalog, register_globals=False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, makes sense. I will say that I think register_globals
we might want to default to True
. Because for example in an interactive or jupyter session one might get the table and then write some processing functionality within functions and it would be likely expected/desired that they are visible there. This might reduce friction for a big use case, and for library methods that need more control the option is there.
Thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes sorry for the confusion, Having register_globals
default to True
is what I intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think I might have misunderstood that as a signature. Do you mean by default enable supplying a catalog and adding caller-visible variables to it? register_globals=False
would mean don't activate this new code, essentially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to clarify, i meant sql(query, catalog, register_globals=False)
as what the user would invoke if they did not want to register globals, the signature should be
def sql(sql: str, catalog: Optional[SQLCatalog] = None, register_globals: bool = True) -> DataFrame:
and the usage should be
sql(query)
-> validsql(query, catalog)
-> validsql(query, catalog, register_globals=False)
-> validsql(query, register_globals=False)
-> type error
once again, sorry for the confusion.
register_globals=False would mean don't activate this new code, essentially?
yes that's correct. Which means this should always be invalid
daft.sql(query, register_globals=False)
Not required, but we could go the extra step and @overload
the method to prevent incorrect usage at the type level
I think this would give us the desired results
from typing import overload, Optional
@overload
def sql(sql: str) -> DataFrame: ...
@overload
def sql(sql: str, catalog: SQLCatalog, register_globals: bool = True) -> DataFrame: ...
@overload
def sql(sql: str, catalog: SQLCatalog = None, register_globals: bool = True) -> DataFrame: ...
def sql(sql: str, catalog: Optional[SQLCatalog] = None, register_globals: bool = True) -> DataFrame:
# Implementation here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the clarification - now I see it better. I had thought of it as if the user supplies a catalog they could populate it as needed and wouldn't need the scoped vars - but this way may be more flexible in case a catalog is passed around - though at this point a rule would have to be imposed on precedence (presumably the catalog is higher than the vars).
For this, I'll have to add some features to SQLCatalog
to add tables and get registered names. I'll look into that as soon as I can, but might be a second 😄
ec2c6eb
to
8b22e76
Compare
@universalmind303, I've made updates which I think capture the desired behavior. Since I don't think it is desired to have a side-effect of modifying the supplied catalog, this makes a new catalog copying from given, and also I've made catalog take precedence in name conflicts. Also, I'm not sure about the overloads, I think the two added capture the desired cases since the default values are not consumed in checking - but I wondered if this could generate false negatives in the case a None value variable is used for catalog. |
great work @amitschang. Let's see how this works out as is. We can adjust the precedence or typings later on if others run into issues. |
@universalmind303, super - and thanks for the feedback! |
This builds a catalog based on the python globals and locals visible to the caller at the point where the
sql
query function is called, in the case where a catalog is not supplied. Otherwise, the catalog is final and must contain necessary tables.resolves #2740