Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State_agg with time ranges in specific states #619

Open
davidkohn88 opened this issue Nov 16, 2022 · 1 comment
Open

State_agg with time ranges in specific states #619

davidkohn88 opened this issue Nov 16, 2022 · 1 comment
Labels
feature-request And area of analysis that could be made easier

Comments

@davidkohn88
Copy link
Contributor

Right now, when we do state_agg, we limit what we can calculate, namely, we only calculate the duration in a given state, while this is nice, for many types of, say, machine metrics, you want to know the actual periods that you were in a state, with all duplicate values of the state reporting removed and only the state change events captured. So you could then easily return all the periods where you were in a given state.

So, we'd want a couple new accessors: state_periods(state_agg, 'state') -> setof state, tstzrange and the related interpolated accessor.
I wonder also if it's useful to have one that can fetch all periods, so something like state_timeline(state_agg) -> setof state, tstzrange

one open question is whether we want to use tstzranges or if we want to use start and end times, I can't remember if/how we've made that choice in the past.

While there are some memory concerns with the current implementation of state_agg and specifically around using text or integers, I think we could support both, I think if we want to limit memory usage, we could just set a limit on the total number of states we can track, probably something like 100 or 1000 or the like, and error if we have more than that. This function is meant for tracking a small number of distinct states, not for tracking a whole lot of long tail data. It's also relatively easy to write a query that specifies the allowed states to pass in to something like this, so if you're running into problems with messy data you can always do something like CASE WHEN state IN ('list', 'of', 'allowed', 'states') THEN state ELSE 'UNKNOWN_STATE' or the like.

I wonder if we want to / can explicitly support ENUM types as well? use them in their int forms

Another question is whether we want to try to deal with the fact that if states are oscillating quickly you can end up with a lot of data in one of these aggregates. Perhaps we could add a flag for the minimum time in a state to track? And could keep the duration counters as well as the larger ranges and some utility functions for figuring out if you're getting lots of oscillations? Or maybe we just wait for it to become a problem and don't bother with it for now. I do wonder whether it's worth it to store the duration anyway to make the calculations faster?

@davidkohn88 davidkohn88 added the feature-request And area of analysis that could be made easier label Nov 16, 2022
@syvb syvb self-assigned this Nov 16, 2022
@syvb
Copy link
Member

syvb commented Nov 16, 2022

Supporting ENUM types would cause problems when an enum is the return value. For state_agg() /duration_in()/interpolated_duration_in() we would just need to alter them to take an ANYELEMENT (or we could create one function for TEXT inputs and one for ANYENUM inputs).

But into_values(), which currently has a type signature of into_values(agg StateAgg) RETURNS (TEXT, BIGINT), wouldn't as simple to add support to since you can't return an ANYELEMENT without an ANYELEMENT input. We'd have to use the ugly hack where you pass a NULL of the type you want, like into_values(agg, NULL::myenum).

@syvb syvb mentioned this issue Nov 28, 2022
bors bot added a commit that referenced this issue Dec 6, 2022
636: Add timeline_agg r=Smittyvb a=Smittyvb

Adds a `timeline_agg` aggregate that can be used in the same ways as `state_agg` but also allows getting the entire state timeline with `state_timeline`.

Note that this PR returns start times and end times instead of `tstzrange`s. pgx 0.6.0 will add support for `tstzrange`s so when that release comes out I'll update the relevant functions to return ranges.

This partially addresses #619 (adding `rollup` and supporting integer values will be separate PRs to make them easier to review).

Co-authored-by: Smitty <smitty@timescale.com>
Co-authored-by: Smittyvb <smitty@timescale.com>
@syvb syvb removed their assignment Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request And area of analysis that could be made easier
Projects
None yet
Development

No branches or pull requests

2 participants