Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approximate seeking mode #427

Closed
scotts opened this issue Dec 9, 2024 · 5 comments
Closed

Approximate seeking mode #427

scotts opened this issue Dec 9, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@scotts
Copy link
Contributor

scotts commented Dec 9, 2024

🚀 The feature

TorchCodec's public VideoDecoder should have an approximate seek mode. Users should be able to specify they want the mode when they instantiate the decoder.

Motivation, pitch

The primary motivation is performance. Currently, TorchCodec always performs exact seeks. We accomplish the exact seeks by first scanning the entire video file, and building up our own frame-table internally. This means we're not susceptible to bad header metadata. But it adds an upfront linear cost to all decoding. This hurts performance for both large files, and when the decoding pattern is sequential from the start.

This is a high priority feature, as it should help to address some current performance issues users are seeing.

@scotts scotts added the enhancement New feature or request label Dec 9, 2024
@tchaton
Copy link

tchaton commented Dec 11, 2024

This would be great. Quite eager to try it in Litdata: https://github.com/Lightning-AI/litdata

@scotts
Copy link
Contributor Author

scotts commented Dec 13, 2024

Sharing some thoughts on how we should go about implementing this based on some code diving. First, I think we want to implement this mode in the C++ layer, not just the Python. I think it's too fundamental a concept to just try to make it work outside the C++ VideoDecoder class.

In the VideoDecoder class, we have already marked all of the places where there will be behavioral differences between exact and approximate seeking behavior: it's every place we have a validateScannedAllStreams() call. For example, this one in getFrameAtIndexInternal(). This call appears in all of the following member functions:

  • getFrameAtIndexInternal(): Handling approximate mode seems easy here, we just need to do some math.
  • getFramesAtIndices(): Ditto, we just need to do some math.
  • getFramesPlayedByTimestamps(): We may need an algorithm change to support approximate seeking. The current implementation maps the pts values to indices and calls getFramesAtIndices(). I'm not sure if that will be valid in approximate mode.
  • getFramesInRange(): We should be able to just do some math.
  • getFramesPlayedByTimestampInRange(): I think we'll need an algorithm change as, again, we turns the pts values in indices by doing binary searches in the frame indices to find the lower and upper bounds of the range.
  • getPtsSecondsForFrame(): We could just do some math, but I'm not sure if it would make sense. This member function exists only for testing, to make sure that we don't lose precision when we go from seconds as doubles to pts values in ints and back. This is not part of our public API, so it's okay if we force this to always be in exact mode.

@scotts
Copy link
Contributor Author

scotts commented Dec 20, 2024

Work in progress PR #440. It takes the approach mentioned above, and also exposes it at the Python layer. It still has some bugs, but I think this proves out the general approach.

@scotts
Copy link
Contributor Author

scotts commented Dec 21, 2024

Update: PR #440 passes all tests and is showing the expected performance. See current benchmark numbers on the PR.

@scotts
Copy link
Contributor Author

scotts commented Jan 22, 2025

Implemented in #440.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants