Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phi3.5V Server API Error: Forward step expected a PagedAttention input metadata. #756

Closed
ytnvj2 opened this issue Sep 6, 2024 · 6 comments
Labels
bug Something isn't working regression Incorrect behavior or performance reduction introduced

Comments

@ytnvj2
Copy link

ytnvj2 commented Sep 6, 2024

Hi,
I am running the phi3.5 vision model using the below command on Apple M2 macbook:
'cargo run --release --features metal -- --port 1234 vision-plain -m microsoft/Phi-3.5-vision-instruct -a phi3v'

Everything loads fine, but when I query I get this error:

mistralrs_core::engine: prompt step - Model failed with error: Msg("Forward step expected a PagedAttention input metadata. This was not provided, please ensure that the scheduler config is correctly configured for PagedAttention.")

On the other hand, If I load the model using the python API everything works fine but I am not sure how to enable ISQ in python.
''
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture

runner = Runner(
which=Which.VisionPlain(
model_id="microsoft/Phi-3.5-vision-instruct",
arch=VisionArchitecture.Phi3V,
),
)
''
Any idea what might be causing this and how to fix this?

@ytnvj2 ytnvj2 added the bug Something isn't working label Sep 6, 2024
@JCRPaquin
Copy link

Same thing happens with Phi-3.5-MoE.

For Phi-3.5-MoE: It looks like the default scheduler path can reach the NormalPipeline, which requires(?) paged attention metadata. Likely something similar is happening for the vision model.

@JCRPaquin
Copy link

JCRPaquin commented Sep 6, 2024

Ah, so the quantized pipelines can handle the lack of paged attention metadata. I'm not clear on why the Python API is routing through a different part of the code, though.

@ytnvj2 can you retry with a GGUF version of the model via the CLI? Using a GGUF might wholly disable vision. The GGUF loading panics in an unrelated area.

@JCRPaquin
Copy link

JCRPaquin commented Sep 6, 2024

Potential confounder: I might be misreading things, but it looks like Paged Attention is disabled for non-CUDA targets (including Metal).

#[cfg(all(feature = "cuda", target_family = "unix"))]
pub const fn paged_attn_supported() -> bool {
true
}
#[cfg(not(all(feature = "cuda", target_family = "unix")))]
pub const fn paged_attn_supported() -> bool {
false
}

@EricLBuehler EricLBuehler added the regression Incorrect behavior or performance reduction introduced label Sep 7, 2024
@EricLBuehler
Copy link
Owner

Hi @JCRPaquin @ytnvj2 thank you for the details and all investigation! Indeed, you are correct the issue arises when there is a discrepancy there. This bug was caused by a regression from #753, and I just merged #759 which should fix this. Can you please confim this works?

@JCRPaquin
Copy link

@EricLBuehler thanks for the quick response! I'll try the fix in a few hours.

@ytnvj2
Copy link
Author

ytnvj2 commented Sep 7, 2024

@EricLBuehler Tried out the fix and it works for me now. Thank you for the quick fix. Appreciate it.

@ytnvj2 ytnvj2 closed this as completed Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression Incorrect behavior or performance reduction introduced
Projects
None yet
Development

No branches or pull requests

3 participants