Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Python Asyncio Support #61

Open
BobbyL2k opened this issue Nov 26, 2024 · 14 comments
Open

Request for Python Asyncio Support #61

BobbyL2k opened this issue Nov 26, 2024 · 14 comments
Labels
enhancement New feature or request

Comments

@BobbyL2k
Copy link

I would like to request support for Python’s asyncio in this library. This feature would be particularly beneficial for Python services, which often rely on asynchronous programming for efficient and scalable operations.

Some providers, such as OpenAI, already offer native async support (e.g., from openai import AsyncOpenAI), making it straightforward to wrap these APIs. Others, like AWS, have community-supported async wrappers, such as aioboto3. For providers without async support, an interim solution using a synchronous wrapper could be implemented while awaiting a proper asyncio implementation.

Asyncio support would greatly enhance the usability of this library. Thank you for considering this enhancement.

@sarthakforwet
Copy link

sarthakforwet commented Nov 26, 2024

Can you please assign this issue to me?

@soulcarus
Copy link

I refactored the code to use a thread pool instead of asyncio.

Initially, I attempted an asyncio-based solution. However, implementing a feature that solely uses asyncio would have required modifying several lines of code, which would have been time-consuming and inefficient for this specific task.

With just over 30 additional lines of code, I implemented a method that handles the heavy lifting by assigning each model inference to a separate thread. This change results in a performance improvement, reducing execution time by approximately 40% to 60%.

For more details, you can check the full implementation here: Pull Request #64.

@soulcarus
Copy link

soulcarus commented Nov 26, 2024

Oh, i also wrote a full-document (1 page) in wich i explain why I THINK it is better to use threads rather than asyncio in this case

https://docs.google.com/document/d/17kESXXEUkA0gwc6qksFnZ2i5IjCP3Nk7-CH6sjnsgIE/edit?usp=sharing

@oraclesystem
Copy link

looks good and right, congrats

i would approve this PR but only with more changes and detail on it

@soulcarus
Copy link

Thanks! I also think it's the better approach

About the PR, yes, i also said it in the description;
If the contributors think it's ok, i'll make it more self-explanatory and finish the feature

thanks for the feedback ;)

@chiyiliao
Copy link

When there are 1000~10000 requests at the same time, which one will perform better, the thread architecture or the asyncio architecture?

@soulcarus
Copy link

Handling 10,000 simultaneous requests can indeed approach the scale of a DDoS for some infrastructures, depending on their capacity and setup. However, if your system can handle this volume without triggering any limits, asyncio would likely be the better choice in terms of efficiency and scalability for managing such high concurrency.

My proposal for a thread-based solution was designed with smaller-scale scenarios in mind as an initial improvement. For example, if you are working with 30 models, this approach can process responses in approximately 3 seconds on average instead of waiting for each model to return sequentially, which would take around 90 seconds.

While an asynchronous client implementation has already been developed by someone else—providing a great solution for large-scale use cases—I opted for a threading approach to achieve significant performance gains with minimal effort and complexity. For smaller workloads or as a stepping stone toward further optimization, threads strike a practical balance between simplicity and performance.

If you look into the code, you can see that even all the test cases are working, because i added almost nothing to the code, just threaded the processment.

@soulcarus
Copy link

soulcarus commented Nov 26, 2024

Take a look at this: I made 49 requests, and they all returned within 8 seconds. Here's the kicker—I'm in Brazil, where we don't have any OpenAI API servers nearby. Despite this, the solution scales efficiently within this range, handling 49 expensive requests simultaneously without any noticeable bottlenecks

CODE
image
OUTPUT
image

video: https://drive.google.com/file/d/17wbfVsZnvVPSKumtsj63qS7srTSYLL82/view?usp=sharing

@BobbyL2k
Copy link
Author

BobbyL2k commented Nov 26, 2024

I refactored the code to use a thread pool instead of asyncio.

Initially, I attempted an asyncio-based solution. However, implementing a feature that solely uses asyncio would have required modifying several lines of code, which would have been time-consuming and inefficient for this specific task.

With just over 30 additional lines of code, I implemented a method that handles the heavy lifting by assigning each model inference to a separate thread. This change results in a performance improvement, reducing execution time by approximately 40% to 60%.

For more details, you can check the full implementation here: Pull Request #64.

Oh, i also wrote a full-document (1 page) in wich i explain why I THINK it is better to use threads rather than asyncio in this case

https://docs.google.com/document/d/17kESXXEUkA0gwc6qksFnZ2i5IjCP3Nk7-CH6sjnsgIE/edit?usp=sharing

Thank you for taking the time to address this issue and for providing changes in Pull Request #64, which proposes dispatching multiple requests in parallel using ThreadPoolExecutor. While this approach offers a way to parallelize tasks, it doesn’t align with the needs of library users requiring an asynchronous interface.

The purpose of an asynchronous interface is to enable seamless integration with other asynchronously executing code, especially for I/O-bound operations. For example, in scenarios where multiple consumer requests hit a backend and each requires a call to a chat completion API, users often do not have a batch of requests to parallelize. Instead, they rely on the non-blocking nature of async operations to manage such tasks efficiently. This is a fundamental use case that the current solution in Pull Request #64 does NOT address.

I would also like to respond to points made in your accompanying document:

“However, for tasks that block the thread (like network calls to AI APIs), the performance gain with asyncio is limited because the asynchronous execution model is not as effective in these cases.”

This statement is incorrect. Network calls are inherently I/O-bound and benefit significantly from asyncio's non-blocking model. In contrast, the current synchronous implementation would block the CPU during I/O operations, causing the async program or service to halt, undermining its responsiveness.

“Asyncio: Working with asyncio requires a deeper understanding of the event loop, asynchronous task creation, and exception handling. If not configured correctly, using asyncio can introduce additional complexity and hard-to-debug errors, especially in applications requiring true parallelism.”

While true, this complexity is why the library itself should handle the implementation of async interfaces.

“Threads can execute multiple tasks simultaneously.”

This is generally accurate for many languages, but Python's Global Interpreter Lock (GIL) imposes significant limitations. The Global Interpreter Lock prevents multiple threads from executing Python bytecode concurrently, which reduces the effectiveness of threads for CPU-bound tasks.

In conclusion, while the use of ThreadPoolExecutor may improve performance in certain contexts, it is not an appropriate solution for this issue. An asynchronous implementation is required to serve the needs of library users writing Asyncio Python.

@soulcarus
Copy link

Now I understand your point, and I absolutely agree. It is indeed possible, and I’m willing to adapt the implementation accordingly.

When I said:

“However, for tasks that block the thread (like network calls to AI APIs), the performance gain with asyncio is limited because the asynchronous execution model is not as effective in these cases.”

My intention was to advocate for hosting and using locally saved models, which aligns with my area of expertise in the market—leveraging local computational power. However, I realize now that my wording may have caused some misunderstanding. I sincerely apologize for this and will take the opportunity to rewrite and clarify my thoughts in the morning (it's currently 4 a.m. here).

A truly asynchronous design allows handling multiple consumers requests without blocking the main flow, while using threads might introduce bottlenecks or overhead in high-concurrency situations. but, as i said, i was envisioning an idea to the lower-range user to extract the most performance (and it is faster now)

Regarding the Pull Request #62, I believe it already addresses the intended purpose. That said, I plan to refine my PR further and make it more suitable for the broader needs of the library.

@samarism
Copy link

samarism commented Nov 26, 2024

I refactored the code to use a thread pool instead of asyncio.

Initially, I attempted an asyncio-based solution. However, implementing a feature that solely uses asyncio would have required modifying several lines of code, which would have been time-consuming and inefficient for this specific task.

With just over 30 additional lines of code, I implemented a method that handles the heavy lifting by assigning each model inference to a separate thread. This change results in a performance improvement, reducing execution time by approximately 40% to 60%.

For more details, you can check the full implementation here: Pull Request #64.

But this is just inefficient usage of resources. You are ending up using more threads. It will lead to cost inflation as the scale increases. I created the #62 while short of time and it can be improved upon to make the design more easily maintainable, usable using proper design patterns. Let me know your thoughts.

@soulcarus
Copy link

Yes, gentlemen, you are all correct, but there’s one key detail to consider.

In my specific field of application, I felt the need to use threads because there was no existing method that allowed me to achieve my goals effectively. With this piece of code, I was able to implement it into an application with some modifications.

Our objectives were different. While I was focused on using threads to parallelize model inferences, others here were exploring ways to make the process fully asynchronous.

Both approaches are valid and excel in their respective use cases. As I mentioned earlier:

"Using threads might introduce bottlenecks or overhead in high-concurrency scenarios."

"My thread-based solution was designed with smaller-scale scenarios in mind as an initial improvement."

"While someone else has developed an asynchronous client implementation—an excellent solution for large-scale use cases—I chose a threading approach to achieve significant performance gains with minimal effort and complexity."

In conclusion, both the thread-based approach and the asynchronous solution have their merits, each catering to different purposes. My choice to use threads was driven by simplicity and a focus on smaller-scale scenarios, where performance gains could be achieved quickly and with minimal complexity. Conversely, the asynchronous implementation offers a robust solution for large-scale demands.

Ultimately, the best approach depends on the context and the specific needs of each application. What matters most is recognizing that both strategies bring valuable contributions to solving distinct challenges effectively, as i said before:

"My intention was to advocate for hosting and using locally saved models, which aligns with my area of expertise in the market—leveraging local computational power. However, I realize now that my wording may have caused some misunderstanding. I sincerely apologize for this and will take the opportunity to rewrite and clarify my thoughts in the morning"

@BobbyL2k
Copy link
Author

Yes, gentlemen, you are all correct, but there’s one key detail to consider.

In my specific field of application, I felt the need to use threads because there was no existing method that allowed me to achieve my goals effectively. With this piece of code, I was able to implement it into an application with some modifications.

Our objectives were different. While I was focused on using threads to parallelize model inferences, others here were exploring ways to make the process fully asynchronous.

Both approaches are valid and excel in their respective use cases. As I mentioned earlier:

"Using threads might introduce bottlenecks or overhead in high-concurrency scenarios."

"My thread-based solution was designed with smaller-scale scenarios in mind as an initial improvement."

"While someone else has developed an asynchronous client implementation—an excellent solution for large-scale use cases—I chose a threading approach to achieve significant performance gains with minimal effort and complexity."

In conclusion, both the thread-based approach and the asynchronous solution have their merits, each catering to different purposes. My choice to use threads was driven by simplicity and a focus on smaller-scale scenarios, where performance gains could be achieved quickly and with minimal complexity. Conversely, the asynchronous implementation offers a robust solution for large-scale demands.

Ultimately, the best approach depends on the context and the specific needs of each application. What matters most is recognizing that both strategies bring valuable contributions to solving distinct challenges effectively, as i said before:

"My intention was to advocate for hosting and using locally saved models, which aligns with my area of expertise in the market—leveraging local computational power. However, I realize now that my wording may have caused some misunderstanding. I sincerely apologize for this and will take the opportunity to rewrite and clarify my thoughts in the morning"

I believe we have different issues. So you really believe adding threads to the library is an improvement, just create another issue. Let this issue (#61) one be about async support.

I refactored the code to use a thread pool instead of asyncio.
Initially, I attempted an asyncio-based solution. However, implementing a feature that solely uses asyncio would have required modifying several lines of code, which would have been time-consuming and inefficient for this specific task.
With just over 30 additional lines of code, I implemented a method that handles the heavy lifting by assigning each model inference to a separate thread. This change results in a performance improvement, reducing execution time by approximately 40% to 60%.
For more details, you can check the full implementation here: Pull Request #64.

But this is just inefficient usage of resources. You are ending up using more threads. It will lead to cost inflation as the scale increases. I created the #62 while short of time and it can be improved upon to make the design more easily maintainable, usable using proper design patterns. Let me know your thoughts.

While I haven't tested your PR #62, just from skimming through the code it looks like it should address this issue. I'll be sure to check your fork when I have time. Thanks for your contribution.

@kapulkin
Copy link

kapulkin commented Feb 6, 2025

I made a PR with async support implementation for several providers: OpenAI, Anthropic, Mistral, Fireworks.

Check it here, please: #185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants