-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for Python Asyncio Support #61
Comments
Can you please assign this issue to me? |
I refactored the code to use a thread pool instead of asyncio. Initially, I attempted an asyncio-based solution. However, implementing a feature that solely uses asyncio would have required modifying several lines of code, which would have been time-consuming and inefficient for this specific task. With just over 30 additional lines of code, I implemented a method that handles the heavy lifting by assigning each model inference to a separate thread. This change results in a performance improvement, reducing execution time by approximately 40% to 60%. For more details, you can check the full implementation here: Pull Request #64. |
Oh, i also wrote a full-document (1 page) in wich i explain why I THINK it is better to use threads rather than asyncio in this case https://docs.google.com/document/d/17kESXXEUkA0gwc6qksFnZ2i5IjCP3Nk7-CH6sjnsgIE/edit?usp=sharing |
looks good and right, congrats i would approve this PR but only with more changes and detail on it |
Thanks! I also think it's the better approach About the PR, yes, i also said it in the description; thanks for the feedback ;) |
When there are 1000~10000 requests at the same time, which one will perform better, the thread architecture or the asyncio architecture? |
Handling 10,000 simultaneous requests can indeed approach the scale of a DDoS for some infrastructures, depending on their capacity and setup. However, if your system can handle this volume without triggering any limits, asyncio would likely be the better choice in terms of efficiency and scalability for managing such high concurrency. My proposal for a thread-based solution was designed with smaller-scale scenarios in mind as an initial improvement. For example, if you are working with 30 models, this approach can process responses in approximately 3 seconds on average instead of waiting for each model to return sequentially, which would take around 90 seconds. While an asynchronous client implementation has already been developed by someone else—providing a great solution for large-scale use cases—I opted for a threading approach to achieve significant performance gains with minimal effort and complexity. For smaller workloads or as a stepping stone toward further optimization, threads strike a practical balance between simplicity and performance. If you look into the code, you can see that even all the test cases are working, because i added almost nothing to the code, just threaded the processment. |
Take a look at this: I made 49 requests, and they all returned within 8 seconds. Here's the kicker—I'm in Brazil, where we don't have any OpenAI API servers nearby. Despite this, the solution scales efficiently within this range, handling 49 expensive requests simultaneously without any noticeable bottlenecks video: https://drive.google.com/file/d/17wbfVsZnvVPSKumtsj63qS7srTSYLL82/view?usp=sharing |
Thank you for taking the time to address this issue and for providing changes in Pull Request #64, which proposes dispatching multiple requests in parallel using The purpose of an asynchronous interface is to enable seamless integration with other asynchronously executing code, especially for I/O-bound operations. For example, in scenarios where multiple consumer requests hit a backend and each requires a call to a chat completion API, users often do not have a batch of requests to parallelize. Instead, they rely on the non-blocking nature of async operations to manage such tasks efficiently. This is a fundamental use case that the current solution in Pull Request #64 does NOT address. I would also like to respond to points made in your accompanying document:
This statement is incorrect. Network calls are inherently I/O-bound and benefit significantly from asyncio's non-blocking model. In contrast, the current synchronous implementation would block the CPU during I/O operations, causing the async program or service to halt, undermining its responsiveness.
While true, this complexity is why the library itself should handle the implementation of async interfaces.
This is generally accurate for many languages, but Python's Global Interpreter Lock (GIL) imposes significant limitations. The Global Interpreter Lock prevents multiple threads from executing Python bytecode concurrently, which reduces the effectiveness of threads for CPU-bound tasks. In conclusion, while the use of |
Now I understand your point, and I absolutely agree. It is indeed possible, and I’m willing to adapt the implementation accordingly. When I said:
My intention was to advocate for hosting and using locally saved models, which aligns with my area of expertise in the market—leveraging local computational power. However, I realize now that my wording may have caused some misunderstanding. I sincerely apologize for this and will take the opportunity to rewrite and clarify my thoughts in the morning (it's currently 4 a.m. here). A truly asynchronous design allows handling multiple consumers requests without blocking the main flow, while using threads might introduce bottlenecks or overhead in high-concurrency situations. but, as i said, i was envisioning an idea to the lower-range user to extract the most performance (and it is faster now) Regarding the Pull Request #62, I believe it already addresses the intended purpose. That said, I plan to refine my PR further and make it more suitable for the broader needs of the library. |
But this is just inefficient usage of resources. You are ending up using more threads. It will lead to cost inflation as the scale increases. I created the #62 while short of time and it can be improved upon to make the design more easily maintainable, usable using proper design patterns. Let me know your thoughts. |
Yes, gentlemen, you are all correct, but there’s one key detail to consider. In my specific field of application, I felt the need to use threads because there was no existing method that allowed me to achieve my goals effectively. With this piece of code, I was able to implement it into an application with some modifications. Our objectives were different. While I was focused on using threads to parallelize model inferences, others here were exploring ways to make the process fully asynchronous. Both approaches are valid and excel in their respective use cases. As I mentioned earlier:
In conclusion, both the thread-based approach and the asynchronous solution have their merits, each catering to different purposes. My choice to use threads was driven by simplicity and a focus on smaller-scale scenarios, where performance gains could be achieved quickly and with minimal complexity. Conversely, the asynchronous implementation offers a robust solution for large-scale demands. Ultimately, the best approach depends on the context and the specific needs of each application. What matters most is recognizing that both strategies bring valuable contributions to solving distinct challenges effectively, as i said before:
|
I believe we have different issues. So you really believe adding threads to the library is an improvement, just create another issue. Let this issue (#61) one be about async support.
While I haven't tested your PR #62, just from skimming through the code it looks like it should address this issue. I'll be sure to check your fork when I have time. Thanks for your contribution. |
I made a PR with async support implementation for several providers: OpenAI, Anthropic, Mistral, Fireworks. Check it here, please: #185 |
I would like to request support for Python’s asyncio in this library. This feature would be particularly beneficial for Python services, which often rely on asynchronous programming for efficient and scalable operations.
Some providers, such as OpenAI, already offer native async support (e.g.,
from openai import AsyncOpenAI
), making it straightforward to wrap these APIs. Others, like AWS, have community-supported async wrappers, such asaioboto3
. For providers without async support, an interim solution using a synchronous wrapper could be implemented while awaiting a proper asyncio implementation.Asyncio support would greatly enhance the usability of this library. Thank you for considering this enhancement.
The text was updated successfully, but these errors were encountered: