My 2024 with vLLM: A Tale of Tsinghua Scholar’s vLLM Journey

My 2024 with vLLM: A Tale of Tsinghua Scholar’s vLLM Journey

The seeds of my journey with vLLM were sown five years ago during a summer fellowship at UC Berkeley’s RISELab under the guidance of Professor Michael Jordan. It was there, amidst the hum of academic pursuit, that I encountered a newly enrolled PhD student involved in a project that would ultimately redirect the course of my life.

The advent of ChatGPT in late 2022 served as a catalyst for change in my research focus. At the time, I was deeply immersed in developing an optimization method to accelerate the fine-tuning efficiency of Conv-BN modules. My exploration led me to the `torch.fx` module and `torch.compile` within PyTorch, and through this, I connected with Jason Ansel of the PyTorch compiler team.

By the end of 2023, I had expressed my growing interest in machine learning systems research to Professor Jordan, who kindly referred me to Ion Stoica. This connection reignited my affiliation with UC Berkeley and marked the beginning of my chapter with vLLM.

In March 2024, I dove into the vLLM project, bringing with me the wisdom of PyTorch’s open-source management practices. To quickly get up to speed, I subscribed to all of vLLM’s GitHub notifications, meticulously reviewing each new issue and pull request daily.

My first task, upgrading the PyTorch version, proved to be a formidable challenge. It took weeks of relentless debugging before I could rectify the unusual memory consumption issues.

We faced a series of trials as feedback rolled in about vLLM’s underwhelming performance on the H100 GPUs, a resource we lacked entirely. A seemingly benign contribution from a community member once sent our performance into a nosedive. However, it was the passion and support of our community, and the generous backing from NVIDIA, AWS, Google Cloud, and others that saw us through our resource constraints.

In our quest to support the LLaMA 3.1 405B model, we developed a multi-machine distributed inference capability. Our solution was so effective that eight out of ten official Meta partners chose vLLM.

As our community expanded, the need for restructuring became a topic of discussion. We introduced new features like a ZMQ-based API server and multi-step scheduling to vLLM. A major refactor is in the works, with a focus on performance optimization.

In my spare time, I explored the integration of `torch.compile`. It was during the addition of support for the Command-R model that I uncovered a flaw in `torch.compile`, prompting the PyTorch team to prioritize vLLM’s integration with it.

Since its open-source debut in June 2023, vLLM has found extensive production deployment across various domains. I firmly believe that vLLM is poised to become the “Linux” of the intelligent era.

In an era where Moore’s Law is faltering, hardware affinity is crucial for algorithmic success. As algorithm researchers, we must learn to understand hardware and design algorithms that are sympathetic to existing hardware capabilities.

Looking ahead, the popularity of vLLM is undeniably tied to the rise of large models. Amidst the泡沫, there is a sense that AI stands at the precipice of the next “Internet-scale miracle.”

“`markdown
The narrative above is a recount of a personal journey intertwined with the evolution of vLLM – a testament to the intersection of human endeavor and technological progress. As the story unfolds within the confines of this digital canvas, one can’t help but feel a surge of emotion, a blend of humility and awe, at the transformative power of collaboration and innovation.

In the heart of the narrative lies not just the triumphs and tribulations of developing advanced machine learning systems but also the human spirit that drives these advancements. It is a story that bridges the gap between the cold logic of code and the warm pulse of ambition that fuels the relentless pursuit of discovery.

Here, presented without a title, is a chronicle of the past, a snapshot of the present, and a glimpse into the future – a future where the lines between human potential and machine capability blur into a horizon of endless possibilities.
“`