Optimizing with TensorRT-LLM
One of the key challenges with larger models is managing latency and cost. To address this, we have integrated Nvidia TensorRT-LLM technique into our workflow to optimize our SLM inference performance.
One of the products where we leverage TensorRT-LLM is in ‘Deep search’. Deep search leverages SLMs in runtime to provide the best possible web results to Bing users.
This experience involves several steps, including understanding the user's query intent and ensuring the relevance and quality of web results. Given that SLMs require time to execute multiple steps, it is crucial to deliver value to users as quickly as possible. However, our product is built on the foundation of providing the best results, and we will not compromise on quality for speed. This is where TensorRT-LLM comes into play, reducing model inference time and, consequently, the end-to-end experience latency without sacrificing result quality.
TensorRT-LLM is a powerful optimization tool that helps us reduce the latency and cost associated with hosting and running large models on Nvidia A100 GPU.
Before optimization, our original Transformer model had a 95th percentile latency of 4.76 seconds per batch and a throughput of 4.2 queries per second per instance. Each batch consists of 20 queries. After integrating TensorRT-LLM, we achieved a 95th percentile latency reduction to 3.03 seconds per batch and increased throughput to 6.6 queries per second per instance. This optimization not only enhances the user experience by delivering quicker search results but also reduces the operational costs of running these large models by 57%.
vLLM (V0.2.1) FP16 (P 50%/P 95%) |
TensorRT-LLM (V0.9.0) int8 SmoothQuant (P 50%/P 95%) |
|
Labeling | 2.96/4.76 | 1.99/3.03 |
TensorRT-LLM improves model performance:
The SmoothQuant technique was introduced in https://arxiv.org/abs/2211.10438. It is a method to run inference using INT8 for both activations and weights while maintaining the accuracy of the network (on downstream tasks).
As explained in the research paper, preprocessing must be applied to the weights of the model. TensorRT-LLM includes scripts to prepare the model to run using the SmoothQuant method.
Benefits for Users
The transition to SLM models and the integration of TensorRT-LLM bring several benefits to our users:
- Faster Search Results: With optimized inference, users can enjoy quicker response times, making their search experience more seamless and efficient.
- Improved Accuracy: The enhanced capabilities of SLM models allow us to deliver more accurate and contextualized search results, helping users find the information they need more effectively.
- Cost Efficiency: By reducing the cost of hosting and running large models, we can continue to invest in further innovations and improvements, ensuring that Bing remains at the forefront of search technology.
As we continue to innovate and refine our search technology, we remain committed to providing the best possible experience for our users. The transition to LLM and SLM models and the integration of TensorRT LLM are just the beginning. We are excited about the future possibilities and look forward to sharing more advancements with you.
Stay tuned for more updates as we continue to push the boundaries of what's possible with search technology.