RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing
RouteLLM: 비용 효율적인 LLM 라우팅을 위한 오픈 소스 프레임워크

by: Isaac Ong*, Amjad Almahairi*, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica, Jul 01, 2024
작성자: 작성자: Isaac 옹*, 암자드 알마헤리*, 빈센트 우, 웨이린 치앙, 티안하오 우, 조셉 E. 곤잘레스, M 왈리드 카두스, 이온 스토이카,2024년 7월 01일

LLMs have demonstrated remarkable capabilities across a range of tasks, but there exists wide variation in their costs and capabilities, as seen from the plot of performance against cost in Figure 1. Very broadly, more capable models tend to be more expensive than less capable models. This leads to a dilemma when deploying LLMs in the real-world - routing all queries to the largest, most capable model leads to the highest-quality responses but can be expensive, while routing queries to smaller models can save costs but may result in lower-quality responses.
LLMs은 다양한 작업에서 뛰어난 성능을 보여주었지만 그림 1의 비용 대비 성능 그래프에서 볼 수 있듯이 비용과 기능에는 큰 편차가 있습니다. 대체로 성능이 뛰어난 모델이 성능이 떨어지는 모델보다 더 비싼 경향이 있습니다. 이는 실제 환경에서 LLMs를 배포할 때 딜레마로 이어집니다. 모든 쿼리를 가장 크고 성능이 뛰어난 모델로 라우팅하면 최고 품질의 응답을 얻을 수 있지만 비용이 많이 드는 반면, 쿼리를 더 작은 모델로 라우팅하면 비용을 절감할 수 있지만 응답 품질이 떨어질 수 있습니다.

Figure 1: Plot of performance against cost of various LLMs. Performance is measured by Elo on Chatbot Arena, and cost per million tokens assuming a 1:1 input / output ratio. Through routing between two models, we ideally achieve a better performance:cost ratio than can be achieved with either model.
그림 1: 다양한 LLMs의 비용 대비 성능 플롯. 성능은 1:1의 입력/출력 비율을 가정하여 챗봇 아레나에서 Elo가 측정한 토큰 백만 개당 비용입니다. 두 모델 간의 라우팅을 통해 두 모델 모두에서 달성할 수 있는 것보다 더 나은 성능:비용 비율을 이상적으로 달성할 수 있습니다.

LLM routing offers a solution to this, where each query is first processed by a system that decides which LLM to route it to. Ideally, all queries that can be handled by weaker models should be routed to these models, with all other queries routed to stronger models, minimizing cost while maintaining response quality. However, this turns out to be a challenging problem because the routing system has to infer both the characteristics of an incoming query and different models’ capabilities when routing.
LLM 라우팅은 이에 대한 해결책을 제공하는데, 각 쿼리를 먼저 처리하는 시스템에서 어느 LLM로 라우팅할지 결정합니다. 이상적으로는 약한 모델에서 처리할 수 있는 모든 쿼리는 이 모델로 라우팅하고, 그 외의 모든 쿼리는 더 강력한 모델로 라우팅하여 응답 품질을 유지하면서 비용을 최소화해야 합니다. 그러나 라우팅 시스템은 라우팅할 때 들어오는 쿼리의 특성과 다양한 모델의 기능을 모두 유추해야 하기 때문에 이는 어려운 문제로 밝혀졌습니다.

To tackle this, we present RouteLLM, a principled framework for LLM routing based on preference data. We formalize the problem of LLM routing and explore augmentation techniques to improve router performance. We trained four different routers using public data from Chatbot Arena and demonstrate that they can significantly reduce costs without compromising quality, with cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K as compared to using only GPT-4, while still achieving 95% of GPT-4’s performance. We also publicly release all our code and datasets, including a new open-source framework for serving and evaluating LLM routers.
이 문제를 해결하기 위해 선호도 데이터에 기반한 LLM 라우팅을 위한 원칙적인 프레임워크인 RouteLLM을 소개합니다. LLM 라우팅 문제를 공식화하고 라우터 성능을 개선하기 위한 증강 기법을 살펴봅니다. 챗봇 아레나의 공개 데이터를 사용하여 4가지 라우터를 학습시킨 결과, GPT-4만 사용할 때와 비교하여 MT Bench에서 85%, MMLU에서 45%, GSM8K에서 35% 이상의 비용을 절감하면서도 GPT-4의 95% 성능을 달성하는 등 품질 저하 없이 비용을 크게 줄일 수 있음을 입증했습니다. 또한 LLM 라우터를 제공하고 평가하기 위한 새로운 오픈 소스 프레임워크를 포함한 모든 코드와 데이터 세트를 공개합니다.

Routing Setup 라우팅 설정

In our routing setup, we focus on the case where there are two models: a stronger, more expensive model, and a weaker but cheaper model. Given this setup, our objective is to minimize costs while achieving high quality by routing between both models.
라우팅 설정에서는 더 강력하고 비싼 모델과 더 약하지만 더 저렴한 모델이라는 두 가지 모델이 있는 경우에 초점을 맞춥니다. 이 설정에서 우리의 목표는 두 모델 간에 라우팅하여 높은 품질을 달성하면서 비용을 최소화하는 것입니다.

Figure 2: Random router performance on MT Bench
그림 2: MT 벤치에서의 랜덤 라우터 성능

This is best understood through Figure 2, which represents the performance of a router that randomly routes between the two models on MT Bench. Specifically, we route between GPT-4 and Mixtral 8x7B here, with their performance denoted by the red and grey dotted lines respectively. For any router, we can plot a similar graph of its performance against the number of the calls made to GPT-4 (which is representative of the cost incurred since the cost of a Mixtral call is negligible).
이는 MT 벤치에서 두 모델 사이를 무작위로 라우팅하는 라우터의 성능을 나타내는 그림 2를 통해 가장 잘 이해할 수 있습니다. 여기서는 GPT-4와 Mixtral 8x7B 사이를 라우팅하며, 각각 빨간색과 회색 점선으로 성능을 표시했습니다. 모든 라우터의 경우, GPT-4에 대한 통화 수에 대한 유사한 성능 그래프를 그릴 수 있습니다(Mixtral 통화 비용은 무시할 수 있으므로 발생하는 비용을 나타냄).

We use preference data for training our routers, building upon previous works (1,2). Each data point consists of a prompt and a comparison between the response quality of two models on that prompt i.e. this could be a win for the first model, a win for the second model, or a tie. Using preference data allows us to learn about the strengths and weaknesses of different models and how they relate to queries, which is effective for training routers. For our base dataset, we utilize public data from Chatbot Arena. We also investigate data augmentation techniques to further improve performance using both golden-label datasets and a LLM judge.
이전 작업(1, 2)을 기반으로 라우터 학습에 선호도 데이터를 사용합니다. 각 데이터 포인트는 프롬프트와 해당 프롬프트에 대한 두 모델의 응답 품질 비교(예: 첫 번째 모델의 승리, 두 번째 모델의 승리 또는 동점일 수 있음)로 구성됩니다. 선호도 데이터를 사용하면 다양한 모델의 장단점과 쿼리와의 관계에 대해 배울 수 있어 라우터를 훈련하는 데 효과적입니다. 기본 데이터 세트의 경우 Chatbot Arena의 공개 데이터를 활용합니다. 또한 골든 라벨 데이터 세트와 LLM 판정자를 모두 사용하여 성능을 더욱 개선하기 위한 데이터 증강 기법도 조사합니다.

We trained four routers using a mix of Chatbot Arena data and data augmentation:
챗봇 아레나 데이터와 데이터 증강을 혼합하여 라우터 4개를 학습시켰습니다:

A similarity-weighted (SW) ranking router that performs a “weighted Elo calculation” based on similarity
유사성을 기반으로 "가중 Elo 계산"을 수행하는 유사성 가중(SW) 순위 라우터입니다.
A matrix factorization model that learns a scoring function for how well a model can answer a prompt
모델이 프롬프트에 얼마나 잘 대답할 수 있는지에 대한 점수 함수를 학습하는 행렬 인수분해 모델
A BERT classifier that predicts which model can provide a better response
어떤 모델이 더 나은 응답을 제공할 수 있는지 예측하는 BERT 분류기
A causal LLM classifier that also predicts which model can provide a better response
어떤 모델이 더 나은 응답을 제공할 수 있는지 예측하는 인과 관계 LLM 분류기입니다.

Results 결과

We evaluated these routers on three popular benchmarks: MT Bench, MMLU, and GSM8K, presenting results for MT Bench and MMLU below. For evaluation, we route between GPT-4 Turbo as our strong model and Mixtral 8x7B as our weak model. We use the random router from before as our baseline.
세 가지 유명 벤치마크에서 이 공유기를 평가했습니다: MT Bench, MMLU, GSM8K이며, 아래는 MT Bench와 MMLU에 대한 결과입니다. 평가를 위해 강력한 모델인 GPT-4 Turbo와 약한 모델인 Mixtral 8x7B 사이를 라우팅했습니다. 이전의 무작위 라우터를 기준선으로 사용합니다.

Figure 3: Router performance on MT Bench (left) trained only on Arena data (right) trained on Arena data augmented using a LLM judge.
그림 3: 아레나 데이터로만 학습된 MT 벤치에서의 라우터 성능(왼쪽)과 LLM 판정을 사용하여 증강된 아레나 데이터로 학습된 라우터 성능(오른쪽).

Figure 3 displays the performance of our routers on MT Bench. For routers trained only on the Arena dataset, we observe strong performance for both matrix factorization and SW ranking. Notably, matrix factorization is able to achieve 95% of GPT-4 performance using 26% GPT-4 calls, which is approximately 48% cheaper as compared to the random baseline.
그림 3은 MT 벤치에서 라우터의 성능을 보여줍니다. Arena 데이터 세트에서만 훈련된 라우터의 경우, 행렬 인수분해와 SW 순위 모두에서 강력한 성능이 관찰됩니다. 특히, 행렬 인수분해는 26%의 GPT-4 호출을 사용하여 GPT-4 성능의 95%를 달성할 수 있으며, 이는 무작위 기준선에 비해 약 48% 더 저렴합니다.

Augmenting the Arena data using an LLM judge leads to significant improvements across all routers. When trained on this augmented dataset, matrix factorization is again the best-performing router, with the number of GPT-4 calls required to achieve 95% GPT-4 performance further halved at 14% of total calls, 75% cheaper than the random baseline.
LLM 판정을 사용하여 아레나 데이터를 증강하면 모든 라우터에서 상당한 개선이 이루어집니다. 이 증강된 데이터 세트로 학습한 결과, 행렬 인수분해는 다시 최고 성능의 라우터로 나타났으며, 95% GPT-4 성능을 달성하는 데 필요한 GPT-4 호출 수가 전체 호출의 14%로 절반으로 줄어들어 무작위 기준선보다 75% 더 저렴했습니다.

Figure 4: Router performance on MMLU (left) trained only on Arena data (right) trained on Arena data augmented using golden-label data from the MMLU validation split.
그림 4: 아레나 데이터로만 학습된 MMLU(왼쪽)의 라우터 성능과 MMLU 검증 분할의 골든 라벨 데이터를 사용하여 증강된 아레나 데이터로 학습된 라우터 성능(오른쪽).

Conversely, on MMLU in Figure 4, all routers perform poorly at a near-random level when trained only on the Arena dataset, which we attribute to most MMLU questions being out-of-distribution. However, augmenting the training dataset using golden-label data from the MMLU validation split leads to significant performance improvements across all routers, with our best-performing causal LLM router now requiring only 54% GPT-4 calls to achieve 95% of GPT-4 performance, 14% cheaper than the random baseline. Importantly, this augmented dataset of approximately 1500 samples represents less than 2% of the overall training data, demonstrating the effectiveness of data augmentation even when the number of samples is small.
반대로, 그림 4의 MMLU에서 모든 라우터는 아레나 데이터셋으로만 훈련했을 때 거의 무작위 수준에 가까운 낮은 성능을 보였는데, 이는 대부분의 MMLU 문제가 분포를 벗어났기 때문으로 추정됩니다. 그러나 MMLU 검증 분할의 골든 라벨 데이터를 사용하여 훈련 데이터 세트를 보강하면 모든 라우터에서 성능이 크게 향상되며, 가장 성능이 좋은 인과 관계 LLM 라우터는 이제 무작위 기준선보다 14% 저렴한 54%의 GPT-4 호출만으로 95%의 GPT-4 성능을 달성할 수 있습니다. 중요한 점은 약 1500개의 샘플로 구성된 이 증강 데이터 세트가 전체 학습 데이터의 2% 미만을 차지한다는 점으로, 샘플 수가 적은 경우에도 데이터 증강이 효과적이라는 것을 입증했습니다.

RouteLLM vs Commercial Offerings RouteLLM과 상용 제품 비교

Figure 6: Comparison of our router against existing routing systems on MT Bench (left) using gpt-4-turbo-2024-04-09 and llama-2-70b-chat (right) using gpt-4-turbo-2024-04-09 and mixtral-8x7b-instruct-v0.1
그림 6: MT 벤치(왼쪽)에서 gpt-4-turbo-2024-04-09를 사용한 라우터와 llama-2-70b-chat(오른쪽)을 사용한 기존 라우팅 시스템과 gpt-4-turbo-2024-04-09 및 mixtral-8x7b-instruct-v0.1 라우터 비교

In Figure 6, we also report the performance of our best-performing routers on MT Bench against Martian and Unify AI, two LLM routing products released by companies. We use the latest GPT-4 Turbo as the strong model and either Llama 2 70B or Mixtral 8x7B as the weak model based on the methodology detailed here. Our routers demonstrate very strong results, achieving the same performance as these commercial routers while being over 40% cheaper.
그림 6에서는 각 회사에서 출시한 두 가지 LLM 라우팅 제품인 Martian 및 Unify AI에 대한 MT 벤치에서 최고 성능 라우터의 성능도 보고합니다. 여기에 설명된 방법론에 따라 최신 GPT-4 Turbo를 강력한 모델로, Llama 2 70B 또는 Mixtral 8x7B를 약한 모델로 사용했습니다. 당사의 라우터는 이러한 상용 라우터와 동일한 성능을 제공하면서도 40% 이상 저렴하여 매우 강력한 결과를 보여주었습니다.

Generalizing to Other Models 다른 모델로 일반화

While we route between GPT-4 and Mixtral for the above evaluations, to demonstrate the generalizability of our framework, we also present MT Bench results when routing between a different model pair: Claude 3 Opus and Llama 3 8B. Importantly, we use the same routers without any retraining, and responses from Claude 3 Opus and Llama 3 8B are not present in our training data.
위의 평가를 위해 GPT-4와 Mixtral 사이를 라우팅하지만, 프레임워크의 일반화 가능성을 입증하기 위해 다른 모델 쌍 사이를 라우팅할 때의 MT 벤치 결과도 제시합니다: 클로드 3 오푸스와 라마 3 8B. 중요한 점은 재학습 없이 동일한 라우터를 사용했으며, 학습 데이터에는 Claude 3 Opus와 Llama 3 8B의 응답이 존재하지 않는다는 점입니다.

Figure 7: Router performance on MT Bench when routed to Claude 3 Opus and Llama 3 8B.
그림 7: 클로드 3 오퍼스 및 라마 3 8B로 라우팅했을 때 MT 벤치에서의 라우터 성능.

Even when the model pair is replaced, we observe strong results across all routers on MT Bench in Figure 7, with performance comparable to our original model pair. This suggests that our routers have learned some common characteristics of problems that can distinguish between strong and weak models, which generalize to new model pairs without additional training.
모델 쌍이 교체된 경우에도 그림 7의 MT 벤치에서 모든 라우터에서 원래 모델 쌍과 비슷한 성능을 보이는 강력한 결과를 관찰할 수 있습니다. 이는 라우터가 강한 모델과 약한 모델을 구분할 수 있는 몇 가지 공통적인 문제 특성을 학습했으며, 이는 추가 학습 없이 새로운 모델 쌍으로 일반화된다는 것을 시사합니다.

Conclusion 결론

These results demonstrate the ability of our routers to achieve significant cost savings while maintaining high-quality responses. They also highlight the effectiveness of data augmentation in improving routing performance using only a small amount of data, offering a scalable path towards improving routing performance for real-world use cases.
이러한 결과는 고품질 응답을 유지하면서 상당한 비용 절감을 달성할 수 있는 라우터의 능력을 보여줍니다. 또한 소량의 데이터만으로 라우팅 성능을 개선하는 데이터 증강의 효과를 강조하여 실제 사용 사례에서 라우팅 성능을 개선할 수 있는 확장 가능한 경로를 제시합니다.

Based on this research, we have created an open-source framework for serving and evaluating routers on GitHub. We are also releasing all our routers and datasets on HuggingFace for public use.
이 연구를 바탕으로 라우터를 제공하고 평가하기 위한 오픈소스 프레임워크를 GitHub에 만들었습니다. 또한 모든 라우터와 데이터세트를 HuggingFace에 공개하여 누구나 사용할 수 있도록 하고 있습니다.

We are excited to see what you build on top of this! Please let us know if you face any issues or have any suggestions. For the full details, please refer to our arXiv paper.
여러분이 이 위에 무엇을 만들어낼지 기대가 됩니다! 문제가 발생하거나 제안 사항이 있으면 알려주세요. 자세한 내용은 arXiv 논문을 참조하세요.

Acknowledgements 감사

We are grateful to Tyler Griggs for his valuable feedback on this post.
이 게시물에 대한 소중한 피드백을 보내주신 타일러 그릭스에게 감사드립니다.

Citations 인용

@misc{ong2024routellmlearningroutellms,
      title={RouteLLM: Learning to Route LLMs with Preference Data},
      author={Isaac Ong and Amjad Almahairi and Vincent Wu and Wei-Lin Chiang and Tianhao Wu and Joseph E. Gonzalez and M Waleed Kadous and Ion Stoica},
      year={2024},
      eprint={2406.18665},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2406.18665},
}

@misc{chiang2024chatbot,
    title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
    author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
    year={2024},
    eprint={2403.04132},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

RouteLLM: An Open-Source Framework for Cost-Effective LLM RoutingRouteLLM: 비용 효율적인 LLM 라우팅을 위한 오픈 소스 프레임워크