Smart Routing
Last updated
Last updated
What is smart routing?
AI Smart Routing enhances the efficiency of machine learning tasks by assigning user prompts to the most suitable open-source model, ensuring contextually accurate responses. This approach has proven to be economically beneficial, as evidenced by benchmarks using MT-Bench and AlpacaEval, where it achieved a cost reduction of 50% and up to 90%, respectively, compared to conventional methods. This system exemplifies the strategic advantage of integrating diverse AI models for cost-effective and scalable solutions.
Why is smart routing useful?
Model: be-base-v1.0 and Model: be-pro-v1.0
Routing and controlling the information flow is a core component in optimizing intelligence. Our previous research, shows that neurons within the brain are selective to certain stimuli. For example, the fusiform face area (FFA) region is known by neuroscientists to activate when people see faces over non-face objects selectively. Other specialized areas within the temporal cortex include selectivity for visual scenes or buildings (parahippocampal place area, PPA), for body parts (extrastriate body area, EBA), and for reading words (visual word form area, VWFA). This infers a "Mixture of Experts" architecture where the dedicated cortical area is utilized for different tasks.
Similarly, we see this theme emerge in Large Language Model artificial intelligence. While the exact details of GPT-4 are unknown, it is widely rumored that the system's architecture is a Mixture of Experts (MoE) consisting of eight models, each having 220B parameters. Other LLMs, including the Mixtral of Experts model and the new Gemini 1.5 model, use an MoE architecture to process data more efficiently. However, while these architectures focus on the internal routing of data within a model, we have observed a significant difference in capabilities between open source and proprietary models, as seen in our analysis of dozens of models in the evaluation.
We observe there are open source models and techniques that are state-of-the-art performant on certain tasks at a significantly lower cost. This opens LLMs to quality-cost arbitrage opportunities which can save the end user on AI costs. This is achieved by first selecting and routing between proprietary APIs, open source LLMs, in combination with prompt engineering, re-ranking, caching, and blending. In our results on MT-bench, our mixture of experts performs nearly equivalent to GPT-4 8.93 (behind GPT-4 at 8.99), at a nearly half of the cost. On the AlpacaEval, we achieve a 93.54 score, which is nearly on par with GPT-4 (95.28) at a discount of nearly 90%.
How are we doing the routing?
All the details about routing are open source and can be seen here.We provide a detailed jupyter notebook that illustrates the entire process of building a fast prompt classifier for routing. We utilize the DistilBERT model, a smaller language representation model designed for efficient on-the-edge operation and training under computational constraints. DistilBERT is less costly to pre-train and well-suited for on-device computations, as demonstrated through experiments and comparative studies. We quantize the model using Optimum, enabling the model to run extremely fast on a CPU router. Each classifier takes 5-8ms to run. An ensemble of 8 prompt classifiers takes about 50ms in total. Thus, each endpoint can route about 20 requests per second.
Comparison to other routing methods
The most popular alternative to routing is via embedding similarity. For example, if one were to try to route a programming question, one might set up the set of target classes as ["coding", "not coding"]. Each string is then transformed into an embedding and compared against a prompt query like "write a bubble sort in Python". Given the computed pair-wise cosine similarity between the query and class, we can label the prompt as a coding question and route the prompt to a coding-specific model. These do not scale well with larger numbers of embeddings. Nor can they capture non-semantic type classes (like is the response likely to be more or less than 200 tokens). However, they are adaptable and comparably fast and thus provide an excellent alternative to the trained fast classifiers.
How can we benefit from smart routing?
You can use our open source code to create your own smart routing algorithms. Alternatively, we offer access to our smart routing API for easy integration into your applications. Simply replace your OpenAI base URL with "https://api.blockentropy.ai/v1" and enter your API key. You can choose from the be-base-v1.0 model, which only uses a mixture of open-source models, or the be-pro-v1.0, which utilizes GPT-4 turbo for complex queries that revolve around coding and reasoning. Note that to use the be-pro model, you must provide your OpenAI API key, but rest assured, we do not store any of your keys in plaintext, and only you can decrypt the key using your private BE API key (we cannot see either your BE key or OpenAI key).
As proprietary models improve, we will update our routing models to include the latest capabilities in proprietary LLMs. Likewise, as open source improves, we will update our routing models to price arbitrage the latest capabilities in open source LLMs.