LLM-Blender: A Simple Ensemble Learning Framework for LLMs

Published in

AI2 Blog

4 min readJul 19, 2023

With the rise of open-source large language models (LLMs) such as Alpaca, Vicuna, and Falcon, we are witnessing boundary-pushing possibilities in this realm. Certain models demonstrate superior overall performances on leaderboards like AlpacaEval and Chatbot Arena. However, is it reasonable to stick with one top-performing LLM for all user inputs? The answer may not be as straightforward as one might think.

Left: The AlpacaEval leaderboard that ranks LLMs with their overall performance. Right: A pie chart for the percentage of data where each LLM ranks first among all models. — **Left**: The AlpacaEval leaderboard that ranks LLMs with their overall performance. **Right**: A pie chart for the percentage of data where each LLM ranks first among all models.

Contrary to popular belief, our analysis presents a divergent narrative: the optimal LLMs may significantly vary based on different inputs. Taking Vicuna as an example, it emerges as the frontrunner amongst the eleven models featured in the pie chart, based on average performance on all data. However, it only takes the crown as the optimal model for a mere 21.22% of examples. This isn’t entirely surprising, considering these LLMs are trained on varied data, employing different architectures and hyper-parameters. It is, therefore, logical to harness the strengths of multiple proficient LLMs and amalgamate them into an ensemble approach. The aim? Ensuring consistently superior outputs that enhance user experience.

An image of the paper title and research team. Title — LLM-Blender: Ensembling LLMs with Pairwise Ranking & Generative Fusion. Researchers Dongfu Jian, Xiang Ren, and Bill Yuchen Lin. This was an ACL 2023 conference paper.

In this blog post, we are delighted to introduce our recent paper published in ACL 2023. In this study, we proposed LLM-Blender, a simple two-stage ensemble learning framework. This framework first undertakes pairwise comparison of candidates, ranks them, and then merges the top K to render the final output. Alongside this, we also introduced a dataset, coined as MixInstruct, to carry out evaluations. Impressively, the LLM-Blender outperforms both single LLMs and baseline ensemble methods across a sundry of metrics. The LLM-Blender represents a significant stride forward in the efficient and effective ensemble learning of multiple proficient LLMs.

LLM-Blender: Ensembling by Pairwise Ranking & Generative Fusion

An illustration of the LLM-Blender framework. — The overview of the LLM-Blender framework.

The LLM-Blender operates with two central modules, PairRanker and GenFuser, which are meticulously designed for executing the ranking and fusing stages correspondingly.

During the ranking stage, we present a specific input x to N different LLMs and compile their outputs as candidates (y_1, …, y_N). A pairwise ranking module, PairRanker, is then employed to analyze and rank these candidates.

More explicitly, we concatenate the input x with each pair of candidates (viz. y_i and y_j) consequently creating an input sequence. This input sequence is then subjected to a cross-attention text encoder such as RoBERTa, which is tasked with learning and determining the superior candidate for the given input x. Given the comparison matrix, we then collate the results rendering a final ranking for the candidates.

Proceeding to the fusion stage, we select the top K candidates (for argument’s sake, let’s consider K=3), and input them into a seq2seq LM like T5, fostering the learning process of a generative fusion model. The sophisticated fusion model then produces a final output y.

Evaluation

MixInstruct

In a bid to facilitate comprehensive evaluation, we have curated a dataset known as MixInstruct. Encompassing 110k examples, coupled with corresponding responses, MixInstruct is a mixture of popular data from multiple sources such as Alpaca and ShareGPT. MixInstruct bridges the gap between multiple open-source LLMs, incorporating responses from 11 widely embraced models, including Vicuna, and juxtaposes them with referential responses for effective evaluation.

Metrics

We implement PairRanker and GenFuser with deberta-v3-Large (400m) and flan-t5-xl (3b). To ensure a comprehensive evaluation, we used a collection of automatic metrics, such as BERTScore, BLEURT, and BARTScore. In addition, we integrated a GPT-based pairwise comparison ranking evaluation for grading responses and ranking the methods.

Results

Our empirical results reveal consistent discrepancies in performance between the top two models, OpenAssistant (OA) and Vicuna (Vic), compared to our PairRanker and full LLM-Blender method across all four established metrics. Remarkably, the outputs generated by Blender surpass or match the performances of these two competitors in 68% and 76% of test instances, respectively. Such results indicate that Blender offers a robust and compelling framework for implementing ensemble learning.

Conclusion

In our quest for advancing the usage and growth of open-source LLMs, we introduce LLM-Blender. This is a post-hoc ensemble learning method specially developed to rank and merge outputs from various LLMs. Our goal with dynamically ensembling LLMs is to curtail biases, inaccuracies, and uncertainties commonly found in individual models. Consequently, this would generate outputs more concurrent with human feedback.

In addition, we present a new benchmark, MixInstruct. This is designed for training and assessing LLM ensembling methods on tasks requiring the following of instructions. Along with this, we’re excited to open-source the LLM-Blender toolkit on Github, linked below. This move is intended to facilitate the utilization of our approach by others, thus supporting the evolution of more sophisticated AI systems. These systems will be capable of achieving robustness, a higher level of generalization, and improved accuracy across a broad spectrum of tasks.