112 reads

Quality-Diversity through AI Feedback: Background and Related Work

by The FeedbackLoop: #1 in PM EducationJanuary 26th, 2024

Too Long; Didn't Read

Discover the synergy of large language models (LMs) and quality-diversity search algorithms in Quality-Diversity through AI Feedback (QDAIF). Learn about the Evolution through Large Models (ELM) framework and how QDAIF revolutionizes creative domains by automating the generation and evaluation of diverse, high-quality text. Explore the role of AI feedback in advancing self-refinement and optimizing text generation across various iterations.

featured image - Quality-Diversity through AI Feedback: Background and Related Work

Authors:

(1) Herbie Bradley, CarperAI, CAML Lab, University of Cambridge & EleutherAI;

(2) Andrew Dai, Aleph Alpha;

(3) Hannah Teufel, Aleph Alpha;

(4) Jenny Zhang, 5Department of Computer Science, University of British Columbia & Vector Institute;

(5) Koen Oostermeijer, Aleph Alpha;

(6) Marco Bellagente, Stability AI;

(7) Jeff Clune, Department of Computer Science, University of British Columbia, Vector Institute & Canada CIFAR AI Chair;

(8) Kenneth Stanley, Maven;

(9) Grégory Schott, Aleph Alpha;

(10) Joel Lehman, Stochastic Labs.

2 BACKGROUND & RELATED WORK

2.1 EVOLUTION THROUGH LARGE MODELS

Advancements in language models have enabled new kinds of powerful search algorithms that apply LMs as search operators, e.g. to create variation or evaluate solutions. While other search algorithms could also be used, this paper creates a QDAIF algorithm by extending upon Evolution through Large Models (ELM) (Lehman et al., 2022), a framework for evolutionary search for code or text that uses LMs to generate intelligent variation (for example through specialized language models trained on code diffs (Bradley et al., 2023b), or through simple few-shot prompting (Meyerson et al., 2023; Chen et al., 2023)). Most QDAIF results in this paper generate new search candidates through Language Model Crossover (LMX) (Meyerson et al., 2023), a recent and general few-shot prompting approach that can evolve e.g. mathematical expressions, sentences, Python programs, and prompts for text-to-image models, by leveraging in-context learning capabilities of LMs (Brown et al., 2020). The approach is simple: A few existing search candidates are concatenated into a prompt, predisposing the LM to generate new, similar candidates. In this way, LMX enables creating intelligent variation without requiring any specially-trained models. Our experimental implementation builds on OpenELM (Bradley et al., 2023a), a versatile open-source Python library designed for research into LM-based evolutionary algorithms.

2.2 QUALITY DIVERSITY ALGORITHMS

Traditional optimization algorithms aim to discover a single high-quality solution, which while appropriate for many situations, can fail to illuminate the full range of possible high-quality solutions. For creative and design problems in particular, a user may want to choose what they think is most appropriate from a diversity of such candidates. In contrast, Quality Diversity (QD) algorithms aim to optimize not just for a single optimal solution, but for a diverse set of high-quality solutions (Lehman & Stanley, 2011b; Mouret & Clune, 2015; Pugh et al., 2016; Fontaine & Nikolaidis, 2021). QD algorithms can thus provide a richer landscape of solutions, enabling adaptability and flexibility in addressing multifaceted challenges (Cully et al., 2015). In addition to a quality measure (i.e. an objective function), QD requires a metric such that it can encourage desired axes of diversity. For instance, Lehman et al. (2022) evolved Python programs to design varied locomoting robots, where the diversity dimensions are the robot’s height, width, and mass.

A significant limitation of existing QD algorithms lies in their reliance on low-level quality and diversity measures (Mouret & Clune, 2015). This requirement confounds applying QD algorithms to complex and creative domains, such as the creative writing ones explored in this paper. Intuitively, such measures (e.g. sensor readings (Cully et al., 2015), feature engineering (Manning, 2009)) lack the subtlety and depth needed to capture the complexities of human creativity and intuition, e.g. nuances, moods, or cultural references that resonate in human experience. Interestingly, from having trained on vast amounts of human-generated data, LMs can begin to emulate such human-nuanced judgments (cf. Section 2.3). Thus, by employing an LM to evaluate both quality and diversity, QDAIF significantly simplifies and enlarges the range of domains QD can be applied to.

Feedback from learned ML models has been used in prior work to reduce the need for hand-crafted heuristics or expensive ground-truth evaluations. In model-based QD, learned feedback is supplied by surrogate models. Gaier et al. (2017) introduced the use of surrogates (via a Gaussian process) to predict fitness (quality). Subsequently, Keller et al. (2020) introduced a learned model to predict both fitness and behavior characteristics (diversity), becoming a standard approach (Lim et al., 2021; 2022; Zhang et al., 2022; Bhatt et al., 2022). Surrogate models require domain-specific training data to update their predictions on a limited domain, whereas AI feedback leverages off-the-shelf instruction-tuned LMs (Chung et al., 2022; Ouyang et al., 2022) to automate expensive human feedback for a variety of evaluation tasks. More recently, Fontaine & Nikolaidis (2021) utilized CLIP embeddings (Radford et al., 2021) as both quality and diversity measures to navigate the search space of StyleGAN (Karras et al., 2019), producing a range of faces with the desired characteristic (e.g. "A person with red hair"). We show that using pre-trained surrogate models is more prone to reward hacking in the natural language case (Skalse et al., 2022; Lehman et al., 2019) (cf. Appendix A.2). Hence, QDAIF capitalizes on the strengths of general-purpose LMs for evaluating generated solutions.

2.3 AI FEEDBACK

Recent months have seen a surge in research that leverages LMs to provide feedback on the training, evaluation, or problem-solving capabilities of other LMs (Bai et al., 2022; Perez et al., 2022; Shinn et al., 2023; Wang et al., 2023a; Colas et al., 2023; Zhang et al., 2023; Lee et al., 2023). Bai et al. (2022) show that using LM-generated critiques and refinements has been instrumental in enhancing performance on metrics like helpfulness and harmlessness. One particularly promising direction for AI feedback is self-refinement, where LMs evaluate and score their own generations, and then iteratively improve their output (Bai et al., 2022; Madaan et al., 2023). Self-refinement has demonstrated significant improvement in output quality as gauged by human evaluators (Madaan et al., 2023), underscoring generation-discrimination discrepancy (Saunders et al., 2022, p.12), meaning that it is often easier for a model to evaluate the quality of a generation than to generate the same high-quality text. Complementary to single-objective optimization with self-refine, QDAIF utilizes AI feedback to assess diversity in addition to quality, facilitating more varied and improved text generation over multiple iterations of refinement through evolution.