From b5464fc23495f1cd0b1b58bc1342c5039e541105 Mon Sep 17 00:00:00 2001
From: Hyesoo Kim <100982596+duper203@users.noreply.github.com>
Date: Thu, 31 Oct 2024 11:57:17 -0700
Subject: [PATCH 1/4] pdf to podcast using RAG
---
PDF_to_Podcast_RAG.ipynb | 839 +++++++++++++++++++++++++++++++++++++++
1 file changed, 839 insertions(+)
create mode 100644 PDF_to_Podcast_RAG.ipynb
diff --git a/PDF_to_Podcast_RAG.ipynb b/PDF_to_Podcast_RAG.ipynb
new file mode 100644
index 0000000..d867535
--- /dev/null
+++ b/PDF_to_Podcast_RAG.ipynb
@@ -0,0 +1,839 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ju_mt8GN1lAM"
+ },
+ "source": [
+ "# An Implementation of Notebook LM's PDF to Podcast"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cnatbafQ1lAN"
+ },
+ "source": [
+ "## Overview\n",
+ "\n",
+ "Inspired by [Notebook LM's](https://notebooklm.google/) podcast generation feature and a recent open source implementation of [Open Notebook LM](https://github.com/gabrielchua/open-notebooklm). In this cookbook we will implement a walkthrough of how you can build a PDF to podcast pipeline.\n",
+ "\n",
+ "## Purpose of the Excercise\n",
+ "\n",
+ "The purpose of this exercise is to guide users through the process of building an automated pipeline that transforms a PDF document into a podcast-ready script and audio output. Specifically, it integrates PDF parsing, question generation, Retrieval-Augmented Generation (RAG) for contextual answers, and text-to-speech (TTS) synthesis to create a complete podcast production workflow."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## 1. Install Dependencies / Import Necessary Libraries\n",
+ "\n"
+ ],
+ "metadata": {
+ "id": "TSqdS-6u3ond"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "cN0Tpr76ssM1"
+ },
+ "outputs": [],
+ "source": [
+ "!apt install -qU libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg\n",
+ "!pip install -qU ffmpeg-python\n",
+ "!pip install -qU PyAudio\n",
+ "!pip install -qU pypdf #to read PDF content\n",
+ "!pip install -qU cartesia #to access TTS model\n",
+ "!pip install -qU langchain-upstage langchain\n",
+ "!pip install -qU langchain_community faiss-cpu"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "iWea6go4r72c"
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "from google.colab import userdata\n",
+ "\n",
+ "from pathlib import Path\n",
+ "from tempfile import NamedTemporaryFile\n",
+ "from typing import List, Literal, Tuple, Optional, Dict, Union, List, Any\n",
+ "\n",
+ "import json\n",
+ "from pydantic import BaseModel\n",
+ "\n",
+ "from cartesia import Cartesia\n",
+ "from pydantic import ValidationError\n",
+ "\n",
+ "from langchain_upstage import ChatUpstage, UpstageEmbeddings, UpstageDocumentParseLoader\n",
+ "from langchain_core.prompts import ChatPromptTemplate\n",
+ "\n",
+ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+ "from langchain.vectorstores import FAISS"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# @title set API key\n",
+ "from pprint import pprint\n",
+ "import os\n",
+ "\n",
+ "import warnings\n",
+ "\n",
+ "warnings.filterwarnings(\"ignore\")\n",
+ "\n",
+ "if \"google.colab\" in str(get_ipython()):\n",
+ " # Running in Google Colab. Please set the UPSTAGE_API_KEY in the Colab Secrets\n",
+ " from google.colab import userdata\n",
+ "\n",
+ " os.environ[\"UPSTAGE_API_KEY\"] = userdata.get(\"UPSTAGE_API_KEY\")\n",
+ " os.environ[\"CARTESIA_API_KEY\"] = userdata.get(\"CARTESIA_API_KEY\")\n",
+ "\n",
+ "else:\n",
+ " # Running locally. Please set the UPSTAGE_API_KEY in the .env file\n",
+ " from dotenv import load_dotenv\n",
+ "\n",
+ " load_dotenv()\n",
+ "\n",
+ "assert (\n",
+ " \"UPSTAGE_API_KEY\" in os.environ\n",
+ "), \"Please set the UPSTAGE_API_KEY environment variable\""
+ ],
+ "metadata": {
+ "id": "QprcKBaQ2xFr"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## 2. Generate QnA context for the podcast using RAG\n",
+ "\n",
+ "### [2-1] Generate 7 Questions to Ask from the PDF\n",
+ "\n",
+ "In this step, we use a Upstage Solar to generate insightful and engaging questions based on the content of the provided PDF. The goal is to create a comprehensive set of questions that cover various aspects of the document, making them suitable for a podcast interview format. The questions should be designed to provoke thought, encourage in-depth discussion, and highlight key points from the PDF content.\n",
+ "\n",
+ "\n",
+ "* Upstage DocParse\n",
+ "* Upstage Solar\n"
+ ],
+ "metadata": {
+ "id": "kgP3DiKd3SUj"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "#Load in PDF of Choice\n",
+ "def get_PDF_text(file):\n",
+ " text = ''\n",
+ " loader = UpstageDocumentParseLoader(file, output_format='text')\n",
+ "\n",
+ " pages = loader.load()\n",
+ " for page in pages:\n",
+ " text += page.page_content\n",
+ "\n",
+ " return text\n",
+ "text = get_PDF_text('solar.pdf')"
+ ],
+ "metadata": {
+ "id": "GPdIe4rl5dVk"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 140
+ },
+ "id": "D9BzDxmgvS2V",
+ "outputId": "1feefc02-5b5b-4c24-c6e0-c926c5e2a77f"
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "'SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective\\nDepth Up-ScalingDahyun Kim∗, Chanjun Park∗†, Sanghoon Kim∗†, Wonsung Lee∗†, Wonho Song∗\\nYunsu Kim∗, Hyeonwoo Kim∗, Yungi Kim, Hyeonju Lee, Jihoo Kim\\nChangbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim\\nMikyoung Cha, Hwalsuk Lee†, Sunghun Kim†Upstage AI, South Korea{kdahyun, chanjun.park, limerobot, wonsung.lee, hwalsuk.lee, hunkim}@upstage.aiAbstractWe introduce SOLAR 10.7B, a large language\\nmodel (LLM) with 10.7 billion parameters,\\ndemonstrating superior performance in various\\nnatural language processing (NLP) tasks. In-\\nspired by recent efforts to efficiently up-scale\\nLLMs, we present a method for scaling LLMs\\ncalled depth up-scaling (DUS), which encom-\\npasses depthwise scaling and continued pre-\\ntraining. In contrast to other LLM up-scaling\\nmethods that use mixture-of-experts, DUS does\\nnot require complex changes to train and infer-\\nence efficiently. We show experimentally that\\nDUS is simple yet effective in scaling up high-\\nperformance LLMs from small ones. Building\\non the DUS model, we additionally present SO-\\nLAR 10.7B-Instruct, a variant fine-tuned for\\ninstruction-following capabilities, surpassing\\nMixtral-8x7B-Instruct. SOLAR 10.7B is pub-\\nlicly available under the Apache 2.0 license,\\npromoting broad access and application in the\\nLLM field 1.1 Introduction2024\\nApr\\n4\\n[cs.CL]\\narXiv:2312.15166v3The field of natural language processing (NLP)\\nhas been significantly transformed by the introduc-\\ntion of large language models (LLMs), which have\\nenhanced our understanding and interaction with\\nhuman language (Zhao et al., 2023). These ad-\\nvancements bring challenges such as the increased\\nneed to train ever larger models (Rae et al., 2021;\\nWang et al., 2023; Pan et al., 2023; Lian, 2023;\\nYao et al., 2023; Gesmundo and Maile, 2023) ow-\\ning to the performance scaling law (Kaplan et al.,\\n2020; Hernandez et al., 2021; Anil et al., 2023;\\nKaddour et al., 2023). To efficiently tackle the\\nabove, recent works in scaling language models\\nsuch as a mixture of experts (MoE) (Shazeer et al.,\\n2017; Komatsuzaki et al., 2022) have been pro-\\nposed. While those approaches are able to effi-∗Equal Contribution † Corresponding Author\\n1https://huggingface.co/upstage/\\nSOLAR-10.7B-v1.0ciently and effectively scale-up LLMs, they often\\nrequire non-trivial changes to the training and infer-\\nence framework (Gale et al., 2023), which hinders\\nwidespread applicability. Effectively and efficiently\\nscaling up LLMs whilst also retaining the simplic-\\nity for ease of use is an important problem (Alberts\\net al., 2023; Fraiwan and Khasawneh, 2023; Sallam\\net al., 2023; Bahrini et al., 2023).Inspired by Komatsuzaki et al. (2022), we\\npresent depth up-scaling (DUS), an effective and\\nefficient method to up-scale LLMs whilst also re-\\nmaining straightforward to use. DUS consists of\\nscaling the number of layers in the base model and\\ncontinually pretraining the scaled model. Unlike\\n(Komatsuzaki et al., 2022), DUS does not scale\\nthe model using MoE and rather use a depthwise\\nscaling method analogous to Tan and Le (2019)\\nwhich is adapted for the LLM architecture. Thus,\\nthere are no additional modules or dynamism as\\nwith MoE, making DUS immediately compatible\\nwith easy-to-use LLM frameworks such as Hug-\\ngingFace (Wolf et al., 2019) with no changes to\\nthe training or inference framework for maximal\\nefficiency. Furthermore, DUS is applicable to all\\ntransformer architectures, opening up new gate-\\nways to effectively and efficiently scale-up LLMs\\nin a simple manner. Using DUS, we release SO-\\nLAR 10.7B, an LLM with 10.7 billion parameters,\\nthat outperforms existing models like Llama 2 (Tou-\\nvron et al., 2023) and Mistral 7B (Jiang et al., 2023)\\nin various benchmarks.We have also developed SOLAR 10.7B-Instruct,\\na variant fine-tuned for tasks requiring strict adher-\\nence to complex instructions. It significantly out-\\nperforms the Mixtral-8x7B-Instruct model across\\nvarious evaluation metrics, evidencing an advanced\\nproficiency that exceeds the capabilities of even\\nlarger models in terms of benchmark performance.By releasing SOLAR 10.7B under the Apache\\n2.0 license, we aim to promote collaboration and in-\\nnovation in NLP. This open-source approach allowsFigure 1: Depth up-scaling for the case with n = 32, s = 48, and m = 8. Depth up-scaling is achieved through a\\ndual-stage process of depthwise scaling followed by continued pretraining.for wider access and application of these models\\nby researchers and developers globally.2 Depth Up-ScalingTo efficiently scale-up LLMs, we aim to utilize pre-\\ntrained weights of base models to scale up to larger\\nLLMs (Komatsuzaki et al., 2022). While exist-\\ning methods such as Komatsuzaki et al. (2022) use\\nMoE (Shazeer et al., 2017) to scale-up the model ar-\\nchitecture, we opt for a different depthwise scaling\\nstrategy inspired by Tan and Le (2019). We then\\ncontinually pretrain the scaled model as just scaling\\nthe model without further pretraining degrades the\\nperformance.Base model. Any n-layer transformer architec-\\nture can be used but we select the 32-layer Llama\\n2 architecture as our base model. We initialize the\\nLlama 2 architecture with pretrained weights from\\nMistral 7B, as it is one of the top performers com-\\npatible with the Llama 2 architecture. By adopting\\nthe Llama 2 architecture for our base model, we\\naim to leverage the vast pool of community re-\\nsources while introducing novel modifications to\\nfurther enhance its capabilities.Depthwise scaling. From the base model with n\\nlayers, we set the target layer count s for the scaled\\nmodel, which is largely dictated by the available\\nhardware.With the above, the depthwise scaling process\\nis as follows. The base model with n layers is\\nduplicated for subsequent modification. Then, we\\nremove the final m layers from the original model\\nand the initial m layers from its duplicate, thus\\nforming two distinct models with n − m layers.\\nThese two models are concatenated to form a scaled\\nmodel with s = 2·(n−m) layers. Note that n = 32\\nfrom our base model and we set s = 48 consideringour hardware constraints and the efficiency of the\\nscaled model, i.e., fitting between 7 and 13 billion\\nparameters. Naturally, this leads to the removal of\\nm = 8 layers. The depthwise scaling process with\\nn = 32, s = 48, and m = 8 is depicted in ‘Step 1:\\nDepthwise Scaling’ of Fig. 1.We note that a method in the community that also\\nscale the model in the same manner 2 as ‘Step 1:\\nDepthwise Scaling’ of Fig. 1 has been concurrently\\ndeveloped.Continued pretraining. The performance of the\\ndepthwise scaled model initially drops below that\\nof the base LLM. Thus, we additionally apply\\nthe continued pretraining step as shown in ‘Step\\n2: Continued Pretraining’ of Fig. 1. Experimen-\\ntally, we observe rapid performance recovery of\\nthe scaled model during continued pretraining, a\\nphenomenon also observed in Komatsuzaki et al.\\n(2022). We consider that the particular way of\\ndepthwise scaling has isolated the heterogeneity\\nin the scaled model which allowed for this fast\\nperformance recovery.Delving deeper into the heterogeneity of the\\nscaled model, a simpler alternative to depthwise\\nscaling could be to just repeat its layers once more,\\ni.e., from n to 2n layers. Then, the ‘layer distance’,\\nor the difference in the layer indices in the base\\nmodel, is only bigger than 1 where layers n and\\nn + 1 are connected, i.e., at the seam.However, this results in maximum layer distance\\nat the seam, which may be too significant of a\\ndiscrepancy for continued pretraining to quickly\\nresolve. Instead, depthwise scaling sacrifices the\\n2m middle layers, thereby reducing the discrep-\\nancy at the seam and making it easier for continued2https://huggingface.co/Undi95/\\nMistral-11B-v0.1Properties Instruction Training Datasets Alignment\\n Alpaca-GPT4 OpenOrca Synth. Math-Instruct Orca DPO Pairs Ultrafeedback Cleaned Synth. Math-Alignment\\n Total # Samples 52K 2.91M 126K 12.9K 60.8K 126K\\n Maximum # Samples Used 52K 100K 52K 12.9K 60.8K 20.1K\\n Open Source O O ✗ O O ✗Table 1: Training datasets used for the instruction and alignment tuning stages, respectively. For the instruction\\ntuning process, we utilized the Alpaca-GPT4 (Peng et al., 2023), OpenOrca (Mukherjee et al., 2023), and Synth.\\nMath-Instruct datasets, while for the alignment tuning, we employed the Orca DPO Pairs (Intel, 2023), Ultrafeedback\\nCleaned (Cui et al., 2023; Ivison et al., 2023), and Synth. Math-Alignment datasets. The ‘Total # Samples‘ indicates\\nthe total number of samples in the entire dataset. The ‘Maximum # Samples Used‘ indicates the actual maximum\\nnumber of samples that were used in training, which could be lower than the total number of samples in a given\\ndataset. ‘Open Source‘ indicates whether the dataset is open-sourced.pretraining to quickly recover performance. We\\nattribute the success of DUS to reducing such dis-\\ncrepancies in both the depthwise scaling and the\\ncontinued pretraining steps. We also hypothesize\\nthat other methods of depthwise scaling could also\\nwork for DUS, as long as the discrepancy in the\\nscaled model is sufficiently contained before the\\ncontinued pretraining step.Comparison to other up-scaling methods. Un-\\nlike Komatsuzaki et al. (2022), depthwise scaled\\nmodels do not require additional modules like gat-\\ning networks or dynamic expert selection. Conse-\\nquently, scaled models in DUS do not necessitate\\na distinct training framework for optimal training\\nefficiency, nor do they require specialized CUDA\\nkernels for fast inference. A DUS model can seam-\\nlessly integrate into existing training and inference\\nframeworks while maintaining high efficiency.3 Training DetailsAfter DUS, including continued pretraining, we\\nperform fine-tuning of SOLAR 10.7B in two stages:\\n1) instruction tuning and 2) alignment tuning.Instruction tuning. In the instruction tuning\\nstage, the model is trained to follow instructions in\\na QA format (Zhang et al., 2023). We mostly use\\nopen-source datasets but also synthesize a math QA\\ndataset to enhance the model’s mathematical capa-\\nbilities. A rundown of how we crafted the dataset is\\nas follows. First, seed math data are collected from\\nthe Math (Hendrycks et al., 2021) dataset only, to\\navoid contamination with commonly used bench-\\nmark datasets such as GSM8K (Cobbe et al., 2021).\\nThen, using a process similar to MetaMath (Yu\\net al., 2023), we rephrase the questions and an-\\nswers of the seed math data. We use the resulting\\nrephrased question-answer pairs as a QA datasetand call it ‘Synth. Math-Instruct‘.Alignment tuning. In the alignment tuning stage,\\nthe instruction-tuned model is further fine-tuned\\nto be more aligned with human or strong AI\\n(e.g., GPT4 (OpenAI, 2023)) preferences using\\nsDPO (Kim et al., 2024a), an improved version\\nof direct preference optimization (DPO) (Rafailov\\net al., 2023). Similar to the instruction tuning stage,\\nwe use mostly open-source datasets but also syn-\\nthesize a math-focused alignment dataset utilizing\\nthe ‘Synth. Math-Instruct‘ dataset mentioned in the\\ninstruction tuning stage.The alignment data synthesis process is as\\nfollows. We take advantage of the fact that\\nthe rephrased question-answer pairs in Synth.\\nMath-Instruct data are beneficial in enhancing the\\nmodel’s mathematical capabilities (see Sec. 4.3.1).\\nThus, we speculate that the rephrased answer to the\\nrephrased question is a better answer than the orig-\\ninal answer, possibly due to the interim rephrasing\\nstep. Consequently, we set the rephrased question\\nas the prompt and use the rephrased answer as the\\nchosen response and the original answer as the re-\\njected response and create the {prompt, chosen,\\nrejected} DPO tuple. We aggregate the tuples from\\nthe rephrased question-answer pairs and call the\\nresulting dataset ‘Synth. Math-Alignment‘.4 Results4.1 Experimental DetailsTraining datasets. We present details regarding\\nour training datasets for the instruction and align-\\nment tuning stages in Tab. 1. We do not always\\nuse the entire dataset and instead subsample a set\\namount. Note that most of our training data is\\nopen-source, and the undisclosed datasets can be\\nsubstituted for open-source alternatives such as theModel Size Type H6 (Avg.) ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n SOLAR 10.7B-Instruct ∼ 11B Alignment-tuned 74.20 71.08 88.16 66.21 71.43 83.58 64.75\\n Qwen 72B ∼ 72B Pretrained 73.60 65.19 85.94 77.37 60.19 82.48 70.43\\n Mixtral 8x7B-Instruct-v0.1 ∼ 47B Instruction-tuned 72.62 70.22 87.63 71.16 64.58 81.37 60.73\\n Yi 34B-200K ∼ 34B Pretrained 70.81 65.36 85.58 76.06 53.64 82.56 61.64\\n Yi 34B ∼ 34B Pretrained 69.42 64.59 85.69 76.35 56.23 83.03 50.64\\n Mixtral 8x7B-v0.1 ∼ 47B Pretrained 68.42 66.04 86.49 71.82 46.78 81.93 57.47\\n Llama 2 70B ∼ 70B Pretrained 67.87 67.32 87.33 69.83 44.92 83.74 54.06\\n Falcon 180B ∼ 180B Pretrained 67.85 69.45 88.86 70.50 45.47 86.90 45.94\\n SOLAR 10.7B ∼ 11B Pretrained 66.04 61.95 84.60 65.48 45.04 83.66 55.50\\n Qwen 14B ∼ 14B Pretrained 65.86 58.28 83.99 67.70 49.43 76.80 58.98\\n Mistral 7B-Instruct-v0.2 ∼ 7B Instruction-tuned 65.71 63.14 84.88 60.78 68.26 77.19 40.03\\n Yi 34B-Chat ∼ 34B Instruction-tuned 65.32 65.44 84.16 74.90 55.37 80.11 31.92\\n Mistral 7B ∼ 7B Pretrained 60.97 59.98 83.31 64.16 42.15 78.37 37.83Table 2: Evaluation results in the Open LLM Leaderboard for SOLAR 10.7B and SOLAR 10.7B-Instruct along with\\nother top-performing models. We report the scores for the six tasks mentioned in Sec. 4.1 along with the H6 score\\n(average of six tasks). We also report the size of the models in units of billions of parameters. The type indicates the\\ntraining stage of the model and is chosen from {Pretrained, Instruction-tuned, Alignment-tuned}. Models based on\\nSOLAR 10.7B are colored purple. The best scores for H6 and the individual tasks are shown in bold.MetaMathQA (Yu et al., 2023) dataset.We reformatted the instruction datasets with an\\nAlpaca-styled chat template. For datasets such as\\nOpenOrca, which are derived from FLAN (Long-\\npre et al., 2023), we filter data that overlaps with\\nthe benchmark datasets (see Tab. 8 in Appendix. C\\nfor more information). The alignment datasets\\nare in the {prompt, chosen, rejected} triplet for-\\nmat. We preprocess the alignment datasets follow-\\ning Zephyr (Tunstall et al., 2023). We use Data-\\nverse (Park et al., 2024) for data preprocessing.Evaluation. In the HuggingFace Open LLM\\nLeaderboard (Beeching et al., 2023), six types of\\nevaluation methods are presented: ARC (Clark\\net al., 2018), HellaSWAG (Zellers et al., 2019),\\nMMLU (Hendrycks et al., 2020), TruthfulQA (Lin\\net al., 2022), Winogrande (Sakaguchi et al., 2021),\\nand GSM8K (Cobbe et al., 2021). We utilize these\\ndatasets as benchmarks for evaluation and also re-\\nport the average scores for the six tasks, e.g., H6.\\nWe either submit directly to the Open LLM Leader-\\nboard or utilize Evalverse (Kim et al., 2024b) for\\nrunning evaluations locally.Model merging. Model merging methods such\\nas Yadav et al. (2023) can boost model perfor-\\nmance without further training. We merge some\\nof the models that we trained in both the instruc-\\ntion and alignment tuning stages. We implement\\nour own merging methods although popular open\\nsource also exist such as MergeKit3.4.2 Main ResultsWe present evaluation results for our SOLAR\\n10.7B and SOLAR 10.7B-Instruct models along3https://github.com/cg123/mergekitwith other top-performing models in Tab. 2. SO-\\nLAR 10.7B outperforms other pretrained models\\nof similar sizes, such as Qwen 14B and Mistral\\n7B, which shows that DUS is an effective method\\nto up-scale base LLMs. Furthermore, despite the\\nsmaller size, SOLAR 10.7B-Instruct scores the\\nhighest in terms of H6, even surpassing the recent\\ntop-performing open-source LLM Mixtral 8x7B-\\nInstruct-v0.1 or Qwen 72B. The above results indi-\\ncate DUS can up-scale models that are capable of\\nachieving state-of-the-art performance when fine-\\ntuned. We also report data contamination results\\nfor SOLAR 10.7B-Instruct in Appendix C.4.3 Ablation StudiesWe present ablation studies for both the instruction\\nand alignment tuning stages. Note that the evalua-\\ntion results for the following studies are ran locally\\nand may vary from results obtained by submitting\\nto the Open LLM Leaderboard.4.3.1 Instruction TuningAblation on the training datasets. We present\\nablation studies using different training datasets\\nfor the instruction tuning in Tab. 3. The ablated\\nmodels are prefixed with SFT for supervised fine-\\ntuning. ‘SFT v1’ only uses the Alpaca-GPT4\\ndataset, whereas ‘SFT v2’ also uses the OpenOrca\\ndataset. ‘SFT v3’ uses the Synth. Math-Instruct\\ndataset along with the datasets used in ‘SFT v2’.\\nSimilarly, ‘SFT v4’ uses the Synth. Math-Instruct\\ndataset along with the datasets used in ‘SFT v1’.First, we analyze how Alpaca-GPT4 and\\nOpenOrca affect the trained models. The first ab-\\nlated model, ‘SFT v1’, which used only the Alpaca-\\nGPT4 dataset for training, resulted in 69.15 for H6.Model Alpaca-GPT4 OpenOrca Synth. Math-Instruct H6 (Avg.) ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n SFT v1 O ✗ ✗ 69.15 67.66 86.03 65.88 60.12 82.95 52.24\\n SFT v2 O O ✗ 69.21 65.36 85.39 65.93 58.47 82.79 57.32\\n SFT v3 O O O 70.03 65.87 85.55 65.31 57.93 81.37 64.14\\n SFT v4 O ✗ O 70.88 67.32 85.87 65.87 58.97 82.48 64.75\\n SFT v3 + v4 O O O 71.11 67.32 85.96 65.95 58.80 82.08 66.57Table 3: Ablation studies on the different datasets used for instruction tuning. ‘SFT v3+v4’ indicates that the model\\nis merged from ‘SFT v3’ and ‘SFT v4’ by simply averaging the model weights. The best scores for H6 and the\\nindividual tasks are shown in bold.Model Ultrafeedback Clean Synth. Math-Alignment H6 (Avg.) ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n DPO v1 O ✗ 73.06 71.42 88.49 66.14 72.04 81.45 58.83\\n DPO v2 O O 73.42 71.50 88.28 65.97 71.71 82.79 60.27\\n DPO v1 + v2 O O 73.21 71.33 88.36 65.92 72.65 82.79 58.23Table 4: Ablation studies on the different datasets used during the direct preference optimization (DPO) stage.\\n‘SFT v3’ is used as the SFT base model for DPO. We name ablated models with the ‘DPO’ prefix to indicate the\\nalignment tuning stage. ‘DPO v1+v2’ indicates that the model is merged from ‘DPO v1’ and ‘DPO v2’ by simply\\naveraging the model weights. The best scores for H6 and the individual tasks are shown in bold.Model Base SFT Model H6 (Avg.) ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n DPO v2 SFT v3 73.42 71.50 88.28 65.97 71.71 82.79 60.27\\n DPO v3 SFT v3 + v4 73.58 71.33 88.08 65.39 72.45 81.93 62.32Table 5: Ablation studies on the different SFT base models used during the direct preference optimization (DPO)\\nstage. Ultrafeedback Clean and Synth. Math-Alignment datasets are used. We name ablated models with the ‘DPO’\\nprefix to indicate the alignment tuning stage. The best scores for H6 and the individual tasks are shown in bold.When we add the OpenOrca dataset to train the\\nsecond ablated model, ‘SFT v2’, the resulting H6\\nscore is 69.21, which is little change from 69.15 of\\n‘SFT v1’. However, the task scores vary more as\\n‘SFT v2’ gets a substantially higher GSM8K score\\nof 57.32 compared to 52.24 of ‘SFT v1’ but also\\ngets noticeably lower scores across the board for\\nARC, HellaSwag, and TruthfulQA. This seems to\\nindicate that using OpenOrca results in a model that\\nbehaves differently from using only Alpaca-GPT4.Second, we investigate whether Synth. Math-\\nInstruct dataset is beneficial. For ‘SFT v3’, we\\nadd the Synth. Math-Instruct dataset, which boosts\\nGSM8K scores to 64.14 and achieves comparable\\nscores for the other tasks. Interestingly, when we\\nadd the Synth. Math-Instruct dataset to ‘SFT v1’\\nto train ‘SFT v4’, we get our highest H6 score of\\n70.88 with higher scores than ‘SFT v3’ for all tasks.\\nFrom the above, we can see that adding the Synth.\\nMath-Instruct dataset is helpful.Lastly, we see whether merging models trained\\nwith and without OpenOrca can boost performance.\\nIn the first analysis, we saw that using OpenOrca re-\\nsulted in a model that behaved differently from the\\nmodel that was trained without OpenOrca. Build-\\ning on this intuition, we merge ‘SFT v3’ and ‘SFT\\nv4’ as they are the best-performing models withand without OpenOrca. To our surprise, the result-\\ning merged model ‘SFT v3+v4’ retains the high\\nscores for non-GSM8K tasks from ‘SFT v4’ but\\nalso achieves a higher GSM8K score than ‘SFT v3’\\nor ‘SFT v4’. Thus, we see that merging models\\nthat specialize in different tasks is a promising way\\nto obtain a model that performs well generally.4.3.2 Alignment TuningAs we utilize sDPO for practical alignment tun-\\ning, there are additional aspects to ablate such as\\nthe SFT base models used. Thus, we present ab-\\nlations for the different training datasets used for\\ntraining, the different SFT base models to initialize\\nthe sDPO training, and finally, the model merging\\nstrategy to obtain the final alignment-tuned model.Ablation on the training datasets. We ablate on\\nthe different alignment datasets used during DPO\\nin Tab. 4. We use ‘SFT v3’ as the SFT base model\\nfor DPO. ‘DPO v1’ only uses the Ultrafeedback\\nClean dataset while ‘DPO v2’ also used the Synth.\\nMath-Alignment dataset.First, we test how Ultrafeedback Clean and\\nSynth. Math-Alignment impacts model perfor-\\nmance. For ‘DPO v1’, it achieves 73.06 in H6,\\nwhich is a substantial boost from the SFT base\\nmodel score of 70.03. However, we note that whileModel H6 (Avg.) ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n Cand. 1 73.73 70.48 87.47 65.73 70.62 81.53 66.57\\n Cand. 2 73.28 71.59 88.39 66.14 72.50 81.99 59.14Table 6: Performance comparison amongst the merge candidates. ‘Cand. 1’ and ‘Cand. 2’ are trained using the\\nsame setting as ‘DPO v2’ and ‘DPO v3’, respectively, but with slightly different hyper-parameters. The best scores\\nfor H6 and the individual tasks are shown in bold.Model Merge Method H6 (Avg.) ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n Merge v1 Average (0.5, 0.5) 74.00 71.16 88.01 66.14 71.71 82.08 64.90\\n Merge v2 Average (0.4, 0.6) 73.93 71.08 88.08 66.27 71.89 81.77 64.52\\n Merge v3 Average (0.6, 0.4) 74.05 71.08 87.88 66.13 71.61 82.08 65.50\\n Merge v4 SLERP 73.96 71.16 88.03 66.25 71.79 81.93 64.59Table 7: Ablation studies on the different merge methods used for obtaining the final model. We use ‘Cand. 1’\\nand ‘Cand. 2’ from Tab. 6 as our two models for merging. We name the merged models with the ‘Merge’ prefix to\\nindicate they are merged. The best scores for H6 and the individual tasks are shown in bold.scores for tasks like ARC, HellaSwag, and Truth-\\nfulQA all improved by good margins, the score\\nfor GSM8K is 58.83, which is lower than the\\nSFT base model score of 64.14. Adding Synth.\\nMath-Alignment to train ‘DPO v2’, we see that\\nthe GSM8k score improves to 60.27, which is\\nlower than the SFT base model but still higher\\nthan ‘DPO v1’. Other task scores are also not nega-\\ntively impacted by adding Synth. Math-Alignment.\\nThus, we can conclude that adding Synth. Math-\\nAlignment is beneficial for H6.Then, we experiment whether merging ‘DPO\\nv1’ and ‘DPO v2’ is beneficial. Unfortunately,\\n‘DPO v1+v2’ scores 73.21 in H6, which is worse\\nthan ‘DPO v2’. More importantly, the gain in\\nthe GSM8K score from adding Synth. Math-\\nAlignment is gone, which is undesirable. One\\nreason for this could be that ‘DPO v2’ is a strict\\nimprovement over ‘DPO v1’, unlike the case for\\nmerging ‘SFT v3’ and ‘SFT v4’ where the models\\nhad different strengths and weaknesses.Ablation on the SFT base models. When ap-\\nplying DPO, we start from a model that is already\\ninstruction tuned ,i.e., the SFT base model and ab-\\nlate on using different SFT base models. We use\\nUltrafeedback Clean and Synth. Math-Alignment\\ndatasets for this ablation. Each of the ablated mod-\\nels is trained as follows. ‘DPO v2’ uses ‘SFT v3’\\nas the base SFT model, while ‘DPO v3’ uses ‘SFT\\nv3+v4’ as the SFT base model instead.Note that ‘SFT v3+v4’ has higher scores on all\\ntasks compared to ‘SFT v3’, and the gap is espe-\\ncially large for ARC (+1.45) and GSM8K (+2.43).\\nSurprisingly, the two models perform similarly in\\nterms of H6. A closer look at the scores for theindividual tasks shows only a small margin in the\\nGSM8K scores, and other task scores show little\\ndifference. Thus, the performance gaps in certain\\ntasks in the SFT base models do not always carry\\nover to the alignment-tuned models.Ablation on different merge methods. From\\nTab. 3, we saw that merging two models that have\\ndifferent strengths can be beneficial to performance.\\nTo utilize this for the alignment-tuned model as\\nwell, we train two models named ‘Cand. 1’ and\\n‘Cand. 2’ using the same training dataset and SFT\\nbase model as ‘DPO v2’ and ‘DPO v3’ but with dif-\\nferent hyper-parameters to maximize each model’s\\nrespective strengths. We compare ‘Cand. 1’ and\\n‘Cand. 2’ in Tab. 6 where we can see that ‘Cand. 1’\\nhas high GSM8K scores but relatively low scores\\nfor the other tasks, whereas ‘Cand. 2’ has low\\nscores for GSM8K but high scores for the other\\ntasks. We merge these two models using various\\nmethods and ablate the results in Tab.. 7.We use two merge methods: 1) Average (a, b),\\nwhere a and b denote the weighting for ‘Cand.\\n1’ and ‘Cand. 2’ when averaging weights and 2)\\nSLERP (Shoemake, 1985). We use (0.5, 0.5), (0.4,\\n0.6), and (0.6, 0.4) for Average (a, b). From Tab. 7,\\nwe can see that the different merge methods have\\nlittle effect on the H6 scores. The scores for the\\nindividual tasks also do not differ by much, suggest-\\ning that as long as the merge candidates have suffi-\\nciently different strengths, the exact merge method\\nmay not be as crucial. Thus, we chose ‘Merge v1’\\nas our SOLAR 10.7B-Instruct model.5 ConclusionWe introduce SOLAR 10.7B and its fine-tuned vari-\\nant SOLAR 10.7B-Instruct, which are depth up-\\nscaled (DUS) models with 10.7 billion parameters4.\\nThey show superior performance over models like\\nLlama 2, Mistral 7B, and Mixtral-7B-Instruct in es-\\nsential NLP tasks while maintaining computational\\nefficiency. Thus, DUS is effective in scaling-up\\nhighly performant LLMs from smaller ones. With\\nmore exploration, DUS could be further improved,\\npaving a new path to efficiently scaling LLMs.AcknowledgementsWe would like to extend our gratitude to the teams\\nat Hugging Face, particularly Clémentine Four-\\nrier, Lewis Tunstall, Omar Sanseviero, and Philipp\\nSchmid. Our appreciation also extends to the\\nteams at AWS, notably Rahul Sharma, Jeongwon\\nYoon, Nieves Garcia, Ritesh Vajaria, Gal Oshri, Jay\\nKwon, Brandon Lee and Effie Bae. We are grateful\\nto the teams at Korea Telecom (KT), especially Jin\\nHyoung Lee, Jungsuk Park, Sungjoon Park, Hong-\\nrae Wang, Kyeongsoo Jung, and Sunyoong Yoon,\\nwhose significant support has been instrumental in\\nensuring the broad compatibility of our model. Ad-\\nditionally, we would like to extend our thanks to the\\nopen community for their invaluable contributions\\nand feedback.LimitationsOur study on the Depth Up-Scaling (DUS) has im-\\nportant limitations and considerations. One key\\nlimitation is the need for more thorough explo-\\nrations of hyperparameters used in the DUS ap-\\nproach. Namely, we removed m = 8 layers from\\nboth ends of our base model, primarily due to hard-\\nware limitations. However, we have not yet deter-\\nmined if this value is optimal for enhancing perfor-\\nmance. The extended time and cost of continued\\npretraining made it challenging to conduct more\\ncomprehensive experiments, which we aim to ad-\\ndress in future work through various comparative\\nanalyses.In terms of the model’s broader implications,\\nthere are several points to note. The model’s sig-\\nnificant computational demands for training and\\ninference might limit its use, especially for those\\nwith restricted computational resources. Addition-4Preprint version is available on https://arxiv.\\norg/abs/2312.15166.ally, like all machine learning models, it is vulnera-\\nble to biases in its training data, which could lead\\nto skewed outcomes in certain situations. Further-\\nmore, the substantial energy consumption required\\nfor training and operating the model raises environ-\\nmental concerns, which are critical in the pursuit\\nof sustainable AI development.Lastly, while the fine-tuned variant of the model\\nshows improved performance in following instruc-\\ntions, it still requires task-specific fine-tuning for\\noptimal performance in specialized applications.\\nThis fine-tuning process can be resource-intensive\\nand not always effective. Recognizing and address-\\ning these limitations is essential for a comprehen-\\nsive understanding of the proposed Large Language\\nModel’s capabilities and for guiding future research\\nand development in the field of LLMs.Ethics StatementWe conscientiously address and emphasize the\\ncommitment of SOLAR 10.7B in maintaining the\\nhighest ethical standards. First, we highlight that\\nSOLAR 10.7B-Instruct has shown low levels of\\ndata contamination in our evaluations, a testament\\nto our rigorous data handling and processing pro-\\ntocols. This aspect is crucial, as it underpins the\\nreliability and integrity of the results obtained from\\nSOLAR.Furthermore, during the course of our experi-\\nments, we ensured that all setups and methodolo-\\ngies employed steer clear of any potential ethical\\npitfalls. This preemptive consideration and avoid-\\nance of ethically questionable practices underscore\\nour dedication to conducting research that is not\\nonly innovative but also responsible.Additionally, we ensure that SOLAR complies\\nwith general ethical considerations in all aspects\\nof its operation. This includes adherence to pri-\\nvacy norms, respect for intellectual property, and\\nensuring the absence of bias in our algorithms. Our\\ncommitment to these ethical principles is unwaver-\\ning, and we believe it significantly contributes to\\nthe credibility and societal acceptance of SOLAR.In conclusion, the ethical framework within\\nwhich SOLAR operates is robust and comprehen-\\nsive, ensuring that our advancements in this field\\nare not only scientifically sound but also ethically\\nresponsible.ReferencesIan L Alberts, Lorenzo Mercolli, Thomas Pyka, George\\nPrenosil, Kuangyu Shi, Axel Rominger, and Ali\\nAfshar-Oromieh. 2023. Large language models\\n(llm) and chatgpt: what will the impact on nuclear\\nmedicine be? European journal of nuclear medicine\\nand molecular imaging, 50(6):1549–1552.Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John-\\nson, Dmitry Lepikhin, Alexandre Passos, Siamak\\nShakeri, Emanuel Taropa, Paige Bailey, Zhifeng\\nChen, et al. 2023. Palm 2 technical report. arXiv\\npreprint arXiv:2305.10403.Aram Bahrini, Mohammadsadra Khamoshifar, Hos-\\nsein Abbasimehr, Robert J Riggs, Maryam Esmaeili,\\nRastin Mastali Majdabadkohne, and Morteza Pase-\\nhvar. 2023. Chatgpt: Applications, opportunities,\\nand threats. In 2023 Systems and Information Engi-\\nneering Design Symposium (SIEDS), pages 274–279.\\nIEEE.Edward Beeching, Clémentine Fourrier, Nathan\\nHabib, Sheon Han, Nathan Lambert, Nazneen\\nRajani, Omar Sanseviero, Lewis Tunstall, and\\nThomas Wolf. 2023. Open llm leaderboard.\\nhttps://huggingface.co/spaces/\\nHuggingFaceH4/open_llm_leaderboard.Tom Brown, Benjamin Mann, Nick Ryder, Melanie\\nSubbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind\\nNeelakantan, Pranav Shyam, Girish Sastry, Amanda\\nAskell, et al. 2020. Language models are few-shot\\nlearners. Advances in neural information processing\\nsystems, 33:1877–1901.Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,\\nAshish Sabharwal, Carissa Schoenick, and Oyvind\\nTafjord. 2018. Think you have solved question an-\\nswering? try arc, the ai2 reasoning challenge. arXiv\\npreprint arXiv:1803.05457.Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,\\nMark Chen, Heewoo Jun, Lukasz Kaiser, Matthias\\nPlappert, Jerry Tworek, Jacob Hilton, Reiichiro\\nNakano, et al. 2021. Training verifiers to solve math\\nword problems. arXiv preprint arXiv:2110.14168.Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao,\\nWei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and\\nMaosong Sun. 2023. Ultrafeedback: Boosting lan-\\nguage models with high-quality feedback. arXiv\\npreprint arXiv:2310.01377.Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Ger-\\nstein, and Arman Cohan. 2023. Investigating data\\ncontamination in modern benchmarks for large lan-\\nguage models. arXiv preprint arXiv:2311.09783.Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan,\\nShizhe Diao, Jipeng Zhang, Kashun Shum, and\\nTong Zhang. 2023. Raft: Reward ranked finetuning\\nfor generative foundation model alignment. arXiv\\npreprint arXiv:2304.06767.Mohammad Fraiwan and Natheer Khasawneh. 2023. A\\nreview of chatgpt applications in education, market-\\ning, software engineering, and healthcare: Benefits,\\ndrawbacks, and research directions. arXiv preprint\\narXiv:2305.00237.Trevor Gale, Deepak Narayanan, Cliff Young, and Matei\\nZaharia. 2023. Megablocks: Efficient sparse training\\nwith mixture-of-experts. Proceedings of Machine\\nLearning and Systems, 5.Andrea Gesmundo and Kaitlin Maile. 2023. Compos-\\nable function-preserving expansions for transformer\\narchitectures. arXiv preprint arXiv:2308.06103.Shahriar Golchin and Mihai Surdeanu. 2023. Time\\ntravel in llms: Tracing data contamination in large\\nlanguage models. arXiv preprint arXiv:2308.08493.Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,\\nMantas Mazeika, Dawn Song, and Jacob Steinhardt.\\n2020. Measuring massive multitask language under-\\nstanding. In International Conference on Learning\\nRepresentations.Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul\\nArora, Steven Basart, Eric Tang, Dawn Song, and Ja-\\ncob Steinhardt. 2021. Measuring mathematical prob-\\nlem solving with the math dataset. arXiv preprint\\narXiv:2103.03874.Danny Hernandez, Jared Kaplan, Tom Henighan, and\\nSam McCandlish. 2021. Scaling laws for transfer.\\narXiv preprint arXiv:2102.01293.Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang,\\nZe Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin\\nJose, Prabhat Ram, et al. 2023. Tutel: Adaptive\\nmixture-of-experts at scale. Proceedings of Machine\\nLearning and Systems, 5.Intel. 2023. Supervised fine-tuning and direct prefer-\\nence optimization on intel gaudi2.Hamish Ivison, Yizhong Wang, Valentina Pyatkin,\\nNathan Lambert, Matthew Peters, Pradeep Dasigi,\\nJoel Jang, David Wadden, Noah A. Smith, Iz Belt-\\nagy, and Hannaneh Hajishirzi. 2023. Camels in a\\nchanging climate: Enhancing lm adaptation with tulu\\n2.Albert Q Jiang, Alexandre Sablayrolles, Arthur Men-\\nsch, Chris Bamford, Devendra Singh Chaplot, Diego\\nde las Casas, Florian Bressand, Gianna Lengyel, Guil-\\nlaume Lample, Lucile Saulnier, et al. 2023. Mistral\\n7b. arXiv preprint arXiv:2310.06825.Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale\\nMinervini, and Matt J Kusner. 2023. No train no\\ngain: Revisiting efficient training algorithms for\\ntransformer-based language models. arXiv preprint\\narXiv:2307.06440.Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B\\nBrown, Benjamin Chess, Rewon Child, Scott Gray,\\nAlec Radford, Jeffrey Wu, and Dario Amodei. 2020.Scaling laws for neural language models. arXiv\\npreprint arXiv:2001.08361.Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo\\nKim, Yunsu Kim, Sanghoon Kim, and Chanjun Park.\\n2024a. sdpo: Don’t use your data all at once.Jihoo Kim, Wonho Song, Dahyun Kim, Yunsu Kim,\\nYungi Kim, and Chanjun Park. 2024b. Evalverse:\\nUnified and accessible library for large language\\nmodel evaluation.Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp,\\nCarlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie,\\nYi Tay, Mostafa Dehghani, and Neil Houlsby.\\n2022. Sparse upcycling: Training mixture-of-\\nexperts from dense checkpoints. arXiv preprint\\narXiv:2212.05055.Wing Lian. 2023. https://huggingface.co/\\nwinglian/omega-3b.Stephanie Lin, Jacob Hilton, and Owain Evans. 2022.\\nTruthfulqa: Measuring how models mimic human\\nfalsehoods. In Proceedings of the 60th Annual Meet-\\ning of the Association for Computational Linguistics\\n(Volume 1: Long Papers), pages 3214–3252.Shayne Longpre, Le Hou, Tu Vu, Albert Webson,\\nHyung Won Chung, Yi Tay, Denny Zhou, Quoc V\\nLe, Barret Zoph, Jason Wei, et al. 2023. The flan\\ncollection: Designing data and methods for effective\\ninstruction tuning. arXiv preprint arXiv:2301.13688.Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawa-\\nhar, Sahaj Agarwal, Hamid Palangi, and Ahmed\\nAwadallah. 2023. Orca: Progressive learning from\\ncomplex explanation traces of gpt-4. arXiv preprint\\narXiv:2306.02707.OpenAI. 2023. Gpt-4 technical report.Yu Pan, Ye Yuan, Yichun Yin, Zenglin Xu, Lifeng\\nShang, Xin Jiang, and Qun Liu. 2023. Reusing pre-\\ntrained models by multi-linear operators for efficient\\ntraining. arXiv preprint arXiv:2310.10699.Hyunbyung Park, Sukyung Lee, Gyoungjin Gim, Yungi\\nKim, Dahyun Kim, and Chanjun Park. 2024. Data-\\nverse: Open-source etl (extract, transform, load)\\npipeline for large language models.Baolin Peng, Chunyuan Li, Pengcheng He, Michel Gal-\\nley, and Jianfeng Gao. 2023. Instruction tuning with\\ngpt-4. arXiv preprint arXiv:2304.03277.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,\\nDario Amodei, Ilya Sutskever, et al. 2019. Language\\nmodels are unsupervised multitask learners. OpenAI\\nblog, 1(8):9.Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie\\nMillican, Jordan Hoffmann, Francis Song, John\\nAslanides, Sarah Henderson, Roman Ring, Susan-\\nnah Young, et al. 2021. Scaling language models:\\nMethods, analysis & insights from training gopher.\\narXiv preprint arXiv:2112.11446.Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano\\nErmon, Christopher D Manning, and Chelsea Finn.\\n2023. Direct preference optimization: Your language\\nmodel is secretly a reward model. arXiv preprint\\narXiv:2305.18290.Oscar Sainz, Jon Ander Campos, Iker García-Ferrero,\\nJulen Etxaniz, Oier Lopez de Lacalle, and Eneko\\nAgirre. 2023. Nlp evaluation in trouble: On the\\nneed to measure llm data contamination for each\\nbenchmark. arXiv preprint arXiv:2310.18018.Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-\\nula, and Yejin Choi. 2021. Winogrande: An adver-\\nsarial winograd schema challenge at scale. Commu-\\nnications of the ACM, 64(9):99–106.Malik Sallam, Nesreen Salim, Muna Barakat, and Alaa\\nAl-Tammemi. 2023. Chatgpt applications in medical,\\ndental, pharmacy, and public health education: A\\ndescriptive study highlighting the advantages and\\nlimitations. Narra J, 3(1):e103–e103.Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,\\nAndy Davis, Quoc Le, Geoffrey Hinton, and Jeff\\nDean. 2017. Outrageously large neural networks:\\nThe sparsely-gated mixture-of-experts layer. arXiv\\npreprint arXiv:1701.06538.Tianxiao Shen, Myle Ott, Michael Auli, and\\nMarc’Aurelio Ranzato. 2019. Mixture models for\\ndiverse machine translation: Tricks of the trade. In\\nInternational conference on machine learning, pages\\n5719–5728. PMLR.Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo\\nHuang, Daogao Liu, Terra Blevins, Danqi Chen,\\nand Luke Zettlemoyer. 2023. Detecting pretraining\\ndata from large language models. arXiv preprint\\narXiv:2310.16789.Ken Shoemake. 1985. Animating rotation with quater-\\nnion curves. In Proceedings of the 12th annual con-\\nference on Computer graphics and interactive tech-\\nniques, pages 245–254.Mingxing Tan and Quoc Le. 2019. Efficientnet: Re-\\nthinking model scaling for convolutional neural net-\\nworks. In International conference on machine learn-\\ning, pages 6105–6114. PMLR.Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-\\nbert, Amjad Almahairi, Yasmine Babaei, Nikolay\\nBashlykov, Soumya Batra, Prajjwal Bhargava, Shruti\\nBhosale, et al. 2023. Llama 2: Open founda-\\ntion and fine-tuned chat models. arXiv preprint\\narXiv:2307.09288.Lewis Tunstall, Edward Beeching, Nathan Lambert,\\nNazneen Rajani, Kashif Rasul, Younes Belkada,\\nShengyi Huang, Leandro von Werra, Clémentine\\nFourrier, Nathan Habib, et al. 2023. Zephyr: Di-\\nrect distillation of lm alignment. arXiv preprint\\narXiv:2310.16944.Peihao Wang, Rameswar Panda, Lucas Torroba Hen-\\nnigen, Philip Greengard, Leonid Karlinsky, Roge-\\nrio Feris, David Daniel Cox, Zhangyang Wang, and\\nYoon Kim. 2023. Learning to grow pretrained mod-\\nels for efficient transformer training. arXiv preprint\\narXiv:2303.00980.Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-\\nisa Liu, Noah A Smith, Daniel Khashabi, and Han-\\nnaneh Hajishirzi. 2022. Self-instruct: Aligning lan-\\nguage model with self generated instructions. arXiv\\npreprint arXiv:2212.10560.Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin\\nGuu, Adams Wei Yu, Brian Lester, Nan Du, An-\\ndrew M Dai, and Quoc V Le. 2021. Finetuned lan-\\nguage models are zero-shot learners. arXiv preprint\\narXiv:2109.01652.Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,\\nBarret Zoph, Sebastian Borgeaud, Dani Yogatama,\\nMaarten Bosma, Denny Zhou, Donald Metzler, et al.\\n2022a. Emergent abilities of large language models.\\narXiv preprint arXiv:2206.07682.Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten\\nBosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,\\net al. 2022b. Chain-of-thought prompting elicits rea-\\nsoning in large language models. Advances in Neural\\nInformation Processing Systems, 35:24824–24837.Thomas Wolf, Lysandre Debut, Victor Sanh, Julien\\nChaumond, Clement Delangue, Anthony Moi, Pier-\\nric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,\\net al. 2019. Huggingface’s transformers: State-of-\\nthe-art natural language processing. arXiv preprint\\narXiv:1910.03771.Prateek Yadav, Derek Tam, Leshem Choshen, Colin\\nRaffel, and Mohit Bansal. 2023. Ties-merging: Re-\\nsolving interference when merging models. In Thirty-\\nseventh Conference on Neural Information Process-\\ning Systems.Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu,\\nQuoc V Le, Denny Zhou, and Xinyun Chen. 2023.\\nLarge language models as optimizers. arXiv preprint\\narXiv:2309.03409.Yiqun Yao, Zheng Zhang, Jing Li, and Yequan\\nWang. 2023. 2x faster language model pre-training\\nvia masked structural growth. arXiv preprint\\narXiv:2305.02869.Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu,\\nZhengying Liu, Yu Zhang, James T Kwok, Zhen-\\nguo Li, Adrian Weller, and Weiyang Liu. 2023.\\nMetamath: Bootstrap your own mathematical ques-\\ntions for large language models. arXiv preprint\\narXiv:2309.12284.Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang,\\nSongfang Huang, and Fei Huang. 2023. Rrhf:\\nRank responses to align language models with\\nhuman feedback without tears. arXiv preprint\\narXiv:2304.05302.Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali\\nFarhadi, and Yejin Choi. 2019. Hellaswag: Can a\\nmachine really finish your sentence? In Proceedings\\nof the 57th Annual Meeting of the Association for\\nComputational Linguistics, pages 4791–4800.Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang,\\nXiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tian-\\nwei Zhang, Fei Wu, et al. 2023. Instruction tuning\\nfor large language models: A survey. arXiv preprint\\narXiv:2308.10792.Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,\\nXiaolei Wang, Yupeng Hou, Yingqian Min, Beichen\\nZhang, Junjie Zhang, Zican Dong, et al. 2023. A\\nsurvey of large language models. arXiv preprint\\narXiv:2303.18223.Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen,\\nWayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong\\nWen, and Jiawei Han. 2023. Don’t make your llm\\nan evaluation benchmark cheater. arXiv preprint\\narXiv:2311.01964.Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B\\nBrown, Alec Radford, Dario Amodei, Paul Chris-\\ntiano, and Geoffrey Irving. 2019. Fine-tuning lan-\\nguage models from human preferences. arXiv\\npreprint arXiv:1909.08593.A ContributionsThe contributions of this study are as follows:• Introduction of the SOLAR 10.7 Billion-\\nParameter Model: We have released the SO-\\nLAR 10.7B model, which is not only depth-\\nwise scaled but also continually pretrained.\\nThe availability of SOLAR 10.7B under the\\nApache 2.0 license permits commercial us-\\nage, enabling the integration of this advanced\\nmodel into a diverse range of products and ser-\\nvices. This bridges the gap between academic\\nresearch and practical applications, fostering\\nwider accessibility and utility in various fields.• Superior Performance Across Diverse\\nBenchmarks: SOLAR 10.7B excels in var-\\nious benchmarks, outperforming established\\nmodels like Llama 2 and Mistral 7B in reason-\\ning, mathematics, and the MMLU framework.• Advancement in Instruction-Following Ca-\\npabilities: The introduction of SOLAR 10.7B-\\nInstruct, a variant fine-tuned for enhanced\\ninstruction-following abilities, marks a sig-\\nnificant improvement in the model’s ability to\\nunderstand and execute complex instructions.Sanghoon Kim, Dahyun Kim, Chanjun Park,\\nWonsung Lee, Wonho Song, Yunsu Kim and\\nHyeonwoo Kim contributed equally to this paper.\\nSanghoon Kim led the Foundation Model part,\\nwith Dahyun Kim, Wonho Song, Yunsu Kim, and\\nHyeonwoo Kim. Chanjun Park led the Data and\\nEvaluation (Data-Centric LLM) part, with Yungi\\nKim, Jihoo Kim, Changbae Ahn, Seonghoon Yang,\\nSukyung Lee, and Hyunbyung Park. Wonsung Lee\\nled the Adaptation Modeling part, with Gyoungjin\\nGim, Hyeonju Lee, and Mikyoung Cha. Hwalsuk\\nLee performed the role of the overall project opera-\\ntion. Dahyun Kim and Chanjun Park were the main\\ntechnical writers. All these individuals contributed\\nto the creation of SOLAR 10.7B.B Related Works and BackgroundB.1 Large Language ModelsFollowing the advent of context-based language\\nmodels, various studies have revealed a “scaling\\nlaw” (Kaplan et al., 2020; Hernandez et al., 2021;\\nAnil et al., 2023), demonstrating a positive corre-\\nlation between the size of model and training dataand model performance. This has led to the emer-\\ngence of Large Language Models (LLMs). Un-\\nlike previous language models, LLMs possess the\\nability for In-context learning, including Zero-shot\\nlearning (Radford et al., 2019) and Few-shot learn-\\ning (Brown et al., 2020), allowing them to perform\\nnew tasks without updating model weights. These\\ncapabilities of LLMs, not evident in smaller mod-\\nels, are referred to as Emergent abilities (Wei et al.,\\n2022a).B.2 Mixture of ExpertsIn the landscape of machine learning architectures,\\nthe Mixture of Experts (MoE) models like (Shazeer\\net al., 2017; Shen et al., 2019; Komatsuzaki et al.,\\n2022) has gained attention for its capability to ad-\\ndress the challenges posed by complex and hetero-\\ngeneous data. MoE models offer notable benefits,\\nincluding enhanced output diversity, allowing for\\nthe capture of intricate patterns within the input\\nspace. Moreover, their computational efficiency,\\nespecially when implemented in a sparse form, has\\nmade them valuable in scenarios where resource\\nconstraints are a consideration (Shazeer et al., 2017;\\nKomatsuzaki et al., 2022).However, efficient implementation of MoE mod-\\nels poses a considerable challenge, primarily due to\\nthe intricacies associated with dynamic routing and\\nload-imbalanced computation (Gale et al., 2023).\\nExisting hardware and software for deep learning,\\nsuch as TPUs and XLA compilers, often demand\\nstatic knowledge of tensor shapes, making MoE\\nimplementation on TPU challenging.While GPU implementation offers more flexi-\\nbility, sparse computation compatibility becomes\\na hurdle. Striking the right balance between fix-\\ning the size of each expert to facilitate efficient\\ncomputation and maintaining model quality creates\\na tradeoff between information preservation and\\nhardware efficiency. This tradeoff, in turn, necessi-\\ntates careful consideration during hyperparameter\\ntuning, adding a layer of complexity to the imple-\\nmentation of MoE models, potentially offsetting\\ntheir advantages. Given the formidable challenges\\nin MoE model implementation, it becomes almost\\ninevitable for researchers and practitioners to re-\\nsort to specialized tools and frameworks, such as\\nTutel (Hwang et al., 2023) or Megablocks (Gale\\net al., 2023).Departing from the horizontal expansion char-\\nacteristic of MoE models, the DUS method intro-duces model scaling in the vertical dimension. No-\\ntably, DUS does not introduce dynamism in the\\nscaled model, which significantly reduces the com-\\nplexity when compared to MoE. This shift in ap-\\nproach offers a unique and more straightforward\\nway of working, moving away from conventional\\nMoE challenges. Not only that, DUS also under-\\ngoes continued pretraining to quickly recover per-\\nformance of the scaled model.B.3 Prompt EngineeringA key research area to harness the emergent abil-\\nities of LLMs is prompt engineering. Prompt en-\\ngineering is the study of how to design inputs\\n(prompts) that enable LLMs to better perform spe-\\ncific tasks. A prime example of this research\\nis Chain-of-Thought (CoT) (Wei et al., 2022b),\\nwhich proposes CoT prompting that decomposes\\nmulti-step problems into a series of intermedi-\\nate reasoning steps. Moreover, efforts are under-\\nway to replace even such prompt engineering with\\nLLMs (Yang et al., 2023).B.4 Instruction TuningTo enhance the steerability of LLMs, instruction\\ntuning (Wei et al., 2021) has emerged as a learning\\ntechnique. This involves fine-tuning LLMs using\\ndata formatted as (instruction, input, output) for\\nvarious tasks (Wang et al., 2022). Instruction tuning\\nallows for targeted adjustments, providing a more\\ncontrolled and task-oriented improvement to the\\nmodel’s capabilities.Before instruction tuning, existing methods\\nfaced challenges in effectively guiding and control-\\nling the behavior of large language models (Zhang\\net al., 2023). The sheer complexity of these models\\nmade it difficult to ensure precise and task-oriented\\nresponses. The need for a more targeted approach\\narose from the limitations of existing methods, lead-\\ning to the development of instruction tuning. This\\ntargeted approach enables better control over the\\nmodel’s behavior, making it more suitable for spe-\\ncific tasks and improving its overall performance in\\nalignment with user-defined objectives. Therefore,\\ninstruction tuning is computationally efficient and\\nfacilitates the rapid adaptation of LLMs to a spe-\\ncific domain without requiring extensive retraining\\nor architectural changes.B.5 Alignment TuningLLM has been observed to generate sentences that\\nmay be perceived as linguistically incongruent byhuman readers since they learned not human inten-\\ntion, but only vast knowledge across various do-\\nmains in the pretraining step (Ziegler et al., 2019).\\nTo overcome this limitation and align with human\\nintentions, previous research (Ziegler et al., 2019)\\nhave proposed Reinforcement Learning with Hu-\\nman Feedback (RLHF). RLHF operates by learning\\na reward model based on human preferences, em-\\nploying reinforcement learning to guide the LLM\\ntowards prioritizing answers with the highest re-\\nward scores. This process enhances the safety,\\npropriety, and overall quality of the generated re-\\nsponses. Despite demonstrating satisfactory per-\\nformance, RLHF encounters challenges such as\\nmanaging numerous hyperparameters and necessi-\\ntating the incorporation of multiple models (policy,\\nvalue, reward, and reference models).In response to these challenges, the supervised\\nfine-tuning based approaches have proposed, such\\nas Rank Responses to align Human Feedback\\n(RRHF) (Yuan et al., 2023), Reward rAnked Fine-\\nTuning (RAFT) (Dong et al., 2023), and Direct\\nPolicy Optimization (DPO) (Intel, 2023). They\\navoid the complexities associated with reinforce-\\nment learning while achieving empirical perfor-\\nmance comparable to RLHF. Among them, DPO\\nthat we used directly guides the LLM to increase\\nthe probability of positive responses and decrease\\nthe probability of negative responses through a \"di-\\nrect\" approach. Interestingly, DPO demonstrates\\nmore stable learning results compared to RLHF,\\ndespite its simple training approach.B.6 Data ContaminationRecent researches (Zhou et al., 2023; Sainz et al.,\\n2023; Golchin and Surdeanu, 2023; Deng et al.,\\n2023) emphasize the need to measure whether a\\nspecific benchmark was used to train the large lan-\\nguage models. There are three types of the data\\ncontamination: guideline, raw text and annota-\\ntion (Sainz et al., 2023). Guideline contamination\\noccurs when a model accesses detailed annotation\\nguidelines for a dataset, providing advantages in\\nspecific tasks, and its impact should be considered,\\nespecially in zero and few-shot evaluations. Raw\\ntext contamination occurs when a model has ac-\\ncess to the original text. Wikipedia is widely used\\nas a pretraining data, but also as a source for cre-\\nating new datasets. The caution is advised in the\\ndevelopment of automatically annotated datasets\\nsourced from the web. Annotation contamina-tion occurs when the annotations of the specific\\nbenchmark are exposed during model training.C Additional InformationWe present additional information for the sake of\\nspace in the main paper.Filtered task names. We present task names\\nwe use to filter FLAN dervied datasets such as\\nOpenOrca in Table 8.Filtered Task Name\\n task228_arc_answer_generation_easy\\n ai2_arcARCChallenge:1.0.0\\n ai2_arcARCEasy:1.0.0\\n task229_arc_answer_generation_hard\\n hellaswag:1.1.0\\n task1389_hellaswag_completion\\n cot_gsm8k\\n cot_gsm8k_ii\\n drop:2.0.0\\n winogrande:1.1.0Table 8: Task names that we use to filter data for FLAN\\nderived datasets such as OpenOrca.ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n 0.06 N/A 0.15 0.28 N/A 0.70Table 9: Data contamination test results for SOLAR\\n10.7B-Instruct. We show ‘result < 0.1, %‘ values where\\na value higher than 0.9 indicates high probability of data\\ncontamination. HellaSwag and Winogrande datasets are\\nnot currently supported. We set SOLAR 10.7B as our\\nreference model when performing the data contamina-\\ntion tests.Results on data contamination. To show the in-\\ntegrity of SOLAR 10.7B-Instruct, we also report\\nthe data contamination test (Shi et al., 2023) results\\nin Table. 9. All four tested benchmark datasets\\nyield results well below the contamination thresh-\\nold, affirming the absence of data contamination\\nin our model. One interesting point is that the\\nvalue for GSM8K is noticeably higher than for\\nother datasets, even without contamination. One\\npotential reason for this is the stronger data similar-\\nity in math-related instruction datasets.'"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 5
+ }
+ ],
+ "source": [
+ "text"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Generate Questions Using LLM\n",
+ "\n",
+ "QUESTION_PROMPT = \"\"\"\n",
+ "You are an AI assistant tasked with generating a list of engaging questions for a podcast interview.\n",
+ "Based on the given text, create 7 questions that would be relevant for a podcast discussion.\n",
+ "The questions should be thought-provoking, insightful, and aimed at extracting key information.\n",
+ "Ensure the questions are diverse and cover different aspects of the text content.\n",
+ "\n",
+ "Return the questions as a json array and have all the key as questions\n",
+ "\"\"\"\n",
+ "\n",
+ "def generate_questions(system_prompt: str, text: str):\n",
+ "\n",
+ " llm = ChatUpstage(extra_body={\"response_format\": {\"type\": \"json_object\"}})\n",
+ " chat_prompt = ChatPromptTemplate.from_messages([\n",
+ " (\"system\", system_prompt),\n",
+ " (\"human\", \"{text}\")\n",
+ " ])\n",
+ "\n",
+ " chain = chat_prompt | llm\n",
+ "\n",
+ " response = chain.invoke({\"text\": text})\n",
+ " print(response.content)\n",
+ "\n",
+ " try:\n",
+ " response_dict = json.loads(response.content)\n",
+ " questions = response_dict.get(\"questions\", [])\n",
+ " if not isinstance(questions, list) or len(questions) == 0:\n",
+ " raise ValueError(\"Invalid response format or no questions generated\")\n",
+ " except (json.JSONDecodeError, ValueError) as e:\n",
+ " print(f\"Error generating questions: {e}\")\n",
+ " return []\n",
+ " return questions\n",
+ "\n",
+ "questions=generate_questions(QUESTION_PROMPT, text)"
+ ],
+ "metadata": {
+ "id": "XfEsTp2yTyPi"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "questions"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "-roMghdNqjq6",
+ "outputId": "7e98952c-7667-4983-c3a6-5b74cf1d1087"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "['What is the main contribution of the SOLAR 10.7B model?',\n",
+ " 'How does the depth up-scaling (DUS) method differ from other up-scaling methods like mixture-of-experts (MoE)?',\n",
+ " 'What are the advantages of using DUS over other up-scaling methods?',\n",
+ " 'What are the key components of the SOLAR 10.7B model?',\n",
+ " 'How does the SOLAR 10.7B model outperform existing models in various NLP tasks?',\n",
+ " 'What are the different stages of fine-tuning for the SOLAR 10.7B-Instruct model?',\n",
+ " \"What is the role of the alignment tuning stage in enhancing the SOLAR 10.7B-Instruct model's performance?\",\n",
+ " 'How does the SOLAR 10.7B-Instruct model compare to other top-performing models in terms of performance metrics?',\n",
+ " 'What are the limitations and considerations of the depth up-scaling (DUS) method?',\n",
+ " 'How does the SOLAR 10.7B-Instruct model address ethical concerns in its operation?']"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 20
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### [2-2] Retrieve and Generate Answers for Each Question Using RAG\n",
+ "\n",
+ "Once the questions are generated, we use a Retrieval-Augmented Generation (RAG) approach to obtain contextually relevant answers. This involves retrieving the most relevant sections from the PDF content, which has been embedded into a vector store, and then using the language model to generate detailed and informative answers. This ensures that the responses are backed by the original document, making them accurate and well-supported for podcast narration.\n",
+ "\n",
+ "\n",
+ "* Upstage Embedding Model\n",
+ "* Faiss"
+ ],
+ "metadata": {
+ "id": "ICK762aO44k6"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Embed PDF Content and Create Vector Store\n",
+ "def vectorstore_embed(file_path: str) -> List[float]:\n",
+ " \"\"\"Embed the given text using the LLM.\"\"\"\n",
+ " loader = UpstageDocumentParseLoader('solar.pdf', output_format='text')\n",
+ " documents = loader.load()\n",
+ "\n",
+ "\n",
+ " text_splitter = RecursiveCharacterTextSplitter(\n",
+ " chunk_size=1000, chunk_overlap=200, length_function=len\n",
+ " )\n",
+ "\n",
+ " texts = text_splitter.split_documents(documents)\n",
+ "\n",
+ " for doc in texts:\n",
+ " doc.page_content = doc.page_content.replace('\\t', ' ')\n",
+ "\n",
+ " embeddings = UpstageEmbeddings(model=\"solar-embedding-1-large\")\n",
+ " vectorstore = FAISS.from_documents(texts, embeddings)\n",
+ "\n",
+ " return vectorstore\n",
+ "\n",
+ "vectorstore=vectorstore_embed('solar.pdf')"
+ ],
+ "metadata": {
+ "id": "Yg-yS0_tt-Wz"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Retrieve Contexts for Questions\n",
+ "def retrieve_contents(vectorstore: str, question: str):\n",
+ "\n",
+ " retriever_store = vectorstore.as_retriever(search_kwargs={\"k\": 1})\n",
+ "\n",
+ " docs = retriever_store.get_relevant_documents(question)\n",
+ "\n",
+ " return docs"
+ ],
+ "metadata": {
+ "id": "tBsRBVIctx8B"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Generate Answers Using LLM\n",
+ "def generate_answer(question: str) -> str:\n",
+ " \"\"\"Generate an answer to a given question using the provided context.\"\"\"\n",
+ "\n",
+ " context=retrieve_contents(vectorstore,question)\n",
+ "\n",
+ " prompt = f\"You are a Guest of the podcast interview and you will be answering as a professional. You just have to answer the following question based on the provided document: {question}. I want you to answer as if you are podcast interview\"\n",
+ "\n",
+ " llm = ChatUpstage()\n",
+ " chat_prompt = ChatPromptTemplate.from_messages([\n",
+ " (\"system\", prompt),\n",
+ " (\"human\",\"{context}\")\n",
+ " ])\n",
+ "\n",
+ " chain = chat_prompt | llm\n",
+ "\n",
+ " response = chain.invoke({\"context\": context})\n",
+ " print(response.content)\n",
+ "\n",
+ " return response.content\n"
+ ],
+ "metadata": {
+ "id": "68hhIlk6tbJP"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "#Create QA Script\n",
+ "def create_qa_script(questions, pdf_text):\n",
+ " qa_script = []\n",
+ " for question in questions:\n",
+ " answer = generate_answer(question)\n",
+ " qa_script.append({\"speaker\": \"Host (Jane)\", \"text\": question})\n",
+ " qa_script.append({\"speaker\": \"Guest\", \"text\": answer})\n",
+ " return qa_script\n",
+ "\n",
+ "qa_script = create_qa_script(questions, text)"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "ExwVH6nBfaAP",
+ "outputId": "6f8b609b-8357-4d59-aa9d-9974208975cf"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "metadata": {
+ "tags": null
+ },
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The main goal of the study presented in this paper is to investigate the advantages and limitations of dental, pharmacy, and public health education.\n",
+ "The proposed DUS method differs from other LLM up-scaling methods by focusing on depth up-scaling, which involves scaling the number of layers in the base model and continually pretraining the scaled model. Unlike some other methods that use Mixture of Experts (MoE) to scale the model, DUS uses a depthwise scaling method similar to Tan and Le (2019) adapted for the LLM architecture. This approach makes DUS more straightforward to use and immediately compatible with easy-to-use LLM frameworks like Hugging Face (Wolf et al., 2019) without requiring any changes to the existing framework.\n",
+ "The key components of the DUS method are not explicitly mentioned in the provided document. However, based on the context, it seems that the DUS method refers to the \"Depth Up-Scaling\" approach mentioned in the text. The limitations and considerations discussed in the document suggest that the DUS approach may involve the removal of layers from a base model and the need for more thorough explorations of hyperparameters. The document also mentions the model's computational demands and potential limitations for those with restricted computational resources. To answer the question about the key components of the DUS method, we would need more information or context about the specific method being referred to.\n",
+ "The DUS method addresses the limitations of existing up-scaling methods by introducing a new approach called Depth Up-Scaling (DUS). DUS is a method that focuses on increasing the depth of pre-trained models while maintaining their original width. This is achieved by removing a portion of the layers from the model and then pre-training the remaining layers with a larger batch size and for a longer period.\n",
+ "\n",
+ "One key limitation of the DUS approach is the need for more thorough exploration of hyperparameters, such as the number of layers removed from the model. The authors removed 8 layers from both ends of their base model due to hardware limitations, but they acknowledge that this value may not be optimal for enhancing performance. They plan to address this in future work through various comparative analyses.\n",
+ "\n",
+ "Another limitation is the extended time and cost of continued pre-training, which made it challenging to conduct more comprehensive experiments. This could limit the use of the model, especially for those with restricted computational resources.\n",
+ "\n",
+ "In terms of the model's broader implications, it's important to note that it has significant computational demands for both training and inference. Like all machine learning models, it is also vulnerable to various attacks and biases, which should be carefully considered when deploying the model in real-world applications.\n",
+ "The benefits of using DUS in scaling up LLMs include effective and efficient scaling, retaining simplicity for ease of use, compatibility with existing LLM frameworks, and no additional modules or dynamism as with MoE.\n",
+ "The DUS method compares to Mixture of Experts (MoE) in terms of complexity and efficiency in a way that it reduces the complexity while maintaining or even improving the model's performance. The vertical scaling approach of DUS eliminates the need for dynamism in the scaled model, which simplifies the implementation process compared to MoE models. This shift in approach offers a more straightforward way of working, moving away from the conventional challenges associated with MoE models, such as hyperparameter tuning and hardware efficiency tradeoffs.\n",
+ "The main contributions of the study are not explicitly mentioned in the provided document. However, based on the keywords and topics mentioned, the study likely discusses the advantages and limitations of dental, pharmacy, and public health education. It also mentions various research papers and authors in the field of artificial intelligence and machine learning, such as Noam Shazeer, Azalia Mirhoseini, and Tianxiao Shen. The document also mentions a specific paper titled \"Mixture models for diverse machine translation: Tricks of the trade\" by Tianxiao Shen, Myle Ott, Michael Auli, and Marc'Aurelio Ranzato.\n",
+ "SOLAR 10.7B is a deep learning model that has been introduced as part of this study. It is a scaled and continually pretrained model, and it is available under the Apache 2.0 license, which allows for commercial use. The model has been designed to bridge the gap between academic research and practical applications, making it accessible and useful in various fields.\n",
+ "\n",
+ "In terms of performance, SOLAR 10.7B excels across diverse benchmarks, indicating that it performs well in a variety of tasks and applications. However, the document does not provide direct comparisons with other models in terms of performance. Therefore, it is not possible to determine how SOLAR 10.7B compares to other models based on the information provided.\n",
+ "The significance of SOLAR 10.7B being available under the Apache 2.0 license is that it allows for commercial use and integration into a wide range of products and services. This bridges the gap between academic research and practical applications, making the advanced model more accessible and useful across various fields.\n",
+ "The fine-tuning of SOLAR 10.7B-Instruct enhances its capabilities by improving its performance on various tasks, such as language modeling, question answering, and code generation. This is achieved through a process called \"instruction tuning,\" which involves training the model on a large dataset of human instructions and their corresponding outputs.\n",
+ "\n",
+ "The paper reports that SOLAR 10.7B-Instruct outperforms other models, even larger ones, in some tasks. For example, it scores higher than Mixtral 8x7B-Instruct-v0.1 and Qwen 72B in terms of H6, a metric used to evaluate the model's performance on a variety of tasks.\n",
+ "\n",
+ "The authors also present ablation studies to analyze the effectiveness of different training datasets and stages in the fine-tuning process. These studies help to understand which components contribute most to the model's performance.\n",
+ "\n",
+ "Overall, the fine-tuning of SOLAR 10.7B-Instruct improves its ability to understand and follow human instructions, making it more useful for a wide range of applications.\n",
+ "The main challenges in implementing Mixture of Experts (MoE) models include dynamic routing and load-imbalanced computation. The intricacies associated with these aspects pose a considerable challenge in efficient implementation of MoE models. Additionally, existing hardware and software for deep learning, such as TPUs and XLA compilers, often require static knowledge of tensor shapes, making MoE implementation difficult on TPU. While GPU implementation offers more flexibility, sparse computation compatibility remains a hurdle. Striking the right balance between fixing the size of each expert to facilitate efficient computation and maintaining model quality creates a tradeoff between information preservation and hardware efficiency.\n",
+ "The DUS method addresses the need for specialized tools and frameworks for Mixture of Experts (MoE) models by introducing a different approach to model scaling. Unlike MoE models, which scale horizontally and introduce complexities such as dynamic routing and hyperparameter tuning, DUS scales models vertically and does not introduce dynamism in the scaled model. This shift in approach reduces the complexity associated with MoE models, making it easier to implement and maintain. As a result, the DUS method offers a more straightforward and less complex way of working compared to specialized tools and frameworks like Tutel and Megablocks, which are specifically designed for MoE models.\n",
+ "The purpose of instruction tuning is to enhance the steerability of large language models (LLMs) by fine-tuning them using data formatted as (instruction, input, output) for various tasks. This allows for targeted adjustments, providing a more controlled and task-oriented improvement to the model's capabilities. Instruction tuning differs from previous methods, which faced challenges in effectively guiding and controlling the behavior of large language models. The need for a more targeted approach arose from the limitations of existing methods, leading to the development of instruction tuning. This targeted approach enables better control over the model's behavior, making it more suitable for specific tasks and improving its overall performance in those tasks.\n",
+ "Alignment tuning helps LLMs generate more human-like responses by aligning their output with human intentions and preferences. This is achieved through techniques such as Reinforcement Learning with Human Feedback (RLHF), which involves learning a reward model based on human preferences and using reinforcement learning to guide the LLM towards prioritizing answers with the highest reward. This approach enables better control over the model's behavior, making it more suitable for specific tasks and improving its overall performance in alignment with user-defined objectives.\n",
+ "The three types of data contamination are guideline, raw text, and annotation. Guideline contamination occurs when a model accesses detailed annotation guidelines for a dataset, providing advantages in specific tasks. Raw text contamination occurs when a model is trained on the raw text of a dataset, giving it an advantage in tasks related to that dataset. Annotation contamination occurs when a model is trained on the annotations of a dataset, giving it an advantage in tasks related to that dataset. These types of contamination can impact the performance of a model, especially in zero and few-shot evaluations, and should be considered when evaluating the performance of a model.\n",
+ "The results of the data contamination test for SOLAR 10.7B-Instruct are as follows:\n",
+ "\n",
+ "- HellaSwag: N/A\n",
+ "- Winogrande: N/A\n",
+ "- MMLU: 0.06\n",
+ "- TruthfulQA: 0.15\n",
+ "- GSM8K: 0.28\n",
+ "- OpenOrca.ARC: 0.70\n",
+ "\n",
+ "These results indicate that there is no significant data contamination in the SOLAR 10.7B-Instruct model, as all values are well below the contamination threshold of 0.9.\n",
+ "The limitations and considerations of the DUS approach include the need for more thorough exploration of hyperparameters used in the DUS approach, such as the number of layers removed from the base model, and the significant computational demands for training and inference, which might limit its use, especially for those with restricted computational resources.\n",
+ "The broader implications of SOLAR 10.7B are significant, as it represents a major advancement in the field of large language models. This model has the potential to revolutionize various industries and applications, such as natural language processing, artificial intelligence, and machine learning. However, it is essential to understand the limitations of SOLAR 10.7B to guide future research and development in the field.\n",
+ "\n",
+ "To ensure the ethical use of SOLAR 10.7B, several steps have been taken. First, the researchers have demonstrated low levels of data contamination in their evaluations, which highlights their rigorous data handling and processing protocols. This underscores the reliability and integrity of the results obtained from SOLAR.\n",
+ "\n",
+ "Second, the researchers have ensured that all setups and methodologies used in their experiments steer clear of any potential ethical pitfalls. This proactive consideration and avoidance of ethically questionable practices demonstrate their commitment to responsible research.\n",
+ "\n",
+ "Lastly, the researchers have ensured that SOLAR complies with ethical standards, which underpins the reliability and integrity of the model. By addressing these ethical considerations, the researchers aim to foster trust in the use of SOLAR 10.7B and promote responsible innovation in the field of large language models.\n",
+ "The main challenges faced by researchers and practitioners in implementing Mixture of Experts (MoE) models are the high computational cost, the need for careful hyperparameter tuning, and the tradeoff between information preservation and hardware efficiency. These challenges necessitate specialized tools and frameworks, such as Tutel and Megablocks, to manage and implement MoE models effectively.\n",
+ "\n",
+ "The DUS (Densely Connected Mixture of Experts) method addresses these challenges by introducing model scaling in the vertical dimension, which reduces complexity compared to MoE. Unlike MoE, DUS does not introduce dynamism in the scaled model, offering a more straightforward approach to working with MoE models. This shift in approach allows researchers and practitioners to work with MoE models more efficiently, potentially offsetting the advantages of traditional MoE models.\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "qa_script"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "eBwnfKjx3HoT",
+ "outputId": "91ff5ed4-42f5-44d8-bee7-83f165a90864"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "[{'speaker': 'Host (Jane)',\n",
+ " 'text': 'What is the main contribution of the SOLAR 10.7B model?'},\n",
+ " {'speaker': 'Guest',\n",
+ " 'text': 'The main contribution of the SOLAR 10.7B model is the introduction of a depth-wise scaled and continually pretrained model that is available under the Apache 2.0 license for commercial use. This model outperforms other benchmarks in various fields, bridging the gap between academic research and practical applications.'},\n",
+ " {'speaker': 'Host (Jane)',\n",
+ " 'text': 'How does the depth up-scaling (DUS) method differ from other up-scaling methods like mixture-of-experts (MoE)?'},\n",
+ " {'speaker': 'Guest',\n",
+ " 'text': 'Depth up-scaling (DUS) differs from other up-scaling methods like mixture-of-experts (MoE) in several ways. Firstly, DUS focuses on increasing the number of layers in the base model, while MoE introduces a Mixture-of-Experts architecture to scale the model. Secondly, DUS uses a depthwise scaling method similar to Tan and Le (2019), which is adapted for the LLM architecture, whereas MoE employs a different approach. Lastly, DUS does not introduce any additional modules or dynamism, making it compatible with existing LLM frameworks like Hugging Face (Wolf et al., 2019) without requiring any changes.'},\n",
+ " {'speaker': 'Host (Jane)',\n",
+ " 'text': 'What are the advantages of using DUS over other up-scaling methods?'},\n",
+ " {'speaker': 'Guest',\n",
+ " 'text': 'The advantages of using DUS over other up-scaling methods are that it does not require additional modules like gating networks or dynamic expert selection, making it seamless to integrate into existing training and inference frameworks while maintaining high efficiency. Additionally, DUS does not necessitate a distinct training framework for optimal training efficiency or specialized CUDA kernels for fast inference.'},\n",
+ " {'speaker': 'Host (Jane)',\n",
+ " 'text': 'What are the key components of the SOLAR 10.7B model?'},\n",
+ " {'speaker': 'Guest',\n",
+ " 'text': 'The key components of the SOLAR 10.7B model are:\\n\\n1. Introduction of the SOLAR 10.7 Billion Parameter Model: The study has released the SOLAR 10.7B model, which is depth-wise scaled and continually pre-trained. This model is available under the Apache 2.0 license, allowing commercial use and integration into various products and services, bridging the gap between academic research and practical applications.\\n\\n2. Superior Performance Across Diverse Benchmarks: The SOLAR 10.7B model demonstrates exceptional performance across a wide range of benchmarks.'},\n",
+ " {'speaker': 'Host (Jane)',\n",
+ " 'text': 'How does the SOLAR 10.7B model outperform existing models in various NLP tasks?'},\n",
+ " {'speaker': 'Guest',\n",
+ " 'text': \"The SOLAR 10.7B model outperforms existing models in various NLP tasks due to its depth-wise scaling and continuous pretraining. The model's availability under the Apache 2.0 license allows for commercial use, bridging the gap between academic research and practical applications. SOLAR 10.7B has been shown to perform better across diverse benchmarks, demonstrating its superiority in various natural language processing tasks.\"},\n",
+ " {'speaker': 'Host (Jane)',\n",
+ " 'text': 'What are the different stages of fine-tuning for the SOLAR 10.7B-Instruct model?'},\n",
+ " {'speaker': 'Guest',\n",
+ " 'text': \"The different stages of fine-tuning for the SOLAR 10.7B-Instruct model are:\\n\\n1. Instruction tuning\\n2. Alignment tuning\\n\\nIn the instruction tuning stage, the model is trained to follow instructions in a QA format. This involves using open-source datasets, as well as synthesizing a math QA dataset to enhance the model's mathematical capabilities. The math QA dataset, called 'Synth. Math-Instruct', is created by rephrasing questions and answers from the Math dataset to avoid contamination with commonly used benchmark datasets.\\n\\nIn the alignment tuning stage, the instruction-tuned model is further fine-tuned to be more aligned with human or strong AI.\"},\n",
+ " {'speaker': 'Host (Jane)',\n",
+ " 'text': \"What is the role of the alignment tuning stage in enhancing the SOLAR 10.7B-Instruct model's performance?\"},\n",
+ " {'speaker': 'Guest',\n",
+ " 'text': \"The role of the alignment tuning stage in enhancing the SOLAR 10.7B-Instruct model's performance is to further fine-tune the model to be more aligned with human or strong AI preferences. This stage follows the instruction tuning stage, where the model is trained to follow instructions in a QA format using open-source datasets and a synthesized math QA dataset called 'Synth. Math-Instruct'. The alignment tuning stage aims to improve the model's performance by ensuring that it generates responses that are more in line with human expectations and preferences.\"},\n",
+ " {'speaker': 'Host (Jane)',\n",
+ " 'text': 'How does the SOLAR 10.7B-Instruct model compare to other top-performing models in terms of performance metrics?'},\n",
+ " {'speaker': 'Guest',\n",
+ " 'text': 'Based on the provided document, it is not possible to directly compare the SOLAR 10.7B-Instruct model with other top-performing models in terms of performance metrics. The document only mentions the evaluation results for SOLAR 10.7B and SOLAR 10.7B-Instruct, along with other top-performing models, in the Open LLM Leaderboard for six tasks. However, the document does not provide specific performance metrics for the SOLAR 10.7B-Instruct model compared to other top-performing models.\\n\\nTo answer the question, we would need more information about the performance metrics of the SOLAR 10.7B-Instruct model and how it compares to other top-performing models.'},\n",
+ " {'speaker': 'Host (Jane)',\n",
+ " 'text': 'What are the limitations and considerations of the depth up-scaling (DUS) method?'},\n",
+ " {'speaker': 'Guest',\n",
+ " 'text': \"The limitations and considerations of the Depth Up-Scaling (DUS) method include the need for more thorough explorations of hyperparameters used in the DUS approach, such as the removal of m = 8 layers from both ends of the base model due to hardware limitations, which may not be optimal for enhancing performance. The extended time and cost of continued pretraining made it challenging to conduct more comprehensive experiments, which the authors aim to address in future work through various comparative analyses. The model's significant computational demands for training and inference might limit its use, especially for those with restricted computational resources. Like all machine learning models, it is vulnerable to potential biases and errors in the training data, which could affect the model's performance and accuracy.\"},\n",
+ " {'speaker': 'Host (Jane)',\n",
+ " 'text': 'How does the SOLAR 10.7B-Instruct model address ethical concerns in its operation?'},\n",
+ " {'speaker': 'Guest',\n",
+ " 'text': 'To address ethical concerns in its operation, the SOLAR 10.7B-Instruct model emphasizes maintaining high ethical standards. It demonstrates low levels of data contamination through rigorous data handling and processing protocols, which are crucial for the reliability and integrity of the results. The model also ensures that all setups and methodologies employed in experiments steer clear of potential ethical pitfalls, and it avoids ethically questionable practices. SOLAR 10.7B-Instruct is committed to conducting innovative and responsible research.'}]"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 25
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## 3. Generating the Complete Podcast Script with QnA script above\n",
+ "\n",
+ "This section involves generating an entire podcast script from the given Q&A content. The function should transform structured data into a conversational format suitable for a podcast setting, ensuring an engaging and natural dialogue flow between the host and the guest.\n",
+ "\n"
+ ],
+ "metadata": {
+ "id": "OeqNgEUb3Luu"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "class DialogueItem(BaseModel):\n",
+ " \"\"\"A single dialogue item.\"\"\"\n",
+ "\n",
+ " speaker: Literal[\"Host (Jane)\", \"Guest\"]\n",
+ " text: str\n",
+ "\n",
+ "\n",
+ "class Dialogue(BaseModel):\n",
+ " \"\"\"The dialogue between the host and guest.\"\"\"\n",
+ "\n",
+ " name_of_guest: str\n",
+ " dialogue: List[DialogueItem]"
+ ],
+ "metadata": {
+ "id": "M-q3v1LP91zb"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Adapted and modified from https://github.com/gabrielchua/open-notebooklm\n",
+ "SYSTEM_PROMPT = \"\"\"\n",
+ "You are a world-class podcast producer tasked with transforming the provided input text {text} into an engaging and informative podcast script.\n",
+ "Ensure the response adheres to this format:\n",
+ "\n",
+ "{{\n",
+ "\"name_of_guest\": \"\",\n",
+ "\"dialogue\": [\n",
+ " {{\n",
+ " \"speaker\": \"Host (Jane)\",\n",
+ " \"text\": \"\"\n",
+ " }},\n",
+ " {{\n",
+ " \"speaker\": \"Guest\",\n",
+ " \"text\": \"\",\n",
+ " }},\n",
+ " ...\n",
+ " ]\n",
+ "}}\n",
+ "\n",
+ "# Steps to Follow:\n",
+ "\n",
+ "0. for \"name_of_guest\": \"\" should be a real person name\n",
+ "\n",
+ "1. **Craft the Dialogue:**\n",
+ " Develop a natural, conversational flow between the host (Jane) and the guest speaker (the author or an expert on the topic).\n",
+ "\n",
+ " Dialogue content:\n",
+ " the {text} will be the main context for the podcast which is a QnA content.\n",
+ " Need all the questions and answers from the {text} in the podcast script.\n",
+ "\n",
+ " Incorporate:\n",
+ " - Clear explanations of complex topics\n",
+ " - An engaging and lively tone to captivate listeners\n",
+ " - A balance of information and entertainment\n",
+ "\n",
+ " Rules for the dialogue:\n",
+ " - The host (Jane) always initiates the conversation and interviews the guest\n",
+ " - Include thoughtful questions from the host to guide the discussion\n",
+ " - Incorporate natural speech patterns, including occasional verbal fillers (e.g., \"Uhh\", \"Hmmm\", \"um,\" \"well,\" \"you know\")\n",
+ " - Allow for natural interruptions and back-and-forth between host and guest - this is very important to make the conversation feel authentic\n",
+ " - Ensure the guest's responses are substantiated by the input text, avoiding unsupported claims\n",
+ " - Maintain a PG-rated conversation appropriate for all audiences\n",
+ " - Avoid any marketing or self-promotional content from the guest\n",
+ " - The host concludes the conversation\n",
+ "\n",
+ "\n",
+ "2. **Maintain Authenticity:**\n",
+ " Throughout the script, strive for authenticity in the conversation. Include:\n",
+ " - Moments of genuine curiosity or surprise from the host\n",
+ " - Instances where the guest might briefly struggle to articulate a complex idea\n",
+ " - Light-hearted moments or humor when appropriate\n",
+ " - Brief personal anecdotes or examples that relate to the topic (within the bounds of the input text)\n",
+ "\n",
+ "3. **Consider Pacing and Structure:**\n",
+ " Ensure the dialogue has a natural ebb and flow:\n",
+ " - Start with a strong hook to grab the listener's attention\n",
+ " - Gradually build complexity as the conversation progresses\n",
+ " - Include brief \"breather\" moments for listeners to absorb complex information\n",
+ " - For complicated concepts, reasking similar questions framed from a different perspective is recommended\n",
+ " - End on a high note, perhaps with a thought-provoking question or a call-to-action for listeners\n",
+ "\n",
+ "IMPORTANT RULE: Each line of dialogue should be no more than 100 characters (e.g., can finish within 5-8 seconds)\n",
+ "\n",
+ "Remember: Always reply in valid JSON format, without code blocks. Begin directly with the JSON output.\n",
+ "\"\"\""
+ ],
+ "metadata": {
+ "id": "iWH3ByXc9Q2s"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "def call_llm(system_prompt: str, text, dialogue_format):\n",
+ " \"\"\"Call the LLM with the given prompt and dialogue format.\"\"\"\n",
+ " llm = ChatUpstage(extra_body={\"response_format\": {\"type\": \"json_object\", \"schema\":dialogue_format.model_json_schema()}})\n",
+ "\n",
+ "\n",
+ " chat_prompt = ChatPromptTemplate.from_messages([\n",
+ " (\"system\", system_prompt),\n",
+ " (\"human\", \"{text}\")\n",
+ " ])\n",
+ "\n",
+ " # Create the chain\n",
+ " chain = chat_prompt | llm\n",
+ "\n",
+ " # Call the chain with the input text\n",
+ " response = chain.invoke({\"text\": text})\n",
+ " return response"
+ ],
+ "metadata": {
+ "id": "BB06DARM9jYc"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "def generate_script(system_prompt: str, input_text, output_model):\n",
+ " \"\"\"Get the dialogue from the LLM.\"\"\"\n",
+ " # Load as python object\n",
+ " try:\n",
+ " response = call_llm(system_prompt, input_text, output_model)\n",
+ " dialogue = output_model.model_validate_json(response.content)\n",
+ " except ValidationError as e:\n",
+ " error_message = f\"Failed to parse dialogue JSON: {e}\"\n",
+ " system_prompt_with_error = f\"{system_prompt}\\n\\nPlease return a VALID JSON object. This was the earlier error: {error_message}\"\n",
+ " response = call_llm(system_prompt_with_error, input_text, output_model)\n",
+ " dialogue = output_model.model_validate_json(response.content)\n",
+ " return dialogue"
+ ],
+ "metadata": {
+ "id": "XnMDM-ko9lpj"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Generate script"
+ ],
+ "metadata": {
+ "id": "iZwHqfmV27cy"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "script = generate_script(SYSTEM_PROMPT, qa_script, Dialogue)"
+ ],
+ "metadata": {
+ "id": "8eVdjHpT9rg6"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "script"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "iBUKhlGtwyFF",
+ "outputId": "f1657cfd-00f9-4afb-84f3-1c3caf2a3cce"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "Dialogue(name_of_guest='Dr. Alice Chan', dialogue=[DialogueItem(speaker='Host (Jane)', text=\"Welcome to our podcast, Dr. Alice Chan. Today, we'll be discussing the SOLAR 10.7B model and its contributions to the field of natural language processing. Let's start with the main contribution of this model. Can you explain what it is, Dr. Chan?\"), DialogueItem(speaker='Guest', text='Of course, Jane. The main contribution of the SOLAR 10.7B model is the introduction of a depth-wise scaled and continually pretrained model that is available under the Apache 2.0 license for commercial use. This model outperforms other benchmarks in various fields, bridging the gap between academic research and practical applications.'), DialogueItem(speaker='Host (Jane)', text=\"Interesting! Now, let's talk about the depth up-scaling (DUS) method. How does it differ from other up-scaling methods like mixture-of-experts (MoE)?\"), DialogueItem(speaker='Guest', text='Depth up-scaling (DUS) differs from other up-scaling methods like mixture-of-experts (MoE) in several ways. Firstly, DUS focuses on increasing the number of layers in the base model, while MoE introduces a Mixture-of-Experts architecture to scale the model. Secondly, DUS uses a depthwise scaling method similar to Tan and Le (2019), which is adapted for the LLM architecture, whereas MoE employs a different approach. Lastly, DUS does not introduce any additional modules or dynamism, making it compatible with existing LLM frameworks like Hugging Face (Wolf et al., 2019) without requiring any changes.'), DialogueItem(speaker='Host (Jane)', text='That sounds like a very efficient method. What are the advantages of using DUS over other up-scaling methods?'), DialogueItem(speaker='Guest', text='The advantages of using DUS over other up-scaling methods are that it does not require additional modules like gating networks or dynamic expert selection, making it seamless to integrate into existing training and inference frameworks while maintaining high efficiency. Additionally, DUS does not necessitate a distinct training framework for optimal training efficiency or specialized CUDA kernels for fast inference.'), DialogueItem(speaker='Host (Jane)', text=\"Now, let's dive into the key components of the SOLAR 10.7B model. What are they?\"), DialogueItem(speaker='Guest', text='The key components of the SOLAR 10.7B model are:\\n\\n1. Introduction of the SOLAR 10.7 Billion Parameter Model: The study has released the SOLAR 10.7B model, which is depth-wise scaled and continually pre-trained. This model is available under the Apache 2.0 license, allowing commercial use and integration into various products and services, bridging the gap between academic research and practical applications.\\n\\n2. Superior Performance Across Diverse Benchmarks: The SOLAR 10.7B model demonstrates exceptional performance across a wide range of benchmarks.'), DialogueItem(speaker='Host (Jane)', text='How does the SOLAR 10.7B model outperform existing models in various NLP tasks?'), DialogueItem(speaker='Guest', text=\"The SOLAR 10.7B model outperforms existing models in various NLP tasks due to its depth-wise scaling and continuous pretraining. The model's availability under the Apache 2.0 license allows for commercial use, bridging the gap between academic research and practical applications. SOLAR 10.7B has been shown to perform better across diverse benchmarks, demonstrating its superiority in various natural language processing tasks.\"), DialogueItem(speaker='Host (Jane)', text=\"It's fascinating how this model has been fine-tuned to perform so well. Can you tell us about the different stages of fine-tuning for the SOLAR 10.7B-Instruct model?\"), DialogueItem(speaker='Guest', text=\"The different stages of fine-tuning for the SOLAR 10.7B-Instruct model are:\\n\\n1. Instruction tuning\\n2. Alignment tuning\\n\\nIn the instruction tuning stage, the model is trained to follow instructions in a QA format. This involves using open-source datasets, as well as synthesizing a math QA dataset to enhance the model's mathematical capabilities. The math QA dataset, called 'Synth. Math-Instruct', is created by rephrasing questions and answers from the Math dataset to avoid contamination with commonly used benchmark datasets.\\n\\nIn the alignment tuning stage, the instruction-tuned model is further fine-tuned to be more aligned with human or strong AI.\"), DialogueItem(speaker='Host (Jane)', text=\"What is the role of the alignment tuning stage in enhancing the SOLAR 10.7B-Instruct model's performance?\"), DialogueItem(speaker='Guest', text=\"The role of the alignment tuning stage in enhancing the SOLAR 10.7B-Instruct model's performance is to further fine-tune the model to be more aligned with human or strong AI preferences. This stage follows the instruction tuning stage, where the model is trained to follow instructions in a QA format using open-source datasets and a synthesized math QA dataset called 'Synth. Math-Instruct'. The alignment tuning stage aims to improve the model's performance by ensuring that it generates responses that are more in line with human expectations and preferences.\"), DialogueItem(speaker='Host (Jane)', text='How does the SOLAR 10.7B-Instruct model compare to other top-performing models in terms of performance metrics?'), DialogueItem(speaker='Guest', text='Based on the provided document, it is not possible to directly compare the SOLAR 10.7B-Instruct model with other top-performing models in terms of performance metrics. The document only mentions the evaluation results for SOLAR 10.7B and SOLAR 10.7B-Instruct, along with other top-performing models, in the Open LLM Leaderboard for six tasks. However, the document does not provide specific performance metrics for the SOLAR 10.7B-Instruct model compared to other top-performing models.\\n\\nTo answer the question, we would need more information about the performance metrics of the SOLAR 10.7B-Instruct model and how it compares to other top-performing models.'), DialogueItem(speaker='Host (Jane)', text=\"That's a shame. Are there any limitations or considerations to keep in mind when using the depth up-scaling (DUS) method?\"), DialogueItem(speaker='Guest', text=\"The limitations and considerations of the Depth Up-Scaling (DUS) method include the need for more thorough explorations of hyperparameters used in the DUS approach, such as the removal of m = 8 layers from both ends of the base model due to hardware limitations, which may not be optimal for enhancing performance. The extended time and cost of continued pretraining made it challenging to conduct more comprehensive experiments, which the authors aim to address in future work through various comparative analyses. The model's significant computational demands for training and inference might limit its use, especially for those with restricted computational resources. Like all machine learning models, it is vulnerable to potential biases and errors in the training data, which could affect the model's performance and accuracy.\"), DialogueItem(speaker='Host (Jane)', text='Lastly, how does the SOLAR 10.7B-Instruct model address ethical concerns in its operation?'), DialogueItem(speaker='Guest', text='To address ethical concerns in its operation, the SOLAR 10.7B-Instruct model emphasizes maintaining high ethical standards. It demonstrates low levels of data contamination through rigorous data handling and processing protocols, which are crucial for the reliability and integrity of the results. The model also ensures that all setups and methodologies employed in experiments steer clear of potential ethical pitfalls, and it avoids ethically questionable practices. SOLAR 10.7B-Instruct is committed to conducting innovative and responsible research.')])"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 99
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PW4-FUmB1lAS"
+ },
+ "source": [
+ "## 4. Generate Podcast Using TTS\n",
+ "\n",
+ "Below we read through the script and parse choose the TTS voice depending on the speaker. We define a speaker and guest voice id.\n",
+ "\n",
+ "We can loop through the lines in the script and generate them by a call to the TTS model with specific voice and lines configurations. The lines all appended to the same buffer and once the script finishes we write this out to a `wav` file, ready to be played.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import subprocess\n",
+ "import ffmpeg\n",
+ "\n",
+ "host_id = \"694f9389-aac1-45b6-b726-9d9369183238\" # Jane - host\n",
+ "guest_id = \"a0e99841-438c-4a64-b679-ae501e7d6091\" # Guest\n",
+ "\n",
+ "model_id = \"sonic-english\" # The Sonic Cartesia model for English TTS\n",
+ "\n",
+ "output_format = {\n",
+ " \"container\": \"raw\",\n",
+ " \"encoding\": \"pcm_f32le\",\n",
+ " \"sample_rate\": 44100,\n",
+ "}\n",
+ "\n",
+ "client_cartesia = Cartesia(api_key=os.environ.get(\"CARTESIA_API_KEY\"))\n",
+ "\n",
+ "\n",
+ "# Set up a WebSocket connection.\n",
+ "ws = client_cartesia.tts.websocket()\n",
+ "\n",
+ "# Open a file to write the raw PCM audio bytes to.\n",
+ "f = open(\"podcast.pcm\", \"wb\")\n",
+ "\n",
+ "# Generate and stream audio.\n",
+ "for line in script.dialogue:\n",
+ " if line.speaker == \"Guest\":\n",
+ " voice_id = guest_id\n",
+ " else:\n",
+ " voice_id = host_id\n",
+ "\n",
+ " for output in ws.send(\n",
+ " model_id=model_id,\n",
+ " transcript='-' + line.text, # the \"-\"\" is to add a pause between speakers\n",
+ " voice_id=voice_id,\n",
+ " stream=True,\n",
+ " output_format=output_format,\n",
+ " ):\n",
+ " buffer = output[\"audio\"] # buffer contains raw PCM audio bytes\n",
+ " f.write(buffer)\n",
+ "\n",
+ "# Close the connection to release resources\n",
+ "ws.close()\n",
+ "f.close()\n",
+ "\n",
+ "# Convert the raw PCM bytes to a WAV file.\n",
+ "ffmpeg.input(\"podcast.pcm\", format=\"f32le\").output(\"podcast.wav\").run()\n",
+ "\n",
+ "# Play the file\n",
+ "subprocess.run([\"ffplay\", \"-autoexit\", \"-nodisp\", \"podcast.wav\"])"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "3xn6I3pn7oI8",
+ "outputId": "21c52a79-0a20-4cba-ac79-c404edff5379"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "CompletedProcess(args=['ffplay', '-autoexit', '-nodisp', 'podcast.wav'], returncode=0)"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 104
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "STWaJf_ySctY"
+ },
+ "outputs": [],
+ "source": [
+ "# Play the podcast\n",
+ "import IPython\n",
+ "IPython.display.Audio(\"podcast.wav\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": [],
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
\ No newline at end of file
From 8c4e7e223929e49f82ee6e9174ac27b039fe28e7 Mon Sep 17 00:00:00 2001
From: Hyesoo Kim <100982596+duper203@users.noreply.github.com>
Date: Thu, 31 Oct 2024 11:58:45 -0700
Subject: [PATCH 2/4] fix
---
PDF_to_Podcast_RAG.ipynb | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/PDF_to_Podcast_RAG.ipynb b/PDF_to_Podcast_RAG.ipynb
index d867535..b8d01c0 100644
--- a/PDF_to_Podcast_RAG.ipynb
+++ b/PDF_to_Podcast_RAG.ipynb
@@ -55,10 +55,9 @@
"!apt install -qU libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg\n",
"!pip install -qU ffmpeg-python\n",
"!pip install -qU PyAudio\n",
- "!pip install -qU pypdf #to read PDF content\n",
"!pip install -qU cartesia #to access TTS model\n",
- "!pip install -qU langchain-upstage langchain\n",
- "!pip install -qU langchain_community faiss-cpu"
+ "!pip install -qU langchain-upstage langchain langchain_community\n",
+ "!pip install -qU faiss-cpu"
]
},
{
From 2ad2badbe3af4e2ce89f7370da4bf0398c4d704d Mon Sep 17 00:00:00 2001
From: Hyesoo Kim <100982596+duper203@users.noreply.github.com>
Date: Thu, 31 Oct 2024 12:02:21 -0700
Subject: [PATCH 3/4] fix pdf path
---
PDF_to_Podcast_RAG.ipynb | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/PDF_to_Podcast_RAG.ipynb b/PDF_to_Podcast_RAG.ipynb
index b8d01c0..ae54fe9 100644
--- a/PDF_to_Podcast_RAG.ipynb
+++ b/PDF_to_Podcast_RAG.ipynb
@@ -152,7 +152,7 @@
" text += page.page_content\n",
"\n",
" return text\n",
- "text = get_PDF_text('solar.pdf')"
+ "text = get_PDF_text('pdfs/solar_paper.pdf')"
],
"metadata": {
"id": "GPdIe4rl5dVk"
@@ -309,7 +309,7 @@
"\n",
" return vectorstore\n",
"\n",
- "vectorstore=vectorstore_embed('solar.pdf')"
+ "vectorstore=vectorstore_embed('pdfs/solar_paper.pdf')"
],
"metadata": {
"id": "Yg-yS0_tt-Wz"
From 2962b961b8ef44d068162a0411e2e383da856dfa Mon Sep 17 00:00:00 2001
From: Hyesoo Kim <100982596+duper203@users.noreply.github.com>
Date: Thu, 31 Oct 2024 12:03:38 -0700
Subject: [PATCH 4/4] fix code
---
PDF_to_Podcast_RAG.ipynb | 31 +------------------------------
1 file changed, 1 insertion(+), 30 deletions(-)
diff --git a/PDF_to_Podcast_RAG.ipynb b/PDF_to_Podcast_RAG.ipynb
index ae54fe9..2d3c980 100644
--- a/PDF_to_Podcast_RAG.ipynb
+++ b/PDF_to_Podcast_RAG.ipynb
@@ -152,6 +152,7 @@
" text += page.page_content\n",
"\n",
" return text\n",
+ "\n",
"text = get_PDF_text('pdfs/solar_paper.pdf')"
],
"metadata": {
@@ -160,36 +161,6 @@
"execution_count": null,
"outputs": []
},
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 140
- },
- "id": "D9BzDxmgvS2V",
- "outputId": "1feefc02-5b5b-4c24-c6e0-c926c5e2a77f"
- },
- "outputs": [
- {
- "output_type": "execute_result",
- "data": {
- "text/plain": [
- "'SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective\\nDepth Up-ScalingDahyun Kim∗, Chanjun Park∗†, Sanghoon Kim∗†, Wonsung Lee∗†, Wonho Song∗\\nYunsu Kim∗, Hyeonwoo Kim∗, Yungi Kim, Hyeonju Lee, Jihoo Kim\\nChangbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim\\nMikyoung Cha, Hwalsuk Lee†, Sunghun Kim†Upstage AI, South Korea{kdahyun, chanjun.park, limerobot, wonsung.lee, hwalsuk.lee, hunkim}@upstage.aiAbstractWe introduce SOLAR 10.7B, a large language\\nmodel (LLM) with 10.7 billion parameters,\\ndemonstrating superior performance in various\\nnatural language processing (NLP) tasks. In-\\nspired by recent efforts to efficiently up-scale\\nLLMs, we present a method for scaling LLMs\\ncalled depth up-scaling (DUS), which encom-\\npasses depthwise scaling and continued pre-\\ntraining. In contrast to other LLM up-scaling\\nmethods that use mixture-of-experts, DUS does\\nnot require complex changes to train and infer-\\nence efficiently. We show experimentally that\\nDUS is simple yet effective in scaling up high-\\nperformance LLMs from small ones. Building\\non the DUS model, we additionally present SO-\\nLAR 10.7B-Instruct, a variant fine-tuned for\\ninstruction-following capabilities, surpassing\\nMixtral-8x7B-Instruct. SOLAR 10.7B is pub-\\nlicly available under the Apache 2.0 license,\\npromoting broad access and application in the\\nLLM field 1.1 Introduction2024\\nApr\\n4\\n[cs.CL]\\narXiv:2312.15166v3The field of natural language processing (NLP)\\nhas been significantly transformed by the introduc-\\ntion of large language models (LLMs), which have\\nenhanced our understanding and interaction with\\nhuman language (Zhao et al., 2023). These ad-\\nvancements bring challenges such as the increased\\nneed to train ever larger models (Rae et al., 2021;\\nWang et al., 2023; Pan et al., 2023; Lian, 2023;\\nYao et al., 2023; Gesmundo and Maile, 2023) ow-\\ning to the performance scaling law (Kaplan et al.,\\n2020; Hernandez et al., 2021; Anil et al., 2023;\\nKaddour et al., 2023). To efficiently tackle the\\nabove, recent works in scaling language models\\nsuch as a mixture of experts (MoE) (Shazeer et al.,\\n2017; Komatsuzaki et al., 2022) have been pro-\\nposed. While those approaches are able to effi-∗Equal Contribution † Corresponding Author\\n1https://huggingface.co/upstage/\\nSOLAR-10.7B-v1.0ciently and effectively scale-up LLMs, they often\\nrequire non-trivial changes to the training and infer-\\nence framework (Gale et al., 2023), which hinders\\nwidespread applicability. Effectively and efficiently\\nscaling up LLMs whilst also retaining the simplic-\\nity for ease of use is an important problem (Alberts\\net al., 2023; Fraiwan and Khasawneh, 2023; Sallam\\net al., 2023; Bahrini et al., 2023).Inspired by Komatsuzaki et al. (2022), we\\npresent depth up-scaling (DUS), an effective and\\nefficient method to up-scale LLMs whilst also re-\\nmaining straightforward to use. DUS consists of\\nscaling the number of layers in the base model and\\ncontinually pretraining the scaled model. Unlike\\n(Komatsuzaki et al., 2022), DUS does not scale\\nthe model using MoE and rather use a depthwise\\nscaling method analogous to Tan and Le (2019)\\nwhich is adapted for the LLM architecture. Thus,\\nthere are no additional modules or dynamism as\\nwith MoE, making DUS immediately compatible\\nwith easy-to-use LLM frameworks such as Hug-\\ngingFace (Wolf et al., 2019) with no changes to\\nthe training or inference framework for maximal\\nefficiency. Furthermore, DUS is applicable to all\\ntransformer architectures, opening up new gate-\\nways to effectively and efficiently scale-up LLMs\\nin a simple manner. Using DUS, we release SO-\\nLAR 10.7B, an LLM with 10.7 billion parameters,\\nthat outperforms existing models like Llama 2 (Tou-\\nvron et al., 2023) and Mistral 7B (Jiang et al., 2023)\\nin various benchmarks.We have also developed SOLAR 10.7B-Instruct,\\na variant fine-tuned for tasks requiring strict adher-\\nence to complex instructions. It significantly out-\\nperforms the Mixtral-8x7B-Instruct model across\\nvarious evaluation metrics, evidencing an advanced\\nproficiency that exceeds the capabilities of even\\nlarger models in terms of benchmark performance.By releasing SOLAR 10.7B under the Apache\\n2.0 license, we aim to promote collaboration and in-\\nnovation in NLP. This open-source approach allowsFigure 1: Depth up-scaling for the case with n = 32, s = 48, and m = 8. Depth up-scaling is achieved through a\\ndual-stage process of depthwise scaling followed by continued pretraining.for wider access and application of these models\\nby researchers and developers globally.2 Depth Up-ScalingTo efficiently scale-up LLMs, we aim to utilize pre-\\ntrained weights of base models to scale up to larger\\nLLMs (Komatsuzaki et al., 2022). While exist-\\ning methods such as Komatsuzaki et al. (2022) use\\nMoE (Shazeer et al., 2017) to scale-up the model ar-\\nchitecture, we opt for a different depthwise scaling\\nstrategy inspired by Tan and Le (2019). We then\\ncontinually pretrain the scaled model as just scaling\\nthe model without further pretraining degrades the\\nperformance.Base model. Any n-layer transformer architec-\\nture can be used but we select the 32-layer Llama\\n2 architecture as our base model. We initialize the\\nLlama 2 architecture with pretrained weights from\\nMistral 7B, as it is one of the top performers com-\\npatible with the Llama 2 architecture. By adopting\\nthe Llama 2 architecture for our base model, we\\naim to leverage the vast pool of community re-\\nsources while introducing novel modifications to\\nfurther enhance its capabilities.Depthwise scaling. From the base model with n\\nlayers, we set the target layer count s for the scaled\\nmodel, which is largely dictated by the available\\nhardware.With the above, the depthwise scaling process\\nis as follows. The base model with n layers is\\nduplicated for subsequent modification. Then, we\\nremove the final m layers from the original model\\nand the initial m layers from its duplicate, thus\\nforming two distinct models with n − m layers.\\nThese two models are concatenated to form a scaled\\nmodel with s = 2·(n−m) layers. Note that n = 32\\nfrom our base model and we set s = 48 consideringour hardware constraints and the efficiency of the\\nscaled model, i.e., fitting between 7 and 13 billion\\nparameters. Naturally, this leads to the removal of\\nm = 8 layers. The depthwise scaling process with\\nn = 32, s = 48, and m = 8 is depicted in ‘Step 1:\\nDepthwise Scaling’ of Fig. 1.We note that a method in the community that also\\nscale the model in the same manner 2 as ‘Step 1:\\nDepthwise Scaling’ of Fig. 1 has been concurrently\\ndeveloped.Continued pretraining. The performance of the\\ndepthwise scaled model initially drops below that\\nof the base LLM. Thus, we additionally apply\\nthe continued pretraining step as shown in ‘Step\\n2: Continued Pretraining’ of Fig. 1. Experimen-\\ntally, we observe rapid performance recovery of\\nthe scaled model during continued pretraining, a\\nphenomenon also observed in Komatsuzaki et al.\\n(2022). We consider that the particular way of\\ndepthwise scaling has isolated the heterogeneity\\nin the scaled model which allowed for this fast\\nperformance recovery.Delving deeper into the heterogeneity of the\\nscaled model, a simpler alternative to depthwise\\nscaling could be to just repeat its layers once more,\\ni.e., from n to 2n layers. Then, the ‘layer distance’,\\nor the difference in the layer indices in the base\\nmodel, is only bigger than 1 where layers n and\\nn + 1 are connected, i.e., at the seam.However, this results in maximum layer distance\\nat the seam, which may be too significant of a\\ndiscrepancy for continued pretraining to quickly\\nresolve. Instead, depthwise scaling sacrifices the\\n2m middle layers, thereby reducing the discrep-\\nancy at the seam and making it easier for continued2https://huggingface.co/Undi95/\\nMistral-11B-v0.1Properties Instruction Training Datasets Alignment\\n Alpaca-GPT4 OpenOrca Synth. Math-Instruct Orca DPO Pairs Ultrafeedback Cleaned Synth. Math-Alignment\\n Total # Samples 52K 2.91M 126K 12.9K 60.8K 126K\\n Maximum # Samples Used 52K 100K 52K 12.9K 60.8K 20.1K\\n Open Source O O ✗ O O ✗Table 1: Training datasets used for the instruction and alignment tuning stages, respectively. For the instruction\\ntuning process, we utilized the Alpaca-GPT4 (Peng et al., 2023), OpenOrca (Mukherjee et al., 2023), and Synth.\\nMath-Instruct datasets, while for the alignment tuning, we employed the Orca DPO Pairs (Intel, 2023), Ultrafeedback\\nCleaned (Cui et al., 2023; Ivison et al., 2023), and Synth. Math-Alignment datasets. The ‘Total # Samples‘ indicates\\nthe total number of samples in the entire dataset. The ‘Maximum # Samples Used‘ indicates the actual maximum\\nnumber of samples that were used in training, which could be lower than the total number of samples in a given\\ndataset. ‘Open Source‘ indicates whether the dataset is open-sourced.pretraining to quickly recover performance. We\\nattribute the success of DUS to reducing such dis-\\ncrepancies in both the depthwise scaling and the\\ncontinued pretraining steps. We also hypothesize\\nthat other methods of depthwise scaling could also\\nwork for DUS, as long as the discrepancy in the\\nscaled model is sufficiently contained before the\\ncontinued pretraining step.Comparison to other up-scaling methods. Un-\\nlike Komatsuzaki et al. (2022), depthwise scaled\\nmodels do not require additional modules like gat-\\ning networks or dynamic expert selection. Conse-\\nquently, scaled models in DUS do not necessitate\\na distinct training framework for optimal training\\nefficiency, nor do they require specialized CUDA\\nkernels for fast inference. A DUS model can seam-\\nlessly integrate into existing training and inference\\nframeworks while maintaining high efficiency.3 Training DetailsAfter DUS, including continued pretraining, we\\nperform fine-tuning of SOLAR 10.7B in two stages:\\n1) instruction tuning and 2) alignment tuning.Instruction tuning. In the instruction tuning\\nstage, the model is trained to follow instructions in\\na QA format (Zhang et al., 2023). We mostly use\\nopen-source datasets but also synthesize a math QA\\ndataset to enhance the model’s mathematical capa-\\nbilities. A rundown of how we crafted the dataset is\\nas follows. First, seed math data are collected from\\nthe Math (Hendrycks et al., 2021) dataset only, to\\navoid contamination with commonly used bench-\\nmark datasets such as GSM8K (Cobbe et al., 2021).\\nThen, using a process similar to MetaMath (Yu\\net al., 2023), we rephrase the questions and an-\\nswers of the seed math data. We use the resulting\\nrephrased question-answer pairs as a QA datasetand call it ‘Synth. Math-Instruct‘.Alignment tuning. In the alignment tuning stage,\\nthe instruction-tuned model is further fine-tuned\\nto be more aligned with human or strong AI\\n(e.g., GPT4 (OpenAI, 2023)) preferences using\\nsDPO (Kim et al., 2024a), an improved version\\nof direct preference optimization (DPO) (Rafailov\\net al., 2023). Similar to the instruction tuning stage,\\nwe use mostly open-source datasets but also syn-\\nthesize a math-focused alignment dataset utilizing\\nthe ‘Synth. Math-Instruct‘ dataset mentioned in the\\ninstruction tuning stage.The alignment data synthesis process is as\\nfollows. We take advantage of the fact that\\nthe rephrased question-answer pairs in Synth.\\nMath-Instruct data are beneficial in enhancing the\\nmodel’s mathematical capabilities (see Sec. 4.3.1).\\nThus, we speculate that the rephrased answer to the\\nrephrased question is a better answer than the orig-\\ninal answer, possibly due to the interim rephrasing\\nstep. Consequently, we set the rephrased question\\nas the prompt and use the rephrased answer as the\\nchosen response and the original answer as the re-\\njected response and create the {prompt, chosen,\\nrejected} DPO tuple. We aggregate the tuples from\\nthe rephrased question-answer pairs and call the\\nresulting dataset ‘Synth. Math-Alignment‘.4 Results4.1 Experimental DetailsTraining datasets. We present details regarding\\nour training datasets for the instruction and align-\\nment tuning stages in Tab. 1. We do not always\\nuse the entire dataset and instead subsample a set\\namount. Note that most of our training data is\\nopen-source, and the undisclosed datasets can be\\nsubstituted for open-source alternatives such as theModel Size Type H6 (Avg.) ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n SOLAR 10.7B-Instruct ∼ 11B Alignment-tuned 74.20 71.08 88.16 66.21 71.43 83.58 64.75\\n Qwen 72B ∼ 72B Pretrained 73.60 65.19 85.94 77.37 60.19 82.48 70.43\\n Mixtral 8x7B-Instruct-v0.1 ∼ 47B Instruction-tuned 72.62 70.22 87.63 71.16 64.58 81.37 60.73\\n Yi 34B-200K ∼ 34B Pretrained 70.81 65.36 85.58 76.06 53.64 82.56 61.64\\n Yi 34B ∼ 34B Pretrained 69.42 64.59 85.69 76.35 56.23 83.03 50.64\\n Mixtral 8x7B-v0.1 ∼ 47B Pretrained 68.42 66.04 86.49 71.82 46.78 81.93 57.47\\n Llama 2 70B ∼ 70B Pretrained 67.87 67.32 87.33 69.83 44.92 83.74 54.06\\n Falcon 180B ∼ 180B Pretrained 67.85 69.45 88.86 70.50 45.47 86.90 45.94\\n SOLAR 10.7B ∼ 11B Pretrained 66.04 61.95 84.60 65.48 45.04 83.66 55.50\\n Qwen 14B ∼ 14B Pretrained 65.86 58.28 83.99 67.70 49.43 76.80 58.98\\n Mistral 7B-Instruct-v0.2 ∼ 7B Instruction-tuned 65.71 63.14 84.88 60.78 68.26 77.19 40.03\\n Yi 34B-Chat ∼ 34B Instruction-tuned 65.32 65.44 84.16 74.90 55.37 80.11 31.92\\n Mistral 7B ∼ 7B Pretrained 60.97 59.98 83.31 64.16 42.15 78.37 37.83Table 2: Evaluation results in the Open LLM Leaderboard for SOLAR 10.7B and SOLAR 10.7B-Instruct along with\\nother top-performing models. We report the scores for the six tasks mentioned in Sec. 4.1 along with the H6 score\\n(average of six tasks). We also report the size of the models in units of billions of parameters. The type indicates the\\ntraining stage of the model and is chosen from {Pretrained, Instruction-tuned, Alignment-tuned}. Models based on\\nSOLAR 10.7B are colored purple. The best scores for H6 and the individual tasks are shown in bold.MetaMathQA (Yu et al., 2023) dataset.We reformatted the instruction datasets with an\\nAlpaca-styled chat template. For datasets such as\\nOpenOrca, which are derived from FLAN (Long-\\npre et al., 2023), we filter data that overlaps with\\nthe benchmark datasets (see Tab. 8 in Appendix. C\\nfor more information). The alignment datasets\\nare in the {prompt, chosen, rejected} triplet for-\\nmat. We preprocess the alignment datasets follow-\\ning Zephyr (Tunstall et al., 2023). We use Data-\\nverse (Park et al., 2024) for data preprocessing.Evaluation. In the HuggingFace Open LLM\\nLeaderboard (Beeching et al., 2023), six types of\\nevaluation methods are presented: ARC (Clark\\net al., 2018), HellaSWAG (Zellers et al., 2019),\\nMMLU (Hendrycks et al., 2020), TruthfulQA (Lin\\net al., 2022), Winogrande (Sakaguchi et al., 2021),\\nand GSM8K (Cobbe et al., 2021). We utilize these\\ndatasets as benchmarks for evaluation and also re-\\nport the average scores for the six tasks, e.g., H6.\\nWe either submit directly to the Open LLM Leader-\\nboard or utilize Evalverse (Kim et al., 2024b) for\\nrunning evaluations locally.Model merging. Model merging methods such\\nas Yadav et al. (2023) can boost model perfor-\\nmance without further training. We merge some\\nof the models that we trained in both the instruc-\\ntion and alignment tuning stages. We implement\\nour own merging methods although popular open\\nsource also exist such as MergeKit3.4.2 Main ResultsWe present evaluation results for our SOLAR\\n10.7B and SOLAR 10.7B-Instruct models along3https://github.com/cg123/mergekitwith other top-performing models in Tab. 2. SO-\\nLAR 10.7B outperforms other pretrained models\\nof similar sizes, such as Qwen 14B and Mistral\\n7B, which shows that DUS is an effective method\\nto up-scale base LLMs. Furthermore, despite the\\nsmaller size, SOLAR 10.7B-Instruct scores the\\nhighest in terms of H6, even surpassing the recent\\ntop-performing open-source LLM Mixtral 8x7B-\\nInstruct-v0.1 or Qwen 72B. The above results indi-\\ncate DUS can up-scale models that are capable of\\nachieving state-of-the-art performance when fine-\\ntuned. We also report data contamination results\\nfor SOLAR 10.7B-Instruct in Appendix C.4.3 Ablation StudiesWe present ablation studies for both the instruction\\nand alignment tuning stages. Note that the evalua-\\ntion results for the following studies are ran locally\\nand may vary from results obtained by submitting\\nto the Open LLM Leaderboard.4.3.1 Instruction TuningAblation on the training datasets. We present\\nablation studies using different training datasets\\nfor the instruction tuning in Tab. 3. The ablated\\nmodels are prefixed with SFT for supervised fine-\\ntuning. ‘SFT v1’ only uses the Alpaca-GPT4\\ndataset, whereas ‘SFT v2’ also uses the OpenOrca\\ndataset. ‘SFT v3’ uses the Synth. Math-Instruct\\ndataset along with the datasets used in ‘SFT v2’.\\nSimilarly, ‘SFT v4’ uses the Synth. Math-Instruct\\ndataset along with the datasets used in ‘SFT v1’.First, we analyze how Alpaca-GPT4 and\\nOpenOrca affect the trained models. The first ab-\\nlated model, ‘SFT v1’, which used only the Alpaca-\\nGPT4 dataset for training, resulted in 69.15 for H6.Model Alpaca-GPT4 OpenOrca Synth. Math-Instruct H6 (Avg.) ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n SFT v1 O ✗ ✗ 69.15 67.66 86.03 65.88 60.12 82.95 52.24\\n SFT v2 O O ✗ 69.21 65.36 85.39 65.93 58.47 82.79 57.32\\n SFT v3 O O O 70.03 65.87 85.55 65.31 57.93 81.37 64.14\\n SFT v4 O ✗ O 70.88 67.32 85.87 65.87 58.97 82.48 64.75\\n SFT v3 + v4 O O O 71.11 67.32 85.96 65.95 58.80 82.08 66.57Table 3: Ablation studies on the different datasets used for instruction tuning. ‘SFT v3+v4’ indicates that the model\\nis merged from ‘SFT v3’ and ‘SFT v4’ by simply averaging the model weights. The best scores for H6 and the\\nindividual tasks are shown in bold.Model Ultrafeedback Clean Synth. Math-Alignment H6 (Avg.) ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n DPO v1 O ✗ 73.06 71.42 88.49 66.14 72.04 81.45 58.83\\n DPO v2 O O 73.42 71.50 88.28 65.97 71.71 82.79 60.27\\n DPO v1 + v2 O O 73.21 71.33 88.36 65.92 72.65 82.79 58.23Table 4: Ablation studies on the different datasets used during the direct preference optimization (DPO) stage.\\n‘SFT v3’ is used as the SFT base model for DPO. We name ablated models with the ‘DPO’ prefix to indicate the\\nalignment tuning stage. ‘DPO v1+v2’ indicates that the model is merged from ‘DPO v1’ and ‘DPO v2’ by simply\\naveraging the model weights. The best scores for H6 and the individual tasks are shown in bold.Model Base SFT Model H6 (Avg.) ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n DPO v2 SFT v3 73.42 71.50 88.28 65.97 71.71 82.79 60.27\\n DPO v3 SFT v3 + v4 73.58 71.33 88.08 65.39 72.45 81.93 62.32Table 5: Ablation studies on the different SFT base models used during the direct preference optimization (DPO)\\nstage. Ultrafeedback Clean and Synth. Math-Alignment datasets are used. We name ablated models with the ‘DPO’\\nprefix to indicate the alignment tuning stage. The best scores for H6 and the individual tasks are shown in bold.When we add the OpenOrca dataset to train the\\nsecond ablated model, ‘SFT v2’, the resulting H6\\nscore is 69.21, which is little change from 69.15 of\\n‘SFT v1’. However, the task scores vary more as\\n‘SFT v2’ gets a substantially higher GSM8K score\\nof 57.32 compared to 52.24 of ‘SFT v1’ but also\\ngets noticeably lower scores across the board for\\nARC, HellaSwag, and TruthfulQA. This seems to\\nindicate that using OpenOrca results in a model that\\nbehaves differently from using only Alpaca-GPT4.Second, we investigate whether Synth. Math-\\nInstruct dataset is beneficial. For ‘SFT v3’, we\\nadd the Synth. Math-Instruct dataset, which boosts\\nGSM8K scores to 64.14 and achieves comparable\\nscores for the other tasks. Interestingly, when we\\nadd the Synth. Math-Instruct dataset to ‘SFT v1’\\nto train ‘SFT v4’, we get our highest H6 score of\\n70.88 with higher scores than ‘SFT v3’ for all tasks.\\nFrom the above, we can see that adding the Synth.\\nMath-Instruct dataset is helpful.Lastly, we see whether merging models trained\\nwith and without OpenOrca can boost performance.\\nIn the first analysis, we saw that using OpenOrca re-\\nsulted in a model that behaved differently from the\\nmodel that was trained without OpenOrca. Build-\\ning on this intuition, we merge ‘SFT v3’ and ‘SFT\\nv4’ as they are the best-performing models withand without OpenOrca. To our surprise, the result-\\ning merged model ‘SFT v3+v4’ retains the high\\nscores for non-GSM8K tasks from ‘SFT v4’ but\\nalso achieves a higher GSM8K score than ‘SFT v3’\\nor ‘SFT v4’. Thus, we see that merging models\\nthat specialize in different tasks is a promising way\\nto obtain a model that performs well generally.4.3.2 Alignment TuningAs we utilize sDPO for practical alignment tun-\\ning, there are additional aspects to ablate such as\\nthe SFT base models used. Thus, we present ab-\\nlations for the different training datasets used for\\ntraining, the different SFT base models to initialize\\nthe sDPO training, and finally, the model merging\\nstrategy to obtain the final alignment-tuned model.Ablation on the training datasets. We ablate on\\nthe different alignment datasets used during DPO\\nin Tab. 4. We use ‘SFT v3’ as the SFT base model\\nfor DPO. ‘DPO v1’ only uses the Ultrafeedback\\nClean dataset while ‘DPO v2’ also used the Synth.\\nMath-Alignment dataset.First, we test how Ultrafeedback Clean and\\nSynth. Math-Alignment impacts model perfor-\\nmance. For ‘DPO v1’, it achieves 73.06 in H6,\\nwhich is a substantial boost from the SFT base\\nmodel score of 70.03. However, we note that whileModel H6 (Avg.) ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n Cand. 1 73.73 70.48 87.47 65.73 70.62 81.53 66.57\\n Cand. 2 73.28 71.59 88.39 66.14 72.50 81.99 59.14Table 6: Performance comparison amongst the merge candidates. ‘Cand. 1’ and ‘Cand. 2’ are trained using the\\nsame setting as ‘DPO v2’ and ‘DPO v3’, respectively, but with slightly different hyper-parameters. The best scores\\nfor H6 and the individual tasks are shown in bold.Model Merge Method H6 (Avg.) ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n Merge v1 Average (0.5, 0.5) 74.00 71.16 88.01 66.14 71.71 82.08 64.90\\n Merge v2 Average (0.4, 0.6) 73.93 71.08 88.08 66.27 71.89 81.77 64.52\\n Merge v3 Average (0.6, 0.4) 74.05 71.08 87.88 66.13 71.61 82.08 65.50\\n Merge v4 SLERP 73.96 71.16 88.03 66.25 71.79 81.93 64.59Table 7: Ablation studies on the different merge methods used for obtaining the final model. We use ‘Cand. 1’\\nand ‘Cand. 2’ from Tab. 6 as our two models for merging. We name the merged models with the ‘Merge’ prefix to\\nindicate they are merged. The best scores for H6 and the individual tasks are shown in bold.scores for tasks like ARC, HellaSwag, and Truth-\\nfulQA all improved by good margins, the score\\nfor GSM8K is 58.83, which is lower than the\\nSFT base model score of 64.14. Adding Synth.\\nMath-Alignment to train ‘DPO v2’, we see that\\nthe GSM8k score improves to 60.27, which is\\nlower than the SFT base model but still higher\\nthan ‘DPO v1’. Other task scores are also not nega-\\ntively impacted by adding Synth. Math-Alignment.\\nThus, we can conclude that adding Synth. Math-\\nAlignment is beneficial for H6.Then, we experiment whether merging ‘DPO\\nv1’ and ‘DPO v2’ is beneficial. Unfortunately,\\n‘DPO v1+v2’ scores 73.21 in H6, which is worse\\nthan ‘DPO v2’. More importantly, the gain in\\nthe GSM8K score from adding Synth. Math-\\nAlignment is gone, which is undesirable. One\\nreason for this could be that ‘DPO v2’ is a strict\\nimprovement over ‘DPO v1’, unlike the case for\\nmerging ‘SFT v3’ and ‘SFT v4’ where the models\\nhad different strengths and weaknesses.Ablation on the SFT base models. When ap-\\nplying DPO, we start from a model that is already\\ninstruction tuned ,i.e., the SFT base model and ab-\\nlate on using different SFT base models. We use\\nUltrafeedback Clean and Synth. Math-Alignment\\ndatasets for this ablation. Each of the ablated mod-\\nels is trained as follows. ‘DPO v2’ uses ‘SFT v3’\\nas the base SFT model, while ‘DPO v3’ uses ‘SFT\\nv3+v4’ as the SFT base model instead.Note that ‘SFT v3+v4’ has higher scores on all\\ntasks compared to ‘SFT v3’, and the gap is espe-\\ncially large for ARC (+1.45) and GSM8K (+2.43).\\nSurprisingly, the two models perform similarly in\\nterms of H6. A closer look at the scores for theindividual tasks shows only a small margin in the\\nGSM8K scores, and other task scores show little\\ndifference. Thus, the performance gaps in certain\\ntasks in the SFT base models do not always carry\\nover to the alignment-tuned models.Ablation on different merge methods. From\\nTab. 3, we saw that merging two models that have\\ndifferent strengths can be beneficial to performance.\\nTo utilize this for the alignment-tuned model as\\nwell, we train two models named ‘Cand. 1’ and\\n‘Cand. 2’ using the same training dataset and SFT\\nbase model as ‘DPO v2’ and ‘DPO v3’ but with dif-\\nferent hyper-parameters to maximize each model’s\\nrespective strengths. We compare ‘Cand. 1’ and\\n‘Cand. 2’ in Tab. 6 where we can see that ‘Cand. 1’\\nhas high GSM8K scores but relatively low scores\\nfor the other tasks, whereas ‘Cand. 2’ has low\\nscores for GSM8K but high scores for the other\\ntasks. We merge these two models using various\\nmethods and ablate the results in Tab.. 7.We use two merge methods: 1) Average (a, b),\\nwhere a and b denote the weighting for ‘Cand.\\n1’ and ‘Cand. 2’ when averaging weights and 2)\\nSLERP (Shoemake, 1985). We use (0.5, 0.5), (0.4,\\n0.6), and (0.6, 0.4) for Average (a, b). From Tab. 7,\\nwe can see that the different merge methods have\\nlittle effect on the H6 scores. The scores for the\\nindividual tasks also do not differ by much, suggest-\\ning that as long as the merge candidates have suffi-\\nciently different strengths, the exact merge method\\nmay not be as crucial. Thus, we chose ‘Merge v1’\\nas our SOLAR 10.7B-Instruct model.5 ConclusionWe introduce SOLAR 10.7B and its fine-tuned vari-\\nant SOLAR 10.7B-Instruct, which are depth up-\\nscaled (DUS) models with 10.7 billion parameters4.\\nThey show superior performance over models like\\nLlama 2, Mistral 7B, and Mixtral-7B-Instruct in es-\\nsential NLP tasks while maintaining computational\\nefficiency. Thus, DUS is effective in scaling-up\\nhighly performant LLMs from smaller ones. With\\nmore exploration, DUS could be further improved,\\npaving a new path to efficiently scaling LLMs.AcknowledgementsWe would like to extend our gratitude to the teams\\nat Hugging Face, particularly Clémentine Four-\\nrier, Lewis Tunstall, Omar Sanseviero, and Philipp\\nSchmid. Our appreciation also extends to the\\nteams at AWS, notably Rahul Sharma, Jeongwon\\nYoon, Nieves Garcia, Ritesh Vajaria, Gal Oshri, Jay\\nKwon, Brandon Lee and Effie Bae. We are grateful\\nto the teams at Korea Telecom (KT), especially Jin\\nHyoung Lee, Jungsuk Park, Sungjoon Park, Hong-\\nrae Wang, Kyeongsoo Jung, and Sunyoong Yoon,\\nwhose significant support has been instrumental in\\nensuring the broad compatibility of our model. Ad-\\nditionally, we would like to extend our thanks to the\\nopen community for their invaluable contributions\\nand feedback.LimitationsOur study on the Depth Up-Scaling (DUS) has im-\\nportant limitations and considerations. One key\\nlimitation is the need for more thorough explo-\\nrations of hyperparameters used in the DUS ap-\\nproach. Namely, we removed m = 8 layers from\\nboth ends of our base model, primarily due to hard-\\nware limitations. However, we have not yet deter-\\nmined if this value is optimal for enhancing perfor-\\nmance. The extended time and cost of continued\\npretraining made it challenging to conduct more\\ncomprehensive experiments, which we aim to ad-\\ndress in future work through various comparative\\nanalyses.In terms of the model’s broader implications,\\nthere are several points to note. The model’s sig-\\nnificant computational demands for training and\\ninference might limit its use, especially for those\\nwith restricted computational resources. Addition-4Preprint version is available on https://arxiv.\\norg/abs/2312.15166.ally, like all machine learning models, it is vulnera-\\nble to biases in its training data, which could lead\\nto skewed outcomes in certain situations. Further-\\nmore, the substantial energy consumption required\\nfor training and operating the model raises environ-\\nmental concerns, which are critical in the pursuit\\nof sustainable AI development.Lastly, while the fine-tuned variant of the model\\nshows improved performance in following instruc-\\ntions, it still requires task-specific fine-tuning for\\noptimal performance in specialized applications.\\nThis fine-tuning process can be resource-intensive\\nand not always effective. Recognizing and address-\\ning these limitations is essential for a comprehen-\\nsive understanding of the proposed Large Language\\nModel’s capabilities and for guiding future research\\nand development in the field of LLMs.Ethics StatementWe conscientiously address and emphasize the\\ncommitment of SOLAR 10.7B in maintaining the\\nhighest ethical standards. First, we highlight that\\nSOLAR 10.7B-Instruct has shown low levels of\\ndata contamination in our evaluations, a testament\\nto our rigorous data handling and processing pro-\\ntocols. This aspect is crucial, as it underpins the\\nreliability and integrity of the results obtained from\\nSOLAR.Furthermore, during the course of our experi-\\nments, we ensured that all setups and methodolo-\\ngies employed steer clear of any potential ethical\\npitfalls. This preemptive consideration and avoid-\\nance of ethically questionable practices underscore\\nour dedication to conducting research that is not\\nonly innovative but also responsible.Additionally, we ensure that SOLAR complies\\nwith general ethical considerations in all aspects\\nof its operation. This includes adherence to pri-\\nvacy norms, respect for intellectual property, and\\nensuring the absence of bias in our algorithms. Our\\ncommitment to these ethical principles is unwaver-\\ning, and we believe it significantly contributes to\\nthe credibility and societal acceptance of SOLAR.In conclusion, the ethical framework within\\nwhich SOLAR operates is robust and comprehen-\\nsive, ensuring that our advancements in this field\\nare not only scientifically sound but also ethically\\nresponsible.ReferencesIan L Alberts, Lorenzo Mercolli, Thomas Pyka, George\\nPrenosil, Kuangyu Shi, Axel Rominger, and Ali\\nAfshar-Oromieh. 2023. Large language models\\n(llm) and chatgpt: what will the impact on nuclear\\nmedicine be? European journal of nuclear medicine\\nand molecular imaging, 50(6):1549–1552.Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John-\\nson, Dmitry Lepikhin, Alexandre Passos, Siamak\\nShakeri, Emanuel Taropa, Paige Bailey, Zhifeng\\nChen, et al. 2023. Palm 2 technical report. arXiv\\npreprint arXiv:2305.10403.Aram Bahrini, Mohammadsadra Khamoshifar, Hos-\\nsein Abbasimehr, Robert J Riggs, Maryam Esmaeili,\\nRastin Mastali Majdabadkohne, and Morteza Pase-\\nhvar. 2023. Chatgpt: Applications, opportunities,\\nand threats. In 2023 Systems and Information Engi-\\nneering Design Symposium (SIEDS), pages 274–279.\\nIEEE.Edward Beeching, Clémentine Fourrier, Nathan\\nHabib, Sheon Han, Nathan Lambert, Nazneen\\nRajani, Omar Sanseviero, Lewis Tunstall, and\\nThomas Wolf. 2023. Open llm leaderboard.\\nhttps://huggingface.co/spaces/\\nHuggingFaceH4/open_llm_leaderboard.Tom Brown, Benjamin Mann, Nick Ryder, Melanie\\nSubbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind\\nNeelakantan, Pranav Shyam, Girish Sastry, Amanda\\nAskell, et al. 2020. Language models are few-shot\\nlearners. Advances in neural information processing\\nsystems, 33:1877–1901.Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,\\nAshish Sabharwal, Carissa Schoenick, and Oyvind\\nTafjord. 2018. Think you have solved question an-\\nswering? try arc, the ai2 reasoning challenge. arXiv\\npreprint arXiv:1803.05457.Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,\\nMark Chen, Heewoo Jun, Lukasz Kaiser, Matthias\\nPlappert, Jerry Tworek, Jacob Hilton, Reiichiro\\nNakano, et al. 2021. Training verifiers to solve math\\nword problems. arXiv preprint arXiv:2110.14168.Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao,\\nWei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and\\nMaosong Sun. 2023. Ultrafeedback: Boosting lan-\\nguage models with high-quality feedback. arXiv\\npreprint arXiv:2310.01377.Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Ger-\\nstein, and Arman Cohan. 2023. Investigating data\\ncontamination in modern benchmarks for large lan-\\nguage models. arXiv preprint arXiv:2311.09783.Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan,\\nShizhe Diao, Jipeng Zhang, Kashun Shum, and\\nTong Zhang. 2023. Raft: Reward ranked finetuning\\nfor generative foundation model alignment. arXiv\\npreprint arXiv:2304.06767.Mohammad Fraiwan and Natheer Khasawneh. 2023. A\\nreview of chatgpt applications in education, market-\\ning, software engineering, and healthcare: Benefits,\\ndrawbacks, and research directions. arXiv preprint\\narXiv:2305.00237.Trevor Gale, Deepak Narayanan, Cliff Young, and Matei\\nZaharia. 2023. Megablocks: Efficient sparse training\\nwith mixture-of-experts. Proceedings of Machine\\nLearning and Systems, 5.Andrea Gesmundo and Kaitlin Maile. 2023. Compos-\\nable function-preserving expansions for transformer\\narchitectures. arXiv preprint arXiv:2308.06103.Shahriar Golchin and Mihai Surdeanu. 2023. Time\\ntravel in llms: Tracing data contamination in large\\nlanguage models. arXiv preprint arXiv:2308.08493.Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,\\nMantas Mazeika, Dawn Song, and Jacob Steinhardt.\\n2020. Measuring massive multitask language under-\\nstanding. In International Conference on Learning\\nRepresentations.Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul\\nArora, Steven Basart, Eric Tang, Dawn Song, and Ja-\\ncob Steinhardt. 2021. Measuring mathematical prob-\\nlem solving with the math dataset. arXiv preprint\\narXiv:2103.03874.Danny Hernandez, Jared Kaplan, Tom Henighan, and\\nSam McCandlish. 2021. Scaling laws for transfer.\\narXiv preprint arXiv:2102.01293.Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang,\\nZe Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin\\nJose, Prabhat Ram, et al. 2023. Tutel: Adaptive\\nmixture-of-experts at scale. Proceedings of Machine\\nLearning and Systems, 5.Intel. 2023. Supervised fine-tuning and direct prefer-\\nence optimization on intel gaudi2.Hamish Ivison, Yizhong Wang, Valentina Pyatkin,\\nNathan Lambert, Matthew Peters, Pradeep Dasigi,\\nJoel Jang, David Wadden, Noah A. Smith, Iz Belt-\\nagy, and Hannaneh Hajishirzi. 2023. Camels in a\\nchanging climate: Enhancing lm adaptation with tulu\\n2.Albert Q Jiang, Alexandre Sablayrolles, Arthur Men-\\nsch, Chris Bamford, Devendra Singh Chaplot, Diego\\nde las Casas, Florian Bressand, Gianna Lengyel, Guil-\\nlaume Lample, Lucile Saulnier, et al. 2023. Mistral\\n7b. arXiv preprint arXiv:2310.06825.Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale\\nMinervini, and Matt J Kusner. 2023. No train no\\ngain: Revisiting efficient training algorithms for\\ntransformer-based language models. arXiv preprint\\narXiv:2307.06440.Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B\\nBrown, Benjamin Chess, Rewon Child, Scott Gray,\\nAlec Radford, Jeffrey Wu, and Dario Amodei. 2020.Scaling laws for neural language models. arXiv\\npreprint arXiv:2001.08361.Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo\\nKim, Yunsu Kim, Sanghoon Kim, and Chanjun Park.\\n2024a. sdpo: Don’t use your data all at once.Jihoo Kim, Wonho Song, Dahyun Kim, Yunsu Kim,\\nYungi Kim, and Chanjun Park. 2024b. Evalverse:\\nUnified and accessible library for large language\\nmodel evaluation.Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp,\\nCarlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie,\\nYi Tay, Mostafa Dehghani, and Neil Houlsby.\\n2022. Sparse upcycling: Training mixture-of-\\nexperts from dense checkpoints. arXiv preprint\\narXiv:2212.05055.Wing Lian. 2023. https://huggingface.co/\\nwinglian/omega-3b.Stephanie Lin, Jacob Hilton, and Owain Evans. 2022.\\nTruthfulqa: Measuring how models mimic human\\nfalsehoods. In Proceedings of the 60th Annual Meet-\\ning of the Association for Computational Linguistics\\n(Volume 1: Long Papers), pages 3214–3252.Shayne Longpre, Le Hou, Tu Vu, Albert Webson,\\nHyung Won Chung, Yi Tay, Denny Zhou, Quoc V\\nLe, Barret Zoph, Jason Wei, et al. 2023. The flan\\ncollection: Designing data and methods for effective\\ninstruction tuning. arXiv preprint arXiv:2301.13688.Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawa-\\nhar, Sahaj Agarwal, Hamid Palangi, and Ahmed\\nAwadallah. 2023. Orca: Progressive learning from\\ncomplex explanation traces of gpt-4. arXiv preprint\\narXiv:2306.02707.OpenAI. 2023. Gpt-4 technical report.Yu Pan, Ye Yuan, Yichun Yin, Zenglin Xu, Lifeng\\nShang, Xin Jiang, and Qun Liu. 2023. Reusing pre-\\ntrained models by multi-linear operators for efficient\\ntraining. arXiv preprint arXiv:2310.10699.Hyunbyung Park, Sukyung Lee, Gyoungjin Gim, Yungi\\nKim, Dahyun Kim, and Chanjun Park. 2024. Data-\\nverse: Open-source etl (extract, transform, load)\\npipeline for large language models.Baolin Peng, Chunyuan Li, Pengcheng He, Michel Gal-\\nley, and Jianfeng Gao. 2023. Instruction tuning with\\ngpt-4. arXiv preprint arXiv:2304.03277.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,\\nDario Amodei, Ilya Sutskever, et al. 2019. Language\\nmodels are unsupervised multitask learners. OpenAI\\nblog, 1(8):9.Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie\\nMillican, Jordan Hoffmann, Francis Song, John\\nAslanides, Sarah Henderson, Roman Ring, Susan-\\nnah Young, et al. 2021. Scaling language models:\\nMethods, analysis & insights from training gopher.\\narXiv preprint arXiv:2112.11446.Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano\\nErmon, Christopher D Manning, and Chelsea Finn.\\n2023. Direct preference optimization: Your language\\nmodel is secretly a reward model. arXiv preprint\\narXiv:2305.18290.Oscar Sainz, Jon Ander Campos, Iker García-Ferrero,\\nJulen Etxaniz, Oier Lopez de Lacalle, and Eneko\\nAgirre. 2023. Nlp evaluation in trouble: On the\\nneed to measure llm data contamination for each\\nbenchmark. arXiv preprint arXiv:2310.18018.Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-\\nula, and Yejin Choi. 2021. Winogrande: An adver-\\nsarial winograd schema challenge at scale. Commu-\\nnications of the ACM, 64(9):99–106.Malik Sallam, Nesreen Salim, Muna Barakat, and Alaa\\nAl-Tammemi. 2023. Chatgpt applications in medical,\\ndental, pharmacy, and public health education: A\\ndescriptive study highlighting the advantages and\\nlimitations. Narra J, 3(1):e103–e103.Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,\\nAndy Davis, Quoc Le, Geoffrey Hinton, and Jeff\\nDean. 2017. Outrageously large neural networks:\\nThe sparsely-gated mixture-of-experts layer. arXiv\\npreprint arXiv:1701.06538.Tianxiao Shen, Myle Ott, Michael Auli, and\\nMarc’Aurelio Ranzato. 2019. Mixture models for\\ndiverse machine translation: Tricks of the trade. In\\nInternational conference on machine learning, pages\\n5719–5728. PMLR.Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo\\nHuang, Daogao Liu, Terra Blevins, Danqi Chen,\\nand Luke Zettlemoyer. 2023. Detecting pretraining\\ndata from large language models. arXiv preprint\\narXiv:2310.16789.Ken Shoemake. 1985. Animating rotation with quater-\\nnion curves. In Proceedings of the 12th annual con-\\nference on Computer graphics and interactive tech-\\nniques, pages 245–254.Mingxing Tan and Quoc Le. 2019. Efficientnet: Re-\\nthinking model scaling for convolutional neural net-\\nworks. In International conference on machine learn-\\ning, pages 6105–6114. PMLR.Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-\\nbert, Amjad Almahairi, Yasmine Babaei, Nikolay\\nBashlykov, Soumya Batra, Prajjwal Bhargava, Shruti\\nBhosale, et al. 2023. Llama 2: Open founda-\\ntion and fine-tuned chat models. arXiv preprint\\narXiv:2307.09288.Lewis Tunstall, Edward Beeching, Nathan Lambert,\\nNazneen Rajani, Kashif Rasul, Younes Belkada,\\nShengyi Huang, Leandro von Werra, Clémentine\\nFourrier, Nathan Habib, et al. 2023. Zephyr: Di-\\nrect distillation of lm alignment. arXiv preprint\\narXiv:2310.16944.Peihao Wang, Rameswar Panda, Lucas Torroba Hen-\\nnigen, Philip Greengard, Leonid Karlinsky, Roge-\\nrio Feris, David Daniel Cox, Zhangyang Wang, and\\nYoon Kim. 2023. Learning to grow pretrained mod-\\nels for efficient transformer training. arXiv preprint\\narXiv:2303.00980.Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-\\nisa Liu, Noah A Smith, Daniel Khashabi, and Han-\\nnaneh Hajishirzi. 2022. Self-instruct: Aligning lan-\\nguage model with self generated instructions. arXiv\\npreprint arXiv:2212.10560.Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin\\nGuu, Adams Wei Yu, Brian Lester, Nan Du, An-\\ndrew M Dai, and Quoc V Le. 2021. Finetuned lan-\\nguage models are zero-shot learners. arXiv preprint\\narXiv:2109.01652.Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,\\nBarret Zoph, Sebastian Borgeaud, Dani Yogatama,\\nMaarten Bosma, Denny Zhou, Donald Metzler, et al.\\n2022a. Emergent abilities of large language models.\\narXiv preprint arXiv:2206.07682.Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten\\nBosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,\\net al. 2022b. Chain-of-thought prompting elicits rea-\\nsoning in large language models. Advances in Neural\\nInformation Processing Systems, 35:24824–24837.Thomas Wolf, Lysandre Debut, Victor Sanh, Julien\\nChaumond, Clement Delangue, Anthony Moi, Pier-\\nric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,\\net al. 2019. Huggingface’s transformers: State-of-\\nthe-art natural language processing. arXiv preprint\\narXiv:1910.03771.Prateek Yadav, Derek Tam, Leshem Choshen, Colin\\nRaffel, and Mohit Bansal. 2023. Ties-merging: Re-\\nsolving interference when merging models. In Thirty-\\nseventh Conference on Neural Information Process-\\ning Systems.Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu,\\nQuoc V Le, Denny Zhou, and Xinyun Chen. 2023.\\nLarge language models as optimizers. arXiv preprint\\narXiv:2309.03409.Yiqun Yao, Zheng Zhang, Jing Li, and Yequan\\nWang. 2023. 2x faster language model pre-training\\nvia masked structural growth. arXiv preprint\\narXiv:2305.02869.Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu,\\nZhengying Liu, Yu Zhang, James T Kwok, Zhen-\\nguo Li, Adrian Weller, and Weiyang Liu. 2023.\\nMetamath: Bootstrap your own mathematical ques-\\ntions for large language models. arXiv preprint\\narXiv:2309.12284.Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang,\\nSongfang Huang, and Fei Huang. 2023. Rrhf:\\nRank responses to align language models with\\nhuman feedback without tears. arXiv preprint\\narXiv:2304.05302.Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali\\nFarhadi, and Yejin Choi. 2019. Hellaswag: Can a\\nmachine really finish your sentence? In Proceedings\\nof the 57th Annual Meeting of the Association for\\nComputational Linguistics, pages 4791–4800.Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang,\\nXiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tian-\\nwei Zhang, Fei Wu, et al. 2023. Instruction tuning\\nfor large language models: A survey. arXiv preprint\\narXiv:2308.10792.Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,\\nXiaolei Wang, Yupeng Hou, Yingqian Min, Beichen\\nZhang, Junjie Zhang, Zican Dong, et al. 2023. A\\nsurvey of large language models. arXiv preprint\\narXiv:2303.18223.Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen,\\nWayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong\\nWen, and Jiawei Han. 2023. Don’t make your llm\\nan evaluation benchmark cheater. arXiv preprint\\narXiv:2311.01964.Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B\\nBrown, Alec Radford, Dario Amodei, Paul Chris-\\ntiano, and Geoffrey Irving. 2019. Fine-tuning lan-\\nguage models from human preferences. arXiv\\npreprint arXiv:1909.08593.A ContributionsThe contributions of this study are as follows:• Introduction of the SOLAR 10.7 Billion-\\nParameter Model: We have released the SO-\\nLAR 10.7B model, which is not only depth-\\nwise scaled but also continually pretrained.\\nThe availability of SOLAR 10.7B under the\\nApache 2.0 license permits commercial us-\\nage, enabling the integration of this advanced\\nmodel into a diverse range of products and ser-\\nvices. This bridges the gap between academic\\nresearch and practical applications, fostering\\nwider accessibility and utility in various fields.• Superior Performance Across Diverse\\nBenchmarks: SOLAR 10.7B excels in var-\\nious benchmarks, outperforming established\\nmodels like Llama 2 and Mistral 7B in reason-\\ning, mathematics, and the MMLU framework.• Advancement in Instruction-Following Ca-\\npabilities: The introduction of SOLAR 10.7B-\\nInstruct, a variant fine-tuned for enhanced\\ninstruction-following abilities, marks a sig-\\nnificant improvement in the model’s ability to\\nunderstand and execute complex instructions.Sanghoon Kim, Dahyun Kim, Chanjun Park,\\nWonsung Lee, Wonho Song, Yunsu Kim and\\nHyeonwoo Kim contributed equally to this paper.\\nSanghoon Kim led the Foundation Model part,\\nwith Dahyun Kim, Wonho Song, Yunsu Kim, and\\nHyeonwoo Kim. Chanjun Park led the Data and\\nEvaluation (Data-Centric LLM) part, with Yungi\\nKim, Jihoo Kim, Changbae Ahn, Seonghoon Yang,\\nSukyung Lee, and Hyunbyung Park. Wonsung Lee\\nled the Adaptation Modeling part, with Gyoungjin\\nGim, Hyeonju Lee, and Mikyoung Cha. Hwalsuk\\nLee performed the role of the overall project opera-\\ntion. Dahyun Kim and Chanjun Park were the main\\ntechnical writers. All these individuals contributed\\nto the creation of SOLAR 10.7B.B Related Works and BackgroundB.1 Large Language ModelsFollowing the advent of context-based language\\nmodels, various studies have revealed a “scaling\\nlaw” (Kaplan et al., 2020; Hernandez et al., 2021;\\nAnil et al., 2023), demonstrating a positive corre-\\nlation between the size of model and training dataand model performance. This has led to the emer-\\ngence of Large Language Models (LLMs). Un-\\nlike previous language models, LLMs possess the\\nability for In-context learning, including Zero-shot\\nlearning (Radford et al., 2019) and Few-shot learn-\\ning (Brown et al., 2020), allowing them to perform\\nnew tasks without updating model weights. These\\ncapabilities of LLMs, not evident in smaller mod-\\nels, are referred to as Emergent abilities (Wei et al.,\\n2022a).B.2 Mixture of ExpertsIn the landscape of machine learning architectures,\\nthe Mixture of Experts (MoE) models like (Shazeer\\net al., 2017; Shen et al., 2019; Komatsuzaki et al.,\\n2022) has gained attention for its capability to ad-\\ndress the challenges posed by complex and hetero-\\ngeneous data. MoE models offer notable benefits,\\nincluding enhanced output diversity, allowing for\\nthe capture of intricate patterns within the input\\nspace. Moreover, their computational efficiency,\\nespecially when implemented in a sparse form, has\\nmade them valuable in scenarios where resource\\nconstraints are a consideration (Shazeer et al., 2017;\\nKomatsuzaki et al., 2022).However, efficient implementation of MoE mod-\\nels poses a considerable challenge, primarily due to\\nthe intricacies associated with dynamic routing and\\nload-imbalanced computation (Gale et al., 2023).\\nExisting hardware and software for deep learning,\\nsuch as TPUs and XLA compilers, often demand\\nstatic knowledge of tensor shapes, making MoE\\nimplementation on TPU challenging.While GPU implementation offers more flexi-\\nbility, sparse computation compatibility becomes\\na hurdle. Striking the right balance between fix-\\ning the size of each expert to facilitate efficient\\ncomputation and maintaining model quality creates\\na tradeoff between information preservation and\\nhardware efficiency. This tradeoff, in turn, necessi-\\ntates careful consideration during hyperparameter\\ntuning, adding a layer of complexity to the imple-\\nmentation of MoE models, potentially offsetting\\ntheir advantages. Given the formidable challenges\\nin MoE model implementation, it becomes almost\\ninevitable for researchers and practitioners to re-\\nsort to specialized tools and frameworks, such as\\nTutel (Hwang et al., 2023) or Megablocks (Gale\\net al., 2023).Departing from the horizontal expansion char-\\nacteristic of MoE models, the DUS method intro-duces model scaling in the vertical dimension. No-\\ntably, DUS does not introduce dynamism in the\\nscaled model, which significantly reduces the com-\\nplexity when compared to MoE. This shift in ap-\\nproach offers a unique and more straightforward\\nway of working, moving away from conventional\\nMoE challenges. Not only that, DUS also under-\\ngoes continued pretraining to quickly recover per-\\nformance of the scaled model.B.3 Prompt EngineeringA key research area to harness the emergent abil-\\nities of LLMs is prompt engineering. Prompt en-\\ngineering is the study of how to design inputs\\n(prompts) that enable LLMs to better perform spe-\\ncific tasks. A prime example of this research\\nis Chain-of-Thought (CoT) (Wei et al., 2022b),\\nwhich proposes CoT prompting that decomposes\\nmulti-step problems into a series of intermedi-\\nate reasoning steps. Moreover, efforts are under-\\nway to replace even such prompt engineering with\\nLLMs (Yang et al., 2023).B.4 Instruction TuningTo enhance the steerability of LLMs, instruction\\ntuning (Wei et al., 2021) has emerged as a learning\\ntechnique. This involves fine-tuning LLMs using\\ndata formatted as (instruction, input, output) for\\nvarious tasks (Wang et al., 2022). Instruction tuning\\nallows for targeted adjustments, providing a more\\ncontrolled and task-oriented improvement to the\\nmodel’s capabilities.Before instruction tuning, existing methods\\nfaced challenges in effectively guiding and control-\\nling the behavior of large language models (Zhang\\net al., 2023). The sheer complexity of these models\\nmade it difficult to ensure precise and task-oriented\\nresponses. The need for a more targeted approach\\narose from the limitations of existing methods, lead-\\ning to the development of instruction tuning. This\\ntargeted approach enables better control over the\\nmodel’s behavior, making it more suitable for spe-\\ncific tasks and improving its overall performance in\\nalignment with user-defined objectives. Therefore,\\ninstruction tuning is computationally efficient and\\nfacilitates the rapid adaptation of LLMs to a spe-\\ncific domain without requiring extensive retraining\\nor architectural changes.B.5 Alignment TuningLLM has been observed to generate sentences that\\nmay be perceived as linguistically incongruent byhuman readers since they learned not human inten-\\ntion, but only vast knowledge across various do-\\nmains in the pretraining step (Ziegler et al., 2019).\\nTo overcome this limitation and align with human\\nintentions, previous research (Ziegler et al., 2019)\\nhave proposed Reinforcement Learning with Hu-\\nman Feedback (RLHF). RLHF operates by learning\\na reward model based on human preferences, em-\\nploying reinforcement learning to guide the LLM\\ntowards prioritizing answers with the highest re-\\nward scores. This process enhances the safety,\\npropriety, and overall quality of the generated re-\\nsponses. Despite demonstrating satisfactory per-\\nformance, RLHF encounters challenges such as\\nmanaging numerous hyperparameters and necessi-\\ntating the incorporation of multiple models (policy,\\nvalue, reward, and reference models).In response to these challenges, the supervised\\nfine-tuning based approaches have proposed, such\\nas Rank Responses to align Human Feedback\\n(RRHF) (Yuan et al., 2023), Reward rAnked Fine-\\nTuning (RAFT) (Dong et al., 2023), and Direct\\nPolicy Optimization (DPO) (Intel, 2023). They\\navoid the complexities associated with reinforce-\\nment learning while achieving empirical perfor-\\nmance comparable to RLHF. Among them, DPO\\nthat we used directly guides the LLM to increase\\nthe probability of positive responses and decrease\\nthe probability of negative responses through a \"di-\\nrect\" approach. Interestingly, DPO demonstrates\\nmore stable learning results compared to RLHF,\\ndespite its simple training approach.B.6 Data ContaminationRecent researches (Zhou et al., 2023; Sainz et al.,\\n2023; Golchin and Surdeanu, 2023; Deng et al.,\\n2023) emphasize the need to measure whether a\\nspecific benchmark was used to train the large lan-\\nguage models. There are three types of the data\\ncontamination: guideline, raw text and annota-\\ntion (Sainz et al., 2023). Guideline contamination\\noccurs when a model accesses detailed annotation\\nguidelines for a dataset, providing advantages in\\nspecific tasks, and its impact should be considered,\\nespecially in zero and few-shot evaluations. Raw\\ntext contamination occurs when a model has ac-\\ncess to the original text. Wikipedia is widely used\\nas a pretraining data, but also as a source for cre-\\nating new datasets. The caution is advised in the\\ndevelopment of automatically annotated datasets\\nsourced from the web. Annotation contamina-tion occurs when the annotations of the specific\\nbenchmark are exposed during model training.C Additional InformationWe present additional information for the sake of\\nspace in the main paper.Filtered task names. We present task names\\nwe use to filter FLAN dervied datasets such as\\nOpenOrca in Table 8.Filtered Task Name\\n task228_arc_answer_generation_easy\\n ai2_arcARCChallenge:1.0.0\\n ai2_arcARCEasy:1.0.0\\n task229_arc_answer_generation_hard\\n hellaswag:1.1.0\\n task1389_hellaswag_completion\\n cot_gsm8k\\n cot_gsm8k_ii\\n drop:2.0.0\\n winogrande:1.1.0Table 8: Task names that we use to filter data for FLAN\\nderived datasets such as OpenOrca.ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K\\n 0.06 N/A 0.15 0.28 N/A 0.70Table 9: Data contamination test results for SOLAR\\n10.7B-Instruct. We show ‘result < 0.1, %‘ values where\\na value higher than 0.9 indicates high probability of data\\ncontamination. HellaSwag and Winogrande datasets are\\nnot currently supported. We set SOLAR 10.7B as our\\nreference model when performing the data contamina-\\ntion tests.Results on data contamination. To show the in-\\ntegrity of SOLAR 10.7B-Instruct, we also report\\nthe data contamination test (Shi et al., 2023) results\\nin Table. 9. All four tested benchmark datasets\\nyield results well below the contamination thresh-\\nold, affirming the absence of data contamination\\nin our model. One interesting point is that the\\nvalue for GSM8K is noticeably higher than for\\nother datasets, even without contamination. One\\npotential reason for this is the stronger data similar-\\nity in math-related instruction datasets.'"
- ],
- "application/vnd.google.colaboratory.intrinsic+json": {
- "type": "string"
- }
- },
- "metadata": {},
- "execution_count": 5
- }
- ],
- "source": [
- "text"
- ]
- },
{
"cell_type": "code",
"source": [