TLDR

For simple Wikipedia article Q&A, I compared OpenAI GPT 3.5, FastChat-T5, FLAN-T5-XXL, and FLAN-T5-XL. GPT 3.5 provided the best answers, but FastChat-T5 was very close in performance (with a basic guardrail). The T5 models I tested are all licensed under Apache 2.0, so they are commercially viable.

For the embedding model, I compared OpenAI text-embedding-ada-002 and the open source INSTRUCTOR-XL models. The INSTRUCTOR-XL model performed better, which is encouraging since INSTRUCTOR-XL is also licensed under Apache 2.0. (Update: I just noticed both INSTRUCTOR-XL/LARGE models also perform better on the MTEB Leaderboard)

The code for running these comparisons is available in my Github repo, which includes a link to a Colab notebook and an interactive Gradio app.

Intro

The recent developments in open source LLMs to match ChatGPT have been incredibly exciting. There are great places to try out these new LLMs, e.g. on HuggingFace Spaces. Inspired by The Ultimate Battle of Language Models: Lit-LLaMA vs GPT3.5 vs Bloom vs …, I wanted to try out different LLMs for Q&A on multiple articles with various questions. I was also particularly interested in commercially viable open source LLMs (so not Llama-based, but I’m curious!), since the use cases for these are broader, e.g. document Q&A/chat on internal company documentation. Thus I decided to try out the following open source LLMs and embedding models, using the OpenAI models as a benchmark:

LLMs:

gpt-3.5-turbo: OpenAI GPT 3.5
lmsys/fastchat-t5-3b-v1.0: FastChat-T5 model (3B), Apache 2.0 license (note it’s trained on ShareGPT data)
google/flan-t5-xl: FLAN-T5-XL model (3B), Apache 2.0 license
google/flan-t5-xxl: FLAN-T5-XXL model (11B), Apache 2.0 license

Embedding models:

text-embedding-ada-002: OpenAI’s ada-002 text embedding model
hkunlp/instructor-xl: The INSTRUCTOR-XL embedding model, Apache 2.0 license

There are many more exciting open source + commercially viable models to try out, I just haven’t yet had time to test them 😄

Note on terminology: In this blog post I only refer to generative LLMs as “LLMs”, even though large embedding models can also be considered “Large Language Models”.

Basic document Q&A system

First, below is the basic Wikipedia article Q&A system I built (code). I used Wikipedia-API to search/retrieve/beautify Wikipedia articles, LangChain for the Q&A framework, and OpenAI & HuggingFace models for embeddings and LLMs.

Search and index Wikipedia article

search and index

Q&A on article

question and answer

Results

In the results below:

❌ = incorrect answer
🤔 = partial credit
(no remarks) = correct answer

Embeddings comparison

In these runs, I compared the embeddings models used to create the embeddings/vectors for semantic search & retrieval of relevant context. I fixed the LLM to OpenAI GPT 3.5, since that is the strongest LLM in these comparisons. I only present the results of two embeddings below – I did try a few other open source embeddings listed here, but the results were not good.

Interestingly, the open source INSTRUCTOR-XL embeddings perform better than the closed source OpenAI ada-002 embeddings.

LLMs comparison

To compare different LLMs, I fixed the embeddings model to the INSTRUCTOR-XL model, since it performed better in the comparisons above.

Naive comparison

Article	Question	gpt-3.5-turbo	lmsys/fastchat-t5-3b-v1.0	google/flan-t5-xl	google/flan-t5-xxl
GPT-4	Who created GPT-4?	GPT-4 was created by OpenAI.	OpenAI.	OpenAI	OpenAI
GPT-4	How is GPT-4 better than its predecessors?	According to OpenAI, GPT-4 is “more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.” It has larger context windows of 8,192 and 32,768 tokens, is a multimodal model that can take images as well as text as input, and can use APIs when instructed to do so. It also introduced the “system message” to gain further control over its tone of voice and task.	GPT-4 is better than its predecessors in several ways: * It has a larger context window of 8,192 and 32,768 tokens, which allows it to handle more complex instructions * It is a multimodal model, which allows it to take images as well as text as input * It has a system message that allows it to specify its tone of voice and task * It can use APIs when instructed to do so, which allows it to fulfill requests beyond its normal capabilities	GPT-4 is a multimodal model: it can take images as well as text as input	GPT-4 is a multimodal model: it can take images as well as text as input
GPT-4	Who played in the finals?	There is no information provided in the given context to answer this question.	Answer: GPT-4 played in the finals. ❌	unanswerable	GPT-4 ❌
2022 FIFA World Cup	Where was the event held? Name the city and country	The event was held in Qatar, with matches played in eight venues across five cities. The opening ceremony took place at the Al Bayt Stadium in Al Khor.	The event was held in Doha, Qatar. 🤔	Qatar 🤔	Qatar 🤔
2022 FIFA World Cup	Who won the tournament?	Argentina won the tournament.	Argentina won the tournament.	Argentina	Argentina
2022 FIFA World Cup	Who played in the finals?	Argentina and France played in the finals.	Argentina and France.	Argentina and France	Argentina and France
2022 FIFA World Cup	Who had the broadcasting rights?	Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish) had the broadcasting rights for the 2022 FIFA World Cup.	The broadcasting rights for the 2022 FIFA World Cup were held by Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish).	Fox Sports 🤔	Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish)
2022 FIFA World Cup	Which two teams qualified for the knock-out round from Group D?	France and Australia qualified for the knock-out round from Group D.	France and Australia.	France and Australia	France and Australia
2022 FIFA World Cup	How is GPT-4 better than its predecessors?	There is no information provided about GPT-4 in the given context, so it is not possible to answer this question.	GPT-4 is better than its predecessors in that it is a more advanced version of the GPT series of games. It has more features and is more advanced than its predecessors. It also has more levels and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels ❌	It is more powerful ❌	It is a more powerful version of GPT-3. ❌
Diablo IV	What classes are available?	The five classes available in Diablo IV are Barbarian, Sorceress, Druid, Rogue, and Necromancer.	Barbarian, Sorceress, Druid, Rogue, and Necromancer.	Barbarian, Sorceress, Druid, Rogue, and Necromancer	Barbarian, Sorceress, Druid, Rogue, and Necromancer
Diablo IV	When is the release date?	The release date for Diablo IV is June 6, 2023.	June 6, 2023.	June 6, 2023	June 6, 2023
Diablo IV	How is the vanishing gradient issue addressed?	There is no information provided about the vanishing gradient issue in the given context.	The vanishing gradient issue is addressed by using a more detailed and realistic texture map for the environment. The texture map is a 3D model of the environment that is used to create the textures for the environment. The texture map is created using a combination of textures from the environment and the player’s character. The texture map is then used to create the textures for the environment. This process is repeated for each texture in the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the ❌	Both Blizzard and Nvidia reported in March 2023 that they were working to resolve the issue. ❌	Blizzard and Nvidia reported in March 2023 that they were working to resolve the issue. ❌
Stable Diffusion	How does Stable Diffusion work?	Stable Diffusion is a deep learning, text-to-image model that generates detailed images conditioned on text descriptions using a latent diffusion model, a kind of deep generative neural network. It supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output, and can also be used to partially alter existing images via inpainting and outpainting. Its code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM.	Stable Diffusion works by using a latent diffusion model, a kind of deep generative neural network, to generate detailed images conditioned on text descriptions. The model can generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output, or existing images can be re-drawn by the model to incorporate new elements described by a text prompt through its diffusion-denoising mechanism. It also allows the use of prompts to partially alter existing images via inpainting and outpainting.	It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt 🤔	Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. 🤔
Stable Diffusion	Who created Stable Diffusion?	Stable Diffusion was created by the start-up company Stability AI in collaboration with academic researchers and non-profit organizations.	The start-up company Stability AI in collaboration with a number of academic researchers and non-profit organizations.	Stability AI	Stability AI
Stable Diffusion	Describe the model architecture	Stable Diffusion uses a latent diffusion model (LDM) architecture developed by the CompVis group at LMU Munich. It consists of three parts: a variational autoencoder (VAE) encoder, a U-Net block composed of a ResNet backbone, and a VAE decoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, while the U-Net block denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The model can be flexibly conditioned on a string of text, an image, or another modality, and the encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism.	Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be flexibly conditioned on a string of text, an image, or another modality. For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. Researchers point to increased computational efficiency for training and generation as an advantage of LDMs.	Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder	Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder.
Stable Diffusion	Who played in the finals?	There is no information provided about any finals, so it is not possible to answer this question.	I don’t know, I don’t know who played in the finals.	Stability AI ❌	Stability AI ❌

Overall, OpenAI GPT 3.5 performed the best. Apart from the “trick questions” (questions whose answer should be “I don’t know”), FastChat-T5 came very close to GPT 3.5 for these Q&A tasks. There was some performance gap between FastChat-T5 and the original Flan-T5 models – FastChat-T5 at 3B parameters even performed better than FLAN-T5-XXL at 11B parameters, which speaks to the importance of the training dataset (and/or training duration?).

However, note the answers produced following the “trick questions”, where the question cannot be answered by the context provided. An example would be asking “Who played in the finals?” on the Wikipedia article about GPT-4. The LLM should say the equivalent of “I don’t know”. GPT 3.5 successfully says it doesn’t know given these trick questions, but the open source models tend to make stuff up.

Comparison using question-checking

Can we just dismiss the open source LLMs from here? Not necessarily. I ran the open source LLMs in a straightforward manner, without any special guardrails. Under the hood, I am guessing OpenAI has created guardrails around the raw GPT 3.5 LLM model, so the results served by the GPT 3.5 API are safe, free of hallucination, non-toxic, etc. Back to answering the trick questions, one simple guardrail I found was to use the LLM itself to determine if a question can be answered given the context.

In a separate LLM invocation, let’s call it the “question-checking” LLM (as opposed to “question-answering” LLM), we can slightly modify LangChain’s default Q&A prompt to the following:

"""Given the following pieces of context, determine if the question is able to be answered by the information in the context.
Respond with 'yes' or 'no'.
{context}
Question: {question}
"""

The question-checking LLM should respond with “yes” or “no”. If we don’t get a “yes”, we can assume the question cannot be answered, and any response from the question-answering LLM can be ignored, and a simple “I don’t know” can be returned the the user.

In my code, the question-checking and question-answering LLMs use the same LLM instance, just different prompts. Thus, for an answerable question, the inference latency will involve two calls to the LLM. If latency is an issue, one can load two LLM instances (they don’t even need to be the same type of LLM), and run both question-checking and question-answering LLMs in parallel. Visually, it may look something like this:

question and answer

The results are much better:

Article	Question	gpt-3.5-turbo	lmsys/fastchat-t5-3b-v1.0	google/flan-t5-xl	google/flan-t5-xxl
GPT-4	Who created GPT-4?	GPT-4 was created by OpenAI.	OpenAI.	OpenAI	OpenAI
GPT-4	How is GPT-4 better than its predecessors?	According to OpenAI, GPT-4 is “more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.” It has larger context windows of 8,192 and 32,768 tokens, is a multimodal model that can take images as well as text as input, and can use APIs when instructed to do so. It also introduced the “system message” to gain further control over its tone of voice and task.	GPT-4 is better than its predecessors in several ways: * It has a larger context window of 8,192 and 32,768 tokens, which allows it to handle more complex instructions * It is a multimodal model, which allows it to take images as well as text as input * It has a system message that allows it to specify its tone of voice and task * It can use APIs when instructed to do so, which allows it to fulfill requests beyond its normal capabilities	GPT-4 is a multimodal model: it can take images as well as text as input	GPT-4 is a multimodal model: it can take images as well as text as input
GPT-4	Who played in the finals?	There is no information provided in the given context to answer this question.	I don’t know	I don’t know	I don’t know
2022 FIFA World Cup	Where was the event held? Name the city and country	The event was held in Qatar, with matches played in eight venues across five cities. The opening ceremony took place at the Al Bayt Stadium in Al Khor.	The event was held in Doha, Qatar. 🤔	Qatar 🤔	Qatar 🤔
2022 FIFA World Cup	Who won the tournament?	Argentina won the tournament.	Argentina won the tournament.	Argentina	Argentina
2022 FIFA World Cup	Who played in the finals?	Argentina and France played in the finals.	Argentina and France.	Argentina and France	Argentina and France
2022 FIFA World Cup	Who had the broadcasting rights?	Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish) had the broadcasting rights for the 2022 FIFA World Cup.	The broadcasting rights for the 2022 FIFA World Cup were held by Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish).	Fox Sports 🤔	Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish)
2022 FIFA World Cup	Which two teams qualified for the knock-out round from Group D?	France and Australia qualified for the knock-out round from Group D.	France and Australia.	I don’t know ❌	France and Australia
2022 FIFA World Cup	How is GPT-4 better than its predecessors?	There is no information provided about GPT-4 in the given context, so it is not possible to answer this question.	I don’t know	I don’t know	I don’t know
Diablo IV	What classes are available?	The five classes available in Diablo IV are Barbarian, Sorceress, Druid, Rogue, and Necromancer.	Barbarian, Sorceress, Druid, Rogue, and Necromancer.	Barbarian, Sorceress, Druid, Rogue, and Necromancer	Barbarian, Sorceress, Druid, Rogue, and Necromancer
Diablo IV	When is the release date?	The release date for Diablo IV is June 6, 2023.	June 6, 2023.	June 6, 2023	June 6, 2023
Diablo IV	How is the vanishing gradient issue addressed?	There is no information provided about the vanishing gradient issue in the given context.	I don’t know	I don’t know	I don’t know
Stable Diffusion	How does Stable Diffusion work?	Stable Diffusion is a deep learning, text-to-image model that generates detailed images conditioned on text descriptions. It is a latent diffusion model, a kind of deep generative neural network, and supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output. It also allows the use of prompts to partially alter existing images via inpainting and outpainting. Its code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM.	Stable Diffusion works by using a latent diffusion model, a kind of deep generative neural network, to generate detailed images conditioned on text descriptions. The model can generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output, or existing images can be re-drawn by the model to incorporate new elements described by a text prompt through its diffusion-denoising mechanism. It also allows the use of prompts to partially alter existing images via inpainting and outpainting.	It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt 🤔	Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. 🤔
Stable Diffusion	Who created Stable Diffusion?	Stable Diffusion was created by the start-up company Stability AI in collaboration with academic researchers and non-profit organizations.	The start-up company Stability AI in collaboration with a number of academic researchers and non-profit organizations.	Stability AI	Stability AI
Stable Diffusion	Describe the model architecture	Stable Diffusion uses a latent diffusion model (LDM) architecture developed by the CompVis group at LMU Munich. It consists of three parts: a variational autoencoder (VAE) encoder, a U-Net block composed of a ResNet backbone, and a VAE decoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, while the U-Net block denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The model can be flexibly conditioned on a string of text, an image, or another modality, and the encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism.	Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be flexibly conditioned on a string of text, an image, or another modality. For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. Researchers point to increased computational efficiency for training and generation as an advantage of LDMs.	Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder	Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder.
Stable Diffusion	Who played in the finals?	There is no information provided about any finals, so it is not possible to answer this question.	I don’t know	I don’t know	I don’t know

Now the open source models can admit they don’t know something, instead of making up a silly answer. All the unanswerable questions correctly resulted in “I don’t know” returned from the Q&A system, and all but one answerable question was marked as such. For FLAN-T5-XL, regarding the article “2022 FIFA World Cup” and the question “Which two teams qualified for the knock-out round from Group D?”, the question-checking FLAN-T5-XL determined that the question was unanswerable, when in fact it was answerable given the context. Thus, the performance of different LLMs as the question-checker needs to be considered.

Conclusion

For Q&A over individual Wikipedia articles, the open source FastChat-T5 performs almost as well as GPT 3.5, which is very promising. Granted, this was a relatively simple task, and the questions were pretty straightforward. For more complicated Q&A systems, such as over a collection of documents, I think the challenge is not just in the LLM itself, but the overall system design.

In terms of the embedding models, this is a win for open source, since the INSTRUCTOR-XL embedding model actually performed better than OpenAI’s text-embedding-ada-002 model. This is good news, even for those who wish to use GPT 3.5/4. One can embed their potentially large collection documents offline, which (1) saves cost, and (2) allows their vector datastore to be decoupled from any embeddings updates from 3rd parties (e.g. if OpenAI modifies and/or deprecates ada-002, one must re-index their entire collection of documents).

Overall, this is a good performance for open source models on single document Q&A tasks, and I’m looking forward to further developments in the open source landscape 🚀