TLDR

For simple Wikipedia article Q&A, I compared OpenAI GPT 3.5, FastChat-T5, FLAN-T5-XXL, and FLAN-T5-XL. GPT 3.5 provided the best answers, but FastChat-T5 was very close in performance (with a basic guardrail). The T5 models I tested are all licensed under Apache 2.0, so they are commercially viable.

For the embedding model, I compared OpenAI text-embedding-ada-002 and the open source INSTRUCTOR-XL models. The INSTRUCTOR-XL model performed better, which is encouraging since INSTRUCTOR-XL is also licensed under Apache 2.0. (Update: I just noticed both INSTRUCTOR-XL/LARGE models also perform better on the MTEB Leaderboard)

The code for running these comparisons is available in my Github repo, which includes a link to a Colab notebook and an interactive Gradio app.

Intro

The recent developments in open source LLMs to match ChatGPT have been incredibly exciting. There are great places to try out these new LLMs, e.g. on HuggingFace Spaces. Inspired by The Ultimate Battle of Language Models: Lit-LLaMA vs GPT3.5 vs Bloom vs …, I wanted to try out different LLMs for Q&A on multiple articles with various questions. I was also particularly interested in commercially viable open source LLMs (so not Llama-based, but I’m curious!), since the use cases for these are broader, e.g. document Q&A/chat on internal company documentation. Thus I decided to try out the following open source LLMs and embedding models, using the OpenAI models as a benchmark:

LLMs:

Embedding models:

There are many more exciting open source + commercially viable models to try out, I just haven’t yet had time to test them 😄

Note on terminology: In this blog post I only refer to generative LLMs as “LLMs”, even though large embedding models can also be considered “Large Language Models”.

Basic document Q&A system

First, below is the basic Wikipedia article Q&A system I built (code). I used Wikipedia-API to search/retrieve/beautify Wikipedia articles, LangChain for the Q&A framework, and OpenAI & HuggingFace models for embeddings and LLMs.

Search and index Wikipedia article

search and index

Q&A on article

question and answer

Results

In the results below:

  • ❌ = incorrect answer
  • 🤔 = partial credit
  • (no remarks) = correct answer

Embeddings comparison

In these runs, I compared the embeddings models used to create the embeddings/vectors for semantic search & retrieval of relevant context. I fixed the LLM to OpenAI GPT 3.5, since that is the strongest LLM in these comparisons. I only present the results of two embeddings below – I did try a few other open source embeddings listed here, but the results were not good.

Interestingly, the open source INSTRUCTOR-XL embeddings perform better than the closed source OpenAI ada-002 embeddings.

Article Question hkunlp/instructor-xl text-embedding-ada-002
GPT-4 Who created GPT-4? GPT-4 was created by OpenAI. GPT-4 was created by OpenAI.
GPT-4 How is GPT-4 better than its predecessors? According to OpenAI, GPT-4 is “more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.” It has larger context windows of 8,192 and 32,768 tokens, is a multimodal model that can take images as well as text as input, and can use APIs when instructed to do so. It also introduced the “system message” to gain further control over its tone of voice and task. GPT-4 is “more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.” It has larger context windows of 8,192 and 32,768 tokens, a significant improvement over GPT-3.5 and GPT-3, which were limited to 4,096 and 2,049 tokens respectively. It is also a multimodal model that can take images as well as text as input, and can use APIs when instructed to do so.
GPT-4 Who played in the finals? There is no information provided in the given context to answer this question. There is no information provided in the context about any finals being played, so it is not possible to answer this question.
2022 FIFA World Cup Where was the event held? Name the city and country The event was held in Qatar, with matches played in eight venues across five cities. The opening ceremony took place at the Al Bayt Stadium in Al Khor. The event was held in Qatar, across five cities. The opening match was held in Al Khor.
2022 FIFA World Cup Who won the tournament? Argentina won the tournament. The context does not provide information on who won the tournament. ❌
2022 FIFA World Cup Who played in the finals? Argentina and France played in the finals. Argentina and France played in the finals.
2022 FIFA World Cup Who had the broadcasting rights? Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish) had the broadcasting rights for the 2022 FIFA World Cup. Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish) had the broadcasting rights for the 2022 FIFA World Cup in Canada and the United States.
2022 FIFA World Cup Which two teams qualified for the knock-out round from Group D? France and Australia qualified for the knock-out round from Group D. The context does not provide information on which two teams qualified for the knock-out round from Group D. ❌
2022 FIFA World Cup How is GPT-4 better than its predecessors? There is no information provided about GPT-4 in the given context, so it is not possible to answer this question. There is no information in the given context about GPT-4 or any comparison with its predecessors.
Diablo IV What classes are available? The five classes available in Diablo IV are Barbarian, Sorceress, Druid, Rogue, and Necromancer. The five classes available in Diablo IV are Barbarian, Sorceress, Druid, Rogue, and Necromancer.
Diablo IV When is the release date? The release date for Diablo IV is June 6, 2023. The release date for Diablo IV is June 6, 2023.
Diablo IV How is the vanishing gradient issue addressed? There is no information provided about the vanishing gradient issue in the given context. There is no information provided about the vanishing gradient issue in the given context.
Stable Diffusion How does Stable Diffusion work? Stable Diffusion is a deep learning, text-to-image model that generates detailed images conditioned on text descriptions. It is a latent diffusion model, a kind of deep generative neural network, and supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output. It also allows the use of prompts to partially alter existing images via inpainting and outpainting. The model weights have been released publicly and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM. Stable Diffusion is a deep learning, text-to-image model that generates detailed images conditioned on text descriptions. It is a latent diffusion model, a kind of deep generative neural network, and supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output. It also allows the use of prompts to partially alter existing images via inpainting and outpainting. The model weights have been released publicly and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM.
Stable Diffusion Who created Stable Diffusion? Stable Diffusion was created by the start-up company Stability AI in collaboration with academic researchers and non-profit organizations. Stable Diffusion was developed by the start-up company Stability AI in collaboration with academic researchers and non-profit organizations.
Stable Diffusion Describe the model architecture Stable Diffusion uses a latent diffusion model (LDM) architecture developed by the CompVis group at LMU Munich. It consists of three parts: a variational autoencoder (VAE) encoder, a U-Net block composed of a ResNet backbone, and a VAE decoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, while the U-Net block denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The model can be flexibly conditioned on a string of text, an image, or another modality, and the encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism. The Stable Diffusion model consists of a latent diffusion model (LDM) that includes a variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, while Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block denoises the output from forward diffusion backwards to obtain a latent representation, and the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be conditioned on a string of text, an image, or another modality, and the encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism. The model was trained on subsets of captioned images using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours.
Stable Diffusion Who played in the finals? There is no information provided about any finals, so it is not possible to answer this question. There is no information provided about any finals being played, so it is not possible to answer this question.

LLMs comparison

To compare different LLMs, I fixed the embeddings model to the INSTRUCTOR-XL model, since it performed better in the comparisons above.

Naive comparison

Article Question gpt-3.5-turbo lmsys/fastchat-t5-3b-v1.0 google/flan-t5-xl google/flan-t5-xxl
GPT-4 Who created GPT-4? GPT-4 was created by OpenAI. OpenAI. OpenAI OpenAI
GPT-4 How is GPT-4 better than its predecessors? According to OpenAI, GPT-4 is “more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.” It has larger context windows of 8,192 and 32,768 tokens, is a multimodal model that can take images as well as text as input, and can use APIs when instructed to do so. It also introduced the “system message” to gain further control over its tone of voice and task. GPT-4 is better than its predecessors in several ways:
* It has a larger context window of 8,192 and 32,768 tokens, which allows it to handle more complex instructions
* It is a multimodal model, which allows it to take images as well as text as input
* It has a system message that allows it to specify its tone of voice and task
* It can use APIs when instructed to do so, which allows it to fulfill requests beyond its normal capabilities
GPT-4 is a multimodal model: it can take images as well as text as input GPT-4 is a multimodal model: it can take images as well as text as input
GPT-4 Who played in the finals? There is no information provided in the given context to answer this question. Answer: GPT-4 played in the finals. ❌ unanswerable GPT-4 ❌
2022 FIFA World Cup Where was the event held? Name the city and country The event was held in Qatar, with matches played in eight venues across five cities. The opening ceremony took place at the Al Bayt Stadium in Al Khor. The event was held in Doha, Qatar. 🤔 Qatar 🤔 Qatar 🤔
2022 FIFA World Cup Who won the tournament? Argentina won the tournament. Argentina won the tournament. Argentina Argentina
2022 FIFA World Cup Who played in the finals? Argentina and France played in the finals. Argentina and France. Argentina and France Argentina and France
2022 FIFA World Cup Who had the broadcasting rights? Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish) had the broadcasting rights for the 2022 FIFA World Cup. The broadcasting rights for the 2022 FIFA World Cup were held by Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish). Fox Sports 🤔 Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish)
2022 FIFA World Cup Which two teams qualified for the knock-out round from Group D? France and Australia qualified for the knock-out round from Group D. France and Australia. France and Australia France and Australia
2022 FIFA World Cup How is GPT-4 better than its predecessors? There is no information provided about GPT-4 in the given context, so it is not possible to answer this question. GPT-4 is better than its predecessors in that it is a more advanced version of the GPT series of games. It has more features and is more advanced than its predecessors. It also has more levels and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels of difficulty. It also has more levels of difficulty and more levels ❌ It is more powerful ❌ It is a more powerful version of GPT-3. ❌
Diablo IV What classes are available? The five classes available in Diablo IV are Barbarian, Sorceress, Druid, Rogue, and Necromancer. Barbarian, Sorceress, Druid, Rogue, and Necromancer. Barbarian, Sorceress, Druid, Rogue, and Necromancer Barbarian, Sorceress, Druid, Rogue, and Necromancer
Diablo IV When is the release date? The release date for Diablo IV is June 6, 2023. June 6, 2023. June 6, 2023 June 6, 2023
Diablo IV How is the vanishing gradient issue addressed? There is no information provided about the vanishing gradient issue in the given context. The vanishing gradient issue is addressed by using a more detailed and realistic texture map for the environment. The texture map is a 3D model of the environment that is used to create the textures for the environment. The texture map is created using a combination of textures from the environment and the player’s character. The texture map is then used to create the textures for the environment. This process is repeated for each texture in the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the textures for the environment. The texture map is then used to create the ❌ Both Blizzard and Nvidia reported in March 2023 that they were working to resolve the issue. ❌ Blizzard and Nvidia reported in March 2023 that they were working to resolve the issue. ❌
Stable Diffusion How does Stable Diffusion work? Stable Diffusion is a deep learning, text-to-image model that generates detailed images conditioned on text descriptions using a latent diffusion model, a kind of deep generative neural network. It supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output, and can also be used to partially alter existing images via inpainting and outpainting. Its code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM. Stable Diffusion works by using a latent diffusion model, a kind of deep generative neural network, to generate detailed images conditioned on text descriptions. The model can generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output, or existing images can be re-drawn by the model to incorporate new elements described by a text prompt through its diffusion-denoising mechanism. It also allows the use of prompts to partially alter existing images via inpainting and outpainting. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt 🤔 Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. 🤔
Stable Diffusion Who created Stable Diffusion? Stable Diffusion was created by the start-up company Stability AI in collaboration with academic researchers and non-profit organizations. The start-up company Stability AI in collaboration with a number of academic researchers and non-profit organizations. Stability AI Stability AI
Stable Diffusion Describe the model architecture Stable Diffusion uses a latent diffusion model (LDM) architecture developed by the CompVis group at LMU Munich. It consists of three parts: a variational autoencoder (VAE) encoder, a U-Net block composed of a ResNet backbone, and a VAE decoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, while the U-Net block denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The model can be flexibly conditioned on a string of text, an image, or another modality, and the encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism. Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be flexibly conditioned on a string of text, an image, or another modality. For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. Researchers point to increased computational efficiency for training and generation as an advantage of LDMs. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder.
Stable Diffusion Who played in the finals? There is no information provided about any finals, so it is not possible to answer this question. I don’t know, I don’t know who played in the finals. Stability AI ❌ Stability AI ❌

Overall, OpenAI GPT 3.5 performed the best. Apart from the “trick questions” (questions whose answer should be “I don’t know”), FastChat-T5 came very close to GPT 3.5 for these Q&A tasks. There was some performance gap between FastChat-T5 and the original Flan-T5 models – FastChat-T5 at 3B parameters even performed better than FLAN-T5-XXL at 11B parameters, which speaks to the importance of the training dataset (and/or training duration?).

However, note the answers produced following the “trick questions”, where the question cannot be answered by the context provided. An example would be asking “Who played in the finals?” on the Wikipedia article about GPT-4. The LLM should say the equivalent of “I don’t know”. GPT 3.5 successfully says it doesn’t know given these trick questions, but the open source models tend to make stuff up.

Comparison using question-checking

Can we just dismiss the open source LLMs from here? Not necessarily. I ran the open source LLMs in a straightforward manner, without any special guardrails. Under the hood, I am guessing OpenAI has created guardrails around the raw GPT 3.5 LLM model, so the results served by the GPT 3.5 API are safe, free of hallucination, non-toxic, etc. Back to answering the trick questions, one simple guardrail I found was to use the LLM itself to determine if a question can be answered given the context.

In a separate LLM invocation, let’s call it the “question-checking” LLM (as opposed to “question-answering” LLM), we can slightly modify LangChain’s default Q&A prompt to the following:

"""Given the following pieces of context, determine if the question is able to be answered by the information in the context.
Respond with 'yes' or 'no'.
{context}
Question: {question}
"""

The question-checking LLM should respond with “yes” or “no”. If we don’t get a “yes”, we can assume the question cannot be answered, and any response from the question-answering LLM can be ignored, and a simple “I don’t know” can be returned the the user.

In my code, the question-checking and question-answering LLMs use the same LLM instance, just different prompts. Thus, for an answerable question, the inference latency will involve two calls to the LLM. If latency is an issue, one can load two LLM instances (they don’t even need to be the same type of LLM), and run both question-checking and question-answering LLMs in parallel. Visually, it may look something like this:

question and answer

The results are much better:

Article Question gpt-3.5-turbo lmsys/fastchat-t5-3b-v1.0 google/flan-t5-xl google/flan-t5-xxl
GPT-4 Who created GPT-4? GPT-4 was created by OpenAI. OpenAI. OpenAI OpenAI
GPT-4 How is GPT-4 better than its predecessors? According to OpenAI, GPT-4 is “more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.” It has larger context windows of 8,192 and 32,768 tokens, is a multimodal model that can take images as well as text as input, and can use APIs when instructed to do so. It also introduced the “system message” to gain further control over its tone of voice and task. GPT-4 is better than its predecessors in several ways:
* It has a larger context window of 8,192 and 32,768 tokens, which allows it to handle more complex instructions
* It is a multimodal model, which allows it to take images as well as text as input
* It has a system message that allows it to specify its tone of voice and task
* It can use APIs when instructed to do so, which allows it to fulfill requests beyond its normal capabilities
GPT-4 is a multimodal model: it can take images as well as text as input GPT-4 is a multimodal model: it can take images as well as text as input
GPT-4 Who played in the finals? There is no information provided in the given context to answer this question. I don’t know I don’t know I don’t know
2022 FIFA World Cup Where was the event held? Name the city and country The event was held in Qatar, with matches played in eight venues across five cities. The opening ceremony took place at the Al Bayt Stadium in Al Khor. The event was held in Doha, Qatar. 🤔 Qatar 🤔 Qatar 🤔
2022 FIFA World Cup Who won the tournament? Argentina won the tournament. Argentina won the tournament. Argentina Argentina
2022 FIFA World Cup Who played in the finals? Argentina and France played in the finals. Argentina and France. Argentina and France Argentina and France
2022 FIFA World Cup Who had the broadcasting rights? Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish) had the broadcasting rights for the 2022 FIFA World Cup. The broadcasting rights for the 2022 FIFA World Cup were held by Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish). Fox Sports 🤔 Bell Media (Canada), Fox (U.S. English), and NBCUniversal (U.S. Spanish)
2022 FIFA World Cup Which two teams qualified for the knock-out round from Group D? France and Australia qualified for the knock-out round from Group D. France and Australia. I don’t know ❌ France and Australia
2022 FIFA World Cup How is GPT-4 better than its predecessors? There is no information provided about GPT-4 in the given context, so it is not possible to answer this question. I don’t know I don’t know I don’t know
Diablo IV What classes are available? The five classes available in Diablo IV are Barbarian, Sorceress, Druid, Rogue, and Necromancer. Barbarian, Sorceress, Druid, Rogue, and Necromancer. Barbarian, Sorceress, Druid, Rogue, and Necromancer Barbarian, Sorceress, Druid, Rogue, and Necromancer
Diablo IV When is the release date? The release date for Diablo IV is June 6, 2023. June 6, 2023. June 6, 2023 June 6, 2023
Diablo IV How is the vanishing gradient issue addressed? There is no information provided about the vanishing gradient issue in the given context. I don’t know I don’t know I don’t know
Stable Diffusion How does Stable Diffusion work? Stable Diffusion is a deep learning, text-to-image model that generates detailed images conditioned on text descriptions. It is a latent diffusion model, a kind of deep generative neural network, and supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output. It also allows the use of prompts to partially alter existing images via inpainting and outpainting. Its code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM. Stable Diffusion works by using a latent diffusion model, a kind of deep generative neural network, to generate detailed images conditioned on text descriptions. The model can generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output, or existing images can be re-drawn by the model to incorporate new elements described by a text prompt through its diffusion-denoising mechanism. It also allows the use of prompts to partially alter existing images via inpainting and outpainting. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt 🤔 Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. 🤔
Stable Diffusion Who created Stable Diffusion? Stable Diffusion was created by the start-up company Stability AI in collaboration with academic researchers and non-profit organizations. The start-up company Stability AI in collaboration with a number of academic researchers and non-profit organizations. Stability AI Stability AI
Stable Diffusion Describe the model architecture Stable Diffusion uses a latent diffusion model (LDM) architecture developed by the CompVis group at LMU Munich. It consists of three parts: a variational autoencoder (VAE) encoder, a U-Net block composed of a ResNet backbone, and a VAE decoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, while the U-Net block denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The model can be flexibly conditioned on a string of text, an image, or another modality, and the encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism. Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be flexibly conditioned on a string of text, an image, or another modality. For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. Researchers point to increased computational efficiency for training and generation as an advantage of LDMs. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder.
Stable Diffusion Who played in the finals? There is no information provided about any finals, so it is not possible to answer this question. I don’t know I don’t know I don’t know

Now the open source models can admit they don’t know something, instead of making up a silly answer. All the unanswerable questions correctly resulted in “I don’t know” returned from the Q&A system, and all but one answerable question was marked as such. For FLAN-T5-XL, regarding the article “2022 FIFA World Cup” and the question “Which two teams qualified for the knock-out round from Group D?”, the question-checking FLAN-T5-XL determined that the question was unanswerable, when in fact it was answerable given the context. Thus, the performance of different LLMs as the question-checker needs to be considered.

Conclusion

For Q&A over individual Wikipedia articles, the open source FastChat-T5 performs almost as well as GPT 3.5, which is very promising. Granted, this was a relatively simple task, and the questions were pretty straightforward. For more complicated Q&A systems, such as over a collection of documents, I think the challenge is not just in the LLM itself, but the overall system design.

In terms of the embedding models, this is a win for open source, since the INSTRUCTOR-XL embedding model actually performed better than OpenAI’s text-embedding-ada-002 model. This is good news, even for those who wish to use GPT 3.5/4. One can embed their potentially large collection documents offline, which (1) saves cost, and (2) allows their vector datastore to be decoupled from any embeddings updates from 3rd parties (e.g. if OpenAI modifies and/or deprecates ada-002, one must re-index their entire collection of documents).

Overall, this is a good performance for open source models on single document Q&A tasks, and I’m looking forward to further developments in the open source landscape 🚀