TextPurr Logo

TextPurr

Loading...
Loading...

RAG Fundamentals and Advanced Techniques – Full Course

freeCodeCamp.org
This course will guide you through the basics of Retrieval-Augmented Generation (RAG), starting with its fundamental concepts and components. You'll learn how to build a RAG system for chatting with documents, explore advanced techniques, and understand the pitfalls of naive RAG. ✏️ Course created by @vincibits For the first fundamentals of the RAG part of the course, here's the GitHub link: https://github.com/pdichone/rag-intro-chat-with-docs For the advanced RAG Techniques part of the course, here's the GitHub link: https://github.com/pdichone/advanced-rag-techniques ❤️ Try interactive AI courses we love, right in your browser: https://scrimba.com/freeCodeCamp-AI (Made possible by a grant from our friends at Scrimba) ⭐️ Contents ⭐️ ⌨️ (0:00:00) Intro ⌨️ (0:02:22) RAG Fundamentals ⌨️ (0:03:21) Components of RAG ⌨️ (0:05:56) RAG Deep Dive ⌨️ (0:07:56) Building a RAG System - Build an Application for Chatting with Our Documents ⌨️ (0:32:52) Using Advanced RAG Techniques - Overview ⌨️ (0:36:07) Naive RAG Overview and Its Pitfalls ⌨️ (0:42:16) Naive RAG Drawbacks Breakdown ⌨️ (0:48:28) Advanced RAG Techniques as the Solution - Query Expansion with Generated Answers ⌨️ (0:54:23) Query Expansion with Generated Answers - Hands-on ⌨️ (1:16:21) Query Expansion Summary ⌨️ (1:17:44) Query Expansion with Multiple Queries - Overview ⌨️ (1:20:57) Query Expansion with multiple Queries - Hands-on ⌨️ (1:33:27) Your Turn - Challenge ⌨️ (1:35:19) The End - Next Steps 🎉 Thanks to our Champion and Sponsor supporters: 👾 davthecoder 👾 jedi-or-sith 👾 南宮千影 👾 Agustín Kussrow 👾 Nattira Maneerat 👾 Heather Wcislo 👾 Serhiy Kalinets 👾 Justin Hual 👾 Otis Morgan 👾 Oscar Rahnama -- Learn to code for free and get a developer job: https://www.freecodecamp.org Read hundreds of articles on programming: https://freecodecamp.org/news
Hosts: Paulo, Course Introduction
📅August 01, 2024
⏱️01:36:48
🌐English

Disclaimer: The transcript on this page is for the YouTube video titled "RAG Fundamentals and Advanced Techniques – Full Course" from "freeCodeCamp.org". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=ea2W8IogX80

00:00:00Course Introduction

This course will guide you through the basics of Retrieval Augmented Generation, or RAG, starting with its fundamental concepts and components. You'll learn how to build a RAG system for chatting with documents, explore advanced techniques, and understand the pitfalls of naive RAG.

💬 1 comment
Add to My Notes
00:00:20Course Introduction

Paulo created this course. He is a senior software engineer and experienced teacher.

💬 0 comments
Add to My Notes
00:00:25Paulo

In this video, I'm going to go through a quick introduction of RAG. RAG stands for Retrieval Augmented Generation. Now, if you have never heard of RAG, no worries. That's what I'm going to be doing in this video. The main idea is that when you use a large language model—a large language model essentially is a model that was trained on certain data. So for instance, if you go to ChatGPT and you type in, "What is the capital of France?" and of course, it will give you the capital of France because it was trained on information including, in this case, the capitals of countries in the world.

💬 0 comments
Add to My Notes
00:01:08Paulo

But if you were to ask ChatGPT, "What is the name of my first dog?" of course, ChatGPT wouldn't know because it's using that large language model, the model that was trained on something that is not related to your information—information that is particular to you, that is specific to you. And that, of course, is a problem. And RAG essentially allows us to take our own information, our own data—databases, video, textual information, raw data, or unstructured data as they call it—and sort of inject it to the large language model.

💬 0 comments
Add to My Notes
00:01:42Paulo

So now the large language model has more information, including your own information. And so now when you ask questions related to your specific data, you are able to get the answer from the large language model because it's able to connect to your data that you have injected. Happy day, so you get the right answer. So that is the idea of RAG. That's what we're going to be doing in this mini-course or in this video, and I hope you enjoy it.

💬 0 comments
Add to My Notes
00:02:09Paulo

All right, let's go ahead and get started.

💬 0 comments
Add to My Notes
00:02:11Paulo

In order for you to follow along in this course, you need to have your development environment set up. Particularly, I expect you to, of course, have Python set up on your machine, also VS Code or any other code editor of your preference, but I will be using VS Code, so I would encourage you to also use it, but that is not a requirement.

💬 0 comments
Add to My Notes
00:02:32Paulo

Also, make sure that you have an OpenAI account, which means you also need to have an API key. That way you're able to follow along if you want to actually do the hands-on with me, which I believe—I want to believe—that's what you're going to be doing. So go ahead and have all those things set up, and we should be good.

💬 0 comments
Add to My Notes
00:02:51Paulo

Now, for you to set up the OpenAI account, again, you can just go to openai.com and go through the process if you haven't done that already, and just set yourself up, create an account, and create an OpenAI API key, which we'll be using in this course. And if you are wanting to install Python and you don't have Python installed, it's very simple. Just follow this link, and they have all of the things or all of the directions you will need to set up Python on your machine. So I encourage you to go through that and have everything set up. Okay, so they have Python for Windows, Mac, Linux, and everything. This is all in case you don't have anything set up, but go ahead and do that if you don't have that set up, and I'll see you next.

💬 0 comments
Add to My Notes
00:03:37Paulo

All right, so let's go ahead and start doing the deep dive on RAG. So I know that most of you who are here may already know what RAG is, and that's wonderful. But I'm going to do just a quick deep dive overview so that we have some sort of a summary overview again of what RAG is. So we're going to look at what is RAG, the motivation behind RAG, and also advantages.

💬 0 comments
Add to My Notes
00:04:04Paulo

Now, what is RAG? RAG stands for Retrieval Augmented Generation. So the key points here is that we have retrieval, augmented, and generation. These are the key points here: retrieval, augmented, and generation. So the idea is that we have a system that retrieves information, we have also a way of augmenting whatever we are passing through, as well as then push that information into a machine, quote-unquote, that will generate a result.

💬 0 comments
Add to My Notes
00:04:37Paulo

So RAG has two main components, which is the retriever. The retriever, what it does, it identifies and retrieves relevant documents. And then we have the generator. Well, it takes retrieved documents and the input query to generate coherent and contextually relevant responses, because that is the whole idea: to get coherent and contextually relevant responses. These are the main components of RAG, but we still haven't defined RAG, really.

💬 0 comments
Add to My Notes
00:05:05Paulo

So what is RAG? The definition will go as follows: a framework that combines the strengths of retrieval-based systems and generation-based models to produce more accurate and contextually relevant responses. And we have the keys again, the keywords: contextually relevant response. That is the whole goal of RAG. Okay, that sounds great. But translating all of that, we would say an efficient way to customize an LLM, a large language model, with your own data.

💬 0 comments
Add to My Notes
00:05:35Paulo

Well, what that means is, what are we doing really? Is that, as we know, a large language model like GPT and many others out there, they only know so much. Okay, so what we're doing is we are injecting our own data into this large language model so that it knows more than the things that it was trained on. So now, the large language model is going to know about specific contextual data in addition to what it was trained on.

💬 0 comments
Add to My Notes
00:06:03Paulo

Let's look at an overview of RAG. We have documents. These documents are cut into small chunks, and then these chunks are put through an embedding large language model to create embeddings essentially. And then that is what is created: embeddings. And those embeddings are set. Okay.

💬 0 comments
Add to My Notes
00:06:22Paulo

So now, the question or the query comes in, goes through the same process, transforms to embedding, and then we have this embedding which then is used to go ahead and find in our retrieval system, in our vector database, most similar items. Which then is pushed into a general large language model, which knows how to take that information—in this case, the most similar results with the question, in this case the prompt—and get the response that is needed, that we're looking for.

💬 0 comments
Add to My Notes
00:06:53Paulo

So that is how RAG works. Notice here when we say RAG—Retrieval Augmented Generation—that means that the generated response is augmented by the data retrieved from the documents in our case, hence the name RAG.

💬 0 comments
Add to My Notes
00:07:09Paulo

So really, if you want to do a deep dive into naive RAG, this is what happens. So we have the documents, and these documents are going through the phase of parsing and pre-processing. So essentially, cut them up into smaller documents. This is the chunking process. And then we pass them around into smaller chunks, and those are passed through the embedding model to create vectors out of these chunks. Okay, so we're vectorizing those chunks, and then that is what is saved into a vector store or a vector database.

💬 0 comments
Add to My Notes
00:07:41Paulo

So this is the part of indexing that happens here. Of course, this is the indexing part, as I have shown you. This is the part where we cut the documents and pre-process everything and chunk it up, and then create those embeddings or vectorize those chunks and save them into a vector store. And then what happens is, then we have a user who has a query or question of some sort, and that also has to go through the embedding model to vectorize that query. And then that is actually what is sent to search into the vector database. So we have vectors and vectors that are easily used in a vector database to do all sorts of things, mainly to search.

💬 0 comments
Add to My Notes
00:08:27Paulo

And then the information is retrieved. The relevant documents are retrieved or packed up, in this case, with a prompt as well as the relevant documents, as I said, and the query. But notice here, this is the different part phase. This is the augmentation phase of the RAG. So we're augmenting, we're adding something to what we had before. So not only do we have a query, but we also have a prompt, which is part of the query, and relevant documents and so forth. Okay. So once that is augmented, we pass that information through a large language model—so it could be any kind of large language model—and then that's when the response is generated, which is then returned to the user.

💬 0 comments
Add to My Notes
00:09:10Paulo

All right, so now you have the basics of understanding what RAG is, how RAG really works. The idea is that we have our own documents, we're going to go through the process of extracting those documents, splitting those up, and then passing them through the large language model. Of course, we're going to be saving that into a vector database. Now, if you don't know what a vector database is, I actually have yet another video where I talk about vector databases somewhere at the top here. Okay, so go ahead and check that out.

💬 0 comments
Add to My Notes
00:09:39Paulo

So we're going to do a hands-on here where I'm going to show you how to use RAG to create a system, a RAG system, that allows us to pass through some documents. In this case, we're going to be using a bunch of articles that we're going to be reading in, saving those to a vector database, and then form a RAG to start conversing, or in this case, querying our documents so we can ask questions and get the correct answers along with the large language model.

💬 0 comments
Add to My Notes
00:10:11Paulo

In this demonstration here, I'm going to be using OpenAI, which means that you need to go and have an OpenAI API key for you to be able to do this with me. Now, if you don't want to use OpenAI, you can use other large language models out there, and things will be a little bit different, of course, but the main idea is going to be the same. All right, okay, let's go ahead and get started.

💬 0 comments
Add to My Notes
00:10:32Paulo

And so I have this project called RAG intro, and I have a few things here. One of the important things is that you have here the OpenAI API key. So you need to get that, you need to have that. Then, of course, I have the app.py, which is empty at this point. So this is where we're going to start doing our magic.

💬 0 comments
Add to My Notes
00:10:51Paulo

Now, before we do that, I need to make sure that I have a virtual environment. If you want to learn more about Python, I have a full video of one hour or so that you can go ahead and check it out also. You'll see somewhere here. All right, or you can search on my channel, you will find that. All right, so I have my virtual environment created there. And let's go ahead and say source venv and activate that real quick. So we have that set up. So now it's active.

💬 0 comments
Add to My Notes
00:11:21Paulo

We are going to install a few dependencies. The first one that I need here, let's see, I have my cheat sheet here. So the first one that we need is the python-dotenv. So pip install... I'm going to pass that. This is going to allow us to retrieve information from our virtual environment file.

💬 0 comments
Add to My Notes
00:11:44Paulo

Okay, and then next I'm going to get the OpenAI because we're going to be using OpenAI. So I say pip install openai. Okay. And because of the nature of large language models and RAG systems, we need to save this information, this data that we're going to split up, these documents, into a vector database. If you don't know what a vector database is, I do have a course that talks about vector databases. And so there are many kinds of vector databases. We're going to be using ChromaDB, which is light and easy to use. So I'm going to say pip install chromadb as such. So we have that set up for us.

💬 0 comments
Add to My Notes
00:12:30Paulo

I'm going to go ahead and import a few things that I need here. Now, just to make sure that this is fast, I'm not going to type everything because you should have access to this code. The OS, because we're going to be needing that to access operating system folders and files and so forth and other functions. I have ChromaDB, and I have dotenv here to load all of our environment variables. And of course, I'm importing EmbeddingFunctions. We're going to use that to create embeddings, because those are the representations of our data that need to go, that we need to have in order to put that into our database, vector database. Of course, we have OpenAI here, which we're going to be using soon.

💬 0 comments
Add to My Notes
00:13:15Paulo

All right. So next, what we'll do here, I'm going to load all of our environment variables. And then I'm going to set up the OpenAI key from my environment variable, as I said, from our environment file there. And then what we're going to do is we're going to create the function, the embedding function. This is what's going to allow us to create those embeddings. Again, once we chop up all of our data, which I'm going to show you in a second here, we want to transform that into embeddings, these zeros and ones, vector space. And then that is what's going to be saved into the vector database, the Chroma vector database.

💬 0 comments
Add to My Notes
00:14:03Paulo

And when you do that, when instantiating this embedding function, you need to pass the API key, OpenAI key, because it needs to know what model it's going to be used to do that. And we're going to pass the actual model name, which is going to be text-embedding-3-small. So this is just a very small, light embedding functional system that allows us to create embeddings.

💬 0 comments
Add to My Notes
00:14:24Paulo

And next, I'm going to go ahead and, of course, initialize the Chroma client persistence. So I want to be able to persist, or in this case, I want to be able to save the actual database. Now, looking at the data here, you see that I also have these news articles. So this is where I have all of these news articles that I found online. This is what we're going to be using as the documents. And then we are going to chop it all up, put it into a database, right? Not a normal database, this is going to be a vector database. And then we're going to use other techniques to start conversing, talking, and getting the documents that we need to answer the questions that we are asking.

💬 0 comments
Add to My Notes
00:15:09Paulo

So what are we doing here? We are initializing the Chroma client. You can see it's very simple really. You say chroma.PersistentClient, and then we pass the path we want this to have. Now, I said chroma_persistent_storage. This is kind of long, but you can make it shorter if you want. And then collection name, add whatever name we want. And then now we actually say chroma.get_or_create_collection, which means this function allows us to create the actual collection. Collection is just a table or document where we can put all of these documents or tables in this case. Okay.

💬 0 comments
Add to My Notes
00:15:45Paulo

And then we need to pass the embedding function. Notice that now we are passing what we instantiated at the top here, the actual OpenAI embedding function that is going to allow us to create those embeddings, right? The vector embeddings along with the collection name. There we go. So now we have this collection that indeed we created with Chroma.

💬 0 comments
Add to My Notes
00:16:11Paulo

All right, so let's go ahead and create our client. This is our OpenAI client. We pass the API key and the OpenAI key, of course. So now we have our client. We can do all sorts of things, meaning we can, for instance, say client.chat.completions and I can go and create. And then here I can pass a few things, such as the model, I believe. Let's say model and I'm going to say gpt-3.5-turbo, and I can pass messages and roles and everything.

💬 0 comments
Add to My Notes
00:16:45Paulo

In fact, let's just do that real quick here so you can see this client working. You can see we have the messages, and the system says, 'You're a helpful assistant. What is human life expectancy in the United States?' Well, that's pretty good. Let's go ahead and see if this works, making sure that you have everything set up, of course. So I'm going to say print... actually let me put this... say res like this, and say I can say res.choices[0].message.content. All right, so if I go ahead and run this, I should be able to get something. So Python like this. So it should let us know that indeed we have everything set up and we should get some results in a second.

💬 0 comments
Add to My Notes
00:17:30Paulo

Okay, looks like we have some issues here. res.choices[0].message. Let's just go and get the response, the whole payload. I think I'm missing the actual object. It's okay. Let's run again. Okay, so we can see that we have the response ChatCompletion and it went ahead and got: "As of 2020, the average life expectancy in the United States is around 78.8 years old," and so forth. Okay, so we're getting a payload to tell us that this is actually working. Of course, you can get to the actual payload if you want this content here... I think... okay, there we go. So now we get the actual content: "As of 2020," blah, blah, blah, and so forth.

💬 0 comments
Add to My Notes
00:18:20Paulo

Okay, so at least we know that this is working. That's all we really wanted. That's not our end goal. So I'm going to get rid of that.

💬 0 comments
Add to My Notes
00:18:25Paulo

Okay, so the first thing we need to do is, of course, to load our documents from our articles. As you can see here, we need to load all of them and then start doing something. So I have all of the code already so that we don't spend too much time, and you'll have access to all this code anyway. And what I'm going to do, it's a function that I created before this. So what are we doing here? We're loading documents from a certain directory. So I have a print statement here just to give us what's happening. And I go through, in this case, I know that all of these articles documents end with .txt. They are .txt files. It's kind of hard to see, you just have to believe me. You can see here they are .txt files. Okay, so that's what we're doing. We're going through and starting to load all of them and return the actual documents. Right? It's going to be a list of documents.

💬 0 comments
Add to My Notes
00:19:20Paulo

All right, so the next thing is we need to split these documents. Once we get them, we've got to have to split them up so that then we can pass them through into our database. So I already have a function that does just that. You can see we pass split_text, and we say the chunk size is 1,000 and the overlap is 20. The overlap essentially says once we split these documents, we want to make sure that they overlap, right? Overlap like this. That way the contextual meaning of each piece of text is overlapped, which means it's kept. The context is kept because once we split all these documents in small chunks, they're going to be very distant. So the more overlap we have, you can see that the more context we'll have kept, the less the less overlap. So that is the idea, really. And so we just go through the splitting process, and then we return the chunks.

💬 0 comments
Add to My Notes
00:20:11Paulo

Okay, so we have those functions we're going to be using soon. And so now we're going to go ahead and load documents from the directory. So the directory, as you see here, is... let's see. Right. I should have said news_articles. Okay, so I'm just going to remove this like that. Should have put it under data, but it's okay. So it's under news_articles, which is this guy here. And it's going to go ahead and get all of them. And then for documents, I'm going to go and then I'm going to call load_documents_from_directory, pass the path, which is this one here. And at this point, we should have all the documents. I'm going to go ahead and print real quick here so we can hopefully see, we should have the length of the documents once we have those documents loaded. Right? So in this case here, you know that this should return a list of documents because documents is indeed a list. Okay, we should have something. So let's go ahead and save this. And I'm going to quickly run again.

💬 0 comments
Add to My Notes
00:21:15Paulo

Okay, "Loaded 21 documents." Very good. So it went and got all of those documents that we loaded in, essentially all of these documents here. So they're about 21. Now, once we have these documents, of course, now it's time for us to do something else. What we need to do really is to get those documents, split up. So I'm going to go ahead and do that.

💬 0 comments
Add to My Notes
00:21:41Paulo

So now I create the "split documents into chunks". So I have a list, and I go through those documents that we just received. And then we call the split_text. split_text is indeed what we have here. So it's going to go through and return another chunks, which is a list of all those documents that are split. But remember, we have this overlap for each document to continue having the context. Okay, very good. So we do all that stuff. And then I can go ahead and say print again, split_documents. So this should give me all of the documents chunk split, documents into the length of chunks. Okay, because I should have something at this point. Let's run again.

💬 0 comments
Add to My Notes
00:22:27Paulo

Okay, so you can see it went through the process, splitting docs, splitting docs, and this is telling me how many splits I got, because that is what indeed I asked, length and so forth. Okay, so we know this is working, which is essentially what we want.

💬 0 comments
Add to My Notes
00:22:41Paulo

Okay, so the next function I need here is a function that will generate those actual embeddings. Because remember, once we split up all the documents, we need to take those splits that we did here and create embeddings. This is what's actually saved into our database, into our vector database. So I have a function here that is going to be helpful for us to use. So essentially, what this does is we use OpenAI to create those embeddings from the text. Okay, that's what we're doing here. So you can say client.embeddings.create, and we pass the text, the pieces that we are putting through. And then we say the model that we want to use to create those embeddings. That's all we're doing here. And then we get those embeddings and return them.

💬 0 comments
Add to My Notes
00:23:26Paulo

Okay, this is going to be helpful in a second here. Then I'm going to generate the actual embeddings. Why? Because I already have the function. So to generate embeddings, we go through all the chunked documents, yes, you remember these guys here. And then as we go through, we call these get_openai_embeddings and we pass the actual information through to then create a document embedding field. So each time we go, we're actually creating those embeddings. So we can actually print to see our embeddings. So I can say doc_embedding, and what will happen is, let's go ahead and run real quick so you can see. We're going to go through the whole process, splitting, and look at that, it's creating those embeddings. We'll take a bit. And in a second here, we should see actual embeddings. So after a little while, so be patient, but this will take a while. You can see now we have the embeddings, the vector spaces, right? And so there we go of all the documents.

💬 0 comments
Add to My Notes
00:24:26Paulo

Now we have all the embeddings. This is these vectors that actually we're able to then add to into the database. So this is good. So I'm going to go ahead and clear this. I just wanted to show you. Okay, so now that we have our embeddings, let's go ahead and comment this out so we don't run that. Let's go ahead and insert each one of these embeddings into our database. Okay, so of course, I have the code for that. So for each one of these, we're going to, because we know this chunked_documents, which is what we have here, has all of the information. What we do now is that we're going to get those chunks, the real chunks before we embed anything, add into our vector database. And at the same time, we're going to add the actual documents along with the embeddings. So now we're going to have these chunks of the documents, these little pieces, not embeddings, these are just the text chunks. And then we're going to have the actual embeddings. They are going to be sitting on our database. Ah, very cool then.

💬 0 comments
Add to My Notes
00:25:28Paulo

And we're going to create a function to query our documents. So I have all of that, and I'm going to just copy that and put it here so we don't have to do all the things. So the idea here is that query_documents. Now, this query_documents is very simple. We pass in the question like, 'Tell me about GPT-4,' something like that, anything pertaining to our data that we've just saved. Okay, and then we say how many results we're expecting to receive, how many documents essentially. Because what will happen is we're going to be able to retrieve the documents corresponding to the query that we're passing in, right? So in the background, what will happen is the database is going to be able to go and search, do similarity search, until it finds what is congruent with the question that we have inserted. That's pretty cool, right?

💬 0 comments
Add to My Notes
00:26:18Paulo

And so we say collection.query, passing the question and the number of results we want, documents that we want. And then we put it in a variable, and then we extract the relevant chunks from that, from our list of documents, because this result here is going to have the list of documents. That's why we can go through those documents and get the relevant chunks. And then once we have them, we just return those relevant chunks. I have this other code here you can check it out. This is going to just give us the distance between the relevancy. So essentially tell us how close to the actual answer these documents are. Okay, and you can play with that.

💬 0 comments
Add to My Notes
00:27:00Paulo

Okay, so next what we'll do once we have this done, I'm going to then, of course, have a function that will generate the response because, think about it, we have taken the documents that we have, we've chopped them up, we put them, we created a vector database, and then we put inside of that vector database. But before that, we were able to create embeddings because we want to save those embeddings because it's easier for the search to happen, right, for the right document once we ask the question.

💬 0 comments
Add to My Notes
00:27:32Paulo

And so now we want to generate the actual response. So now we are going to use the large language model again, OpenAI in this case, to do all the work with all these pieces that we have right now. So as you can see here, we pass in the question and we need the relevant chunks, right? We are taking the relevant chunks that we created, we were able to query the database, right? And then we are passing that along with the question. So now we have the question, our question asking questions about these documents, right, whatever. And then now we have only relevant chunks that were passed through the large language model. And the large language model now has more information or more knowledge of what we want to get the answer from, right?

💬 0 comments
Add to My Notes
00:28:19Paulo

And so that is what's happening. So here I'm creating the context. Essentially, I'm getting this relevant chunks and joining in with other stuff. And then I have a prompt here, a prompt for our large language model to say, 'Hey, this is what you need to be aware of when you are answering these questions. You're an assistant for question answering tasks. Use the following pieces of data, blah, blah, blah, retrieve context to answer a question. If you don't know, please say 'I don't know',' and things like that.

💬 0 comments
Add to My Notes
00:28:45Paulo

Now, a prompt is actually its own thing, and you have to be really good at prompting to get the right result from a large language model. And then, of course, we pass that context and the question. We need to pass those two things, right? So now the large language model will have the question that we're asking, we typed in, and then we'll have the relevant documents because we've parsed that through already, you see? Okay, and then we call the create again. We go to the client.chat.completions.create, of course. And now we are going to the actual model saying, 'Hey, here's the information, go ahead and give me the answer.' That's all we're doing here. Okay, we pass in the prompt as well as the question, and then we get the answer. This is what this will return.

💬 0 comments
Add to My Notes
00:29:27Paulo

So now it's time for us to check out to see how this will work. So now I'm going to have here this query that I'm going to start doing here. So here's an example question: 'Tell me about AI replacing TV writers in strike.' Now, I know this question would work because in one of these documents here, we talk about AI replacing jobs and so forth. And so I'm going to see if this works. So here's what's happening: I have the question and I need to get relevant chunks, right, from the document, in this case, from the database. Well, I call the query_documents and I pass the question. So it's going to go ahead and query the documents that are in our database, finding documents that are relevant to the question that we're asking, which is this here, right?

💬 0 comments
Add to My Notes
00:30:14Paulo

And once we get these relevant chunks, we're going to need that along again with the question, the first question, and the relevant chunks we got to get the answer, right? Because this generate_response here, this is what we just talked about, is going to go ahead and pull in the context, relevant chunks, as well as the question, create a prompt, and then pass that through that prompt and then call the large language model to then answer that question, and we get that answer. Okay, let's see if this works. And then we're going to print the answer here.

💬 0 comments
Add to My Notes
00:30:48Paulo

All right, let's run this again and see if this works. So it's going to go through the whole process, of course, it will generate everything. And one thing also, in the beginning, of course, it will go through the process, but once you run once, because we will have that data and everything, we should be able to just comment out the first part of this code essentially. So everything is good, but in any case, everything will still work. It should work because we have that data already. The other thing you will notice is that now we should have... you can see now we have this chroma_persistent_storage, which is indeed the Chroma SQLite3, which is the database that we created. This is the actual Chroma database. Pretty cool, pretty cool indeed. Okay, so we have that set up. So this will take a little bit, of course.

💬 0 comments
Add to My Notes
00:31:36Paulo

Okay, now it says, "Retrieving relevant chunks," and voilà! Says here, "TV writers are currently on strike due to the Writers Guild of America demanding regulations on the use of AI in writers' rooms." So the Writers Guild, blah, blah, blah. So all this information actually pertains to the articles that we have here. So 'AI Replace TV Writers', if I click here, you will see that indeed I should have something related to that. Let's see. Okay, "regulate the use of AI" and so forth. So it goes ahead and looks at the correct other ones here that relate to exactly that.

💬 0 comments
Add to My Notes
00:32:14Paulo

So for instance, I can go ahead and ask something else. Let's see. Let's say something about Databricks. Okay, so let's say, 'Tell me about Databricks.' And let's go ahead and go through the process. It went through the whole process, inserting chunks into DB, blah, blah, blah, went through all of that. We know that. And at some point, it went ahead and said, "Retrieving relevant chunks," and then of course, we hit the large language model and it says, "Databricks is a data and AI company that recently acquired Okera," blah, blah, blah, and so forth.

💬 0 comments
Add to My Notes
00:32:51Paulo

So just like that, we're able to take the information, our own data in this case here—this could be anything, our own data in this case—and we parsed all that through, we extracted everything, we created these little chunks of this data. And then we used the OpenAI API to actually create the embeddings, that's very important. And then we saved that into a vector database. This is very important also. This is not a normal database, this is a vector database. And then we're able to search that vector database according to the question that we're passing through it. And then we got the right chunks, right? And then we passed those chunks of documents and the question and passed that through the large language model, and then we're able to get the answer.

💬 0 comments
Add to My Notes
00:33:39Paulo

This is the power, as you see here, because now we're able to take our own data and sort of inject it into the large language model so that we are able to ask questions about our particular data. And I hope you can see so many use cases that you can use this particular RAG system here to help you with analyzing data and so forth.

💬 0 comments
Add to My Notes
00:34:06Paulo

All right, so now that you have the basics, the fundamentals of RAG, how to create a simple RAG system that allows you to converse or chat with your own documents, so essentially injecting some information, your custom information, your data with the large language model so you can converse and start talking, chatting, and getting responses, conveniently responses that are attached or that are congruent with your own data. Okay, so you know how to do that. And of course, now you know how to create, of course, a database, a vector database, which is very important for us to be able to save those pieces of information of our data, documents in this case, and then save all of that information along with the embeddings, which is really important because all of this is being saved in a vector space, which is easier for the vector database to be able to find things faster, things that have meaning and relevancy.

💬 0 comments
Add to My Notes
00:35:10Paulo

If you think that this is where it ends, but obviously this is not where it ends because you will see that RAG, as it is right now, we call it naive RAG, it has its own pitfalls. And so now we are ready to move forward and learn of these pitfalls that naive RAG, which is what we've been doing, the pitfall that it has. And then we're going to learn and implement a technique or certain techniques that will take our RAG system to the next level, okay, so we can actually get consistent results. Because as it is, you can get some results that are not very consistent and results that may not be necessarily congruent or related semantically with your query. And that is what we're going to be touching on next. Okay, let's go ahead and do that.

💬 0 comments
Add to My Notes
00:36:02Paulo

So naive RAG is the most simple, straightforward way of dealing with large language models. So essentially it has indexing, retrieval, and generation. So indexing, it is the process of cleaning up and extracting data from the documents, just like as we saw earlier. And then we have the retrieval. So this is the part where we turn questions into a vector, a vector space, and that's what's used for comparison, which allows us to retrieve closely related chunks, which then are pushed with the query into the generation phase, which then the query and chosen documents are combined into a prompt. So we have a prompt, the query, and the documents that were chosen, pushed to the model to generate an answer.

💬 0 comments
Add to My Notes
00:36:50Paulo

So this is the naive RAG. So really, if you want to do a deep dive into naive RAG, this is what happens. So we have the documents, and these documents are going through the phase of parsing and pre-processing. So essentially, cut them up into smaller documents. This is the chunking process. And then we pass them around into smaller chunks, and those are passed through the embedding model to create vectors out of these chunks. Okay, so we're vectorizing those chunks, and then that is what is saved into a vector store or a vector database.

💬 0 comments
Add to My Notes
00:37:26Paulo

So this is the part of indexing that happens here. Of course, this is the indexing part, as I have shown you. This is the part where we cut the documents and pre-process everything and chunk it up, and then create those embeddings or vectorize those chunks and save them into a vector store. And then what happens is, then we have a user who has a query or question of some sort, and that also has to go through the embedding model to vectorize that query. And then that is actually what is sent to search into the vector database. So we have vectors and vectors that are easily used in a vector database to do all sorts of things, mainly to search.

💬 0 comments
Add to My Notes
00:38:11Paulo

And then the information is retrieved. The relevant documents are retrieved or packed up, in this case, with a prompt as well as the relevant documents, as I said, and the query. But notice here, this is the different part phase. This is the augmentation phase of the RAG. So we're augmenting, we're adding something to what we had before. So not only do we have a query, but we also have a prompt, which is part of the query, and relevant documents and so forth. Okay. So once that is augmented, we pass that information through a large language model—so it could be any kind of large language model—and then that's when the response is generated, which is then returned to the user. So this is how naive RAG works.

💬 0 comments
Add to My Notes
00:38:57Paulo

As wonderful as this sounds and looks, there's some issues with this. Well, naive RAG has some challenges, some pitfalls, some drawbacks. The first one is that we have limited contextual understanding. So for example, if a user asks about the impact of climate change on polar bears, so a naive RAG might retrieve documents broadly discussing climate change and polar bears separately, but will fail to find the most relevant documents discussing both topics in context. Okay, so that is a problem because naive RAG models often retrieve documents based solely on keyword matching or basic semantic similarity, which can lead to retrieving irrelevant or partially relevant documents.

💬 0 comments
Add to My Notes
00:39:49Paulo

And then, of course, we get inconsistent relevance and quality of retrieved documents. So what that means is that the quality and relevance of the retrieved documents can vary significantly because naive RAG models may not rank the documents effectively, which leads to poor quality inputs to the generative model. And the third one is that we have poor integration between retrieval and generation. What that means is that in naive RAG systems, the retriever and the generator components often operate independently without optimizing their interactions. So you can see two things are working independently. They're not optimized to work together. That's a problem.

💬 0 comments
Add to My Notes
00:40:31Paulo

So this lack of synergy could lead to suboptimal performance where the generative model doesn't fully leverage the retrieved information. So as an example, the generative model might generate a response that actually ignores critical context that was provided by the retrieved documents, which in this case will result in generic or off-topic answers.

💬 0 comments
Add to My Notes
00:40:58Paulo

Also, we have inefficiency handling of large-scale data. Naive RAG systems may struggle with scaling to large datasets because of its inefficient retrieval mechanisms, which leads to slower response times and decreased performance. For example, in a large knowledge base, a naive retriever might take too long to find relevant documents or even miss critical information due to inadequate indexing and search strategies.

💬 0 comments
Add to My Notes
00:41:27Paulo

We also have lack of robustness and adaptability. The issue here is that naive RAG models often lack mechanisms to handle ambiguous or even complex queries robustly. So they're not adaptable to changing contexts or user needs without significant manual intervention. You can imagine if a user query is vague or multifaceted, a naive RAG might retrieve documents that partially address different aspects of the query but fail to provide a coherent and comprehensive answer. Okay, so these are some of the main drawbacks or challenges or even pitfalls of naive RAG.

💬 0 comments
Add to My Notes
00:42:13Paulo

Okay, let's go ahead and break it down, each one of these pitfalls of naive RAG, so we can have some better ideas. Okay, so let's look at the limited contextual understanding as we focused on in the last lecture. How does that even look like? Again, we understand in naive RAG, limited contextual understanding focuses on keyword matching or basic semantic search, in this case, retrieving irrelevant or partially relevant documents. So essentially, as I said, we have a query, for instance, that says, "the impact of climate change on polar bears." And so the idea is that that question is transferred through the whole process, this naive RAG, which we don't have to go through the whole details of what happens there, as we know it. So this is just an illustration. It goes through all of that, and then it just retrieves non-relevant docs on both topics.

💬 0 comments
Add to My Notes
00:43:05Paulo

Okay, so that is the limited contextual understanding that is lacking, which is one of the pitfalls of naive RAG. Okay, because naive RAG models often retrieve documents based solely on the keyword matching or basic semantic similarity, which you can see here can lead to retrieving irrelevant or partially relevant documents, non-relevant docs on both topics.

💬 0 comments
Add to My Notes
00:43:31Paulo

Now, the next one we talked about is the inconsistent relevance and quality of retrieved documents. So the problem is the same with naive RAG. We have this plethora of varying in quality and relevance documents, which means we have poor quality inputs for the model. Again, the same example. We'll have a query that goes through and that says, "Latest research on AI ethics," for instance. And so that goes through the naive RAG process, and then we may end up getting outdated or less credible resources because the quality and relevance of the retrieved documents can also vary significantly. So naive RAG models may not rank the documents effectively, which in this case will lead to poor quality inputs for the generative model, which in this case, you know, will get very poor results.

💬 0 comments
Add to My Notes
00:44:27Paulo

The next one that we discussed is the poor integration between retrieval and generation. And so we know that because of the nature of naive RAG, the retriever, I should say, and the generator components often operate independently. They are operating alone without optimizing the interactions within themselves, which, as you know, if two systems are working not in conjunction, the synergy can lead to suboptimal performance, which means we end up with very unoptimized documents or information that is being passed through.

💬 0 comments
Add to My Notes
00:45:03Paulo

So again, we have, in this case, the retrieved documents, and these retrieved documents are, of course, used through the naive RAG system. And what happens is that we may lose what's actually important, what is relevant, and we end up with very generic or even off-topic answers through this whole process, because these two systems are working—the generator in this case, generator and the retriever—are components that are working independently without optimizing their interactions.

💬 0 comments
Add to My Notes
00:45:31Paulo

And the next one is inefficient handling of large scale. Let's look at the overview. The idea is that using naive RAG systems, these tend to struggle with scaling to large datasets because of the inefficient retrieval mechanisms, which lead to very slow response times and, of course, decreased performance because it takes too long in larger-scale datasets and so forth to find relevant docs, which of course leads to missing information due to bad indexing, essentially. So essentially, it's the same thing. We have the index process, which creates, of course, these index data structures, which then, in the grand scheme of things, is going to have trouble trying to find the right information in a very large knowledge base if you have a lot of documents and so forth, because it's taking too long. And of course, it may end up missing critical information because of this inadequate indexing and search strategies.

💬 0 comments
Add to My Notes
00:46:27Paulo

Okay, so going to the lack of robustness and adaptability, this is another one. Models often lack mechanisms to handle ambiguous or complex queries. So remember that queries are not always very direct. So you can have a query that is loaded, per se, it's ambiguous, it has more information, more questions that are a little bit complex. So with the naive RAG, they just don't have the mechanisms to handle that because they're not adaptable. So that is the problem. Here's an example. So we have a query: "Tell me more about index funds and anything related to finances." So this is a little bit loaded, this query, isn't it? And you can see that naive RAG doesn't have a way of deciphering through this query to get to the bottom of the actual query or to get the correct, coherent, and comprehensive answer.

💬 0 comments
Add to My Notes
00:47:20Paulo

So these are the drawbacks, doing a little bit of a deep dive into each one of the drawbacks or challenges or pitfalls of using naive RAG.

💬 0 comments
Add to My Notes
00:47:32Paulo

So in summary, naive RAG has, of course, some pitfalls as we've seen. And we can subdivide those pitfalls in two categories. We have first, the retrieval challenges. So in this case, they lead to the selection of misaligned or even irrelevant chunks, which of course leads to missing of crucial information, which, as you know, is not a good thing. And then we have the other side, which is the generative challenges. So under this umbrella, we have issues that the model might struggle with hallucination and have issues with relevance, toxicity, or bias in its outputs. So these are the two umbrellas, I would say, of the drawbacks that naive RAG brings to the table: retrieval challenges as well as generative challenges.

💬 0 comments
Add to My Notes
00:48:27Paulo

Now, let's talk about the solutions here. In this case, we are going to go from the naive RAG that we've looked at and its pitfalls. Now we're going to look at advanced RAG techniques that help us make this whole process, that makes this whole process more efficient, okay, for our RAG. And so, advanced RAG techniques and their solutions.

💬 0 comments
Add to My Notes
00:48:50Paulo

Now let's look at the first one. Now, first, let's look at advanced RAGs' benefits. Why is it important? The beauty here is that they introduce the specific improvements, this plethora of improvements that allow us to overcome the limitations of naive RAG, as we've seen previously. The main goal here is that with advanced RAG benefits, we are focusing on enhancing retrieval quality.

💬 0 comments
Add to My Notes
00:49:17Paulo

So advanced RAG employs the following strategies. The first one is pre-retrieval. The idea is to improve the indexing structure and user's query, because that is very important, the first step. If you don't have a good indexing structure and user's query, then everything else falls apart. And also we're going to improve the data details, in this case, organizing indexes better, adding extra information—augmentation part of it now—and aligning things correctly.

💬 0 comments
Add to My Notes
00:49:48Paulo

And then we have the second part, which is the post-retrieval. At this stage, we are combining what we got from the pre-retrieval stage, okay, the data that we got there with the original query. So we're combining all of that to finalize this augmentation part of the RAG. In this case, we're going to be reranking to highlight the most important content or doing all other techniques that goes ahead and enhance our retrieval processes.

💬 0 comments
Add to My Notes
00:50:19Paulo

Now, there are many advanced RAG techniques out there. There have been a lot of studies and written papers that you can go and take a look. One thing I will let you know of is that as you go through all of these other different techniques, even beyond this course, you will realize that most of them tend to overlap, and sometimes the namings can also overlap, and that's okay. But the main idea is the same, which is having techniques that allow for a better workflow of our RAG systems.

💬 0 comments
Add to My Notes
00:50:48Paulo

So first of all, I'm going to look at here is the query expansion. In this case, it's going to be with generated answers. What does that really, really mean? Is that we're going to generate potential answers to the query. In this case, we're going to actually use a large language model to get relevant context. So with the query expansion as an advanced retrieval technique, it's used to improve the relevance—again, keep in mind these keywords—it's used to improve the relevance of search results by augmenting the original query with additional terms or phrases. Because these additional terms are selected based on various strategies, such as synonym expansion, related terms, or contextually similar words. So the goal is to really capture more relevant documents that might not match the original query terms. In this case, they will not match them exactly, but are semantically related.

💬 0 comments
Add to My Notes
00:51:50Paulo

So to show you here, here's a simple diagram. So you have a query, and this query has to go through a large language model to create an answer. This is a, I would call, a hallucinated answer. Okay. And then we take that answer that goes through the vector base, and then we concatenate the answer and the original query and use that as a new query, which you pass through the vector database, in this case, the retrieval system. And then you return your query results, which was passed through the large language model again to get the actual results, the answer.

💬 0 comments
Add to My Notes
00:52:29Paulo

So that's all we're doing here, is that we are using the large language model in this case to hallucinate a little bit on that first query. And then we concatenate the original query, or in this case, the answer, that answer that we got, right, and the original query, and use that as a new query, which then we pass through the vector database, the retrieval system, and then return the query results again, pass those through a large language model again to get the results.

💬 0 comments
Add to My Notes
00:52:59Paulo

So what happens here is that with this expansion here, we have a few use cases, right? This query expansion with generated answers, we can use that in the information retrieval. Can you imagine with this system now, we can enhance the effectiveness of search engines because we're providing more comprehensive search results. And also in question answering systems because now we're improving the retrieval of relevant documents or passages that potentially help in answering user queries. In e-commerce search, because it increases the accuracy and relevance of product search by expanding user queries with related terms. Academic research, it makes sense because now we are able to find more relevant papers by expanding their search queries with related scientific terms and concepts.

💬 0 comments
Add to My Notes
00:53:52Paulo

Okay, so let's go ahead and do a hands-on. So essentially, again, this is the diagram: we have the query, pass that through a large language model, and we get an answer, so a hallucinated answer. And that's okay, but that's going to be used, of course, to concatenate with the answer and the previous query. And we get some results from our vector database, which then we pass all of that into a large language model, and we get an answer. So in the next video, we're going to see how all that works in code.

💬 0 comments
Add to My Notes
00:54:24Paulo

Okay, so I have a project here. You should have access to all of this code and data and everything. And at the top here, we have this data folder which has this Microsoft annual report. Okay, so if you click on that, it's just a large PDF with about 126 or 116 pages. So it talks about Microsoft annual report 2023. Okay, so I just found this and I thought it would be a good idea to use that for our demonstration here. Okay, so you'll have access to all of this. And also make sure that you create a virtual environment on your project so this works, as well as having, in this case, an OpenAI API key. You should be able to go and create an account with OpenAI and get a key so if you want to follow along. And so forth, I've got this helper_utils, other utility methods and so forth.

💬 0 comments
Add to My Notes
00:55:19Paulo

Okay, all right. So the first thing, let's go ahead and do some importing here, or installing some dependencies. First, I'm going to say pip install chromadb because that's what we're going to be using. And next, let's pip install pypdf because we're going to be using that as well to read a PDF file and extract everything. And let's go and say... pip, I'm going to install OpenAI.

💬 0 comments
Add to My Notes
00:55:52Paulo

Okay, so let's go ahead and create a new file here. Let's call this xp_expansion_insert.py. Okay, all right. So I already have the code, so I'm going to just get parts of it and we go from there. First, I'm going to add some imports here. Get PyPDF. Let's make sure that everything is set up. Okay, we got our PyPDF, PDF reader. We have OpenAI, all of that. And I'm going to set up the environment variables as well. So OpenAI API key, make sure you have that set up inside of your .env file. It's very important.

💬 0 comments
Add to My Notes
00:56:32Paulo

First, let's go ahead and read our Microsoft annual report. It's under data. You can see it's under here. Okay, you should have access to all of that. So that's what we're doing here. And then PDF text, I'm extracting everything from the pages, filter the empty strings real quick. And while I'm here, let's go ahead and run this real quick to see if this works. I'm going to just print this PDF text that we get here, filter out the empty strings, and I'm passing through the word wrapper so we can see. So I'm going to print that out. Let's see what's going to happen. Let's go ahead and run this.

💬 0 comments
Add to My Notes
00:57:13Paulo

Okay, so I need to import pandas as pandas. So that's something. Okay, make sure you import pandas as well. Let's go ahead and run. Okay, so we can see that we're able to extract our PDF and we get all of that response. So this is really good.

💬 0 comments
Add to My Notes
00:57:35Paulo

Okay, so now that we have the information that we need, which means we extracted the text, I'm just going to comment this out. We extracted our PDF, we got our information from our document. What we need to do next is we want to be able to split the text chunks. And to do that, we're going to use LangChain, which is a framework that allows us to deal with large language models and do all sorts of things. Okay, so first let's go ahead and get that. pip install langchain. LangChain splitter as such. So I can go ahead and get the RecursiveCharacterTextSplitter as well as the SentenceTransformersTokenTextSplitter because we're going to need those two before we get everything and start embedding everything.

💬 0 comments
Add to My Notes
00:58:31Paulo

Okay, so let's go ahead and take care of splitting our document using the RecursiveCharacterTextSplitter, which we are passing the chunk size 1,000 and chunk overlap of zero. And then here we are splitting the text by getting all of the actual PDF text that we have created from here. Okay, so taking that, this is where we're splitting it. And we're going to get the character splits. So I can also come back here, I'm going to print the character split text, the 10th one. And then I'm going to show the total chunks. So I'm going to save this and let's do a quick run.

💬 0 comments
Add to My Notes
00:59:13Paulo

Okay, so we can see that we have indeed the information that we wanted to see, this word wrap of the character splits, right, in this case, just the 10th one. And we see that's the one that we got. And then we have total chunks is 410 chunks that were splitted. That's very good.

💬 0 comments
Add to My Notes
00:59:32Paulo

Next, we use the SentenceTransformersTokenTextSplitter to split the text into chunks of about 256 tokens, and then we're going to set the chunk overlap to zero. Okay. Now, the reason why we need to do this is because we need to be mindful of the token size limitations that the large language models could impose. Then we'll be able to use to generate the embeddings, which we need. Okay, so let's do that.

💬 0 comments
Add to My Notes
01:00:04Paulo

So here I'm creating the token splitter by using the SentenceTransformersTokenTextSplitter. I'm passing overlap zero and the tokens per chunk 256. And then we're going to go ahead and actually start doing the splitting. So here I create the text token_split_text and I loop through all of the pieces and I split them up. And now let's go ahead and print some information out so we can see the difference here. So I'm going to save this. So notice the first one, the total chunk when we first did it, all right, about total chunks is 410. And now you can see the difference that for the chunk, the total chunks is going to be a little bit elevated, which makes sense.

💬 0 comments
Add to My Notes
01:00:51Paulo

Okay, so looks like could not import SentenceTransformer from Python packages. Need to order sentence-transformers. Yes, I need to go ahead and install that. That's right. So let's go ahead and say... copy that and first of all say pip install sentence-transformers. So we have that. Okay, we have that. Let's run this. Okay, so first we have the 410 total chunks, but next we should have a little bit more because of what we just did. Okay, so you can see it's a little bit more now, 419 as opposed to 410. So very good. So that is the total chunks of the splits that we did here. Okay, this is the token_split_text.

💬 0 comments
Add to My Notes
01:01:28Paulo

Okay, so now it's time that we have, since we have all the tokens, we can go ahead and start the embedding process. So first of all, we're going to import ChromaDB, and then we're going to use ChromaDB and get the embedding function. We could have used some other embedding functions using OpenAI and so forth, but I'm going to use the SentenceTransformerEmbeddingFunction for this demonstration here. So we're going to import that. And then I'm going to actually instantiate the SentenceTransformerEmbeddingFunction as you see here. And just because I want to see, I'm going to go ahead and print the token splits, token, because it comes as a list, the 10th one here, the 10th chunk, to show so we can print the actual embeddings of the 10th chunk. Okay, so just to avoid having a lot of things printing out, I'm going to just comment out all of these others that we had before.

💬 0 comments
Add to My Notes
01:02:26Paulo

Okay, so you can see now we have the embeddings for the 10th chunk. Okay, not very useful, but at least we see that things are actually working. So the embedding function is indeed working. Okay, so we know that the embedding function is working, so we no longer need to see it. So I'm going to close this out, going to leave it there.

💬 0 comments
Add to My Notes
01:02:45Paulo

And next, what we'll do is we're going to instantiate the Chroma client and create a collection and give it a name and passing, of course, the embedding function so that we can embed each one of these items that are passed along. So Chroma, I'm going to create a client and then create a function. In this case, I'm going to just give this microsoft_collection. You can name it whatever you want. And then embedding function, it's going to be passing the embedding function, which is going to be attached when we create the Chroma database. Okay. And once we have that, we can extract embeddings of the token splits by just going through like this. Okay.

💬 0 comments
Add to My Notes
01:03:26Paulo

And we are going to use that where we get all of these embeddings of the token splits, and that's what we're going to be adding to our collection, right? So to do so, I'm going to just copy this. To do so, we do this. So we say chroma.add, and we're going to add these IDs as we go through and the actual documents. So the documents are going to be the token splits, right? And I'm going to just say chroma_collection.count. And for now, let's go ahead and print just say count and say count like that so we can see. Okay, so we can see the count is 419, so essentially what we had before. Very good. So at least we know that it's working. Okay, so I don't need to do any of that.

💬 0 comments
Add to My Notes
01:04:11Paulo

And so what we'll do next is we're going to create the actual query. We've extracted the document, split it all up, and then added that, putting that through an embedding function so we have the embeddings. And then we added all of that inside of our ChromaDB, okay, our vector store. So I'm going to have a query here: "What was the total revenue for the year?" for instance, right? And then what we'll do is we're going to query using the collection. So collection.query, we pass the query text, which could be—we could pass more than one query, that's why we pass it as a list—and then the number of results that we want to get. Okay. And then retrieved documents, I'm going to say results and get documents as such. And so what I will do is for us to be able to see, I'm going to loop through all these documents so we can see what we get. Let's run again.

💬 0 comments
Add to My Notes
01:05:08Paulo

All right, so you can see that we got some of the documents. Okay, so very good, "Revenue" and "this year," and of course, these are split up. And that's the whole idea, is that they may not make sense where they start, but you can see we have one, two, three, four, and a few documents that we're getting here. All right, so this is indeed working.

💬 0 comments
Add to My Notes
01:05:28Paulo

So now we know we're getting all the documents. So it's a matter of trying to figure out, okay, how do we see how this technique works? Meaning that the first query, and we got the documents that we've just retrieved, and as well as creating, in this case, we're going to generate the actual queries, right, the augmented query as we see, because that's the whole idea of this technique.

💬 0 comments
Add to My Notes
01:05:50Paulo

Okay, first thing is I'm going to create a function called augment_query_generated. So this is where I'm going to use the augment_query_generated to generate an actual answer. So we're going to use, of course, a large language model, in this case OpenAI, and generate this actual answers, or one answer in this case. So first of all, I'm going to have to create a client for OpenAI. So I'm going to copy that. And so this augmented_query_generator, what it does is allows us to pass the query, and we're setting up the model to the GPT-3.5 Turbo. You can change that if you want.

💬 0 comments
Add to My Notes
01:06:28Paulo

And then I have the prompt here that says, 'You are a helpful expert financial research assistant. Provide an example answer to a given question that might be found in a document like an annual report.' So we're prompting it, making sure that it knows what to do. And then we pass the message here for the prompt here for the system is what we created here to know this is what you need to do, what he needs to be knowledgeable on. And then passing the query. So the query is going to be the question, the first query that we pass along. Okay. And then we use, of course, the completion API, create, pass it the model and the message, and we get that response.

💬 0 comments
Add to My Notes
01:07:07Paulo

So essentially here, so this is where we are looking at this part here. So we're going to generate this one answer, which we then, we're going to take with the query, the answer we got, go through our DB, and we get the query results and then put all of that together into the large language model and get the answer.

💬 0 comments
Add to My Notes
01:07:27Paulo

All right, so we have that function, which we'll be using soon. Okay, so next, what we'll do here is we're going to create an original query here. For instance, "What was the total profit for the year and how does it compare to the previous year?" So this is the original query. And then I have this hypothetical query or answer, I should say. And so I'm going to use the augmented_query_generator function, and I'm going to pass the original query, which is this one here. And then I'm going to join those two queries, right? The original and the hypothetical answer. And I'm going to print them out so we can see how those will look. So I'm going to go ahead and run this.

💬 0 comments
Add to My Notes
01:08:06Paulo

You can see here we have this, "What was the total profit for the year and how does it compare to the previous year?" Very good. But then we have this total profit. Look at this, we got an answer. So we hallucinated, we used a large language model to create that answer. The answer is, "The total profit for the year was 10 billion," and so forth, "and increased by sales and successful cost-cutting initiatives implemented throughout the year." Very good. So now we have the query and we have the answer that was hallucinated, right? That was created, which is part of this technique. So now we are golden because we have those two. And so once we have those two, it's time for us to then pull all of that into our Chroma collection and query using those two disjoint queries, right?

💬 0 comments
Add to My Notes
01:08:55Paulo

In this case, I call joint_query, but it's the original query and the hypothetical answer. So I'm going to say chroma_collection. I'm going to put it in a result variable because we're going to use that to just retrieve the documents, right? So I can say query, I can pack query_texts. What is the query text? Well, in this case, it's going to be joint_query because that is what I got. So the original query and the hypothetical answer, which is what we just received from the large language model. Now, you could have created yourself a list of, or one question for that answer, but it's always nice to use the large language model because then it can create something that is more useful. You don't have to think about it.

💬 0 comments
Add to My Notes
01:09:34Paulo

All right, so now we have this joint_query, which is that. And then we say we want five results, and most importantly, we want to include documents as well as embeddings in our results. All right? Because you will see that the reason why I want embeddings, you will see because we're going to use actual those embeddings to create a graph, a graph that will show the relationship of those embeddings visually so we can see how this works. All right, so then we have retrieved documents. In this case, I'm going to go ahead and show those retrieved documents. All right, so I'm going to print them out. Let's run. And there we go. So the idea, we are getting the documents that were passed through. In this case, we have the query, the original query as documents that we receive, as well as the generated answer. Put them together, and then we pass them through our collection to say, 'Okay, this is the query text, the joint_query, and we want five results. So include documents and embeddings.' So we have all the documents that are related to the query.

💬 0 comments
Add to My Notes
01:10:38Paulo

So we are seeing the query results from the answer concatenated with the query, the original query, go through the vector database, and we get the query results. So now that we have our retrieved documents, which is exactly what we want, we want to project this dataset on a nice graph so we can see the relationship, we can see the improvements that we have in the documents that we get that are related to the original query.

💬 0 comments
Add to My Notes
01:11:03Paulo

And for that, we are going to import a few things here. So I'm going to first of all create embeddings, go to the Chroma, and we've seen this before, right? I'm going to get the embeddings because we're going to be using those. And then I'm going to use the UMAP library to actually create the projections. And for that, we need to install, say pip install umap-learn. Okay, this library. Okay, it's all done. Import umap so we can use it at the bottom here. And I'm going to say from helper_utils import project_embeddings. There we go.

💬 0 comments
Add to My Notes
01:11:41Paulo

Okay, so now we have this projected_dataset_embeddings. Now, notice that we're getting these embeddings here from our Chroma collection, and because when we saved, we saved also embeddings along with other information. Okay, so that's the beauty. Okay, so I'm going to use project_embeddings, and this is a function from our utils as well, the umap_transform.

💬 0 comments
Add to My Notes
01:12:05Paulo

So next, what we'll do, we're going to retrieve the embeddings, okay, as you see there. So I'm going to retrieve embeddings from our results, which is our collection query there. And then I'm going to get the original query embedding by calling the embedding function and passing the original query that we saw before and getting the augmented query embedding. This is for us to see the differences of the results that we get between the original query embedding, in this case, the original query, and the joint query, which is the original query plus the answers that we hallucinated to put together. Okay, that's all we're doing here.

💬 0 comments
Add to My Notes
01:12:43Paulo

And then we are going to create some projection variables here. So here I'm creating an object to project_original_query_embedding by calling project_embeddings and passing the same thing, the query embedding for the original and passing the umap_transform. And doing the same for project_augmented_query_embeddings. Here we're passing the actual augmented query embedding, which is what we have here. Okay, and then projecting_retrieved_embeddings. So I'm just projecting them.

💬 0 comments
Add to My Notes
01:13:14Paulo

All right, and next we're just going to import Matplotlib like such. If we don't have that, let's go ahead and pip install so that we're able to see the figures, see the graphs and everything. Okay, once that is all set up, then I'm going to go ahead and just plot everything. So now here I'm plotting the projected query and retrieve documents in the embedding space. So this is going to be just a 2D, and you know that embeddings, the vector space can be... it's multi, N is multi-dimensional. So, but we're going to use 2D to facilitate so we can actually see the differences here. That's what I'm doing here, scatter and creating different points in our graph. And then I'm going to go ahead and show the graph. Okay, let's go ahead and run this real quick so you can actually see what we're talking about.

💬 0 comments
Add to My Notes
01:14:01Paulo

So this is beautiful. So you can see here, this is a two-dimensional vector space that shows all of the information that we've plugged in. So what we have here is the red X, as you see here, is the original query. And all of these gray dots are the dataset embeddings. Now, the retrieved documents are close to the augmented query. What is the augmented query? Well, this is the augmented query. The orange X is the augmented query, which contains, as you remember, the hypothetical, in this case, the hallucinated answer and the original query. And then these green dots or circles here, these are the retrieved documents.

💬 0 comments
Add to My Notes
01:14:47Paulo

So you can see that the retrieved documents are close to the augmented query in the embedding space, which is an improvement. Whereas here is the original one, and we are really far away from all the retrieved documents. So that tells you that indeed this technique works. So the bottom line here is that the augmented query is closer, again, to the retrieved documents than the original query in the embedding space. It tells us that using a hallucinated answer—which we hallucinated, meaning we asked the large language model to create that answer first beforehand—was part of the query to improve the retrieval results, because the augmented query is closer to the retrieved documents than the original query in the embedding space.

💬 0 comments
Add to My Notes
01:15:34Paulo

Looking at this, you can see that is quite an improvement. Now again, this is not perfect, of course, but you see that we can use it to refine the results that we get from our previous naive RAG. Using this technique here, then we get something that is a little bit more closer to the actual documents, right, the most relevant documents that were retrieved. So what I need you to do is to ask a different question and see how this graph will show the improvements between the original query and the augmented query.

💬 0 comments
Add to My Notes
01:16:11Paulo

Okay, so we just saw how to use the query expansion with generated answers as an advanced RAG technique to generate potential answers to a query using a large language model and to get relevant context. So essentially, we have the query, and then we pass that through the LLM to generate an answer. And then we combine all of that, so the original query and the answer that we received, to get query results. And then, of course, that helps to pass through the large language model and to get an answer to see how in a 2D dimensional space how the relevant documents are closer to the augmented query instead of the original query.

💬 0 comments
Add to My Notes
01:16:56Paulo

Now, what are the query expansion use cases? Well, this can be used in the information retrieval system, or in the question and answering systems, as well as e-commerce search and also academic research, as we've seen. So what I want you to do next is to create different queries and look at the differences that are plotted so you can start seeing how query expansion with generated answers works. It's a really good technique. Obviously, it's not perfect, but you can see the difference in many cases that are not negligible at all. There are really good differences that perhaps you can see the power or the usability of using this query expansion as an advanced RAG technique.

💬 0 comments
Add to My Notes
01:17:42Paulo

Okay, so now let's look at the query expansion again, but now with multiple queries. So the idea is that we use the large language model or the model to hallucinate or generate additional queries that might help getting the most relevant answer. Just like before, we have a query and we pass that through the large language model, and then we actually get more queries. So essentially, we use a large language model to suggest additional queries, and then we retrieve results for the original and new queries through the database or vector space. And that's what we actually pass through. And then we send all those responses to the large language model for a final relevant answer.

💬 0 comments
Add to My Notes
01:18:30Paulo

So that is the overall workflow how the query expansion with multiple queries works. So essentially, we have more queries instead of one answer. In this case, we're actually creating queries as opposed to creating answers. In a nutshell, we have the original query analysis part that happens. So we're analyzing the original user query to understand its intent and context. And then we have the subquery generation. So this is where we generate multiple subqueries that expand on different aspects or interpretations of the original query. So we have a breadth of different subqueries. Now, this can be done using synonym expansion, related terms, or contextually similar phrases. You could have in your organization or in your business, you could have a list of these subqueries, but we're going to use the large language model to hallucinate those subqueries.

💬 0 comments
Add to My Notes
01:19:24Paulo

And then we have the document retrieval. So now we retrieve the documents for each subquery separately. That is the beauty here. And then we have the combination, this is the aggregation side of things, where we combine the documents retrieved from all subqueries, that way ensuring a diverse and a more comprehensive set of relevant documents. And then, of course, the response generation. This is where we use the aggregated documents to generate a more informative and contextually relevant response.

💬 0 comments
Add to My Notes
01:19:54Paulo

Here are some use cases for query expansion: exploratory data analysis. So this could be a way of helping analysts explore different facets of data by generating varied subqueries. We have academic research again, we've seen this before, to provide researchers with different angles on a research question by generating multiple subqueries. Customer support, this helps in covering all aspects of a user's query by breaking it down into small, specific subqueries. Healthcare information systems, so helping with retrieving comprehensive medical information by expanding queries to to include related symptoms or treatments and diagnosis.

💬 0 comments
Add to My Notes
01:20:43Paulo

Let's go ahead and put all of that together in code. So I put together this expansion_queries.py. You should have access to all of this anyway. And let's get started. So I went ahead and did some imports. So most of this actually is going to be exactly the same as we had before, but I'm going to go through again just for completion. Okay, so we've imported everything. We have the OpenAI client and everything is good.

💬 0 comments
Add to My Notes
01:21:09Paulo

And first, we are going to do what we did before, which is to read everything through using the PDF reader to get our data. Okay, so we get our data, Microsoft_Annual_Report.pdf, and extract those texts, okay, from the pages. And we're going to filter the empty strings as you see here. Okay, so I'm not going to run this again.

💬 0 comments
Add to My Notes
01:21:33Paulo

And then I'm going to split all of those pieces into smaller chunks, as you can see here. And I'm using the RecursiveCharacterTextSplitter from LangChain. And we're going to always be using the SentenceTransformersTokenTextSplitter as we saw before. So essentially, really the same thing that we did before. Okay, use the sentence transformer to get split into tokens, as you see here. That's what we're doing. We have the text splits, and we have all set up in this list here. We are looping through and adding that to that list of token splits.

💬 0 comments
Add to My Notes
01:22:12Paulo

And next, we are going to import ChromaDB and all of that again. So I know this is kind of overkill, but I'm going to put all of this for completion. Okay, so now we're creating our embedding function using the SentenceTransformerEmbeddingFunction. And so we instantiated our ChromaDB client. Okay, and all of this is good. Next, we're going to add all of those embeddings into our ChromaDB. But first, we're going to extract the embeddings of the tokens as we did before, and we added all of those. Okay, we have the count there as well.

💬 0 comments
Add to My Notes
01:22:49Paulo

And now I'm going to add the query, pretty much the same as we saw before. So, 'What was the total revenue for the year?' Pass that query through our collection to see what we get. So collection.query, we pass the query until here. Let's go ahead and make sure that we're actually getting the documents. So I'm just going to loop through all the retrieved documents and see. Okay, very good. The same thing we've seen before. We're getting all of our documents, at least five. One, two, three, four, and five.

💬 0 comments
Add to My Notes
01:23:24Paulo

Next, let's have a multi-query function that will be responsible for generating the multi-queries. So the difference here is that in our prompt, in this generate_multi_query, we still pass the actual query, and the model is defaulted at GPT-3.5 Turbo. And here the prompt, here we're saying, 'You're a knowledgeable financial research assistant, and your users are inquiring about an annual report. For the given question, propose up to five relevant or related questions to assist them in finding the information they need.'

💬 0 comments
Add to My Notes
01:23:56Paulo

Now, this is very important because everything is driven through the prompt. So your prompt should be something that you put some thought into it. And the important thing here is that you need to make sure that we have a prompt that encourages the model to actually generate related questions or queries. So make sure that the prompt must include different aspects of the topic and the variety of questions that will help the model understand the topic better. Provide concise, single-topic questions without compounding sentences that cover various aspects of the topic. Ensure each question is complete and directly related to the original inquiry. Very important to have a well-refined prompt. Of course, we pass the prompt and the query, which is going to be passed through the function.

💬 0 comments
Add to My Notes
01:24:48Paulo

Let's go ahead and generate the multi-query. So at the bottom here, I'm going to just paste. So I have the original query: "What details can you provide about the factors that led to revenue growth?" for instance. And so the augmented queries here, I call the generate_multi_query and pass the original because that is what it needs. So at this point, I should get a result that will give me what I need, step through and get augmented queries. Okay, so I'm going to just show here, so let's query through all the augmented queries we're going to get and print them out. And at the top, let's make sure that I have... Okay, very good. This is already commented out. Just going through the process and soon we should see...

💬 0 comments
Add to My Notes
01:25:32Paulo

And just like that, you can see we have our augmented queries. So here is the first one: "How do changes in pricing strategy impact revenue growth?" "What role did new product launches play in driving revenue growth?" "Were there any specific marketing or advertising campaigns that significantly contributed, that significantly contributed to revenue growth?" "How did changes in customer demographics influence revenue growth?" "Did partnerships or collaborations with other companies impact revenue growth?" This is wonderful. So you can see that our large language model was able to plug in from the query that we passed along, the original query, we were able to plug in and get at least five new queries, five new augmented queries.

💬 0 comments
Add to My Notes
01:26:25Paulo

Now that we see that we have our augmented queries because we looked through and we saw all of them, all of those were generated, let's go ahead and join or concatenate the original query with the augmented queries. Okay, let's go ahead and do that. Let's go ahead and print joint_query so we can see. Okay, so we can see that we have the augmented query still loading. That's very good. But then we have this list that will contain each one of our augmented queries with the original query. Okay, so you can see here, "What details can you provide about the factors that led to revenue growth?" And then we have the augmented query, "What were the key products," which is the first one here. So for each one of these, we are concatenating or attaching to the original queries as well as the augmented query. So we are concatenating both of them, and you can see they're all in a list.

💬 0 comments
Add to My Notes
01:27:25Paulo

Now we're going to go ahead and get this joint_query, which will have, of course, the original query and the augmented queries through our collection, in this case, through our vector database, and query. So let's do that. And there we go. So now we have retrieved results.

💬 0 comments
Add to My Notes
01:27:42Paulo

And the next thing what we need to do, because we might get multiple duplicates now. Because of the nature of things, we might end up getting multiple duplicates. We can now remove duplicates and sanitize the list that comes from our retrieved documents from our collection.query here. Okay, so to do that, let's go ahead and... So now we're going to go ahead and just go through and sanitize our list to remove the duplicates. And let's go ahead and output results documents. Okay, okay, let's go ahead and run this. So it should go and output result documents for each query. Going to add that query and then result, and loop through and get each one of those results. Okay, let's go ahead and run this real quick.

💬 0 comments
Add to My Notes
01:28:30Paulo

Okay, there we go. There's a lot here. Looks like gibberish, but exactly, that's exactly what we're doing. So we're outputting the results documents. So at this point, we have passed everything through the query, and now we're looking through and getting the query results. So all the documents that we were able to retrieve through the queries concatenated with the query, the query.

💬 0 comments
Add to My Notes
01:28:55Paulo

All right, so next step really is to go through, pull all that information, pass that through a large language model so we can get the answer. All right, so now what we can do is just plot everything so we can see the results, right, in a graph. So let's go ahead and project everything. Okay, so we pass the embeddings just like we did before. We're getting all the embeddings from our collection, and then we're instantiating the UMAP transform so that we can have that to pass through the dataset embeddings and project embeddings to project everything. Okay, so same thing we've seen before.

💬 0 comments
Add to My Notes
01:29:32Paulo

So next, we're going to visualize results in the embedding space by creating the original query embedding and then the augmented query embedding so we can see the differences. We're going to project the original query and augmented queries in the embedding space. But because of the nature of things, we're going to flatten the list of the retrieved documents to a single list because the project_embeddings expects indeed a single list of embeddings. So that's what we're doing there.

💬 0 comments
Add to My Notes
01:30:02Paulo

And what we'll do next is we're going to go ahead and retrieve the embeddings themselves, and result embeddings, just looping through all of them and get them. And we're going to go ahead and project those embeddings, the result embeddings, and UMAP transform that we're passing through, as you see there. And of course, as always, we're going to go ahead and plot everything using the Matplotlib. So we're going to have to import that. And then we're going to go ahead and pass it through the plot figure, the plot object, and project everything. Okay, let's go ahead and run, and we should see, if all goes well, something projected.

💬 0 comments
Add to My Notes
01:30:46Paulo

Again, you can see here are our results. Everything is pretty much squished in. Red X, which is the original queries. And then we have the orange X's, which are the augmented new queries that were generated, okay, by the large language model. The green circles are the retrieved document results by the vector database search. Okay, and the gray dots are the dataset embeddings.

💬 0 comments
Add to My Notes
01:31:14Paulo

Now, the retrieved documents are close to the augmented query, as you can see here. All these yellow or orange X's are the augmented query. And you can see most of them are really aggregated around the augmented query, which is really good. It tells us again that we're able to see the closeness and not so close to the original one, even though there's this closeness between the original and the augmented queries here, but the actual documents that are relevant are agglomerated or are around the augmented query, as you see.

💬 0 comments
Add to My Notes
01:31:48Paulo

Okay, much better. So you can see that we have one, two, three, four, five, six, I guess six augmented queries, five or six or some sort, something like that. And of course, we only have one original query. So this is probably better for you to visualize. So we see that with the query expansion, we are able to retrieve more relevant documents that we might have missed with the original query, which gives us a better understanding of the topic and helps in answering the original query.

💬 0 comments
Add to My Notes
01:32:20Paulo

Because the model-generated queries, the hallucinated queries, have helped us in capturing different aspects of the original query and provide a more comprehensive view of the topic. So the model-generated queries help in capturing different aspects of the original query. So now we have a more comprehensive view of the topic and can better answer the original query, especially for cases where we have complex topics like financial reports and so forth.

💬 0 comments
Add to My Notes
01:32:49Paulo

Now, there's a downside to this. The downside of this approach is that the model-generated queries might not always be relevant or useful and can sometimes introduce noise in the search results. So it's important to carefully evaluate the generated queries and the retrieved documents. Okay, so in the next videos, we are going to tackle this issue of noise that I'm talking about in the search results by using a more advanced technique that will allow us to rank, in this case, all of these documents so we can get the relevant feedback.

💬 0 comments
Add to My Notes
01:33:25Paulo

So before we move forward here, I would like for you to do a challenge. So what I want you to do is to play with different prompts and queries to see what results you get each time. So keep refining the prompt and see the results. So this is very important because as we've talked about, the prompt is what guides everything for you to be able to get those multiple queries that are related to the query, which is going to influence, of course, the documents that you end up getting from the vector database.

💬 0 comments
Add to My Notes
01:34:01Paulo

So we just finished going through the query expansion, in this case, with multiple queries, which allows us to use the large language model to hallucinate or generate additional queries that might help us get the most relevant answer. So this is the overall flow where we create queries from the large language model, and then we concatenate the original query with the augmented queries, as we call it. And then we extract the relevant information from the vector database, the query results, and then we can use that to do all sorts of things to get the actual answer, of course.

💬 0 comments
Add to My Notes
01:34:38Paulo

Use cases for query expansion range from exploring data analysis, academic research—we've seen this before—customer support, healthcare information systems, and so forth. As I said, there are some downsides to this technique because there's lots of results that means queries might not always be relevant or useful because we end up having noise. Hence, we need another technique to find relevant results. So once we get these results from this technique, we should probably go a little bit further and find what we just received as a result from this technique.

💬 0 comments
Add to My Notes
01:35:16Paulo

Okay, very well. I hope you enjoyed this mini-course. I made it for you and that you see the whole picture of RAG systems now. The idea here is that you take this and you build your own RAG systems and understanding now the techniques, the advanced techniques that you can use to make your RAG systems even better so that they don't hallucinate as much and you get the correct or the most, the closest information or pieces of information from your documents that then you pass through the large language model, so then you have a fuller response that you're confident that it is exactly what is being retrieved from the system.

💬 0 comments
Add to My Notes
01:36:01Paulo

So thank you so much for being here. So if you're interested in learning more about RAG, about AI agents and Python and programming in general, I do have a channel called Vinci Bits. So it's right here. And also, I'm working on a larger, more comprehensive course and or courses related to AI, large language models, and creating AI-based applications from zero to hero, right? And so if you're interested, I should have a link somewhere in the descriptions where you can go and drop your email in the mailing list, in the waiting list, I should say. And as soon as I have everything set up, you will be the first one to be notified. So again, thank you so much for your time and till next time, be well.

💬 0 comments
Add to My Notes
Video Player
My Notes📝
Highlighted paragraphs will appear here