#34: Is your AI still hallucinating, even with RAG?

About This Episode

While Retrieval Augmented Generation is an excellent step towards getting LLMs to ground answers, you don’t have to settle for what your provider gives you.

Watch

Embedded video and links available on the episode page.

**[00:00:00]** Hello everyone, my name is Arando Presceno and welcome to the web talk show. Today we're talking about hallucinations, rag retrieval augmented generation, what it means for your business and why should you do something differently. That's going to be the topic of today. If you're just joining us, make sure to subscribe and like if you enjoy the content. We are on the different networks. So you'll find us on YouTube, LinkedIn X, Instagram, Tik Tok, etc. So what is RAG retrieval augmented generation? Basically, if you've ever talked with an LLM like Chad GBT, claude, etc., and you've asked it questions, it will either give you results based on whatever the knowledge it was given when it was created or updated or it might give you actual contextual information from your business knowledge base. When does it do either? That's what I'm going to explain today. And why is it important? Well, if you're just talking to an LLM as a regular thing, you have it on your phone. You're asking it about things like what to do in your city or plumbers or I don't know, you're doing research, whatever you want to do with the agent, it will give you or might give you a response. The first versions, the first iterations of this, the first sort of chat GPT were mostly disconnected. They were just trained on the a big big big corpus of data. So basically everything that has ever been written that's public on the internet was fed to this thing in general terms and you could just ask it about many things about ancient Egypt and Rome and current events but sort of until a cut of date right so that's that's sort of how it worked and then other tools came **[00:02:00]** along that gave it access to the world like perplexity came along did a very good job of creating a tool where it would still use this chat GPT type interface where you're chatting with it, but it would take it a little further and allow you to get results from the web. So basically, when you ran a search, it would go and do searches for you, get relevant data, and then ground the response in that data, which makes it a little more valid because now that data was a little more upto-date. Why? Because [laughter] what what was the reason for it? Well, if you asked it something about current events that for those initial versions, it wouldn't know. It just wouldn't know that was stuck in the past with perplexity and other tools like it. Now you could get more current information which was really good. Now after that of course chatgpt cloud all of them started adding web search as well. So now you can talk to it and it can go and get information for you if you enable that little checkbox. So you can still sort of talk to it in private or it can go and search for results and it is still has its own data store. But let's say you're using it for business. So you're a business owner or you work in a business or you're a CEO and you want it to ground the responses in information that you have that you have in your knowledge base. Let's say you have training manuals for all your products, right? And you want to be able to ask it things and give you the response from the training manuals, not from what it knows about because it is very likely **[00:04:00]** does not know about your particular products, especially if they're very technical products. If you need things to be very precise and you don't want it to hallucinate, and that's where we're going with this today, then you will want to feed it information from your company, from your business. Now, how does that work? Well, there's this thing called vector embeddings and it's used a lot to provide information to LLMs. Basically, to provide context to LLM so that they can then give you more relevant data. So, in very basic terms and you can do this in CHPT as well and on claude and on Gemini etc. You can just give it a file. dump a bunch of files in it and it will either depending on how you have things set up, it will either use them as a vector store, which I'll get into more technical details in a bit, or it will just include it in the context on the of the conversation. What is context? Let's talk about that a little bit. When you're chatting with an LLM, it's not that it's really an entity or anything like that. It's just an it's just a code that understands natural language and it's very good at guessing what the next word should be in a conversation and so it it feels like you're talking to something but in reality every turn when you're talking to it every turn it's basically giving the whole conversation back. So you say something it responds then you say something back it responds. Every time you send something back it's actually sending the whole conversation back to it. And so it uses that to know where what you've been talking about and then give you responses that make sense. So **[00:06:00]** it works. It's been working really good. But let's say you wanted to give it a manual of sorts so that it would give you information about the man. You want to talk to your manual, right? Or to a contract for example. Yes, you can just feed it the PDF. And so this is where context comes in. If the document is relatively good sized then you might run into an issue which is the context window. Okay, how much think of it as a buffer how much the LLM can support at the same time has a limit and it's been growing all these as you sometimes if you see any PR releases of the chip of the world and and claws and geminis and [laughter] all of those you'll see that oh now we have this amount of context and now the windows is this size and now the window is this size so that just means that now it can fit more tokens in it. So if you give it a document that's huge, it has to have a big context window, a large context window for it to be able to understand or fit the whole thing and still talk to you. If it doesn't, here's the real issue. If you feed it too big of a document for its context window, then what happens is it will start compressing. It will start losing some of that data and it might still respond well, but it might also hallucinate. Hallucinate just means it doesn't really know and it just makes something up. And that's a big problem with LLMs. Instead of just saying I don't know, a lot of them will just make things up and make it sound like they actually know what they're **[00:08:00]** saying. So you get this sort of false [laughter] positive where you think it's telling you the truth. And I've seen a lot of people like that come to me like what the LM told me. And they don't know that it just told them a lie basically. So we have to be careful especially in business especially in healthcare and finance and many other industries you you definitely don't want that that's not okay. So what's the other approach? Well, that's where vector storage come in. The retrieval augmented generation which is rag that is where you can if you have a big corpus of data you can give the LLMs access to this in a bite-sized way. So, let's talk about vector stores, what they mean, how they work. Basically, and I'm I will try not to get too technical because this might just make a lot of people glaze over. What it does is you get the document and then you use something called an embedding model and it basically splits the document into a bunch of little chunks, a bunch of them. Okay? And what happens when you're doing that is it places all that in a special type of database. Okay? And it it makes it into imagine a a plot of dots in threedimensional space, right? that there are they are semantically organized and so when you ask it something what the LM will do or what your embedding process will do what will it will grab what you asked and then it will transform it into that same type of structure and then compare it. So basically it's doing a numbers comparison just to put it very simply and so it's going to find things that are semantically relevant. So it's very good **[00:10:00]** at finding things that are semantically relevant. Okay. So that's really really good. So let's say you have a what's a good example? So a general knowledge base, you have documentation with terms that most people understand, etc. Then you can frequently ask questions, that sort of thing. You can ask it and it'll very likely give you a very good response. That's that's really good. The problem with embeddings and just vector semantic search is that it can have a hard time finding very specific terms or serial numbers or part numbers or if there's a very specific keyword that you're looking for that is not contextually relevant. Does that make sense? So if something like that occurs, you're searching for it, it might not give you a response. It might simply not find it. Okay, that's that's one of the the downsides of retry blog generation, just using semantic search. By the way, I'll I'll give a pause here. If you have any questions about this, please put them in the chat on wherever you are and the LinkedIn, YouTube, Facebook, Instagram, and and if I see it live, I'll I'll answer your questions. And if not, I'll just answer in the comments afterwards. So when you do this sort of vector embeddings and retrieval augmented generation with for your for your LM if you're working with an LLM in your company it's very useful to find let's say you're building a chatbot for people to just ask oh what is your refund policy well it will very easily give them a response based on your data and even in the very basic term if you're using traditional retrial augment generation you can tell the the LLM that when it's doing the query if it didn't get any **[00:12:00]** results back to just say I don't know and that's that's one of the pluses right so just starting out if you do that then when it doesn't find something it might just say I don't know and that's perfect because now you're a step further than just talking with one that might just make stuff up. So again in business you might have a chatbot you might have a voice agent some other things and you want to make sure that the data is grounded on truth espec specifically your truth from your knowledge base from your company. So, if there's this downside of, well, I do need it to find serial numbers, part numbers, or very specific things, or I've been playing around with it. I've added a corpus of data, many documents, and it's sort of working, but sometimes it just doesn't give the right answer, then there's the next step. And there's many approaches to this. This is this is not exhaustive. There's there's too many ways to do this, but I'll just explain a few just so you have an idea of what's possible. If say you are having issues where it's still hallucinating and that's supposedly what we're talking about today, that's the topic. If you you have your rag and it's hallucinating, you don't know what's going on. Well, my first suggestion is don't necessarily think that you have to go with whatever your provider is giving you. Okay? So, let's say you use chat GPT and it allows you even if you use the developer platform, it allows you to have a vector store. Great. And it works very nicely. You can upload documents etc and it will work. You can do the same in let's say you're using voice flow to **[00:14:00]** create agents chat agents or voice agents or you're using Vappy to create telephone voice agents right or anything like that. So you're using these tools and they allow you to have a knowledge base or a vector store but you have no idea what happens behind the scenes. So it works okay very very good first step and if it works for you perfect you can leave it at that but if you're having issues and you want to make sure that the data is better grounded then here are a few other approaches that you can consider so first step what I would do and we've done before many times is try to separate the concerns so if you're using an LLM maybe you're using an A10 you have these workflows you're building things that connect you have a chatbot it talks to your knowledge base etc. but you're using an end to connect to make some of those connections etc. Well, instead of just going directly to OpenAI and using a uh responses API for example to tell it the query and have its model talk to the embeddings behind the scenes in a black box and then bring everything back to you with not a lot of information what's going on. What you can do is separate the steps. So basically you can have your own [snorts] ingestion flow and I can talk about this in another video in detail uh visually but you could have your own ingestion flow where you connect the embeddings model it could still be open embedding model which is very good and you can have that put that in your superbase database you can have a vector store inside superbase which is really nice because it's basically free and then **[00:16:00]** you what that allows just by doing that without anything else that I'm going to talk about later it gives you the flexibility of attaching additional data to it. So instead of just chunking with whatever the system is telling you to do, you have more control over what is going into the vector store. So instead of just having all those chunks, you can add context. And I'll mention some of these approaches in a second. But you can add additional metadata. You can write what type of document it is. You can uh write if it's a specific category or if it's for a specific sport because maybe you're talking about multiple sports, right? So you can add additional metadata to the documents and to the chunks themselves. And so when the query is done later on, the retrieval is done, it will be more context aware. would know more about what's going on with that document with that chunk instead of just having the isolated chunk which is also another downside where sometimes you might get something like click on the next page to find more and if you just find that text using vector store yeah it found it but what's what does it mean what section is it is especially if if you're doing a retrieval from an instruction manual or something like that that might appear a hundred times in the document and you're just you just care about like the instructions for the microwave or something. You don't you don't really care about other parts of the manual. So there are many approaches that you can do to augment how this this augmented generation works. All right. So let's talk about some of these approaches and I'll try to show some graphics so **[00:18:00]** that it makes more sense for those of you who are more visual and we'll see how with numbers you it can actually be better. Okay so first step like I was saying you can do the embeddings on your own which is great and then you can use any model now to talk about any LLM. So you can have chat GPT talk to your knowledge base. You can have claude talk to your knowledge base. You can have Gemini talk to your knowledge base. You can have all of them talk to your knowledge base separately directly and you're not tied to whatever again whatever that vector store is in that tool that you're using. So in any tool that you're using that you have access to vector stores you can typically also just plug in your own. And this is great like I said because it gives you more flexibility into what you can actually embed along with the chunks. Now, I mentioned some of the downsides and now I'm going to talk about what you can do to make it better, right? I'm going to try to see if I can um show some imagery here. But we initially talked about standard retrieval augmented generation. So, what I'm going to do now is I'm going to try to share an image here. Let me see if I can do this share screen. And I'm going to share this particular image. Let's see if we can see it. Yeah, it's there. Okay. And now, can I see myself down here? Perfect. All right. So, this is what standard rag would be with a little added step, which I'm going to talk about right now. So, if you look at this image, and if you're listening, I I'll **[00:20:00]** explain it anyway if you're just listening, but here it goes. So standard rag, you have your corpus of data or what you see here on the left side, right? So you have all your documents. They were cut into little tiny pieces. They were chunked and they were they were put in the the vector store. Typically, if you just look at the top part, it would go into an embedding model and then into the vector database. And when you ask a query, it would just go to that last step. So basically, it would grab your query, run it with an embedded model again, put it in, compare, get the results back. Let's say you get five results back and that goes directly to the last step which is the generative model in this case cloud right. So your LLM then gets five items or 10 items from the vector store. It gets all the chunks. So all the little imagine paragraphs right little pieces of the document. It'll get them and then it'll say okay so I sort of have my context and it could give you an answer based on that. That's just the basic way to do it. Now if you want to take it a little further, we can do something called hybrid search which is very powerful. Hybrid search combines sort of the best of both worlds. You have contextual embeddings. I mean you have u semantic search which is what we're talking about right now with the vector store. But then you can also have u free text search right um which is like keyword search. And let's say you're using superase which is what we like to use for clients. SuperVase allows you to do full text search on documents **[00:22:00]** within your database. What does that mean? That if I search for PG-001, it can actually find that exact word within the documents like a regular search would. So, we're talking about the downsides of regular semantic search that you wouldn't find those sorts of things. Well, with full text search, you would. But you don't want to just have one or the other. And so, here's where where it really gets interesting. If you start mixing and doing hybrid um embeddings, hybrid retrieval, augmented generation like we see here, basically what it's doing is it's putting the things in the database and then when you're doing the query, what's happening is it's actually going in and it's doing two things in parallel. It's doing a full text search and it's also doing a vector search. And then it gets both things back, right? And it does reciprocal rank fusion. And that's a mouthful, but what it does is it basically grabs them and it tells you, okay, which one is better is a better result for relevance, right, for what you asked, right? And so it gives you those and then the AI, your LLM can then say, "Oh, perfect. This is I have what I need. I'll give you an answer." So that is very good and it gives you very good results. There is another way to do it where you could do contextual retrieval pre-processing. And so this can be done in two different ways. Well, many ways anyway. But what this would do is it will add additional context to each of the chunks like we were talking about earlier, the metadata, right? So instead of just having the chunks, you can say where in the document it is, what part of the document it **[00:24:00]** comes from, what the previous section is talking about, what the main containing section is talking about. And so this gets really interesting because then with contextual embeddings you can sort of make sure that your data is stored in a way where if you make a query it will not only get chunks but it will know the containing part of the chunk. So if we go back to what we were discussing earlier, the having the the semantic search is very good to cast a wide net. So it's very good to allow you to just go and fetch a lot of different chunks that might make sense. That's perfect. But then when you do the keyword search, you can also at the same time get very specific results, right? So you can get both and then you can combine them. And then contextual retrieval with pre-processing allows you to add additional context so that maybe it found something and I I heard a very good example the other day about let's say you're in looking at an insurance coverage sheet and you ask it is I think tennis elbow was the example is tennis elbow covered in the insurance and so it would search but if it just finds the chunks and it turns out that the chunk that it grabbed talks about tennis elbow but it didn't grab anything preceding it, then it might just say yes, I found it in the document. Yes, it's covered. But if you were to look at that document as a human, you would see that in the a little higher in the hierarchy of the document, there was a heading saying excluded conditions. And so that would seem like a hallucination, but it's not really a hallucination. It's that **[00:26:00]** it found data that wasn't accurately described because it just it it just found that out of context. And so with contextual retrieval, it allows you to sort of expand. And there's many ways to do this as well. You can do expansion where you find something and then it goes out and gets sort of the higher level of it and so you get more data. Some people go as far as to say, "Oh, I found a good chunk. Now let me bring in the whole document because I found it between thousands of documents. I found this whole document. Let me put the whole thing in the context." And that also helps. I mean, it works, but it's expensive as well. It really depends on on what you needed to do. So contextual expansion is another thing that you can do to sort of get better data in those chunks. So as you can see like you can go through a very deep rabbit hole of of what can be done with retrial augmented generation just to get the right data for your business. But it's possible. This is the great thing. It's not you're not just fix to whatever you've tried of putting a document into the chat. No, you can make it very precise and tailor it to give you the exact type of responses that you actually want it to give you. So, we were talking about contextual, which is which is great. But then if we look at some of the the numbers, that's that's where it's really interesting. So, I'm going to go here and share this other image and here. So what we're seeing here is this is the error rate for regular embeddings that would be about 5%. But then **[00:28:00]** if you do embeddings plus something like BM25 or something like uh full text search for Postgress like we were talking about it goes down to 5%. And then if you do contextual embeddings on its own it would be about 3.7% error rate. But if you do contextual embeddings plus contextual BM25, then it's 2.9% error rate. So it it just gets a lot better. And then you could do combined, right? So we were talking about combining different things where you do contextual, right? And then you have the vector search, the full text search or BM25 reciprocal rank fusion. And then the final step which I haven't talked about which is the reranker which basically the reranker what it does is it allows you to have a last step in the chain that very intelligently narrows down the results. Okay. So if we go back to everything we've been discussing, remember basically all this is just giving a bunch of chunks to your LLM so that it can generate an answer with the chunks. But the query itself when it went to pull everything it goes and pulls discrete things, separate things and then it bring us together and then the LLM is the one that sort of has to see okay so these ones are the ones that I should use. So everything we mentioned is very good. it will give you a much better down to earth response. But sometimes the original query what you had asked the LLM might be complex. It might have many layers. It might have some abstractions and some like negations you're setting but make sure you don't whatever right XYZ. And so that is not understood very well by by the cosine similarity and all those things. And so **[00:30:00]** what a reranker model does like coheres is it grabs whatever the results were. So it already did all the fetching. It gives you all the results. Perfect. I have them. And in this case, maybe you even tell it to give you more. Say should give me 20, give me 50, whatever. Right? So gives you a bunch of them. Then what cohhere will do is it will grab all of them. And then based on the query, the original query, it will go in and it's it's a train a model that's trained on finding relevance. And so it goes in and finds which ones are the most valid responses for the original query and reranks them. So basically puts them in an order of better ranking and then it feeds it to the model. So by the time the LLM gets it, it's already got all the data that it needs and it's already ranked by priority. So then the LLM can just say, "Oh, perfect. Here's your answer." So you just at the end, you just give three items to the LM, for example, three items that are very highly tuned to be what it actually needs to provide a grounded answer. I hope hope that makes sense. Anyway, that's that's great. Now the benefit of doing that I'm going to share one last thing here on the screen if it decides to let me of course here we are. So the benefit of doing that is that now we were looking at some of the percentages that will actually bring it all the way down to 3.5 from regular embeddings. So if you just had regular embeddings it would be 5.7% error rate but then or failed retrievalss. If we add embeddings we had already **[00:32:00]** seen how it goes down but if we add reranked embeddings plus so hybrid search that would be 3.5. So it goes down considerably and then if we do all the things we talked about then we would bring it down to 1.9%. So, I think that's like a 67% decrease or something or or or better yet, uh, boost in I think it's great. So that's that's just something to consider if you are using an agent or any sort of LLM for your business, either for a chatbot or for a voice agent that's calling, receiving calls, maybe via a receptionist or whatever, or you have your own AI companion agent like I was talking about the other day where you can ask it about things that are going on in your business as a CEO. You want to make sure that's giving you the most reliable grounded data. And so using these steps instead of just blindly trusting on what whatever OpenAI gave you or whatever Entropic gave you or whatever Google gave you, you can actually build this flow in something like N10 and have it ingest the data in the way that you precisely need it and then also extract the data in the way that you precisely need it. And so this all becomes much faster, more efficient, and you get much better results and much more grounded to the truths of whatever your your knowledge base is. If you want help with any of this, please reach out. You can send me a message through all the different platforms or just go to prescaro.com. You can find us there as well. And if you're just joining us, you can find the full recording on the Web Talk Show podcast on Spotify, iTunes, podcasts, **[00:34:00]** wherever podcasts are found. you should be able to find it. If you have any questions, please write them in the comments. I'll be sure to go through all the different platforms and answer them directly right there. And if you have any direct questions, again, you can also ask them in the website, but any of the platforms are just fine as well to reach out. All right, so this is just um I tried not to make it too technical. still went a little bit technical but I can make another stream where we can actually show the contents of a workflow like this so you can see how you can build something like this in N810 so that you have like say your superbase store with everything and you have a way to retrieve the data more accurately if that makes sense. All right well thanks everyone for joining

Topics Covered

About This Episode

Watch