Who am I? How to give an LLM access to new data



At some point in life, you've probably googled your own name. Google will usually returns links to things such as your facebook profile, linkedIn profile, information about someone with the same name as you (There's a Nathan Brake that was on The Voice Australia) or maybe a news article from when you made honor roll freshman year of high school.

When I ask ChatGPT "Who is Nathan Brake": here's what I get back:

I'm sorry, but I don't have any information on a person named Nathan Brake in my pre-2022 knowledge base. It's possible that Nathan Brake is a private individual or a relatively unknown figure, or the name might be misspelled. If you have more context or details about who Nathan Brake is or why you're inquiring about them, I may be able to provide more relevant information.

This is a major roadblock for using a Large Language Model (LLM): if you want to use it as a general knowledge source, it needs to be able to access up-to-date information about the world. Although we could re-train or fine-tune the LLM using newer up to date data, this quickly becomes extremely expensive and inefficient due to the rapid creation of new information that the model would be expected to know.

One such way to give the LLM knowledge of current data is called Retrieval Augmented Generation (RAG) paper. Originally proposed for seq2seq (encoder-decoder) transformer models, it has seen great success in the world of autoregressive (decoder-only) transformers as well.

In this blog post I'll explain what RAG is and walk through code to add this power to the LlaMA-2 LLM. We will build a simple example where we load the LlaMA-2 13b model, create and populate a datastore, and enable the LLM to answer the question "Who is Nathan Brake".

You can navigate to the google colab jupyter notebook I provide and run this code to try it out for yourself!

What is Retrieval Augmented Generation (RAG)

In simple terms, RAG is a method to add a datastore alongside of an LLM that allows for selected knowledge in the datastore to be inserted into the context of the LLM prompt. Jacky has a good slightly longer explanation on medium.

Because LLMs typically have a limited context length that they can support, we can't simply feed in a large number of internet resources/wikipedia/etc, we need to selectively access the right data that should be inserted into the context.

In order to populate this datastore, we will ingest documents of our choice (in our example, we will scrape some websites), and then convert them to "document embeddings". In my previous post here I explain how word embeddings are n-dimensional vectors that help to organize words/tokens such that similar tokens are close to each other in the vector space. What if instead of words, we used multiple words? If we map chunks of words into an n-dimensional vector space, we would then have a way to group content close to each other based on the content of the chunks of words.

We will use the all-MiniLM-L6-v2 model to convert length 1024 token chunks into a 384 dimensional vector space.


Thankfully, due to the popularity of all things LLM, there are plenty of people adding libraries to help make our lives easier. We will use the following libraries in this project

  • Huggingface Transformers docs: Python Library for loading transformer models (For us, an LLM and a Sentence Transformer)
  • BitsAndBytes docs: Python library add-on to Transformers that allows for easy quantization of the LLM so that we can fit it into the GPU memory of a cheap/free T4 GPU on Google Colab
  • Llama-Index docs: Python library data framework for connecting custom data sources to LLMs. We use this to store and access the data we parse from websites
  • Llama-Hub docs: Python library for custom datastore plugins (this is the library that connects the web scraping functionality to llama-index)
  • Langchain docs: Python library for context-aware reasoning. We will only use this for its text embedding model framework that will help us to perform a search over the document embedding space.

Let's Code

Let's get to the fun stuff!

I put all of this in a Google Colab Jupyter Notebook here. You can run it directly and play around with it after doing two setup steps (If you google around you can find detailed steps about how to do them):

  1. Change Colab Runtime to use a T4 GPU
  2. Log in/Create a Huggingface Account and request access to the LlaMA-2 model family
  3. Create an access token for your Huggingface account
  4. Add your Huggingface Account Token to the Colab Secrets

If you want to run this and hit issues, feel free to reach out for some help 😁

Install Pre-reqs

First install all the libraries we need to use

!pip install torch langchain einops accelerate transformers bitsandbytes llama-index llama_hub sentence-transformers --quiet

Next, import all the functions that we will use

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
from llama_index.embeddings import LangchainEmbedding
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.prompts.prompts import SimpleInputPrompt
from llama_index.llms import HuggingFaceLLM
from llama_index import set_global_service_context, ServiceContext, VectorStoreIndex, SimpleDirectoryReader, download_loader
from google.colab import userdata

Let's load up the model and the tokenizer. There are a few tricks happening here:

  • Grab the auth token from where it was added in the Google Colab secrets (Never put authentication tokens in source code or share your secrets publicly)
  • Use Rotary Position Embedding (RoPE) scaling to improve the maximum supported context length of the model docs. LlaMA-2 only supports a ~4k token context length out of the box, but we can use RoPE scaling to boost it up to 8k without the need to re-finetune it.
  • Quantize the model to 4-bits docs. The model weights are generally 16-bit float numbers. By quantizing all of the weights to 4-bit representations, we can fit the LlaMA-13b (13 billion parameter) model into the T4 16GB GPU that is available for free use on Google Colab.
name = "meta-llama/Llama-2-13b-chat-hf"
auth_token = userdata.get('hf_token')
tokenizer = AutoTokenizer.from_pretrained(name, use_auth_token=auth_token)
# Create model
model = AutoModelForCausalLM.from_pretrained(name,
    rope_scaling={"type": "dynamic", "factor": 2},

Now that the model is ready, we will put the model inside of the llama-index wrapper. This is setting the stage for when we put both the LLM and the document embedding model into the main llama-index wrapper that will handle most of the complexity around RAG for us (so many wrappers!). We will use the new chat_template feature in huggingface that I talked about in a previous blog post. One of the fun things about working with newly released open source is that it can have bugs that we get to find and help contribute back to the community to fix! You'll see a bit of a hack put in below with a link to the issue I filed to help notify the huggingface maintainers of the need to fix a bug in the chat_template logic for the LlaMA-2 model family.

Once the system_prompt is ready (this is what "sets the stage" for what the LLMs purpose is for us), we pass it into the InputPrompt wrapper which is what llama-index will use to properly format our queries.

chat = [
  {"role": "system", "content": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible."},
  {"role": "user", "content": ""}, # need to because of bug https://discuss.huggingface.co/t/issue-with-llama-2-chat-template-and-out-of-date-documentation/61645 we will remove the added [/INST]
system_prompt = tokenizer.apply_chat_template(chat, tokenize=False).replace(" [/INST]", "")
query_wrapper_prompt = SimpleInputPrompt("{query_str} [/INST]")

llm = HuggingFaceLLM(context_window=4096,

Now that the LLM is good to go, we'll load up the document embedding model. Langchain and llama-index abstract away all of the complexity with using and querying the Embedding model.

Putting it all together, we create a llama-index ServiceContext object, which is going to be the interface that manages passing data between the LLM, document embedding model, and back to us. The service chunks the documents it receives into chunks of 1024 tokens with an overlap of 20 tokens (the default settings).


# Create new service context instance
service_context = ServiceContext.from_defaults(
# And set the service context

Almost lastly, we will pass in some websites that we want to be scraped. Once again, a library handles all of the messy complexity for us: llama_hub BeautifulSoupWebReader has all of the web-scraping logic to pull out all of the text and metadata from the websites we provide. We'll also add a debugger so that we can print out the final prompt that was created with the injected data.

BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")

loader = BeautifulSoupWebReader()
urls = [
documents = loader.load_data(urls=urls)
index = VectorStoreIndex.from_documents(documents)

# Debug logging setup
from llama_index.callbacks import (
llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])

service_context = ServiceContext.from_defaults(
    callback_manager=callback_manager, llm=llm

Finally! The most fun part. Try out some queries and see what comes up

response = query_engine.query("Who is Nate Brake? Be straightforward, this isn't a pump up speech. Talk like an australian and cite your sources")

# Print info on llm inputs/outputs - returns start/end events for each LLM call
event_pairs = llama_debug.get_llm_inputs_outputs()
print("The input prompt to LLM was:")
# Print the response
print(f"Response is: \n{response}")

Let's look at the output that comes from this code block.

The input prompt sent to the LLM after the search using the document embeddings is shown below. This is the exact text that was sent to the model.generate() function of the LlaMA-2 model. llama-index and langchain are handling the search of the vectordatabse using the document embeddings and prepending the data as appropriate into the beginning of our prompt context. Note that the search returns 3 items: the first one is relevant and has the info we need (the about page from my website), however the other 2 are irrelevant, and were returned because we specified a top_k of 3, meaning that the search should return the top 3 articles, even if only 1 is actually relevant, it will still return the other 2 closest matches.

Context information is below.

URL: https://www.natebrake.com/about

About - Nathan BrakeNathan BrakeAI BlogTagsAboutAI BlogTagsAboutAboutNathan BrakeMachine Learning Engineer3M | M*ModalmailgithublinkedintwitterHi! My name is Nathan Brake. I'm a Machine Learning Engineer and Researcher at 3M in Pittsburgh, PA. I focus on designing and training deep neural networks for clinical note generation.I graduated from Grove City college in 2015 with a B.S. in Electrical Engineering, and from Georgia Tech in 2022 with an M.S. in Computer Science, with a concentration on Machine Learning & AI.I'm passionate about software engineering and working out how to fit together all the pieces in order to deliver technology that excites our customers.I spend most of my time trying to keep up with the blazing fast pace of innovation in the NLP space.Subscribe to the newsletterEmail addressSign upmailgithublinkedintwitterNathan Brake • © 2023 • Nathan Brake

URL: https://www.nytimes.com/

Given the context information and not prior knowledge, answer the query. Query: Who is Nate Brake? Be straightforward, this isn't a pump up speech. Talk like an australian and cite your sources Answer:

With that final prompt provided to the LlaMA LLM, this is the response that it gives:

Response is:

G'day mate! I'm here to help answer yer query about who this Nate Brake bloke is. Now, I've done me research and found some top-notch sources that'll give ye the lowdown on this fella.

First off, Nate Brake is a machine learning engineer and researcher at 3M in Pittsburgh, Pennsylvania. He's got a swag of qualifications under his belt, including a B.S. in Electrical Engineering from Grove City College and an M.S. in Computer Science from Georgia Tech, with a concentration on machine learning and AI. (Source: https://www.natebrake.com/about)

Now, Nate's a bit of a whiz when it comes to designing and training deep neural networks for clinical note generation. He's passionate about software engineering and figuring out how to make all the pieces fit together to deliver technology that excites customers. (Source: https://www.natebrake.com/about)

Oh, and did I mention he's also a bit of a wordsmith? He's got a nifty blog where he shares his thoughts on machine learning, AI, and the like. It's worth checkin' out if ye want to learn more about the inner workings of these technologies. (Source: https://www.natebrake.com/blog/)

There ye have it, mate! That's the scoop on Nate Brake. Hope that helps ye out. Cheers! (Source: https://www.natebrake.com/about)

There we have it! You can add additional webpages which will give LlaMA more information to draw from when it answers questions.