drawing — DALL-E 3 prompt: A picture of Pittsburgh from Mt. Washington in a cyberpunk, neon, utopian future.

Introduction

At some point in life, you've probably googled your own name. Google will usually returns links to things such as your facebook profile, linkedIn profile, information about someone with the same name as you (There's a Nathan Brake that was on The Voice Australia) or maybe a news article from when you made honor roll freshman year of high school.

When I ask ChatGPT "Who is Nathan Brake": here's what I get back:

I'm sorry, but I don't have any information on a person named Nathan Brake in my pre-2022 knowledge base. It's possible that Nathan Brake is a private individual or a relatively unknown figure, or the name might be misspelled. If you have more context or details about who Nathan Brake is or why you're inquiring about them, I may be able to provide more relevant information.

This is a major roadblock for using a Large Language Model (LLM): if you want to use it as a general knowledge source, it needs to be able to access up-to-date information about the world. Although we could re-train or fine-tune the LLM using newer up to date data, this quickly becomes extremely expensive and inefficient due to the rapid creation of new information that the model would be expected to know.

One such way to give the LLM knowledge of current data is called Retrieval Augmented Generation (RAG) paper. Originally proposed for seq2seq (encoder-decoder) transformer models, it has seen great success in the world of autoregressive (decoder-only) transformers as well.

In this blog post I'll explain what RAG is and walk through code to add this power to the LlaMA-2 LLM. We will build a simple example where we load the LlaMA-2 13b model, create and populate a datastore, and enable the LLM to answer the question "Who is Nathan Brake".

You can navigate to the google colab jupyter notebook I provide and run this code to try it out for yourself!

What is Retrieval Augmented Generation (RAG)

In simple terms, RAG is a method to add a datastore alongside of an LLM that allows for selected knowledge in the datastore to be inserted into the context of the LLM prompt. Jacky has a good slightly longer explanation on medium.

Because LLMs typically have a limited context length that they can support, we can't simply feed in a large number of internet resources/wikipedia/etc, we need to selectively access the right data that should be inserted into the context.

In order to populate this datastore, we will ingest documents of our choice (in our example, we will scrape some websites), and then convert them to "document embeddings". In my previous post here I explain how word embeddings are n-dimensional vectors that help to organize words/tokens such that similar tokens are close to each other in the vector space. What if instead of words, we used multiple words? If we map chunks of words into an n-dimensional vector space, we would then have a way to group content close to each other based on the content of the chunks of words.

We will use the all-MiniLM-L6-v2 model to convert length 1024 token chunks into a 384 dimensional vector space.

Toolkit

Thankfully, due to the popularity of all things LLM, there are plenty of people adding libraries to help make our lives easier. We will use the following libraries in this project

Huggingface Transformers docs: Python Library for loading transformer models (For us, an LLM and a Sentence Transformer)
BitsAndBytes docs: Python library add-on to Transformers that allows for easy quantization of the LLM so that we can fit it into the GPU memory of a cheap/free T4 GPU on Google Colab
Llama-Index docs: Python library data framework for connecting custom data sources to LLMs. We use this to store and access the data we parse from websites
Llama-Hub docs: Python library for custom datastore plugins (this is the library that connects the web scraping functionality to llama-index)
Langchain docs: Python library for context-aware reasoning. We will only use this for its text embedding model framework that will help us to perform a search over the document embedding space.

Let's Code

Let's get to the fun stuff!

I put all of this in a Google Colab Jupyter Notebook here. You can run it directly and play around with it after doing two setup steps (If you google around you can find detailed steps about how to do them):

Change Colab Runtime to use a T4 GPU
Log in/Create a Huggingface Account and request access to the LlaMA-2 model family
Create an access token for your Huggingface account
Add your Huggingface Account Token to the Colab Secrets

If you want to run this and hit issues, feel free to reach out for some help 😁

Install Pre-reqs

First install all the libraries we need to use

!pip install torch langchain einops accelerate transformers bitsandbytes llama-index llama_hub sentence-transformers --quiet

Next, import all the functions that we will use

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
from llama_index.embeddings import LangchainEmbedding
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.prompts.prompts import SimpleInputPrompt
from llama_index.llms import HuggingFaceLLM
from llama_index import set_global_service_context, ServiceContext, VectorStoreIndex, SimpleDirectoryReader, download_loader
from google.colab import userdata

Let's load up the model and the tokenizer. There are a few tricks happening here:

Grab the auth token from where it was added in the Google Colab secrets (Never put authentication tokens in source code or share your secrets publicly)
Use Rotary Position Embedding (RoPE) scaling to improve the maximum supported context length of the model docs. LlaMA-2 only supports a ~4k token context length out of the box, but we can use RoPE scaling to boost it up to 8k without the need to re-finetune it.
Quantize the model to 4-bits docs. The model weights are generally 16-bit float numbers. By quantizing all of the weights to 4-bit representations, we can fit the LlaMA-13b (13 billion parameter) model into the T4 16GB GPU that is available for free use on Google Colab.

name = "meta-llama/Llama-2-13b-chat-hf"
auth_token = userdata.get('hf_token')
tokenizer = AutoTokenizer.from_pretrained(name, use_auth_token=auth_token)
# Create model
model = AutoModelForCausalLM.from_pretrained(name,
    use_auth_token=auth_token,
    rope_scaling={"type": "dynamic", "factor": 2},
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
    ),
)

Now that the model is ready, we will put the model inside of the llama-index wrapper. This is setting the stage for when we put both the LLM and the document embedding model into the main llama-index wrapper that will handle most of the complexity around RAG for us (so many wrappers!). We will use the new chat_template feature in huggingface that I talked about in a previous blog post. One of the fun things about working with newly released open source is that it can have bugs that we get to find and help contribute back to the community to fix! You'll see a bit of a hack put in below with a link to the issue I filed to help notify the huggingface maintainers of the need to fix a bug in the chat_template logic for the LlaMA-2 model family.

Once the system_prompt is ready (this is what "sets the stage" for what the LLMs purpose is for us), we pass it into the InputPrompt wrapper which is what llama-index will use to properly format our queries.

chat = [
  {"role": "system", "content": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible."},
  {"role": "user", "content": ""}, # need to because of bug https://discuss.huggingface.co/t/issue-with-llama-2-chat-template-and-out-of-date-documentation/61645 we will remove the added [/INST]
]
system_prompt = tokenizer.apply_chat_template(chat, tokenize=False).replace(" [/INST]", "")
query_wrapper_prompt = SimpleInputPrompt("{query_str} [/INST]")

llm = HuggingFaceLLM(context_window=4096,
                    max_new_tokens=1024,
                    system_prompt=system_prompt,
                    query_wrapper_prompt=query_wrapper_prompt,
                    model=model,
                    tokenizer=tokenizer)

Now that the LLM is good to go, we'll load up the document embedding model. Langchain and llama-index abstract away all of the complexity with using and querying the Embedding model.

Putting it all together, we create a llama-index ServiceContext object, which is going to be the interface that manages passing data between the LLM, document embedding model, and back to us. The service chunks the documents it receives into chunks of 1024 tokens with an overlap of 20 tokens (the default settings).

embeddings=LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
)

# Create new service context instance
service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    llm=llm,
    embed_model=embeddings
)
# And set the service context
set_global_service_context(service_context)

Almost lastly, we will pass in some websites that we want to be scraped. Once again, a library handles all of the messy complexity for us: llama_hub BeautifulSoupWebReader has all of the web-scraping logic to pull out all of the text and metadata from the websites we provide. We'll also add a debugger so that we can print out the final prompt that was created with the injected data.

BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")

loader = BeautifulSoupWebReader()
urls = [
    'https://natebrake.com',
    'https://www.natebrake.com/blog/2023/11-02-executive-order',
    'https://www.natebrake.com/about',
    'http://www.youtube.com/',
    'https://github.com/',
    'https://www.nytimes.com/',
    'https://news.3m.com/']
documents = loader.load_data(urls=urls)
index = VectorStoreIndex.from_documents(documents)
index.as_query_engine(similarity_top_k=3)

# Debug logging setup
from llama_index.callbacks import (
    CallbackManager,
    LlamaDebugHandler,
    CBEventType,
)
llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])

service_context = ServiceContext.from_defaults(
    callback_manager=callback_manager, llm=llm
)

Finally! The most fun part. Try out some queries and see what comes up

response = query_engine.query("Who is Nate Brake? Be straightforward, this isn't a pump up speech. Talk like an australian and cite your sources")

# Print info on llm inputs/outputs - returns start/end events for each LLM call
event_pairs = llama_debug.get_llm_inputs_outputs()
print("The input prompt to LLM was:")
print(event_pairs[0][0].payload['formatted_prompt'])
print("-----------------------------")
# Print the response
print(f"Response is: \n{response}")

Let's look at the output that comes from this code block.

The input prompt sent to the LLM after the search using the document embeddings is shown below. This is the exact text that was sent to the model.generate() function of the LlaMA-2 model. llama-index and langchain are handling the search of the vectordatabse using the document embeddings and prepending the data as appropriate into the beginning of our prompt context. Note that the search returns 3 items: the first one is relevant and has the info we need (the about page from my website), however the other 2 are irrelevant, and were returned because we specified a top_k of 3, meaning that the search should return the top 3 articles, even if only 1 is actually relevant, it will still return the other 2 closest matches.

Context information is below.

URL: https://www.natebrake.com/about
About - Nathan BrakeNathan BrakeAI BlogTagsAboutAI BlogTagsAboutAboutNathan BrakeMachine Learning Engineer3M | M*ModalmailgithublinkedintwitterHi! My name is Nathan Brake. I'm a Machine Learning Engineer and Researcher at 3M in Pittsburgh, PA. I focus on designing and training deep neural networks for clinical note generation.I graduated from Grove City college in 2015 with a B.S. in Electrical Engineering, and from Georgia Tech in 2022 with an M.S. in Computer Science, with a concentration on Machine Learning & AI.I'm passionate about software engineering and working out how to fit together all the pieces in order to deliver technology that excites our customers.I spend most of my time trying to keep up with the blazing fast pace of innovation in the NLP space.Subscribe to the newsletterEmail addressSign upmailgithublinkedintwitterNathan Brake • © 2023 • Nathan Brake
URL: https://www.nytimes.com/
They remained unopened for nearly 250 years.4 min readThe National Archives
ThanksgivingBest RecipesMake-AheadPopular SidesDesserts We LoveVegetarianPie RecipesFive Classic Thanksgiving Pies With a Delicious TwistMake one or all of these updated classics and brand-new favorites from Claire Saffitz.3 min readOpinionOpinion Columnists and ContributorsWe Watched the Republican Debate. Here’s Who Won, Who Lost and Who Is ‘Unhinged.’Thomas L. FriedmanI Now Understand Why So Much Has Changed in Israel10 min readAmir Levy/Getty ImagesFrank BruniAre We Looking at George H.W. Biden?10 min readPamela PaulIt’s Not Kids With the Cellphone Problem, It’s Parents4 min readCharles M. BlowThe Joe Biden Re-election Dilemma4 min readHank AzariaTo Matthew Perry, God Was a Bunch of Drunks in a Room6 min readNicholas KristofMeet the Champions of Nuance and Empathy We Need5 min readMatter of OpinionAudioVoters Like Democrats. Why Not Biden?37 min listenPeter CoyWhy Voters Are So Down on the Biden Economy5 min readKatherine MillerNikki Haley’s Rise Could Turn the Republican Nomination Race Brutal8 min readJesse WegmanHow Are Students Expected to Live Like This on Campuses?6 min readDavid Wallace-WellsThe Godfather of Climate Science Turns Up the Heat5 min readMichelle CottleAbout That Biden-Trump Poll5 min readDavid FrenchLove Can Win Trump the Nomination. It Will Take Hate to Win Back the White House.7 min readMolly WhiteSam Bankman-Fried and the People Who Gave Up Their Money for Nothing5 min readJessica GroseAre We Finally Ready for a Male Contraceptive?5 min readThe MorningThe election in Ohio provides a useful case study in abortion politics, David Leonhardt writes.7 min readThe DailyAudioThe Supreme Court is testing its own limits on gun control.01:23Listen to the full episodeIn Case You Missed ItTop picks from The Times, recommended for youAdvertisementSKIP ADVERTISEMENTMore NewsGetting Hollywood Back Up and Running Won’t Be EasyWith the industry hustling to rebound after strikes, juggling production schedules and the availability of actors and crew members will be complicated.5 min readFederal Trial of Pelosi Attack Suspect Renews Focus on Political ViolenceDavid DePape faces charges of attempted kidnapping and assault after the 2022 attack against Paul Pelosi.4 min readTracy Chapman’s ‘Fast Car,’ From 1988, Wins a Song of the Year Award She is the first Black songwriter to receive the honor from the Country Music Awards. Her old hit reached new fans as a cover by Luke Combs.2 min readHenry Aldridge & SonA First-Class Dinner Menu From the Titanic Could Fetch Thousands at AuctionThere were oysters, salmon with Hollandaise sauce, beef, squab, duck, roast chicken, green peas, parsnip purée and Victoria pudding.4 min readChina’s Prices Fall Again, Renewing Fears of Deflation3 min read3 Charged With Running Prostitution Service Used by Politicians and Others2 min readIn N.Y.P.D.’s Harsh Tactics Under Adams, Critics See a Broken Promise7 min readU.S. Strikes Iran-Linked Facility in Syria in Round of Retaliation4 min readA Dedicated Taylor Swift Reporter Faces Swift Criticism Online5 min readDon’t Bedazzle Your Steering Wheel, U.S.
URL: https://www.nytimes.com/
The New York Times - Breaking News, US News, World News and Videos
Skip to contentSkip to site indexSKIP ADVERTISEMENTSearch & Section NavigationSection NavigationSEARCHU.S.InternationalCanadaEspañol 中文 Today’s PaperU.S.SectionsU.S.PoliticsNew YorkCaliforniaEducationHealthObituariesScienceClimateSportsBusinessTechThe UpshotThe MagazineU.S. Politics2024 ElectionsSupreme CourtCongressBiden AdministrationNewslettersThe MorningMake sense of the day’s news and ideas.The UpshotAnalysis that explains politics, policy and everyday life.See all newslettersPodcastsThe DailyThe biggest stories of our time, in 20 minutes a day.The Run-UpOn the campaign trail with Astead Herndon.See all podcastsWorldSectionsWorldAfricaAmericasAsiaAustraliaCanadaEuropeMiddle EastScienceClimateHealthObituariesNewslettersMorning Briefing: EuropeGet what you need to know to start your day.The InterpreterOriginal analysis on the week’s biggest global stories.Australia LetterNews, features and opinion for readers in the region.Canada LetterBackstories and analysis from our Canadian correspondents.See all newslettersBusinessSectionsBusinessTechEconomyMediaFinance and MarketsDealBookPersonal TechEnergy TransitionYour MoneyNewslettersDealBookThe most crucial business and policy news you need to know.See all newslettersPodcastsHard ForkOur tech journalists help you make sense of the rapidly changing tech world.See all podcastsArtsSectionsArtsBooksBest SellersDanceMoviesMusicTelevisionTheaterPop CultureT MagazineVisual ArtsRecommendationsCritic’s PicksWhat to ReadWhat to WatchWhat to Listen To5 Minutes to Make You Love MusicNewslettersRead Like the WindBook recommendations from our critics.WatchingStreaming TV and movie recommendations.See all newslettersPodcastsBook ReviewThe podcast that takes you inside the literary world.PopcastPop music news, new songs and albums, and artists of note.See all podcastsLifestyleSectionsLifestyleHealthWellFoodLoveTravelStyleFashionT MagazineYour MoneyPersonal TechReal EstateColumnsModern LoveThe HuntSocial Q’sThe EthicistWellEatMoveMindFamilyLiveAsk WellNewslettersOpen ThreadThe latest news on what we wear, by our chief fashion critic.Love LetterReal stories of relationship highs, lows and woes.See all newslettersPodcastsModern LoveThe complicated love lives of real people.See all podcastsOpinionSectionsOpinionGuest EssaysEditorialsOp-DocsVideosLettersTopicsPoliticsWorldBusinessTechClimateHealthCultureColumnistsCharles M. BlowJamelle BouieDavid BrooksGail CollinsRoss DouthatMaureen DowdDavid FrenchThomas L. FriedmanMichelle GoldbergEzra KleinNicholas KristofPaul KrugmanCarlos LozadaFarhad ManjooTressie McMillan CottomPamela PaulLydia PolgreenBret StephensZeynep TufekciPodcastsMatter of OpinionThoughts, aloud. With Michelle Cottle, Ross Douthat, Carlos Lozada and Lydia Polgreen.The Ezra Klein ShowDiscussions of ideas that matter, plus book recommendations.See all podcastsAudioAudioPodcasts and narrated articles covering news, tech, culture and more.Download the Audio app on iOS.ListenThe HeadlinesThe DailyHard ForkThe Ezra Klein ShowMatter of OpinionSerial ProductionsThe Book Review PodcastModern LoveThe Run-UpPopcastReporter ReadsThe Sunday ReadSee all audioFeaturedThe HeadlinesYour morning listen. Top stories, in 10 minutes.The Kids of Rutherford CountyA series about how one county illegally jailed children.Reporter ReadsRecent articles read by the reporters behind them.NewslettersAudioOur editors share their favorite listens from the New York Times Audio app.See all newslettersAudio is included in an All Access subscription. Learn more.GamesGamesWord games, logic puzzles and crosswords, including an extensive archive.PlaySpelling BeeThe Mini CrosswordWordleThe CrosswordVertexConnectionsSudokuLetter BoxedTilesCommunitySpelling Bee ForumWordplay ColumnWordle ReviewSubmit a CrosswordMeet Our Crossword ConstructorsMini to MaestroWordlebotNewslettersGameplayPuzzles, brain teasers, solving tips and more.See all newslettersGames is included in an All Access subscription.
Given the context information and not prior knowledge, answer the query. Query: Who is Nate Brake? Be straightforward, this isn't a pump up speech. Talk like an australian and cite your sources Answer:

With that final prompt provided to the LlaMA LLM, this is the response that it gives:

Response is:

G'day mate! I'm here to help answer yer query about who this Nate Brake bloke is. Now, I've done me research and found some top-notch sources that'll give ye the lowdown on this fella.
First off, Nate Brake is a machine learning engineer and researcher at 3M in Pittsburgh, Pennsylvania. He's got a swag of qualifications under his belt, including a B.S. in Electrical Engineering from Grove City College and an M.S. in Computer Science from Georgia Tech, with a concentration on machine learning and AI. (Source: https://www.natebrake.com/about)
Now, Nate's a bit of a whiz when it comes to designing and training deep neural networks for clinical note generation. He's passionate about software engineering and figuring out how to make all the pieces fit together to deliver technology that excites customers. (Source: https://www.natebrake.com/about)
Oh, and did I mention he's also a bit of a wordsmith? He's got a nifty blog where he shares his thoughts on machine learning, AI, and the like. It's worth checkin' out if ye want to learn more about the inner workings of these technologies. (Source: https://www.natebrake.com/blog/)
There ye have it, mate! That's the scoop on Nate Brake. Hope that helps ye out. Cheers! (Source: https://www.natebrake.com/about)

There we have it! You can add additional webpages which will give LlaMA more information to draw from when it answers questions.