How to Organize Data for RAG

Full Guide to Organizing and Structuring your Documents for your Knowledge Base

Mar 19, 2024

Next, we will dive deeply into how to organize a Knowledge Base. For this, I created a video presentation along with the experiments, and below are the main concepts and how to apply them.

Hierarchical Structures

What are information hierarchies?

Informational hierarchies are a way to organize information so it sets a clear context and conveys the relationship between entities and the overall context. It is a clear signal to what information is most important, and it can help clarify the proper context.

Topical Map

The goal is to create a topical map that includes all of the relevant entities within the proper contextual hierarchies so that the LLM can quickly identify the context.

In essence, what we are doing is creating content that is easily identifiable by the LLM. It is like creating a mirror image of what it already knows about your main topic. Then fill in the details according to your data.

Let’s look at our example:

Central Entity: This is the main thing that our content is about.
- Conference
Source Context: Going to a Conference. Going implies time and place. In fact, you can’t go to a conference that is outside of time or space, so these are required entities that must be included in the content.
- conferences/time
- conferences/location
Root, Seed, Node: This is the hierarchy in which things happen related to our Central Entity and source context. Here, you get to decide how to order things based on your preferences. For example, if this was only an event that happened in NYC, then location would be more important than the year.
- root/seed/node = root/topic/subtopic = conferences/year/location
Other Topics & Subtopics: To fill out your Topical Map, you will need to include all of the other relevant topics and subtopics. Here are a few examples:
- Topics
  - What is the main theme of the conference?
  - What are the main topics?
- Speakers
  - Who is speaking?
  - What are they speaking about?
  - Why them?
- Agenda
  - What is the agenda?
  - Time and Place?
- Attendees
  - Who attends?
  - Why attend?
- Sponsors

A great topical map will give the LLM a lot of details to go on which will make it easier to answer detailed questions about the event.

Now, how do we actually organize it?

Organize Knowledge in Folders

Hierarchies can be organized into folders, sub-folders, etc. This is a parent-child type of relationship.

Taxonomy 101: Definition, Best Practices, and How It Complements Other IA Work

Organize Knowledge via URLs

Well-structured websites have a proper URL hierarchy that conveys the correct meaning, context, and relationship between entities.

You can use URLs to showcase these relationships. Consider the difference in meaning between:

ChatbotConferences.com/conferences/2019/nyc: URL implies that there are multiple events in multiple cities.
ChatbotConferences.com/new-york-city: The URL only shows the city, so all we can infer is that something is happening in NYC.
ChatbotConferences.com/nyc/2019: This URL conveys that there are multiple events in NYC.
ChatbotConferences.com/blog/104335923032: This means nothing…
ChatbotConferences.com/Chatbot-Conf-NYC-2019: This URL refers to a single event. We don’t know if it is part of a category or if it is related to any other events.

Now, which example do you think is best suited for our knowledge base?

The first example is the strongest. It has a ROOT, of Conferences. The ROOT is the main topic of the website. The second level of the hierarchy, or SEED is the year, and the final NODE is the city. Organizing it by year is much easier than by city, which is why year is at a higher level within the hierarchy.

Test

The Hierarchy Test tests these ideas out to see to what degree they make a difference with LLMs. I built two chatbots for this test. The chatbots have the same information, but the main difference is the hierarchical organization.

1) Good Bot: The bot is trained on the following pages:

2) Bad Bo: The Bot is trained on a single Page

https://www.chatbotconference.com/knowledge-bases/all-agendas

What to try it yourself? Go to the Test and Chat with the Bots

Documents: Context & Semantics

Does the way information within a document is organized matter?

Semantics helps LLMs identify the main context of the document and the relationship of entities with each other. These relationships inform the LLM on the proper way to connect concepts and the topic, which later helps give rise to meaning.

Macro & Micro Semantics

The overall context of a document is set by Marco and Micro Semitics.

The Marco Context is the overarching topic of the document and is organized by the H1, H2, H3, H4 Tags.

All these tags are in order of importance and set up the Marco context. The H1 tag, is the title of the document and represents its overall macro context.

The H2 tags are the main topics within the document and support the overall thesis. The H3 tags are sub-topics of the H2 tags.

Consider the difference between:

1. Marco Context: H1, H2, H3, H4

2. Micro Context: Definitions, questions, phrases, and world order within each heading.

Well-written documents generally are well organized, and easy to follow, read and understand.

They typically have a hierarchical structure, which allows the reader to go deeper into a topic.

Topics are laid out in a logical and coherent manner. Topics often have sub-topics and supporting information. Well-written documents are able to answer our questions.

Contextual Layers

Consider the questions below.

We broke it down into its elements so you can gain insight into how an LLM reads it. If children is well defined or the document has contextual layers, then an LLM can answer a detailed question like this.

What are the best bikes [knowledge domain] for [functional word] short boys [contextual domain]
What are the most useful diets [knowledge domain] for [functional word] children with insomnia [contextual domain] for kids under six [contextual layers]?

What is the overall structure of a good document?

A good document should follow the following parameters:

Macro Content: H1 Title Tag
10% Summary: Extractive & Abstractive Summary
60% Main Topics: H2 Tags & Micro Context: Definitions, paragraphs, etc
30% Supplementary Context: H2 & H3 Subtopics, related topics, synonyms, antonyms, etc

Test

We will test two articles that have the same information. Article 1 will have all of the attributes shared above. Article two will miss most of the attributes above. Each article will be used to train a bot, and we can all play around with the differences.

1) Good Bot 1: Doc has the following

H1 Title Tag
Summary
H2 & H3
Definitions
Supplementary Content

2) Bad Bot 2: Doc is designed to mirror a poorly structured articles

All tags converted to Paragraph Format
No summary
No supplementary content
No definitions or questions

Join the Journey

Join me as I journey and navigate through these waters.

Over the next few weeks and months, I will share my insights on AI, philosophy, and the results of my experiments in implementing the best versions of these technologies.

Stefan Speaks

Discussion about this post