Artisan IMG > GitHub (github) (4396bb8f-d7bd-4f7c-8180-2c71b2feeb3d)

Artisan IMG > OpenAI (openai) (28aad8ce-5765-43fe-8b44-6ac6d98891c9)

Artisan IMG > Pinecone (pinecone) (478b3ead-7cb5-455e-b264-41761fc8f45e)

Build a RAG pipeline with GitHub, Pinecone and OpenAI

Project

Artificial Intelligence

Intermediate

This is a 'Project' template which means that it contains a group of workflows that work together to achieve a particular aim

On this page
Build a RAG pipeline with GitHub, Pinecone and OpenAI
Overview
Prerequisites
Getting Live

Overview
Copy

By deploying this template you can quickly build a RAG pipeline which crawls markdown documentation from a GitHub repository.

This can act as the foundation for AI knowledge agents and AI-powered integrations.

It is the same method we use to build our own docs knowledge agent:

Please see our documentation on Building knowledge agents for a step-by-step explanation of how a RAG pipeline works in Tray

Prerequisites
Copy

To deploy this template you will need to have:

A GitHub repository which contains a source 'documentation' folder populated with subfolders containing markdown documentation files
An OpenAI account
A Pinecone vector database account

Getting Live
Copy

After deploying the template, you will need to take the following steps:

1 - Add authentications
Copy

Create a new or use an existing GitHub authentication for the relevant steps in the '1 - pull markdown files from GitHub' workflow
Create a new or use an existing Google Sheets authentication for the relevant step in the '3 - create embeddings' workflow
Create a new or use an existing OpenAI authentication for the relevant step in the '3 - create embeddings' workflow
Create a new or use an existing Pinecone authentication for the relevant step in the '4 - add to Pinecone' workflow

2 - Set project config
Copy

In order to make this template work you will need to edit the project config to match your GitHub and Pinecone setup:

3 - Edit metadata creation steps
Copy

With the 'Create page object' step we have made some assumptions about the metadata you want to capture and add to Pinecone, and we have used a JSONata script to construct a public page url based on the GitHub folder structure.

You may need to edit this according to your needs:

This metadata structure is then reflected in the 'Vectors + metadata object' step:

4 - Edit chunking script
Copy

The 'chunk page' step splits each page into chunks separated by '##' (h2) headers to make manageable chunk sizes for the LLM to process when they are retrieved.

You can adjust this script according to your needs:

Next steps
Copy

Chat completions
Copy

In order to make use of the Pinecone vector DB created by this pipeline, you will need to create a 'Chat completion' workflow which acts as an endpoint for your application interface.

This workflow should:

Accept a user query
Create a vector from that query
Find the closest matching chunk vectors in your Pinecone db
Ask an LLM to formulate a response to the query based on the matched chunks of info
Return that response to the user

Updating the vector DB
Copy

When new or updated content is added to your docs repo you will need a process to update the vector DB.

One simplistic approach is to simply periodically 'nuke' the DB and run the crawl pipeline again.

More likely, you will want to build a system which is triggered by updates to the repo.

In this case, it is recommended that you:

Attach a unique identifier to each markdown file in GitHub
Store this identifier as metadata for each chunk in Pinecone
When a page is updated delete all Pinecone entries that have that page's identifier
Store the chunks for the updated page

Build a RAG pipeline with GitHub, Pinecone and OpenAI

OverviewCopy

PrerequisitesCopy

Getting LiveCopy

1 - Add authenticationsCopy

2 - Set project configCopy

3 - Edit metadata creation stepsCopy

4 - Edit chunking scriptCopy

Next stepsCopy

Chat completionsCopy

Updating the vector DBCopy