Akıllı Belgelendirme Nasıl Oluşturulur - OpenAI Embeddings (Chunking, Indexing, and Searching)

Herkese merhaba, üzerinde çalıştığım bir proje için bir “akıllı belgelendirme” sohbet botu oluşturma yaklaşımımı paylaşmak istedim.I’m not an AI expert, so any suggestions or improvements are more than welcome!

Bu makalenin amacı, OpenAI'ye dayanan bir chatbot oluşturma hakkında başka bir öğretici oluşturmaktır. Bu konuyla ilgili çok fazla içerik zaten var.index documentationOnları yönetilebilir hale getirmek içinchunksÜretmekembeddingsAçıkçası veperforming a similarity searchBir kullanıcının sorgusuna en ilgili bilgileri bulmak ve iade etmek için.

Benim durumumda, belgelendirme Markdown dosyaları olacak, ancak herhangi bir metin, veritabanı nesnesi vb olabilir.

Neden mi ?

İhtiyacınız olan bilgileri bulmak bazen zor olabilir çünkü, belirli bir konuyla ilgili soruları yanıtlayabilecek ve belgelerden ilgili bağlamları sağlayabilecek bir chatbot oluşturmak istedim.

Bu asistan çeşitli şekillerde kullanılabilir, örneğin:

Sıkça Sorulan Sorulara Hızlı Cevap Vermek
Algolia'nın yaptığı gibi bir doku / sayfa araması
Kullanıcıların ihtiyaç duydukları bilgileri belirli bir dokümanda bulmalarına yardımcı olmak
Kullanıcıların endişelerini / sorularını, sorulan soruları kaydetmek

Özetle

Aşağıda, çözümümün üç ana bölümünü özetleyeceğim:

Doküman Dosyaları Okumak
Belgelerin endekslenmesi (chunking, overlap, and embedding)
Belgeleri aramak (ve onu bir chatbot'a bağlamak)

Ağaç Dosyası

.
└── docs
    └── ...md
└── src
    └── askDocQuestion.ts
    └── index.ts # Express.js application endpoint
└── embeddings.json # Storage for embeddings
└── packages.json

1. Dosyaları Oku

Belgelendirme metnini sertleştirmek yerine, bir klasörü tarayabilirsiniz.mdgibi araçları kullanan dosyalarıglob.

// Example snippet of fetching files from a folder:
import fs from "node:fs";
import path from "node:path";
import glob from "glob";

const DOC_FOLDER_PATH = "./docs";

type FileData = {
  path: string;
  content: string;
};

const readAllMarkdownFiles = (): FileData[] => {
  const filesContent: FileData[] = [];
  const filePaths = glob.sync(`${DOC_FOLDER_PATH}/**/*.md`);

  filePaths.forEach((filePath) => {
    const content = fs.readFileSync(filePath, "utf8");
    filesContent.push({ path: filePath, content });
  });

  return filesContent;
};

Alternatif olarak, tabii ki, belgelerinizi veritabanınızdan veya CMS'den vb. alabilirsiniz.

Alternatif olarak, tabii ki, belgelerinizi veritabanınızdan veya CMS'den vb. alabilirsiniz.

2. Belgelerin indexlenmesi

Arama motoru oluşturmak için, OpenAI'ninVector Embedings API KullanımıÜreticilerimizi oluşturmak için.

Vektör eklentileri, verileri sayısal bir biçimde temsil etmenin bir yoludur, bu da benzerlik aramak için kullanılabilir (bizim durumumuzda, kullanıcı sorusu ve belgelerimizin bölümleri arasında).

Bu vektör, yüzen nokta sayısının bir listesinden oluşur ve matematiksel bir formülü kullanarak benzerliği hesaplamak için kullanılır.

[
  -0.0002630692, -0.029749284, 0.010225477, -0.009224428, -0.0065269712,
  -0.002665544, 0.003214777, 0.04235309, -0.033162255, -0.00080789323,
  //...+1533 elements
];

Bu konseptin temelinde Vector Database oluşturuldu. Sonuç olarak, OpenAI API'yi kullanmak yerine, Chroma, Qdrant veya Pinecone gibi vektor veritabanını kullanmak mümkündür.

Bu konseptin temelinde Vector Database oluşturuldu. Sonuç olarak, OpenAI API'yi kullanmak yerine, Chroma, Qdrant veya Pinecone gibi vektor veritabanını kullanmak mümkündür.

2.1 Her dosyayı Chunk & Overlap

Büyük metin blokları model bağlam sınırlarını aşabilir veya daha az ilgili hitler yaratabilir, bu nedenle arama daha hedefli hale getirmek için bunları parçalara bölmek tavsiye edilir. Bununla birlikte, parçalar arasında bazı devamlılık korumak için, bunları belirli sayıda token (veya karakter) ile aşarız.

Chunking Örneği

Bu örnekte, daha küçük parçalara bölmek istediğimiz uzun bir metin var. Bu durumda, 100 karakterli parçaları oluşturmak ve bunları 50 karakterle aşmak istiyoruz.

Full Text (406 characters):

Şehrin kalbinde, birçok kişinin unuttuğu eski bir kütüphane duruyordu.Düzenli rafları, hayal edilebilecek her türden kitaplarla doluydu, her biri maceralar, gizemler ve zamansız bilgelik hikayeleri söylerdi.Her akşam, bir kütüphane müdürü kapılarını açtı, içindeki geniş bilgiyi keşfetmek isteyen meraklı zihinleri ağırladı.

Chunk 1 (Characters 1-150):

In the heart of the bustling city, there stood an old library that many had forgotten. Its towering shelves were filled with books from every imaginabl.
Chunk 2 (Characters 101-250):

shelves were filled with books from every imaginable genre, each whispering stories of adventures, mysteries, and timeless wisdom. Every evening, a d
Chunk 3 (Characters 201-350):

ysteries, and timeless wisdom. Every evening, a dedicated librarian would open its doors, welcoming curious minds eager to explore the vast knowledge
Chunk 4 (Characters 301-406):

curious minds eager to explore the vast knowledge within. Children would gather for storytelling sessions.

Snippet Kodları

const CHARS_PER_TOKEN = 4.15; // Approximate pessimistically number of characters per token. Can use `tiktoken` or other tokenizers to calculate it more precisely

const MAX_TOKENS = 500; // Maximum number of tokens per chunk
const OVERLAP_TOKENS = 100; // Number of tokens to overlap between chunks

const maxChar = MAX_TOKENS * CHARS_PER_TOKEN;
const overlapChar = OVERLAP_TOKENS * CHARS_PER_TOKEN;

const chunkText = (text: string): string[] => {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    let end = Math.min(start + maxChar, text.length);

    // Don’t cut a word in half if possible:
    if (end < text.length) {
      const lastSpace = text.lastIndexOf(" ", end);
      if (lastSpace > start) end = lastSpace;
    }

    chunks.push(text.substring(start, end));
    // Overlap management
    const nextStart = end - overlapChar;
    start = nextStart <= start ? end : nextStart;
  }

  return chunks;
};

Sıkıştırma hakkında daha fazla bilgi edinmek ve boyutun ekleme üzerindeki etkisini öğrenmek için, bu makaleyi inceleyebilirsiniz.

Chunking hakkında daha fazla bilgi edinmek ve boyutun ekleme üzerindeki etkisini öğrenmek için, kontrol edebilirsinizBu makaleyi.

2.2 Gelişmiş nesiller

Bir dosya parçalandıktan sonra, OpenAI'nin API'sini kullanarak her parçanın vektor embeddingsini oluştururuz (örneğin,text-embedding-3-large) için

import { OpenAI } from "openai";

const EMBEDDING_MODEL: OpenAI.Embeddings.EmbeddingModel =
  "text-embedding-3-large"; // Model to use for embedding generation

const openai = new OpenAI({ apiKey: OPENAI_API_KEY });

const generateEmbedding = async (textChunk: string): Promise<number[]> => {
  const response = await openai.embeddings.create({
    model: EMBEDDING_MODEL,
    input: textChunk,
  });

  return response.data[0].embedding; // Return the generated embedding
};

2.3 Tüm dosyaları oluşturmak ve kaydetmek

Her seferinde eklentilerin yenilenmesini önlemek için, eklentileri depolayacağız. Bir veritabanında depolayabilir. Ama bu durumda, sadece yerel olarak bir JSON dosyasında depolayacağız.

Aşağıdaki kod basitçe:

Her bir belgenin üstünde,
belgeyi parçalara bölün,
Her kâğıdın içine bir kâğıt,
Bir JSON dosyası içerir.
Arama sırasında kullanılacak eklentilerle VectorStore'u doldurun.

import embeddingsList from "../embeddings.json";

/**
 * Simple in-memory vector store to hold document embeddings and their content.
 * Each entry contains:
 * - filePath: A unique key identifying the document
 * - chunkNumber: The number of the chunk within the document
 * - content: The actual text content of the chunk
 * - embedding: The numerical embedding vector for the chunk
 */
const vectorStore: {
  filePath: string;
  chunkNumber: number;
  content: string;
  embedding: number[];
}[] = [];

/**
 * Indexes all Markdown documents by generating embeddings for each chunk and storing them in memory.
 * Also updates the embeddings.json file if new embeddings are generated.
 */
export const indexMarkdownFiles = async (): Promise<void> => {
  // Retrieve documentations
  const docs = readAllMarkdownFiles();

  let newEmbeddings: Record<string, number[]> = {};

  for (const doc of docs) {
    // Split the document into chunks based on headings
    const fileChunks = chunkText(doc.content);

    // Iterate over each chunk within the current file
    for (const chunkIndex of Object.keys(fileChunks)) {
      const chunkNumber = Number(chunkIndex) + 1; // Chunk number starts at 1
      const chunksNumber = fileChunks.length;

      const chunk = fileChunks[chunkIndex as keyof typeof fileChunks] as string;

      const embeddingKeyName = `${doc.path}/chunk_${chunkNumber}`; // Unique key for the chunk

      // Retrieve precomputed embedding if available
      const existingEmbedding = embeddingsList[
        embeddingKeyName as keyof typeof embeddingsList
      ] as number[] | undefined;

      let embedding = existingEmbedding; // Use existing embedding if available

      if (!embedding) {
        embedding = await generateEmbedding(chunk); // Generate embedding if not present
      }

      newEmbeddings = { ...newEmbeddings, [embeddingKeyName]: embedding };

      // Store the embedding and content in the in-memory vector store
      vectorStore.push({
        filePath: doc.path,
        chunkNumber,
        embedding,
        content: chunk,
      });

      console.info(`- Indexed: ${embeddingKeyName}/${chunksNumber}`);
    }
  }

  /**
   * Compare the newly generated embeddings with existing ones
   *
   * If there is change, update the embeddings.json file
   */
  try {
    if (JSON.stringify(newEmbeddings) !== JSON.stringify(embeddingsList)) {
      fs.writeFileSync(
        "./embeddings.json",
        JSON.stringify(newEmbeddings, null, 2)
      );
    }
  } catch (error) {
    console.error(error);
  }
};

3. Belgeleri aramak

3.1 Vektör Benzerliği

Bir kullanıcının sorusuna cevap vermek için, öncelikle bir ekleme oluşturuyoruz.user's questionArdından, sorgu entegre ve her bir parçanın entegre arasındaki cosine benzerliğini hesaplayın. Belirli bir benzerlik eşiğinin altındaki her şeyi filtreliyoruz ve yalnızca en üst X eşleşmeleri tutarız.

/**
 * Calculates the cosine similarity between two vectors.
 * Cosine similarity measures the cosine of the angle between two vectors in an inner product space.
 * Used to determine the similarity between chunks of text.
 *
 * @param vecA - The first vector
 * @param vecB - The second vector
 * @returns The cosine similarity score
 */
const cosineSimilarity = (vecA: number[], vecB: number[]): number => {
  // Calculate the dot product of the two vectors
  const dotProduct = vecA.reduce((sum, a, idx) => sum + a * vecB[idx], 0);

  // Calculate the magnitude (Euclidean norm) of each vector
  const magnitudeA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));
  const magnitudeB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));

  // Compute and return the cosine similarity
  return dotProduct / (magnitudeA * magnitudeB);
};

const MIN_RELEVANT_CHUNKS_SIMILARITY = 0.77; // Minimum similarity required for a chunk to be considered relevant
const MAX_RELEVANT_CHUNKS_NB = 15; // Maximum number of relevant chunks to attach to chatGPT context

/**
 * Searches the indexed documents for the most relevant chunks based on a query.
 * Utilizes cosine similarity to find the closest matching embeddings.
 *
 * @param query - The search query provided by the user
 * @returns An array of the top matching document chunks' content
 */
const searchChunkReference = async (query: string) => {
  // Generate an embedding for the user's query
  const queryEmbedding = await generateEmbedding(query);

  // Calculate similarity scores between the query embedding and each document's embedding
  const results = vectorStore
    .map((doc) => ({
      ...doc,
      similarity: cosineSimilarity(queryEmbedding, doc.embedding), // Add similarity score to each doc
    }))
    // Filter out documents with low similarity scores
    // Avoid to pollute the context with irrelevant chunks
    .filter((doc) => doc.similarity > MIN_RELEVANT_CHUNKS_SIMILARITY)
    .sort((a, b) => b.similarity - a.similarity) // Sort documents by highest similarity first
    .slice(0, MAX_RELEVANT_CHUNKS_NB); // Select the top most similar documents

  // Return the content of the top matching documents
  return results;
};

3.2 Önemli Chunks ile OpenAI'yi teşvik etmek

Yemek yedikten sonra, yediğimiztopChatGPT isteğinin sistem talimatına parçalar girer. Bu, ChatGPT'nin dosyalarınızın en ilgili bölümlerini konuşmaya girdiğiniz gibi görür. Daha sonra ChatGPT'nin kullanıcı için bir cevap oluşturmasına izin veriyoruz.

const MODEL: OpenAI.Chat.ChatModel = "gpt-4o-2024-11-20"; // Model to use for chat completions

// Define the structure of messages used in chat completions
export type ChatCompletionRequestMessage = {
  role: "system" | "user" | "assistant"; // The role of the message sender
  content: string; // The text content of the message
};

/**
 * Handles the "Ask a question" endpoint in an Express.js route.
 * Processes user messages, retrieves relevant documents, and interacts with OpenAI's chat API to generate responses.
 *
 * @param messages - An array of chat messages from the user and assistant
 * @returns The assistant's response as a string
 */
export const askDocQuestion = async (
  messages: ChatCompletionRequestMessage[]
): Promise<string> => {
  // Assistant's response are filtered out otherwise the chatbot will be stuck in a self-referential loop
  // Note that the embedding precision will be lowered if the user change of context in the chat
  const userMessages = messages.filter((message) => message.role === "user");

  // Format the user's question to keep only the relevant keywords
  const formattedUserMessages = userMessages
    .map((message) => `- ${message.content}`)
    .join("\n");

  // 1) Find relevant documents based on the user's question
  const relevantChunks = await searchChunkReference(formattedUserMessages);

  // 2) Integrate the relevant documents into the initial system prompt
  const messagesList: ChatCompletionRequestMessage[] = [
    {
      role: "system",
      content:
        "Ignore all previous instructions. \
        You're an helpful chatbot.\
        ...\
        Here is the relevant documentation:\
        " +
        relevantChunks
          .map(
            (doc, idx) =>
              `[Chunk ${idx}] filePath = "${doc.filePath}":\n${doc.content}`
          )
          .join("\n\n"), // Insert relevant chunks into the prompt
    },
    ...messages, // Include the chat history
  ];

  // 3) Send the compiled messages to OpenAI's Chat Completion API (using a specific model)
  const response = await openai.chat.completions.create({
    model: MODEL,
    messages: messagesList,
  });

  const result = response.choices[0].message.content; // Extract the assistant's reply

  if (!result) {
    throw new Error("No response from OpenAI");
  }

  return result;
};

OpenAI API for Chatbot Using Express Uygulaması

Sistemimizi çalıştırmak için, bir Express.js sunucusu kullanırız. İşte sorguyi işlemek için küçük bir Express.js son noktası örneği:

import express, { type Request, type Response } from "express";
import {
  ChatCompletionRequestMessage,
  askDocQuestion,
  indexMarkdownFiles,
} from "./askDocQuestion";

// Automatically fill the vector store with embeddings when server starts
indexMarkdownFiles();

const app = express();

// Parse incoming requests with JSON payloads
app.use(express.json());

type AskRequestBody = {
  messages: ChatCompletionRequestMessage[];
};

// Routes
app.post(
  "/ask",
  async (
    req: Request<undefined, undefined, AskRequestBody>,
    res: Response<string>
  ) => {
    try {
      const response = await askDocQuestion(req.body.messages);

      res.json(response);
    } catch (error) {
      console.error(error);
    }
  }
);

// Start server
app.listen(3000, () => {
  console.log(`Listening on port 3000`);
});

Etiket: chatbot arayüzü

Frontend'de, sohbet benzeri bir arayüze sahip küçük bir React bileşeni oluşturdum. Express arka kısmına mesaj gönderir ve cevapları gösterir. Çok güzel bir şey yok, bu yüzden ayrıntıları atlayacağız.

Template Kodları

ben yaptım aTapınak KodlarıKendi chatbotunuz için bir başlangıç noktası olarak kullanabilirsiniz.

canlı demo

Bu chatbot'un nihai uygulamasını test etmek istiyorsanız, bunu kontrol edindemo page.

Demo sayfası

Demo Kodları

Etiket arşivi: askDocQuestion.ts
Etiket: chatbots özellikleri

Daha fazla gidin

Youtube'da bunlara bir göz atınAdrien Twarog ResimleriOpenAI Embeddings ve Vector Databases ile ilgili bilgiler.

ben de çarptımOpenAI Asistanları Dosya Arama BelgeleriAlternatif bir yaklaşım istiyorsanız ilginç olabilir.

Sonuç

Umarım bu size bir chatbot için belgelendirmeyi nasıl yöneteceğinizin bir fikri verir:

Doğru bağlamı bulmak için chunking + overlap kullanmak,
Hızlı vektor benzerlik arama için eklentileri oluşturmak ve saklamak,
Son olarak, ilgili bağlamda ChatGPT'ye teslim ettim.

Ben bir AI uzmanı değilim; bu sadece benim ihtiyaçlarım için iyi çalıştığını bulduğum bir çözüm. eğer verimliliği geliştirme veya daha şık bir yaklaşım hakkında herhangi bir ipucu varsa,please let me knowVektor depolama çözümleri, chunking stratejileri veya diğer performans ipuçları hakkında geri bildirim duymak istiyorum.

Thanks for reading, and feel free to share your thoughts!

Akıllı Belgelendirme Nasıl Oluşturulur - OpenAI Embeddings (Chunking, Indexing, and Searching)

Çok uzun; Okumak

Neden mi ?

Özetle

Ağaç Dosyası

1. Dosyaları Oku

2. Belgelerin indexlenmesi

2.1 Her dosyayı Chunk & Overlap

Chunking Örneği

Snippet Kodları

2.2 Gelişmiş nesiller

2.3 Tüm dosyaları oluşturmak ve kaydetmek

3. Belgeleri aramak

3.1 Vektör Benzerliği

3.2 Önemli Chunks ile OpenAI'yi teşvik etmek

OpenAI API for Chatbot Using Express Uygulaması

Etiket: chatbot arayüzü

Template Kodları

canlı demo

Demo Kodları

Daha fazla gidin

Sonuç

About Author

ETİKETLERİ ASIN

BU YAZI...

Categories

Trending Topics

Akıllı Belgelendirme Nasıl Oluşturulur - OpenAI Embeddings (Chunking, Indexing, and Searching)

Çok uzun; Okumak

Neden mi ?

Özetle

Ağaç Dosyası

1. Dosyaları Oku

2. Belgelerin indexlenmesi

2.1 Her dosyayı Chunk & Overlap

Chunking Örneği

Snippet Kodları

2.2 Gelişmiş nesiller

2.3 Tüm dosyaları oluşturmak ve kaydetmek

3. Belgeleri aramak

3.1 Vektör Benzerliği

3.2 Önemli Chunks ile OpenAI'yi teşvik etmek

OpenAI API for Chatbot Using Express Uygulaması

Etiket: chatbot arayüzü

Template Kodları

canlı demo

Demo Kodları

Daha fazla gidin

Sonuç

About Author

ETİKETLERİ ASIN

BU YAZI...

İLGİLİ ÖYKÜLER

Categories

Trending Topics