あなたがする必要がないように紙を解釈するAIエージェント:完全なビルドガイド

Research Paper Retrieval, Search, and Summarization 用の AI エージェントを構築する方法を学びましょう。

Research Paper Retrieval, Search, and Summarization 用の AI エージェントを構築する方法を学びましょう。

研究者にとって、最新の発見を更新することは、ヘイスタックに針を見つけるようなものです。AI駆動のアシスタントは、最も関連する論文を取得するだけでなく、重要な洞察をまとめ、あなたの特定の質問にリアルタイムで答えます。

This article delves into constructing such an AI research agent using Superlinked's complex document embedding capabilities. By integrating semantic and temporal relevance, we eliminate the need for complex reranking, ensuring efficient and accurate retrieval of information.

この記事では、Superlinkedの複雑な文書埋め込み機能を使用して AI 研究エージェントを構築することに深く取り組んでいます. By integrating semantic and temporal relevance, we eliminate the need for complex rearanking, ensuring efficient and accurate retrieval of information.

TL;DR:

Superlinked のベクター検索を使用してリアルタイムの AI 研究エージェントを構築します. It skipps complex RAG pipelines by embedding and querying documents directly - making research faster, simpler, and smarter.

(コードに直接ジャンプしたいですか? ここでGitHubのオープンソースをチェックしてください. あなた自身のエージェント使用ケースのセマンティックな検索を試してみる準備ができていますか? 私たちは助けに来ています。

GitHubのオープンソースをチェックここ.ここここWe are here to助け.助け助け

この記事では、クエリを処理するためにカーネルエージェントを使用してエージェントシステムを構築する方法を説明します。here’s the コラボ.

コラボ.コラボ

研究アシスタントシステムの構築はどこから始めるべきですか?

伝統的に、そのようなシステムの構築には複雑さとかなりのリソース投資が含まれています。検索システムは通常、関連性に基づいて初期の幅広いドキュメントを取得し、その後、結果を改良し再編成するための二次的な再編成プロセスを適用します。再編成は正確性を向上させますが、計算の複雑さ、遅延性、および初期に必要とする膨大なデータの取得のために大幅に増加します。 Superlinkedは、構造化された数値およびカテゴリ的な埋め込みとセマンティックテキストの埋め込みを組み合わせ、包括的な多様なベクターを提供することによって、この複雑さを解決します。

Superlinked でエージェントシステムを構築する

このAIエージェントは3つの主なことを行うことができます:

Find Papers: 研究論文をテーマ(例えば「量子コンピューティング」)で検索し、その後、関連性と最近の順位で検索します。
Summarize papers: リクエストされた論文をビットサイズの洞察に凝縮します。
回答質問: ターゲットユーザーの質問に基づいて特定の研究論文から直接回答を抽出します。

スーパーリンクは、ベクトル検索関連性を向上させるため、再ランキング方法の必要性を排除します。スーパーリンクのRecencySpaceは、特定のタイムメタデータを暗号化し、検索中に最近のドキュメントを優先させ、計算的に高価な再ランキングの必要性を排除します。

ステップ1:Toolboxの設定

 %pip install superlinked

よりシンプルでモジュール化するために、私は抽象ツールクラスを作成しました. This will simplify the process of building and adding tools

import pandas as pd
import superlinked.framework as sl
from datetime import timedelta
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import os
from abc import ABC, abstractmethod
from typing import Any, Optional, Dict
from tqdm import tqdm
from google.colab import userdata

# Abstract Tool Class
class Tool(ABC):
    @abstractmethod
    def name(self) -> str:
        pass

    @abstractmethod
    def description(self) -> str:
        pass

    @abstractmethod
    def use(self, *args, **kwargs) -> Any:
        pass


# Get API key from Google Colab secrets
try:
    api_key = userdata.get('OPENAI_API_KEY')
except KeyError:
    raise ValueError("OPENAI_API_KEY not found in user secrets. Please add it using Tools > User secrets.")

# Initialize OpenAI Client
api_key = os.environ.get("OPENAI_API_KEY", "your-openai-key")  # Replace with your OpenAI API key
if not api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable.")

client = OpenAI(api_key=api_key)
model = "gpt-4"

ステップ2:データセットの理解

この例では、約 10,000 件の AI 研究論文を含むデータセットを使用します。カグリ. 簡単にするには、以下のセルを実行して、自動的にデータセットを作業ディレクトリにダウンロードします. 研究論文やその他の学術コンテンツなどの独自のデータソースを使用することもできます。

import pandas as pd

!wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1FCR3TW5yLjGhEmm-Uclw0_5PWVEaLk1j' -O arxiv_ai_data.csv

今のところ、物事を少し速く実行するために、我々は物事を加速するために紙の小さなサブセットを使用しますが、完全なデータセットを使用して例を試すことは自由です。ここで重要な技術的な詳細は、データセットからのタイムスタンプがストレージタイムスタンプ(例えば『1993-08-01 00:00:00+00:00』)からパンダのデータタイムオブジェクトに変換されることです。この変換は、日付/時間操作を実行することを可能にするため必要です。

df = pd.read_csv('arxiv_ai_data.csv').head(100)

# Convert to datetime but keep it as datetime (more readable and usable)
df['published'] = pd.to_datetime(df['published'])

# Ensure summary is a string
df['summary'] = df['summary'].astype(str)

# Add 'text' column for similarity search
df['text'] = df['title'] + " " + df['summary']

Debug: Columns in original DataFrame: ['authors', 'categories', 'comment', 'doi', 'entry_id', 'journal_ref' 'pdf_url', 'primary_category', 'published', 'summary', 'title', 'updated']

データセットコラムの理解

以下は、次のステップで重要になるデータセットの重要な列の概要です。

published: The publication date of the research paper.
概要:論文の抽象、簡潔な概要を提供します。
entry_id: arXiv からの各紙のユニークな識別子。

このデモのために、私たちは特に4つのコラムに焦点を当てています:entry_idで、publishedで、titleそして、summary検索の質を最適化するために、タイトルと概要は、私たちの埋め込みと検索プロセスの核心を構成する1つの包括的なテキスト列に結合されます。

Superlinked の In-Memory Indexer についての注記: Superlinked の In-Memory インデックスは、データセットを RAM に直接保存し、リアルタイムの検索と迅速なプロトタイプ作成に最適な検索速度を極めて高速にします。

ステップ3:Superlinked Schemaの定義

前へ進むには、データをマップするためのスケジュールが必要です。PaperSchemaキーフィールド: with key fields:

lass PaperSchema(sl.Schema):
    text: sl.String
    published: sl.Timestamp  # This will handle datetime objects properly
    entry_id: sl.IdField
    title: sl.String
    summary: sl.String

paper = PaperSchema()

Superlinked Spaces for Effective Retrievalの定義

データセットを組織し、効果的にクエリするための重要なステップは、テキストシミラリティスペースとRecencySpaceの2つの専用ベクトルスペースを定義することです。

テキストスペース

THETextSimilaritySpace研究論文のタイトルや抽象をベクトルに変換するように設計されている。テキストを埋め込みに変換することにより、このスペースはセマンティックな検索の容易さと精度を大幅に向上させます。

text_space = sl.TextSimilaritySpace(
    text=sl.chunk(paper.text, chunk_size=200, chunk_overlap=50),
    model="sentence-transformers/all-mpnet-base-v2"
)

Recenceスペース

THERecencySpaceタイムメタデータをキャプチャし、研究出版物の最新性を強調します。タイムスタンプを暗号化することで、このスペースはより新しいドキュメントにより大きな意義を付与します。

recency_space = sl.RecencySpace(
    timestamp=paper.published,
    period_time_list=[
        sl.PeriodTime(timedelta(days=365)),      # papers within 1 year
        sl.PeriodTime(timedelta(days=2*365)),    # papers within 2 years
        sl.PeriodTime(timedelta(days=3*365)),    # papers within 3 years
    ],
    negative_filter=-0.25
)

RecencySpace をタイムベースのフィルターとして考えて、電子メールを日付に分類したり、最初に最新の投稿をインスタグラムで閲覧したりするのと同様に、「この紙はどのくらい新鮮ですか?」という質問に答えるのに役立ちます。

より小さなタイムデルタ(例えば365日)は、より細かい、年間の時間ベースのランキングを可能にします。
より大きなタイムデルタ(例えば1095日)は、より広い期間を作成します。

THEnegative_filter非常に古い論文を罰します. より明確に説明するには、次の例で、二つの論文が同一のコンテンツ関連性を持っていますが、そのランキングはその出版日付に依存します。

Paper A: Published in 1996 
Paper B: Published in 1993

Scoring example:
- Text similarity score: Both papers get 0.8
- Recency score:
  - Paper A: Receives the full recency boost (1.0)
  - Paper B: Gets penalized (-0.25 due to negative_filter)

Final combined scores:
- Paper A: Higher final rank
- Paper B: Lower final rank

これらのスペースは、データセットをよりアクセシビリティと効率化するための鍵であり、コンテンツベースとタイムベースの両方の検索を可能にし、研究論文の関連性と最新性を理解するのに本当に役立ちます。

ステップ4:インデックスの構築

次に、スペースは、検索エンジンのコアであるインデックスに合併します。

paper_index = sl.Index([text_space, recency_space])

その後、DataFrame はスケジュールにマッピングされ、バッチ(一度に 10 枚の紙)で in-memory ストアにロードされます。

# Parser to map DataFrame columns to schema fields
parser = sl.DataFrameParser(
    paper,
    mapping={
        paper.entry_id: "entry_id",
        paper.published: "published",
        paper.text: "text",
        paper.title: "title",
        paper.summary: "summary",
    }
)

# Set up in-memory source and executor
source = sl.InMemorySource(paper, parser=parser)
executor = sl.InMemoryExecutor(sources=[source], indices=[paper_index])
app = executor.run()

# Load the DataFrame with a progress bar using batches
batch_size = 10
data_batches = [df[i:i + batch_size] for i in range(0, len(df), batch_size)]
for batch in tqdm(data_batches, total=len(data_batches), desc="Loading Data into Source"):
    source.put([batch])

メモリ内のエクスケーターは、スーパーリンクがここに輝く理由であり、1000枚の紙がRAMに適合し、クエリはディスクのI/Oボトルネックなしで飛ぶ。

ステップ5:Crafting the query

次に、クエリを作成します。ここでクエリを作成するためのテンプレートが作成されます。これを管理するには、関連性と最近の両方をバランスを取ることができるテンプレートが必要です。

# Define the query
knowledgebase_query = (
    sl.Query(
        paper_index,
        weights={
            text_space: sl.Param("relevance_weight"),
            recency_space: sl.Param("recency_weight"),
        }
    )
    .find(paper)
    .similar(text_space, sl.Param("search_query"))
    .select(paper.entry_id, paper.published, paper.text, paper.title, paper.summary)
    .limit(sl.Param("limit"))
)

これは、コンテンツ(relevance_weight)または最近のコンテンツ(recency_weight)を優先するかを選択することを可能にします - 私たちのエージェントのニーズに非常に役立つ組み合わせです。

ステップ6:構築ツール

次はツールの部分です。

3つのツールを作成する...

検索ツール : このツールは、スーパーリンクのインデックスに接続して作成され、クエリに基づいてトップ5の論文を引っ張ることができます。それは「検索論文」の目標を達成するために、関連性(重さ1.0)と最近(重さ0.5)をバランスをとります。我々が望むのは、質問に関連する論文を見つけることです。したがって、質問が「1993年から1994年まで何の量子コンピューティング論文が出版されたのですか?」なら、検索ツールはそれらの論文を取得し、それらを一つずつまとめ、結果を返します。

class RetrievalTool(Tool):
    def __init__(self, df, app, knowledgebase_query, client, model):
        self.df = df
        self.app = app
        self.knowledgebase_query = knowledgebase_query
        self.client = client
        self.model = model

    def name(self) -> str:
        return "RetrievalTool"

    def description(self) -> str:
        return "Retrieves a list of relevant papers based on a query using Superlinked."

    def use(self, query: str) -> pd.DataFrame:
        result = self.app.query(
            self.knowledgebase_query,
            relevance_weight=1.0,
            recency_weight=0.5,
            search_query=query,
            limit=5
        )
        df_result = sl.PandasConverter.to_pandas(result)
        # Ensure summary is a string
        if 'summary' in df_result.columns:
            df_result['summary'] = df_result['summary'].astype(str)
        else:
            print("Warning: 'summary' column not found in retrieved DataFrame.")
        return df_result

次に上がるのは、Summarization Toolこのツールは、紙の簡潔な概要が必要な場合に設計されています。paper_id, which is the ID of the paper that needs to be summarized. If apaper_id付与されていない場合、このツールは、これらのIDがデータセットに該当する書類を見つけるために必要不可欠であるため、動作しません。

class SummarizationTool(Tool):
    def __init__(self, df, client, model):
        self.df = df
        self.client = client
        self.model = model

    def name(self) -> str:
        return "SummarizationTool"

    def description(self) -> str:
        return "Generates a concise summary of specified papers using an LLM."

    def use(self, query: str, paper_ids: list) -> str:
        papers = self.df[self.df['entry_id'].isin(paper_ids)]
        if papers.empty:
            return "No papers found with the given IDs."
        summaries = papers['summary'].tolist()
        summary_str = "\n\n".join(summaries)
        prompt = f"""
        Summarize the following paper summaries:\n\n{summary_str}\n\nProvide a concise summary.
        """
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=500
        )
        return response.choices[0].message.content.strip()

最後に、我々はQuestionAnsweringToolこのツールは、チェーンRetrievalTool関連する論文を取得し、それらを使用して質問に答える. 関連する論文が見つからない場合は、一般的な知識に基づいて答えを提供します。

class QuestionAnsweringTool(Tool):
    def __init__(self, retrieval_tool, client, model):
        self.retrieval_tool = retrieval_tool
        self.client = client
        self.model = model

    def name(self) -> str:
        return "QuestionAnsweringTool"

    def description(self) -> str:
        return "Answers questions about research topics using retrieved paper summaries or general knowledge if no specific context is available."

    def use(self, query: str) -> str:
        df_result = self.retrieval_tool.use(query)
        if 'summary' not in df_result.columns:
            # Tag as a general question if summary is missing
            prompt = f"""
            You are a knowledgeable research assistant. This is a general question tagged as [GENERAL]. Answer based on your broad knowledge, not limited to specific paper summaries. If you don't know the answer, provide a brief explanation of why.

            User's question: {query}
            """
        else:
            # Use paper summaries for specific context
            contexts = df_result['summary'].tolist()
            context_str = "\n\n".join(contexts)
            prompt = f"""
            You are a research assistant. Use the following paper summaries to answer the user's question. If you don't know the answer based on the summaries, say 'I don't know.'

            Paper summaries:
            {context_str}

            User's question: {query}
            """
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=500
        )
        return response.choices[0].message.content.strip()

ステップ7:Kernel Agentの構築

次に、カーネルエージェントです。それは中央コントローラとして機能し、スムーズで効率的な操作を確保します。システムのコアコンポーネントとして機能し、カーネルエージェントは、複数のエージェントが同時に動作するときに、目的に応じてクエリをルーティングすることによって通信を調整します。このような単一エージェントシステムでは、カーネルエージェントは直接、関連するツールを使用してタスクを効果的に管理します。

class KernelAgent:
    def __init__(self, retrieval_tool: RetrievalTool, summarization_tool: SummarizationTool, question_answering_tool: QuestionAnsweringTool, client, model):
        self.retrieval_tool = retrieval_tool
        self.summarization_tool = summarization_tool
        self.question_answering_tool = question_answering_tool
        self.client = client
        self.model = model

    def classify_query(self, query: str) -> str:
        prompt = f"""
        Classify the following user prompt into one of the three categories:
        - retrieval: The user wants to find a list of papers based on some criteria (e.g., 'Find papers on AI ethics from 2020').
        - summarization: The user wants to summarize a list of papers (e.g., 'Summarize papers with entry_id 123, 456, 789').
        - question_answering: The user wants to ask a question about research topics and get an answer (e.g., 'What is the latest development in AI ethics?').

        User prompt: {query}

        Respond with only the category name (retrieval, summarization, question_answering).
        If unsure, respond with 'unknown'.
        """
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=10
        )
        classification = response.choices[0].message.content.strip().lower()
        print(f"Query type: {classification}")
        return classification

    def process_query(self, query: str, params: Optional[Dict] = None) -> str:
        query_type = self.classify_query(query)
        if query_type == 'retrieval':
            df_result = self.retrieval_tool.use(query)
            response = "Here are the top papers:\n"
            for i, row in df_result.iterrows():
                # Ensure summary is a string and handle empty cases
                summary = str(row['summary']) if pd.notna(row['summary']) else ""
                response += f"{i+1}. {row['title']} \nSummary: {summary[:200]}...\n\n"
            return response
        elif query_type == 'summarization':
            if not params or 'paper_ids' not in params:
                return "Error: Summarization query requires a 'paper_ids' parameter with a list of entry_ids."
            return self.summarization_tool.use(query, params['paper_ids'])
        elif query_type == 'question_answering':
            return self.question_answering_tool.use(query)
        else:
            return "Error: Unable to classify query as 'retrieval', 'summarization', or 'question_answering'."

この段階では、リサーチエージェントシステムのすべてのコンポーネントが構成されています。システムは、カーネルエージェントに適切なツールを提供することによって初期化することができ、その後、リサーチエージェントシステムは完全に動作します。

retrieval_tool = RetrievalTool(df, app, knowledgebase_query, client, model)
summarization_tool = SummarizationTool(df, client, model)
question_answering_tool = QuestionAnsweringTool(retrieval_tool, client, model)

# Initialize KernelAgent
kernel_agent = KernelAgent(retrieval_tool, summarization_tool, question_answering_tool, client, model)

さて、今度はシステムを試してみよう・・・。

# Test query print(kernel_agent.process_query("Find papers on quantum computing in last 10 years"))

これを実行すると、アクティブ化されるRetrievalToolそれは、関連性と最近の両方に基づいて関連する論文を取得し、関連する列を返します. 返された結果が概要の列を含む場合(データセットから得られた論文を示す)、それらの概要を使用し、それらを私たちに返します。

Query type: retrieval
Here are the top papers:
1. Quantum Computing and Phase Transitions in Combinatorial Search 
Summary: We introduce an algorithm for combinatorial search on quantum computers that
is capable of significantly concentrating amplitude into solutions for some NP
search problems, on average. This is done by...

1. The Road to Quantum Artificial Intelligence 
Summary: This paper overviews the basic principles and recent advances in the emerging
field of Quantum Computation (QC), highlighting its potential application to
Artificial Intelligence (AI). The paper provi...

1. Solving Highly Constrained Search Problems with Quantum Computers 
Summary: A previously developed quantum search algorithm for solving 1-SAT problems in
a single step is generalized to apply to a range of highly constrained k-SAT
problems. We identify a bound on the number o...

1. The model of quantum evolution 
Summary: This paper has been withdrawn by the author due to extremely unscientific
errors....

1. Artificial and Biological Intelligence 
Summary: This article considers evidence from physical and biological sciences to show
machines are deficient compared to biological systems at incorporating
intelligence. Machines fall short on two counts: fi...

もう1つのクエリを試してみましょう、今回は、一つまとめましょう。

print(kernel_agent.process_query("Summarize this paper", params={"paper_ids": ["http://arxiv.org/abs/cs/9311101v1"]}))

Query type: summarization
This paper discusses the challenges of learning logic programs that contain the cut predicate (!). Traditional learning methods cannot handle clauses with cut because it has a procedural meaning. The proposed approach is to first generate a candidate base program that covers positive examples, and then make it consistent by inserting cut where needed. Learning programs with cut is difficult due to the need for intensional evaluation, and current induction techniques may need to be limited to purely declarative logic languages.

私はこの例がAIエージェントやエージェントベースのシステムを開発するのに役立つことを願っています。ここで示されたリハビリ機能の多くはSuperlinkedによって可能となりましたので、エージェントベースの主演を検討してください。リポジトリAIエージェントのための正確な回収能力が必要な将来の参考に!

Takeaways

ノートコード

用語的および時間的関連性を組み合わせると、複雑なリランキングを排除し、研究論文の検索精度を維持します。
時間ベースの罰則(negative_filter=-0.25)は、論文が類似したコンテンツ関連性を持っている場合、最近の研究を優先します。
モジュール型ツールベースのアーキテクチャにより、特殊なコンポーネントが異なるタスク(回収、概要、質問回答)を処理し、システムの一貫性を維持することができます。
データを小さなバッチ(batch_size=10)でロードすることにより、進捗追跡により、大規模な研究データセットを処理する際にシステムの安定性が向上します。
調整可能なクエリ重量により、ユーザーは特定の研究ニーズに基づいて関連性(1.0)と最近の (0.5)をバランスを取ることができます。
質問に答えるコンポーネントは、紙の特定の文脈が利用できない場合に優雅に一般的な知識に劣化し、デッドエンドユーザー体験を防止します。

定期的に発表される多数の研究論文の最新情報を保持することは、挑戦的で時間がかかる可能性があります。関連研究を効率的に検索し、重要な洞察をまとめ、これらの論文から特定の質問に答えることができるエージェントAIアシスタントのワークフローは、このプロセスを大幅に簡素化することができます。

貢献者

Vipul Maheshwari 著者
フィリップ・マクラドゥリ、レビュー

あなたがする必要がないように紙を解釈するAIエージェント:完全なビルドガイド

長すぎる; 読むには

TL;DR:

Superlinked でエージェントシステムを構築する

ステップ1:Toolboxの設定

ステップ2:データセットの理解

データセットコラムの理解

ステップ3:Superlinked Schemaの定義

Superlinked Spaces for Effective Retrievalの定義

ステップ4:インデックスの構築

ステップ5:Crafting the query

ステップ6:構築ツール

ステップ7:Kernel Agentの構築

Takeaways

貢献者

About Author

ラベル

この記事は...

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps

あなたがする必要がないように紙を解釈するAIエージェント:完全なビルドガイド

長すぎる; 読むには

TL;DR:

Superlinked でエージェントシステムを構築する

ステップ1:Toolboxの設定

ステップ2:データセットの理解

データセットコラムの理解

ステップ3:Superlinked Schemaの定義

Superlinked Spaces for Effective Retrievalの定義

ステップ4:インデックスの構築

ステップ5:Crafting the query

ステップ6:構築ツール

ステップ7:Kernel Agentの構築

Takeaways

貢献者

About Author

ラベル

この記事は...

関連ストーリー

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps