Using AI for Financial Analysis
As AI continues to gain adoption across many industries, a clear trend is emerging. AI is being used to automate well-structured, “unmessy” tasks like programming or web design, where the goal is well-defined and the tools are cleanly inventoried. Less attention has been paid to tasks like financial analysis, where the goal is constantly evolving and the data is obscure.
That is unfortunate, because one of the most interesting attributes of AI–and LLMs in particular– is their ability to analyze unstructured data and make independent decisions within an unpredictable data environment. This makes certain agent-based AI architectures particularly suited for analyzing economic and financial data.
In today’s article, we will build an autonomous and agentic data pipeline that highlights these capabilities. Anyone with a Python installation and access to a modern OpenAI model will be able to replicate this pipeline. I hope that this article will inspire you to examine your own data processes and explore ways to make them more efficient with AI.
Problem Statement
As Bitcoin gains mainstream adoption, many consolidated Bitcoin mines have become publicly traded institutions with substantial data center operations. These data centers collectively mine billions of dollars in bitcoin every month and draw significant energy, either from their own sources or from the grid, making them consequential regional and national economic actors.
The growing difficulty of the Bitcoin network requires these miners to constantly expand and update their equipment to stay profitable.
Image Credit: Blockchain.com
Unfortunately, most of the cutting-edge Bitcoin mining equipment is made in China, which is currently subject to major tariffs by the United States. To circumvent these tariffs, the mining equipment providers have set up operations outside of China. However, due to capital and technology transfer constraints, these operations are growing slowly and are unlikely to satisfy the significant demand from the miners.
In today’s article, we will build a pipeline that analyzes financial filings of Core Scientific (CORZ), one of the world’s largest Bitcoin miners, and estimates the tariff bill for its mining equipment yet to be delivered.
Technologies Used
We will primarily use Python and OpenAI to build this pipeline, but you should be able to substitute these with any language with good data analysis capabilities (such as R or Ruby) and LLM models from any of the major providers (such as Claude or Gemini).
We will use a vector database to search through the text of the filing. For this tutorial, we will use ChromaDB for its flexible in-memory database, convenient utilities, and easy setup.
We will use Google’s Search API for programmatic web search capabilities, combined with some light web scraping to get CORZ’s filings and pricing information from the mining equipment manufacturers.
Techniques
The pipeline relies on a combination of techniques: Retrieval-Augmented Generation (RAG), Agentic AI techniques, and Function Calling.
- Retrieval-Augmented Generation is an AI engineering technique in which the engineer takes control of the knowledge retrieval process to make generation more accurate. For this data pipeline tutorial, we will use two kinds of RAG: one that relies on semantic search and a RAG that relies on programmatic search. Both will help the AI generate more accurate and truthful responses.
- Agentic AI is a technique where AI processes interact with other programs, including other AI systems. This allows AI to interact with the outside world, increasing the flexibility of the AI application. AI agents, where different AI processes interact with one another, enable greater specialization within prompts. This leads to higher-quality final outputs.
- Function Calling is a feature of many modern LLMs. Function calling allows the AI engineer to define tools that the AI model can use, and the AI can decide if and when to invoke these tools to better perform its task.
One bonus technique is Chain-of-Thought Prompting, which plays a minor role in the article but is an important example of how prompt engineering can be used to improve model skill on difficult tasks in deeply nuanced ways.
Building the Data Pipeline
Setup
We will build the entire pipeline in Python, so let’s first open a file, and name it “corz_tariff_analysis.py”.
We begin by importing some requisite packages. If you don’t have the required packages, install them using pip:
pip install requests beautifulsoup chromadb markdownify
Now let’s import the necessary packages into the script. We could have used the Python OpenAI package, but I prefer using the Requests package here for simplicity.
import requests # helps us make GPT4o calls and run scrapes
from bs4 import BeautifulSoup # process the html
import chromadb # a in-memory vector database for faster retrieval
from markdownify import markdownify as md # converting HTML to markdown for better chunking
import json # parse JSON outputs
Let’s also define our API keys:
openai_key = <your OpenAI API key here>
google_api_key = <Your Google API key here (make sure programmatic search is enabled on your account!)>
google_cx_id = <your google programmatic search api engine ID>
Great! We are now ready to start coding.
Download and Chunk the SEC Filing
First, let’s download the SEC filing from which we will make our tariff estimate. For this article, I am using the most recent filing as of May 2025. CORZ’s IR page has a convenient HTML version that we can download.
filing_res = requests.get('https://investors.corescientific.com/sec-filings/all-sec-filings/content/0001628280-25-008302/core-20241231.htm')
print(len(filing_res.text))
Run the script and you should see that the filing is about 3.5 million characters. A data source this long will exceed the context window of most commercially available models on the market.
Even if a large enough model existed, we wouldn’t want to feed the whole filing to the model directly, since we would waste a lot of model time (and thus money) processing tokens that are not needed, and a high presence of irrelevant context can dramatically reduce model accuracy and performance.
In order to find the most relevant parts of the filing, we will have to use a technique called vector search.
Vector Search: A Quick Primer
Vector search is a technique that splits the text into chunks, uses a specialized AI to create a numerical representation of what each chunk is about (vector), and analyzes the vectors to find chunks those most relevant to a particular subject or query.
While the underlying technology is relatively simple, there are several choices we have to make about the chunking methodology. One of the key skills of a modern AI engineer is the ability to identify the optimal chunking methodology based on the data, problem set, and the models being used.
Cleaning, Chunking, and Embedding
We’ll split the filing into relatively small chunks of 500 characters, but include an entire page of the filing where a relevant chunk was found. This approach allows for good search accuracy while also ensuring that the subsequent model has enough information to analyze.
Let’s first clean and split the data into sections. Normally, we would have to parse the HTML and remove unnecessary content, but we can save a significant amount of effort by converting HTML to Markdown, a lightweight text markup format. Markdown is also well-represented in the training data, making it easier for the AI to understand.
soup = BeautifulSoup(filing_res.text) # use beautifulsoup for initial processing
all_sections = md(str(soup)).split("\n---\n") # now feed the clean HTML to markdown and split by page breaks
content_sections = [s for s in all_sections if len(s) > 1000] # some pages are mostly empty
use_sections = content_sections[2:] # first two sections are disclosures and boilerplate
Now we can split the pages and turn the chunks into vectors (i.e., “embed” them). In previous articles, I managed this process using native Python data structures, but today we have access to more advanced tools that streamline the task.
We’ll use ChromaDB, which features an optimized in-memory vector database, built-in vector embedding capabilities, and the ability to store metadata. The metadata will be especially helpful later.
We can chunk and vectorize the filing as follows:
chroma_client = chromadb.Client()
chroma_client.delete_collection(name="full_sections")
collection = chroma_client.get_or_create_collection(name="full_sections")
padding = 100
window = 500
for idx, section in enumerate(use_sections):
for i in range(0, len(section), window):
start_idx = max(0, i - padding)
end_idx = min(len(section) - 1, i + window + padding)
collection.upsert(
documents=[
section[start_idx: end_idx],
],
ids=["idx-%d-%d-%s" % (idx, start_idx, end_idx)],
metadatas=[{'section': idx, 'start': start_idx, 'end': end_idx}],
)
And with that, we have a vector database that we can use to efficiently search the document. Let’s query the database to find the most relevant sections.
Querying the Vector DB
To find the sections most relevant to CORZ’s Bitcoin miner purchase, we just need to directly query the ChromaDB collection. ChromaDB will automatically embed the query and compare the query vector against the document vectors to find the source data sections most relevant to the query.
Simply run the following code:
results = collection.query(
query_texts=["Bitcoin Miners Purchase"],
n_results=3, # how many results to return
)
print(results['metadatas'])
This identifies sections 45 and 47 of the filing (roughly pages 48 and 50 of the actual filing). If you inspect the actual document sections, you will see that section 47 discusses the need to buy increasingly powerful mining equipment to compete with other miners. Section 45 talks about the specific equipment CORZ is purchasing, along with the quantity of each type of miner being purchased.
Vector search is highly adaptable - even if the language or the structure of the filing changes substantially in the future, vector search will still be able to extract the relevant sections.
Let’s use AI to summarize these pages and extract the information about CORZ’s miner purchase:
# define this function for later use
def generate_content(instruction, json):
config_js = {
'model': 'gpt-4o-mini',
"messages": [{"role": "user", "content": instruction}],
"temperature": 0.7,
}
if json:
config_js['response_format'] = {"type": "json_object"}
res = requests.post('https://api.openai.com/v1/chat/completions', headers={
'Content-Type': 'application/json',
'Authorization': 'Bearer %s' % openai_key,
},
json=config_js)
return res.json()['choices'][0]['message']['content']
miners_rs = generate_content('''
Consider the following filing sections from CORZ's 10-K:
=====
%s
====
What types of bitcoin miners is CORZ in the process of acquiring from Bitmain, and in what quantity (in number of miners, dollar amounts, or hashrates)?
Please reply in JSON format.
''' % '\n------\n'.join(
[
use_sections[idx] for idx in
set([m['section'] for m in results['metadatas'][0]])
]
), json=True)
print(miners_rs)
Run the script and you should see an output similar to this:
{
"miners": [
{
"model": "Antminer S19J XP",
"quantity": 28400,
"exahash": 4.1
},
{
"model": "Antminer S21",
"quantity": 12900,
"exahash": 2.5
}
]
}
Now that we have successfully found the relevant filing sections, let’s use AI to find and calculate the tariff on these miners!
AI Orchestration and Function-Calling
While GPT4-o is very powerful, we can’t simply ask about the economic characteristics of the miner purchase. That’s because commercially available models tend to freeze their training data roughly a year before release. Given that tariffs change quickly, we need to provide the AI with the most up-to-date information.
We can use Google’s programmatic search API to retrieve the data and supply it to the AI, using a technique similar to vector-enabled RAG. Because this RAG is structurally much simpler, we can actually ask AI to implement autonomously.
Implementing Function Calling
Function-calling allows us to provide the AI with a list of external tools and let it autonomously invoke these tools as needed. This powerful feature allows the AI to access proprietary data and processes, acting as a semi-executive decision-making layer within the data pipeline.
While these features are powerful, we must be careful when invoking them. AI will not always be an expert at executing external functions, and certain human-like behavior, such as trial-and-error reasoning, remains a challenge for commercial AI models. A key skill for AI engineers is a deep and up-to-date understanding of current model capabilities and the ability to determine if a certain technique fits the use.
In this case, we grant AI access to an external search function powered by AI. Specifically, we will give it the ability to call this function, which searches Google and summarizes the search snippets using GPT-4o:
def search_snippets_summary(query, summary_prompt):
res = requests.get('https://www.googleapis.com/customsearch/v1',
params={
"key": google_api_key,
"cx": google_cx_id,
"q": query
})
snippets = [i['snippet'] for i in res.json()['items']]
rs = generate_content('''
Consider the following search engine snippets:
=====
%s
====
%s
''' % (
json.dumps(snippets),
summary_prompt
), False)
return rs
To do this, we define a special function that interacts with GPT-4o’s function-calling feature:
tools_library = {
'search_snippets_summary': {
"type": "function",
"function": {
"name": "search_snippets_summary",
"description": "Search the web using the Google search API and summarize it using GPT-4o",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "the search query to hit with the Google API"
},
"summary_prompt": {
"type": "string",
"description": "The prompt to use to summarize the snippets"
}
},
"required": ["query", "summary_prompt"]
}
}
}
}
functions = {
'search_snippets_summary': search_snippets_summary,
}
def function_enabled_call(messages, tool_names):
tools = [tools_library[tool_name] for tool_name in tool_names]
print('message length', len(messages))
payload = {
"model": "gpt-4o",
"messages": messages,
"tools": tools,
"tool_choice": "auto",
"response_format": {"type": "json_object"},
}
response = requests.post("https://api.openai.com/v1/chat/completions", headers={
"Authorization": f"Bearer {openai_key}",
"Content-Type": "application/json"
}, json=payload)
response_data = response.json()
if 'choices' not in response_data:
print(messages)
print(response_data)
message = response_data["choices"][0]["message"]
tool_calls = message.get("tool_calls", [])
if not tool_calls or len(tool_calls) == 0:
return response_data["choices"][0]["message"]["content"]
messages.append(message)
for tool_call in tool_calls:
function_name = tool_call["function"]["name"]
arguments = json.loads(tool_call["function"]["arguments"])
tc_id = tool_call["id"]
print(tc_id, function_name, arguments)
function = functions.get(function_name)
if function is not None:
result = function(**arguments)
print(result)
messages.append({
"role": "tool",
"tool_call_id": tc_id,
"content": json.dumps(result)
})
return function_enabled_call(messages, tool_names)
While this function is quite a bit longer than the generate_content
function, the main takeaway is that we invoke the AI with specific tools defined, and the AI can either reply with generated content or with a request to use the available tools. We then invoke the tools on its behalf and send the results, which the AI then uses to generate the final output.
Let’s use this function to determine the current price of the Bitmain products CORZ is buying, using the results from the previous call:
prices_rs = function_enabled_call([ {
"role": "user", "content": """
What's the unit price of the following bitcoin miners?
%s
If there are multiple sub-models, return the price for the base model
Reply in JSON format""" % miners_rs
} ], ['search_snippets_summary'])
print(prices_rs)
You should see the following output:
message length 1
<call_id> search_snippets_summary {'query': 'Antminer S19J XP base model price', 'summary_prompt': "What's the unit price of Antminer S19J XP base model?"}
The unit price of the Antminer S19J XP is mentioned as $2,288.75 in one of the snippets.
<call_id> search_snippets_summary {'query': 'Antminer S21 base model price', 'summary_prompt': "What's the unit price of Antminer S21 base model?"}
The unit price of the Antminer S21 base model is $3,695.00.
message length 4
{
"miners": [
{
"model": "Antminer S19J XP",
"unit_price": 2288.75
},
{
"model": "Antminer S21",
"unit_price": 3695.00
}
]
}
You can see that GPT-4o is quite effective at using Google Search and GPT4-o! Please note that because there are so many secondary market transactions on these machines, prices may vary from call to call, but will mostly remain within a consistent range.
We can do the same to determine the tariff rate:
tariff_rs = function_enabled_call([ {
"role": "user", "content": '''
What is the tariff rate, in percentage, for a U.S. firm on Bitmain products?
Reply in JSON format
'''
} ], ['search_snippets_summary'])
print(tariff_rs)
You should see an output similar to the one below:
message length 1
<call_id> search_snippets_summary {'query': 'US tariff rate Bitmain products', 'summary_prompt': 'Find the current tariff rate for Bitmain products imported to the US.'}
Based on the snippets provided, the current tariff rate for Bitmain products imported to the United States is 25%. This is mentioned in multiple snippets, indicating that if shipping to the U.S., customers will need to pay an extra 25% tariff on products made in China.
message length 3
{"tariff_rate_percentage":25}
Now that we have the tariff rate and prices, let’s calculate the total tariffs owed by CORZ on its Bitcoin miner order.
A Quick Note on AI Math
Since we have the tariff rates, prices, and units, it’s tempting to simply give all of the information to the AI and ask it to calculate the tariff. This would unfortunately lead to highly inaccurate answers. During my experiments, a simple “calculate total tariff” query returned different results each time, sometimes off by more than an order of magnitude.
This is because AI processes inputs as text rather than numbers, so something as simple as adding two numbers becomes computationally intensive and error-prone. Some prompt engineering techniques have been developed to alleviate some of these weaknesses, but basic arithmetic remains a known weakness in GPT-4o.
Summarization and Calculation
We can make up for this shortcoming using function calling. In this case, we can simply define simple external functions that perform the calculations on behalf of the AI.
To do this, define the following two functions:
def product(iterable):
a = 1
for i in iterable:
a *= i
return a
def sum_nums(iterable):
return sum(iterable)
And now modify your tools library like so:
tools_library = {
'search_snippets_summary': {
"type": "function",
"function": {
"name": "search_snippets_summary",
"description": "Search the web using the Google search API and summarizes it",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "the search query to hit with the Google API"
},
"summary_prompt": {
"type": "string",
"description": "The prompt to use to summarize the snippets"
}
},
"required": ["query", "summary_prompt"]
}
}
},
'sum_nums': {
"type": "function",
"function": {
"name": "sum_nums",
"description": "sum numbers in the list",
"parameters": {
"type": "object",
"properties": {
"iterable": {
"type": "array",
"items": {
"type": "number"
},
"description": "list of number to sum"
}
},
"required": ["iterable"]
}
}
},
'product': {
"type": "function",
"function": {
"name": "product",
"description": "calculate product of numbers in the list",
"parameters": {
"type": "object",
"properties": {
"iterable": {
"type": "array",
"items": {
"type": "number"
},
"description": "list of number to sum"
}
},
"required": ["iterable"]
}
}
},
}
functions = {
'search_snippets_summary': search_snippets_summary,
'sum_nums': sum_nums,
'product': product
}
Now we can call the summarization function. Because this is a problem requiring multi-step calculations, we can increase AI’s accuracy by using a simple chain-of-thought prompt:
final_rs = function_enabled_call([ {
"role": "user", "content": '''
What is the total tariff for the following purchase order of bitcoin miners?
Miners:
%s
Unit Prices:
%s
Tariff rate:
%s
Reply in JSON format, with the following JSON object:
{
"steps": [
"name": <step_name>,
"action": <step_action>,
"step_result": <step_result>
],
"total_tariff": <total_tariff>,
"itemized_tariff": {
<miner_type_1>: <tariff_type_1>,
<miner_type_2>: <tariff_type_2>
}
}
''' % (miners_rs, unit_prices_rs, tariffs_rs)
} ], ['search_snippets_summary', 'sum_nums', 'product'])
print(final_rs)
You should see something like the output below:
{
"steps": [
{
"name": "Calculate total price for Antminer S19J XP",
"action": "Quantity (28400) * Unit Price (2445.63)",
"step_result": 69455892.0
},
{
"name": "Calculate total price for Antminer S21",
"action": "Quantity (12900) * Unit Price (3695)",
"step_result": 47665500
},
{
"name": "Calculate total order price",
"action": "Sum of total prices of all miners",
"step_result": 117121392
},
{
"name": "Calculate tariff for Antminer S19J XP",
"action": "Total price (69455892) * Tariff Rate (25%)",
"step_result": 17363973.0
},
{
"name": "Calculate tariff for Antminer S21",
"action": "Total price (47665500) * Tariff Rate (25%)",
"step_result": 11916375.0
},
{
"name": "Calculate total tariff",
"action": "Sum of tariffs on all miners",
"step_result": 29280348.0
}
],
"total_tariff": 29280348.0,
"itemized_tariff": {
"Antminer S19J XP": 17363973.0,
"Antminer S21": 11916375.0
}
}
In the final prompt, we ask the AI to output the individual steps, forcing a chain-of-thought approach. This has been shown to increase accuracy in multi-step tasks like this one. We also specify the format of the final output JSON, because this object will likely be read by other (non-AI) programs. We therefore need to ensure the JSON output will have a consistent and predictable structure.
Conclusion
Our tutorial explored building a data pipeline that analyzes financial filings to extract specific pieces of information. In the process, we used several advanced techniques, such as vector search, RAG, agents, chain-of-thought prompting, and function calling. To build modern AI applications, an AI engineer often has to combine several techniques to get the best performance out of AI.
Using advanced prompting techniques and modern AI features, we turned AI into something more than an information processor, something more akin to an executive/planning layer in our data pipeline. As AI models and applications become more advanced, this will likely become an increasingly common use case for AI technology.
Likewise, as AI features evolve, understanding the capabilities and limitations of intricate AI models will become crucial, and allow AI engineers to get optimal results while avoiding costly mistakes.
Why AI Is Great for Financial Analysis
The article today demonstrates why AI is a great tool for financial analysts. Financial data is often complex, messy, and fast-moving. Traditional data pipelines must be continuously re-developed to adapt to new data formats and changing information demands.
AI is uniquely suited to the challenges of the financial industry. Not only does it excel at analyzing unstructured data such as filings and text data, but it can adapt to changing conditions. An AI-powered financial analysis application will likely be more flexible, resilient, and adaptive to changing economic conditions than one that relies solely on traditional methods.
What Do We Do Next?
I hope today’s article has inspired you to think about your own use cases and explore the world of AI. As a former quant at a top hedge fund and now the Director of AI at Talkoot, I’m always excited to think about new and challenging problems in the field of AI. If you have a problem or dataset that you’d like to tackle but don’t know where to start, feel free to reach out on LinkedIn or GitHub!