Using OpenAI API and LangChain to assist exploratory data analysis

9 min readJan 22, 2024

By now, we are sufficiently aware of the capabilities offered by Generative AI and its use cases across various industries. It is a paradigm shift in how AI can assist and augment our decision making processes and create efficient workflows.

As a data scientist, my first reaction was to leverage this technology to assist in my data analysis journey, for example, to generate data analysis ideas, create dataset summaries for specific columns, generate code to create plots, etc.

I have been reading a lot about LLMs fundamentals, How are they trained, How do they work, and its applications. This blog is aimed to leverage GPT functionality to assist a data scientist workflow. Additionally, I wanted to gain a deeper understanding of LLMs and build applications with it.

Let’s connect — LinkedIn
Let’s build something — GitHub

In this article, we’ll cover:

Set up an OpenAI developer account and integrate it with Python environment.
Utilizing the chat functionality in the OpenAI API, with and without langchain.
Perform prompt engineering.
Build longer conversations with GPT.
Ideas for incorporating GPT into a data analysis or data science workflow.

I am using Jupyter Notebook to execute all the code in this blog, please feel free to use the IDE of your preference.

All the files for this blog are stored in:
https://github.com/chayansraj/GPT-and-LangChain-for-Data-analysis

We will explore electric vehicle dataset (This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).) Link: https://catalog.data.gov/dataset/electric-vehicle-population-data/resource/fa51be35-691f-45d2-9f3e-535877965e69

What is Open AI API and LangChain?

Open AI API — Open AI has released an API for accessing new AI models developed by OpenAI. Unlike most AI systems which are designed for one use-case, this API provides general-purpose “text in, text out” interface, allowing users to try it on virtually any English language task.

LangChain — LangChain is a framework for developing applications powered by language models. It connects a language model to sources of context (prompt instructions, content to ground its response in, etc.) and rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)

Set up an OpenAI developer account and integrate it with Python environment.

Go to the API signup page.
Create your account (you’ll need to provide your email address and your phone number).
Go to the API keys page.
Create a new secret key.
Take a copy of it. (If you lose it, delete the key and create a new one.)

Open AI API Keys page (Image credits: Author)

6. Set this key as an environment variable since we will utilize this variable while working with chat functionality.

# Import the os package
import os

# Import the openai package
import openai

# Set openai.api_key to the OPENAI_API_KEY environment variable
openai.api_key = os.environ['OPENAI_KEY']

Utilizing the chat functionality in the OpenAI API, with and without langchain.

# Install the langchain package
!pip install langchain==0.0.300

# Update the typing_extensions package
!pip install typing_extensions==4.8.0

# Import the langchain package as lc
import langchain as lc

# From the langchain.chat_models module, import ChatOpenAI
from langchain.chat_models import ChatOpenAI

# From the langchain.schema module, import AIMessage, HumanMessage, SystemMessage
from langchain.schema import AIMessage, HumanMessage, SystemMessage

Import the Electric Cars Dataset: https://catalog.data.gov/dataset/electric-vehicle-population-data/resource/fa51be35-691f-45d2-9f3e-535877965e69

It can be pre-processed and the notebook file has been added in the GitHub.

Each row in the dataset contains the count of the number of cars registered within a city, for a particular model.

The dataset contains the following columns.

city (object): The city in which the registered owner resides.
county (object): The county in which the registered owner resides.
model_year (integer): The model year of the car.
make (object): The manufacturer of the car.
model (object): The model of the car.
electric_vehicle_type (object): "Plug-in Hybrid Electric Vehicle (PHEV)" or "Battery Electric Vehicle (BEV)".
n_cars (integer): The count of the number of vehicles registered.

Asking questions from GPT

Initially, we establish a connection between GPT and humans using messages as described in chat completion API under text generation models that have been trained to understand natural language, code, and images. Typically, a conversation is initiated with a system message first, followed by alternating user and assistant messages.

One thing to understand is that the foundation models are trained on TBs of internet data and unless we consciously streamline the subject corpus for the GPT, the output would not be of high quality. The system message helps set the behavior of the assistant.

However note that the system message is optional and the model’s behavior without a system message is likely to be similar to using a generic message such as “You are a helpful assistant.”

The system and user messages could be defined as follows:

# Define the system message. Assign to system_msg_test.
system_msg_test = "You are a helpful assistant who has a good understanding of data science. Your way of writing is very clear and simple. You keep your answers brief."

# Define the user message. Assign to user_msg_test.
user_msg_test = "Tell me some uses of GPT for data analysis."

# Create a message list from the system and user messages. Assign to msgs_test.
msgs_test = [{"role": "system", "content":system_msg_test},
             {"role": "user", "content":user_msg_test}]

# Send the messages to GPT. Assign to rsps_test.
rsps_test = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=msgs_test
)
# print(rsps_test)

The response takes about 10 seconds and arrives in a JSON format as follows:

<OpenAIObject chat.completion id=chatcmpl-8igis8VqYX9ydhcIfCnpJIHBWPE0J at 0x7fbb3d0d2f40> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "GPT (which stands for Generative Pre-trained Transformer) 
        can be helpful for data analysis in a few different ways. 
        Firstly, it can help with data cleaning and pre-processing, which 
        means making sure the data is in a good condition for analysis. 
        GPT can also be used for data visualization, which means creating 
        pictures or graphs to help understand the data better. 
        Lastly, GPT can be used for data prediction, which means making 
        guesses about what might happen based on the data that has already 
        been collected. Overall, GPT is quite useful for analyzing and 
        understanding data.",
        "role": "assistant"
      }
    }
  ],
  "created": 1705661350,
  "id": "chatcmpl-8igis8VqYX9ydhcIfCnpJIHBWPE0J",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 115,
    "prompt_tokens": 52,
    "total_tokens": 167
  }
}

Next, Asking GPT a question about the dataset

Now, we have received a response from the API and next we can start to give GPT some information about our dataset in the prompt and ask questions about it. As witnessed above, response from the API above is a bit truculent and becomes a overhead to extract the core content. This is where langchain comes into the context which acts as an abstraction layer and serves the response in a more intiutive way.

Perform Prompt Engineering

One of the advantages of langchain package is that it simplifies the code for some tasks, letting you simplify the response process and skip the drudgery caused by specific language syntax.

Secondly, it promotes easy experimentation where if we would like to swap GPT for a different model at a later date, it can be easier to do so if you use the langchain package rather than the openai package directly.

The LangChain message types are names slightly differently from the OpenAI message types.

SystemMessage is the equivalent of OpenAI's system message.
HumanMessage is the equivalent of OpenAI's user message.

Prompt Setup:

Designing a prompt that contains dataset details such as:

Describe the dataset, its purpose and the columns.
Create a task for the AI. Assign to suggest_questions. Use the text "Suggest some questions about this dataset for performing data analysis and creating insights.".
Concatenate the dataset description and the request. Assign to msgs_suggest_questions. The first message is a system message with the content "You are a data analysis expert.". The second message is a human message with dataset_description and suggest_questions concatenated with two line breaks in between.

# A description of the dataset
dataset_description = """
You have a dataset about electric cars registered in Washington state, USA in 
2020. It is available as a pandas DataFrame named `electric_cars`.

Each row in the dataset represents the count of the number of cars 
registered within a city, for a particular model.

The dataset contains the following columns.

- `city` (character): The city in which the registered owner resides.
- `county` (character): The county in which the registered owner resides.
- `model_year` (integer): The [model year](https://en.wikipedia.org/wiki/Model_year#United_States_and_Canada) of the car.
- `make` (character): The manufacturer of the car.
- `model` (character): The model of the car.
- `electric_vehicle_type` (character): Either "Plug-in Hybrid Electric Vehicle (PHEV)" or "Battery Electric Vehicle (BEV)".
- `n_cars` (integer): The count of the number of vehicles registered.
"""

# Create a task for the AI. Assign to suggest_questions.
suggest_questions = "Suggest some questions about this dataset for 
performing data analysis and creating insights."

# Concatenate the dataset description and the request. 
# Assign to msgs_suggest_questions.
msgs_suggest_questions = [
    SystemMessage(content="You are a data analysis expert."),
    HumanMessage(content=f"{dataset_description}\n\n{suggest_questions}")
]

Next create a ‘ChatOpenAI’ object and assign it to ‘chat’ along with your Open AI API Key. Pass msgs_suggest_questions into the chat and print the response.

# Create a ChatOpenAI object. Assign to chat.
chat = ChatOpenAI(openai_api_key=os.environ['OPENAI_KEY'])

# Pass your message to GPT. Assign to rsps_suggest_questions.
rsps_suggest_questions = chat(msgs_suggest_questions)

# Print the text content of the response.
print(rsps_suggest_questions.content)

The response is actually a set of questions that could be used to analyse this dataset further.

1. What is the total number of electric cars registered in Washington state in 2020?
2. Which city had the highest number of electric cars registered in 2020?
3. What was the most popular electric car model in Washington state in 2020?
4. Which county had the highest number of battery electric vehicles (BEVs) registered in 2020?
5. How many plug-in hybrid electric vehicles (PHEVs) were registered in each county in 2020?
6. How does the distribution of electric car models differ between different cities in Washington state?
7. What is the average number of electric cars registered per city in 2020?
8. Is there a correlation between the model year of the car and the number of registered cars for each model?
9. How does the distribution of electric vehicle types (BEVs vs PHEVs) vary across different counties in Washington state?
10. Are there any outliers in the number of registered electric cars for specific models?

Build longer conversations with GPT.

While a single prompt and response can be useful in some settings, often we want to have a longer conversation with GPT to dive deeper into the context. In this scenario, you can pass previous messages, creating a chain, so that GPT can “remember” what was said before.

Now, based on the set of questions provided by GPT, it would be fun to implement some of the code in Python, let’s check it out.

Append different prompt to the conversation and chat with GPT again.

Append the response and a new message to the previous messages. Assign to msgs_python_code.
Pass your message to GPT. Assign to rsps_python_code.

# Append the response and a new message to the previous messages. 
# Assign to msgs_python_code.
msgs_python_code = msgs_suggest_questions + [
    rsps_suggest_questions,
    HumanMessage(content="Write a code to find out if there is a correlation 
    between the model year and the count of electric cars registered in 
    Washington state in 2020")
]

# Pass your message to GPT. Assign to rsps_python_top_models.
rsps_python_code = chat(msgs_python_code)

# Display the response's Markdown content
display(Markdown(msgs_python_code.content))

The output from GPT turns out to be this:

# To find out if there is a correlation between the model year and the 
# count of electric cars registered in Washington state in 2020, you can 
# use the corr() function in pandas.

# Here's the code to calculate the correlation coefficient:

import pandas as pd

# Assuming the dataset is already loaded into the 'electric_cars' DataFrame

# Convert the 'model_year' column to numeric data type
electric_cars['model_year'] = pd.to_numeric(electric_cars['model_year'])

# Calculate the correlation coefficient
correlation = electric_cars['model_year'].corr(electric_cars['n_cars'])
print('Correlation coefficient:', correlation)

"""
The corr() function computes the correlation between two series, in this 
case, the 'model_year' column and the 'n_cars' column from the 'electric_cars'
DataFrame. The correlation coefficient ranges from -1 to 1, where 0 indicates 
no correlation, -1 indicates a perfect negative correlation, and 1 indicates 
a perfect positive correlation.

The result will be printed as the correlation coefficient between the model 
year and the count of electric cars registered in Washington state in 2020. 

"""

Correlation coefficient: 0.039856022914093334

In software development, when developing an application, we want our code to be modular, scalable and reusable. How do we do this when it comes to LLM Prompts?

This is where Prompt Templates come into the picture! It is a technique that enables the use of dynamic prompts, with built-in verification tools on whether all inputs are entered (this will ease the load on testing). They can easily be saved, versioned and integrated into the code base.

It is considered a best practice to build and use prompt templates where there is an extensive use of repetitive prompts or commands to language model. It reduces the overhead, introduces a modular approach and reduces human error in prompt engineering.

The fun doesn’t end here since there are many techniques that falls under prompt engineering and deserves a separate article to cover all of the material. In the next article, we shall dive deeper into Prompt engineering for GPT.

Thanks for checking in and reading!