Lately, I am a bit obsessed with OpenAI and ChatGPT.
One feature that really excited me, went comparatively under the radar - Advanced Data Analysis. The name hides the true nature of the feature. Previous version of this feature was called Code Interpreter, and that got a quite the hype for people who got to test it.
Advanced Data Analysis allows ChatGPT to work with python code. It can write and execute the code in a sandbox environment. This is groundbreaking for multiple reasons.
ChatGPT can execute and test your code
The most obvious use case is to connect it with actual coding.
If you have ever asked ChatGPT to help you with a piece of code, you know how frustrating it is when it returns a chunk of code that does not work due to a syntax error, using non-existent functions or simply wrong logic.
With the ability to execute code you can make sure that ChatGPT verifies its own response before spitting it out. For now, it is only capable of processing python. We are yet to see if OpenAI will make any developments in other languages.
It is not magic, with complex frameworks and large systems you will still have to engineer everything, however this is a great step in making ChatGPT more useful for coding.
While this feature is amazing, it is only relevant to developers. There was a reason why the feature was renamed to Advanced Data Analysis.
Advanced Data Analysis with ChatGPT
The new name really describes what it can do. ChatGPT is now able to read files, analyze them and provide visualizations, files and valuable analysis. And I am not talking about visualizations using DALL-E. Actual graphs, spreadsheets, presentations.
Let's take a look at how to use ChatGPT for data analysis.
To demonstrate the power of Advanced Data Analysis, I will take some data from Free Data Sets. The first thing that caught my eye is the report on stolen vehicles, there is a lot to visualize there.
First thing you will notice is that you can upload multiple files and ChatGPT will understand how they relate to each other. At the moment you can upload up to 750MB (1 file max is ~25MB).
First ChatGPT will read in the files, you can even see what python it is executing to do that:
Next, it will describe what it finds in the graphs and explain how it will draw the graph, eventually ending up with something like this:
You are also able to click on the [>_] to view the analysis, which once again shows the python code needed to create the chart:\
If we take look at the graph, it looks almost good, but not great. Some of the labels are too close to each other. But hey, given that it generated it in a few seconds this is not bad.
We should not stop here, lets improve it by asking ChatGPT to modify the visualization to something a bit more specific. Let's try making a pie chart with a colored legend.
It was correct on the first go. If you ask me this is pretty impressive already. But before we move on to the next example, I want to demostrate that you can also download this file as an image or in any other format.
And you don't need to stop here. You can ask it to include this file in a presentation, and even generate a whole presentation around this type of data.
Analyzing Location Data
Moving on to different types of visualizations. We can also draw maps using location data. For this, I will take a different set of data, which contains latitude and longitude of certain places. I decided to use a sample size of Motor vehicle collisions reported by the New York City Police Department from January 2021 to April 2023.
Initially I got the map of traffic collisions like this:
In general it does what I asked, but there is no map in the background for context. I tinkered with it a bit more and was able to generate some interesting heatmaps. It was not perfect right away and it did take me some time to find the right prompts. In the end I got ChatGPT to generate an HTML map that I can embed anywhere on the web like this.
If you ask me, it looks pretty damn good for the little effort that was needed.
I did not want to ruin my pagespeed so I generated 2 versions. The embed version you see above is generated from only 1000 rows of data so the heatmap is a bit plain. To test the limits of processing I generated the full version with 240k rows (which returned a 4.5MB HTML file, you can see it here - https://agilemerchants.com/content/images/2023/11/nyc_full_collisions_heatmap.html)
Time Series Analysis
With Advanced data Analysis you can perform trend analysis and seasonality detection, and forecasting models. To experiment with this we will use the same NYC collisions data we used in visualization examples.
After uploading the files simply ask ChatGPT to perform trend analysis, or seasonality detection. In my case, I decided to upload a few months of data and ask for a trend analysis for trends based on day of week. To me, it is interesting to see if more crashes happen on certain days, like Mondays? Or is it just completely random?
From this data, it appears that Wednesday has the highest number of collisions, followed by Thursday and Friday. The weekend and the beginning of the week tend to have fewer collisions compared to the middle of the week. This is interesting and not something I would have guessed. Keep in mind that I only fed in a few months of data, so I decided to test with a larger data set to see if I got the same results. So I fed in 2 years of data from 2021 to 2023.
As you can see the results are different, however these make more sense to me. On Fridays everyone is rushing to get somewhere, so it seems reasonable that there would be more crashes. Either way the tool handled my medium-sized dataset (32MB) very nobly.
Similarly you could analyze the time of day, or season of the year or any other time based trend. It does a pretty good job of cleaning and preparing the data for this analysis.
Data cleaning, transformation and aggregation
In all of my previous examples my datasets were fed in as is. ChatGPT decided on multiple occasions to clean up the data and prepare it in a way that suits the analysis. It does well understanding the existing structure and preparing the data for analysis.
With this said, you can also utilize this. In case you have some really dirty data that you need to use, you can ask ChatGPT to go through it and restructure, reformat field types and prepare it any way you wish.
The main limitation here is the size of the files. You can't really process huge files. I was able to process a 32MB file which contained ~240 000 rows of data. For most actions it worked. But for some that took longer I exceeded the 60 second processing limit and the actions failed.
So for large datasets this might not work, yet. I am sure that these limits will increase and we will have huge amount of processing power available to us. I write about the current limits at the end of this article.
Generating QA Codes
Yes, you can use Advanced Data Analysis to generate a QR code. Simply ask:
Here is the code that it generated, and yes it works, you can try it below. This is actually mind blowing, there are services that used to live of such features.
Other things you can do with Advanced Data Analysis
There are some other things that are noteworthy to mention:
- Statistical Analysis: You can conduct statistical tests and analyses, including hypothesis testing, correlation analysis, regression analysis, and more complex statistical modeling.
- Machine Learning: You can build and evaluate machine learning models. This includes tasks like data preparation, feature engineering, model selection, training, evaluation, and prediction. I believe this would deserve a separate article.
- Text Analysis: For text data, you can perform tasks like text preprocessing, sentiment analysis, topic modeling, and other natural language processing (NLP) techniques. This goes beyond what default ChatGPT can do, because you can try using other python libraries like NLTK, spacy, CoreNLP, etc....
- I am sure there are so many more things that neither I or anyone else have thought of
How to enable Advanced Data Analysis
Whether you are interested in how to access code interpreter in ChatGPT, or you want to try the Advanced Data analysis, you will need to head to the same place.
Currently Beta features are available only for ChatGPT Plus users. And these features are not enabled by default. You have to enable it in your profile settings. Here is how to enable Advanced Data Analysis:
- Click on your Profile
- Click on Settings & Beta
- Find Beta features
- Enable Advanced data analysis
If you do not see this feature inside the Beta features configuration then the feature may not be rolled out for your account or region. OpenAI is very careful with mass deployments and all of their features are initially available only to certain users and the more stable they get the more users are able to access them.
Limitations of Advanced Data Analysis
Code Interpreter Using libraries it does not have
Sometimes Code Interpreter decides to use python libraries that it does not have access to. This is not a deal breaker because it tries again with another package or solution and usually get's it right on the next go.
Advanced Data Analysis Processing Power Limitations
The sandbox environment has limitations in terms of processing power, execution time (60 seconds limit per execution), and lack of internet access. You have to consider these when attempting to process large datasets or do complex operations with the data.
So far, I was able to hit only the processing time limitation - it was when trying to do analysis on a 30MB file.
Based on the information I found by asking the GPT itself, these are the current limits at the time of writing (November 2023):
- Upload Size Limit: The maximum size limit for a file you can upload is ~25MB. However in practice this was not true, I was able to upload a file of ~32MB size so I guess the limit may depend on other factors as well.
- Storage Capacity: In a single conversation, you can upload multiple files, but there is a cumulative limit of 750 MB for all uploads. This means you can upload several files as long as their total size does not exceed 750 MB. Each individual file must still adhere to the maximum file size limit of 25 MB.
- Output File Size: The largest file size that can be returned in the environment is approximately 25 MB. This size limit applies to any file type, including images, data files (like CSV, Excel), and any other documents.
- Processing Power: There is also a memory limit, although the exact amount is not specified. If your data processing or analysis task requires a large amount of memory, you might encounter issues.
- Execution Time Limit: Each Python execution or code cell run has a time limit of 60 seconds. If a task takes longer than this to complete, it will be terminated.
What have you used the Advanced Data Analysis for? Care to share any interesting features?