My First Foray into ChatGPT for Data and Analytics
Recently, the OI Metrics committee hosted a webinar on Using ChatGPT for Email Data Analysis. I must admit I was skeptical of it when we started. So far, my exposure had been subject line tools that used ChatGPT, which I consistently beat for one of my clients. I also thought that it would not be up to par with built in data prep tools already in programs I use, such as IBM Statistics. But by the time the webinar aired not only had I used ChatGPT enough times to make a judgment about it, I became a believer in ChatGPT. Every time I investigated the data behind an answer that was wonky, I found it was a problem with the original data, not the AI. Just like all other software, the results are only as good as the data you put in - broken data leads to broken results.
When I approached using ChatGPT for the webinar, I had a specific use in mind and that was data prep for email analysis. This struck me as an area where ChatGPT had the potential to really help me save time. Anyone that does data analysis knows that all the time is the prep work not the analysis. This is why you often see data prep features touted as points of differentiation among analysis software vendors. All of them can load in file and start running procedures in a few seconds to minutes depending on the number of cases in the file. But if your data is in the wrong format, or doesn’t meet analysis assumptions, etc., it comes out with wrong answers.
The hallmark of a good analyst is they know how to prep the data to make it useable, reliable, and valid. I don’t see AI replacing analysts -- instead it saves them time to do better or deeper projects. Whether I am typing in a prompt or running a procedure, I still need to know the right questions to ask, the right techniques to apply to the data, and whether my data meets the requirements.
Let me dive into some the tips I learned using ChatGPT. The first thing I noticed is that if you like free/open-source tools, then you will love working with the free version of ChatGPT. The first time I used it I didn’t have a paid version and realized I couldn’t upload files/data into it. ChatGPT told me it could not support that and instead it returned the python code to do the data checks and transformations locally. Interestingly, it also did not mention that the paid version of ChatGPT does allow uploads.
This is a gold mine for any data geeks that like working with tools like Python’s Panda library or other similar tools, as it writes the code for you – complete with correct syntax. No more hours spent trying to find the one place that’s missing a semi-colon or stray comma!
The paid version of ChatGPT is only $20 a month, which considering the time savings is a bargain. The paid plan of ChatGPT does allow you to load in data and was still quick with results for datasets over 100,000 rows. Personally, I ran 125K and loaded up another file that was over 300K without major delays and was ready to start working in under 3 minutes.
Next, I started running checks that you would do for most projects: seeing what variables we must work with, coverage of the data, correct format, etc. Just like in my other tools, I had it run a frequency table along with min/max of the values to check results made sense. There’s no way a human can check 100,000+ rows visually to find random missing values, or if a column shifted during export or read in. For example, I can find the 4 rows missing data for a variable out of 40,000 but it will take a few minutes, for ChatGPT it takes a few seconds. Not only could the AI do that for me, but it was also good at finding anomalies in the data that normally would take a bit of time to figure out existed and then find to correct in the data.
Time-based variables are some of the most useful for analyzing email data. But working with date-based data is often not easy and can be confusing. For example, think about your own segmentation rules – we don’t write them as a formula, rather we say things like “clicked in the past 180 days”. But when building time-based variables, you are writing that formula and it’s easy to make a mistake. ChatGPT was good at doing all the normal transformations I would do such as calculating lifespan by determining the number of days since opt-in to present for every subscriber. If we stopped there, that would be easy, but the reality is that’s not too useful. Instead, we need to roll that up into cohorts based on weeks or months and then graph them out to see trends. ChatGPT was helpful with doing the aggregation and then marking the people into groups for further analysis.
As I mentioned in the webinar, on rare occasions I would see wonky results for a specific person, such as ‘Weeks Since Last Action’ being longer than lifespan, or a negative number. Sometimes an error like this was due to misread a time stamp or transposition. (If you’re using CSV files, I found there was the occasional column transposition due to a comma in the value in another column.) ChatGPT was able to find and fix these, but it went beyond that. I could also tell it the dates in this variable are written this way, please reformat/convert them into another format for uniformity. IBM Statistics for example, only recognizes a few date formats as dates. If the dates are not in that format, then you must fix that in the file and re-upload or write your own syntax to do it within the program. Either way it’s not as easy as simply writing a single prompt to find and change theses.
One of the issues with many software packages, or moving data into your ESP, is that they either don’t allow joins or they don’t work well with joined datasets. You need to upload a flat file for most tools. In fact (and this may be just me), I often switch between one program that appends variables to another file and then export it back to the other program to create flat file to work with. ChatGPT gives me a better way to do this. It performed well joining files based on email address and then flattening across datasets for export. Of course, why export at all when you can simply have it run the stats right now?
Another place that ChatGPT really shined was aggregating and grouping data in text or string fields. This is something that most stats programs don’t do in the base program. For example, I use IBM Statistics (SPSS for you old timers) and it does have this functionality, but it’s an add-on that’s not part of the basic program. Regardless of the tool, one thing to watch out for is characters breaking due to changing character sets. Have you ever exported your subject lines with campaign statistics and noticed there’s a lot of weird characters in them? Like Greek letters or wingdings? This is usually due to changing character sets during the export. A common piece of advice when I started was never copy and paste from Word into your ESP’s editor – it creates those weird-o characters. That was just an accepted fact for years, but now I’ll tell you why. For most Windows PCs, Word did not default into UTF-8 Character sets, rather it normally used Latin-1. Latin-1 doesn’t support certain characters or emojis. Similar issues used to happen when crossing from Mac to PC and back which also changes the character encoding (Windows uses 8-bit characters, Apple uses 7-bit). When you see those broken characters in an export from your ESP, it’s likely due to how the characters were encoded during the export and only UTF supports emojis characters.
Since the webinar I’ve turned to ChatGPT to solve another data prep problem: I got an export of email event data in JSON format as a CSV file. The problem is the event data is wrapped in curly brackets with variable names on every line and then the actual value. Fixing this manually will be a laborious task to say the least, with more than 700,000 rows of data there is no way to do this in a timely fashion manually. Instead, ChatGPT to rescue. It was able to split the various event columns data into their own columns for analysis. Now I have the data in format I can access and use to find their deliverability issues.
Again and again my experience with ChatGPT was that it saved me time on data prep to spend more time on applying results and solving email problems. I would encourage all email marketers to give ChatGPT a try for different things. While we often think of AI as helping with copywriting, it does so much more than that. Try it out and post some creative ways you’ve used ChatGPT in your email program.