Shopify’s a name you can’t miss if you’re in the e-Commerce business. After all, the B2B e-Commerce platform has a whopping 1.7 million customers and is used to automate much of the electronic shopping experience for customers and businesses alike. If you haven't already be sure to check out our other post on Analyzing Shopify Data Exports.
If you’re looking to upscale your shop, learn trends and patterns from your customers, and analyze your Shopify big data, Gigasheet can greatly help you out! In this article, I’ll explore a sample sales dataset that I’ve acquired from Shopify using standard data analysis techniques. Let’s see what we find! 🚀
If you're not already familiar with Gigasheet, it's a free big data analysis spreadsheet. You can use it as an online CSV viewer of course, or to do things like merge tons of CSV files. But perhaps most relevant for this post, we'll be using it for big data analysis. The great thing about Gigasheet is that it works much like a spreadsheet, so no training is required.
The dataset we’ll be analyzing was part of Shopify’s recruitment challenges from 2021. If you’d like to follow along or get a sampler, you can acquire the dataset from Kaggle here. Once done, head over to Gigasheet, upload the dataset, and you’re good to follow along – starting with Data Exploration.
Let’s first take a quick look at the data and see what we’ve got here. From the UI itself, we can see it has 7 columns and 5,000 rows. Out of the 7 columns, three of them are ID fields (shop, user, and order) and the remaining 4 show more information about the orders themselves.
As part of exploration, we can also look for missing values which can throw off our analysis moving forward.
There are several ways to look for missing values in Gigasheet:
We’ll look at missing values in the Data Cleanup section in a little more detail.
If you’ve had any hits against the filter or grouping, you can now delete all the matching rows or use a substitute value. Say you’d like to use the average value – you can group the values against that column and calculate the average. Here’s an example of the data grouped by “shop_id” and the average of “order_amount” (click the down arrow for more options).
Luckily, this dataset doesn’t have any missing value. Neither does the rest of the dataset need cleanup. For a bonus or if you’d like to map the dataset against days, months, or year, you can explode the date. Here’s an example of how the “exploded” date looks:
No more Python functions, memorizing datetime functions, or using the tz library to modify the time zones – it’s super simple to work on dates using Gigasheet!
Grouping the shop_id column, we see the dataset contains sales data from 100 stores (the “Rows” count shows the number of grouped values). Using the “Row Count” aggregation on the order_id column, we can retrieve the number of orders against each of these shops. Here are the top five stores:
However, there’s no anomaly against the number of orders from a store. They’re all averaging the same figures. We can pivot from this field for now.
Next, let’s sort the data against the average of the order_amount column. Two shops stick out of the dataset and are clear outliers – shop 42 and 78 with an order amount of 235,101 and 49,213. How about we check which users account for these insanely large orders?
We can group the data against the user ID’s and then use the average aggregation on the order_amount column. It’s likely we’re observing some form of fraud from the user against the ID, 607 and 878 (most likely a normal user spending more than the average figure).
We can also use a sweet bar graph to view this data visually using charts:
Did you know? You can create all kinds of visual diagrams and graphs in Gigasheet, customize them, and save them as PNGs for your reporting!
Let’s first pivot our analysis on the user 607 because the user clearly indicates fraud and patterns of abnormal usage. I’ll use a filter to select all data from the user.
Once filtered, we can see an even clearer pattern of some form of automated fraud activity where the user purchases 2000 items every day at 04:00 UTC from the shop, 42, using a credit card. This is millions of dollars in potentially bogus sales!
Now that the analysis on 607 is concluded, let’s pivot to the second most suspicious user – 878. Filtering results in a few rows with one outlier – the total order_amount is 154350 from the shop with the ID, 78. Well, this isn’t substantial evidence of anything at the moment.
Finally, I’ll pivot into the third user with the ID – 766. Again, another order looks suspicious with the total order_amount being 77175 from the shop with the ID, 78. Well, the store seems to be behind several suspicious orders. Maybe this should be our next pivot.
Filtering on the shop_id of 78, we get 46 results with almost all order_amount values pointing to an anomalous figure. Some of these order amounts are the same against multiple user IDs . The shop seems to be the hub for fraud as well.
At the start of our analysis, the average order value from the dataset of 100 stores selling shoes was around $3145.13 which was clearly anomalous for something like shoes. As part of our data analysis using Gigasheet, we uncovered two scenarios of fraud from the shop ID, 78, and user ID, 607.
After removing the two outliers from our dataset, we get an average order value of 302.58 which is how it should’ve been before.
That’s all for this Shopify fraud data analysis, folks! Hopefully techniques like this will help you prevent fraud in your shop. Beyond fraud prevention, Gigasheet offers a wide range of features for Shopify owners and online sales analysts to explore large datasets, create graphs, and report results in beautiful visuals.
Want to take Gigasheet out on a test ride? Well, you can… for free! Sign up for an account today, upload your data, and go crazy – billions of rows and we don’t even break a sweat!