Imagine you’re performing competitive analysis and want to verify data published on the internet. For instance, let’s say we want to find original publish dates of several blog posts. Now, even if you try to do it for 100 articles, it may take you more than a day to analyze every article’s source code and find their respective publish dates manually.
Wouldn’t it be great if you could just put a list of URL’s into Microsoft Excel and it would scrape the HTML from each location into a cell? Then you could use Excel formulas to extract the data from the HTML that you need, such as the publish date in our example above.
While it is technically possible in Excel, it requires coding using Visual Basic and the steps go something like this:
Great news! There’s a better, faster way to to perform no code web scraping.
We ran an experiment on over 500 articles published on Neil Patel’s website and found the original publish date for all these articles—in under 10 minutes. And it is all done in a familiar spreadsheet format, that any Microsoft Excel user will love.
Here’s a breakdown of the steps we followed to accomplish this with Gigasheet, the big data spreadsheet and the custom enrichment feature.
Whether you want to make a list of price points for all your competitors’ products or create a database of topics you’ve covered on your site so far, you can collect any data in minutes with Gigasheet.
Here’s how we ran an experiment to find the publish date for over 500 articles on a website:
To start the experiment, we created a new file on Gigasheet and made a list of links for all the blogs we wanted to analyze.
You can log into your Gigasheet account (or create a new one) and click on the Blank Sheet option to get started.
Then give your file a name and keep adding all the links in any column. Be sure to give a relevant name to your column as well.
Once you’ve added all the links, go to Enrichments and scroll down to select the Custom Enrichments option.
You can also run enrichment requests for other apps, like Apollo.io, Agify, ChatGPT, etc.
Next, you’ll be prompted to add an API key link for your enrichment request. We are going to use the ScrapingBee API to perform the no code web scraping for us, extracting the HTML source code for all the articles. ScrapingBee comes with a free version, so you can try this out yourself before ever having to pay. The nice thing about using a service like ScrapingBee is that it handles headless browsers and rotates proxies for you so you don't have to worry about getting blocked while scraping. We add the ScrapingBee API key link in this step. Then hit next.
In this step, you can customize the API key URL to scrape data for the links you added to your list in step 1. To do that, select the part of the link starting after the = sign.
Then click on the drop-down menu on the right and select the column with all the links. Hit Insert Column Reference and your API key link will change to this. Now, press Test to run the request.
The custom enrichment process will start immediately. It’ll take a few minutes to complete, depending on the number of data points you’re covering.
Before displaying the data, Gigasheet will show you a preview of what the results will look like. Click on Next to continue the enrichment process and get the entire source code for the selected blogs.
Here’s how the results will appear:
We successfully used a spreadsheet to perform no code web scraping. Now that it’s in our sheet, let’s pull out the relevant data!
Once all the data is available in your Gigasheet file, go to the Data Cleanup menu and select the Split Column feature. This will help you segregate your columns and get whatever data you need.
Here, you have to select the column containing the source code for each article. In separator, we added the text just before the published date. Then hit Apply.
This step divided the HTML code into two columns. Part 1 had the code till the publish date, and part 2 had the code starting from the publish date. Check the results below:
Now, click on the plus icon on the top-right corner of your file and add a formula.
We used a formula to get the exact publish date from the part 2 column. With this formula, you can create a new column including the first n number of characters in the defined column.
So, in our case, we gave the new column the name “Published Date”. Then we added the formula LEFT and added the part 2 column as reference. We also added a criteria to select the first 10 characters from this column.
Here are the results: a complete list of only the published dates for all the blogs.
We repeated these steps to find the meta descriptions for all the blogs. Here are the results:
While our experiment presented a great use case for content managers, you can replicate these steps for extracting data for virtually any task, especially from web pages. No code web scraping is a powerful tool!
Once you have the data in spreadsheet form, you can organize it database by filtering, sorting, or grouping results. Besides these basic features, you can try the IF/THEN function or data cleanup functions to better represent your dataset. Gigasheet’s AI Sheet Assistant is another efficient way to navigate a large dataset and find what you need.
Sign up for free to extract data from anywhere with Gigasheet.