More advanced, and faster, than =IMPORTXML()
, that is. In Google Sheets you can use =IMPORTXML()
with xpath expressions to return values from a HTML-page. While it is pretty neat and easy to use, it quickly becomes slow if you want to scrape a bit more than a few URLs and a few elements.
Let's say I want to know the following from 250 URLs:
These are 9 HTML-elements, for 250 URLs. Instead of 2250 requests with =IMPORTXML()
, we can do it in 250 requests (where every request returns all 9 elements at once), making scraping with Google Sheets a lot faster!
Let's get started.
Did you know you can load external libraries into Google Apps and so use their goodness? For this, I've used Cheerio, which basically mimics jquery's element selection easyness. It sometimes is a bit picky when dealing with other character encodings, but most of the time it gets the job done.
How to add Cheerio
See here for instructions on how to add Cheerio to a Google Apps Script, and then return here.
We need to use a custom function in Apps Script for this. Don't worry: it's easy, because I've already written the code for you ;)
The only thing you need to do, is to copy the code and paste it into the Script Editor:
/**
* Scrape URL, return whatever you choose with jquery-style selectors. Dependency: cheeriogs, see https://github.com/fgborges/cheeriogs
*
* @param {url} valid start-url
* @return result (array values)
*
* @customfunction
*/
function scraper(url) {
var result = [];
var title, ogurl, ogtitle, ogimage, ogdescription, description, h1, robots, canonical;
var options = {
'muteHttpExceptions': true,
'followRedirects': false,
};
try {
// trim url to prevent (rare) errors
url.toString().trim();
var r = UrlFetchApp.fetch(url, options);
var c = r.getResponseCode();
// check for meta refresh if 200 ok
if (c == 200) {
var html = r.getContentText();
var $ = Cheerio.load(html); // make sure this lib is added to your project!
// meta robots
if ($('meta[name="robots"]').attr('content')) {
robots = $('meta[name="robots"]').attr('content').trim();
}
// canonical
if ($("link[rel='canonical']").attr("href")) {
canonical = $("link[rel='canonical']").attr("href");
}
// meta title
if ($('title')) {
title = $('title').text().trim();
}
// meta description
if ($('meta[name=description]').attr("content")) {
description = $('meta[name=description]').attr("content").trim();
}
// h1
if ($('h1')) {
h1 = $('h1').text().trim();
}
// open graph url
if ($('meta[property="og:url"]').attr('content')) {
ogurl = $('meta[property="og:url"]').attr('content').trim();
}
// open graph title
if ($('meta[property="og:title"]').attr('content')) {
ogtitle = $('meta[property="og:title"]').attr('content').trim();
}
// open graph image
if ($('meta[property="og:image"]').attr('content')) {
ogimage = $('meta[property="og:image"]').attr('content');
}
// open graph description
if ($('meta[property="og:description"]').attr('content')) {
ogdescription = $('meta[property="og:description"]').attr('content').trim();
}
}
result.push([c, canonical, robots, title, description, h1, ogurl, ogtitle, ogimage, ogdescription]);
} catch (error) {
result.push(error.toString());
} finally {
return result;
}
}
That's it. Now you can give this function a URL, and it will return all fields (and others, if you want) for you. I'll show you how.
So to use this fuction, simply pass a URL to it: scraper(url)
. The output is 9 columns of data if the elements have value, and empty cells where the scraped elements have no value.
For sanity's sake, I usually add a check if the URL actually IS a valid URL:
=if(isurl(url);scraper(url);"non-valid url")
That's it. Copy the formula if you want to use it for multiple URLs.
It also is quite easy to modify this script. Do you want to extract the text in a tag called 'my_custom_tag
'? Add this:
if ($('my_custom_tag').text() ) {
var my_custom_tag = $('my_custom_tag').text();
}
and push 'my_custom_tag' into the result array:
result.push([c,canonical,my_custom_tag ... etc
That's it! Now go and use it, for example to use Google Sheets's auto-updating features to provide your client with real-time SEO-content reporting through Data Studio (with this as a self-updating datastore of your meta-texts ;))
If you think this was useful to you, it might be for others also. Feel free to share: