More advanced web scraping with Google Sheets

More advanced, and faster, than =IMPORTXML(), that is.

In Google Sheets you can use =IMPORTXML() with xpath expressions to return values from a HTML-page. While it is pretty neat and easy to use, it quickly becomes slow if you want to scrape a bit more than a few URLs and a few elements.

So let's say I want to know the following from 250 URLs:

  • url-stuff: status code, robots, canonical
  • meta: title, description
  • open graph: og:title, image, url, description
  • and the h1-tag

These are 9 HTML-elements, for 250 URLs. Instead of 2250 requests with =IMPORTXML(), we can do it in 250 requests (where every request returns all 9 elements at once), making scraping with Google Sheets a lot faster!

Let's get started.

Step 1: add Cheerio to your sheet

Did you know you can load external libraries into Google Apps and so use their goodness? For this, I've used Cheerio, which basically mimics jquery's element selection easyness. It sometimes is a bit picky when dealing with other character encodings, but most of the time it gets the job done.

Step 2: Copy the script below

We need to use a custom function in Apps Script for this. Don't worry: it's easy, because I've already written the code for you ;)

The only thing you need to do, is to copy the code and paste it into the Script Editor:

  1. go to Tools > Script Editor:
Script editor
  1. delete the 'function myFunction() {}' example code:
Script editor startup screen
  1. and replace it with the following code:
/**
 * Scrape URL, return whatever you choose with jquery-style selectors. Dependency: cheeriogs, see https://github.com/fgborges/cheeriogs
 *
 * @param {url} valid start-url
 * @return result (array values)
 *
 * @customfunction
 */
function scraper(url) {
    var result = [];
    var title, ogurl, ogtitle, ogimage, ogdescription, description, h1, robots, canonical;
    var options = {
        'muteHttpExceptions': true,
        'followRedirects': false,
    };
    try {
        // trim url to prevent (rare) errors
        url.toString().trim();
        var r = UrlFetchApp.fetch(url, options);
        var c = r.getResponseCode();
        // check for meta refresh if 200 ok
        if (c == 200) {
            var html = r.getContentText();
            var $ = Cheerio.load(html); // make sure this lib is added to your project!
            // meta robots
            if ($('meta[name="robots"]').attr('content')) {
                robots = $('meta[name="robots"]').attr('content').trim();
            }
            // canonical
            if ($("link[rel='canonical']").attr("href")) {
                canonical = $("link[rel='canonical']").attr("href");
            }
            // meta title
            if ($('title')) {
                title = $('title').text().trim();
            }
            // meta description
            if ($('meta[name=description]').attr("content")) {
                description = $('meta[name=description]').attr("content").trim();
            }
            // h1
            if ($('h1')) {
                h1 = $('h1').text().trim();
            }
            // open graph url
            if ($('meta[property="og:url"]').attr('content')) {
                ogurl = $('meta[property="og:url"]').attr('content').trim();
            }
            // open graph title
            if ($('meta[property="og:title"]').attr('content')) {
                ogtitle = $('meta[property="og:title"]').attr('content').trim();
            }
            // open graph image
            if ($('meta[property="og:image"]').attr('content')) {
                ogimage = $('meta[property="og:image"]').attr('content');
            }
            // open graph description
            if ($('meta[property="og:description"]').attr('content')) {
                ogdescription = $('meta[property="og:description"]').attr('content').trim();
            }
        }
        result.push([c, canonical, robots, title, description, h1, ogurl, ogtitle, ogimage, ogdescription]);
    } catch (error) {
        result.push(error.toString());
    } finally {
        return result;
    }
}

That's it. Now you can give this function a URL, and it will return all fields (and others, if you want) for you. I'll show you how.

Step 3: use it

So to use this fuction, simply pass a URL to it: scraper(url). The output is 9 columns of data if the elements have value, and empty cells where the scraped elements have no value.

For sanity's sake, I usually add a check if the URL actually IS a valid URL: =if(isurl(url);scraper(url);"non-valid url"). That's it. Copy the formula if you want to use it for multiple URLs.

Step 4: easily expand, if you want

It also is quite easy to modify this script. Do you want to extract the text in a tag called 'my_custom_tag'? Add this:

if ($('my_custom_tag').text() ) {
   var my_custom_tag = $('my_custom_tag').text();
}

and push 'my_custom_tag' into the result array: result.push([c,canonical,my_custom_tag ... etc.

That's it! Now go and use it, for example to use Google Sheets's auto-updating features to provide your client with real-time SEO-content reporting through Data Studio (with this as a self-updating datastore of your meta-texts ;))

What's next?

If you think this was useful to you, it might be for others also. Feel free to share: