(Table of contents & other stuff:  )

Minimal (but powerful) keyword preprocessing in BigQuery

Speed up keyword research tasks in BigQuery by preprocessing keywords first. But also make sure you don't preprocess too much (and thus overwrite or delete valuable information).

This UDF does that for you.

I needed to speed up keyword research, and was trying to do that in a huge dataset (421 GB). So I needed a way to make keyword preprocessing faster, without sacrificing (a lot) of information within them.

So, here is the UDF I usually use for this. It contains - and this is important - a very MINIMAL set of preprocessing tasks that still make a text dataset easier to parse and interpret, without removing (too much) valuable information.

Here is what it does #

  1. 1.

    It replaces uncommon accents with a common letter. So a ü becomes a u, an  becomes an A, for example.

  2. 2.

    Since we're dealing with keywords, everything is converted to lowercase. However if you need uppercase letters, you can of course comment the let processedText = normalizedText.toLowerCase(); line

  3. 3.

    It removes characters that are outside the common ASCII range, like emojis and most symbols

  4. 4.

    Non-word characters are replaces by a space

  5. 5.

    Double spaces are replaced by a single space

  6. 6.

    It removes telephone numbers (9 or more digits, optional spaces, optionally starting with a + sign). You could also include another line that filters email addresses, but i'll leave that up to you.

The code #

Replace project.dataset with your own BigQuery project and dataset.

CREATE OR REPLACE FUNCTION `project.dataset.basic_text_preprocess`(input STRING) RETURNS STRING LANGUAGE js AS R"""
const accentsMap = {
    'ç': 'c', 'æ': 'ae', 'œ': 'oe', 'á': 'a', 'é': 'e', 'í': 'i', 'ó': 'o', 'ú': 'u',
    'à': 'a', 'è': 'e', 'ì': 'i', 'ò': 'o', 'ù': 'u', 'ä': 'a', 'ë': 'e', 'ï': 'i',
    'ö': 'o', 'ü': 'u', 'ÿ': 'y', 'â': 'a', 'ê': 'e', 'î': 'i', 'ô': 'o', 'û': 'u',
    'å': 'a', 'ø': 'o', 'Ø': 'O', 'Å': 'A', 'Á': 'A', 'À': 'A', 'Â': 'A', 'Ä': 'A',
    'È': 'E', 'É': 'E', 'Ê': 'E', 'Ë': 'E', 'Í': 'I', 'Î': 'I', 'Ï': 'I', 'Ì': 'I',
    'Ò': 'O', 'Ó': 'O', 'Ô': 'O', 'Ö': 'O', 'Ú': 'U', 'Ù': 'U', 'Û': 'U', 'Ü': 'U',
    'Ÿ': 'Y', 'Ç': 'C', 'Æ': 'AE', 'Œ': 'OE', 'ñ': 'n', 'Ñ': 'N'
};

// Normalize accented characters
let normalizedText = input.split('').map(char => accentsMap[char.toLowerCase()] || char).join('');

// Convert to lowercase
let processedText = normalizedText.toLowerCase();

// This regex targets characters outside the basic ASCII range, which includes emojis and most symbols
processedText = processedText.replace(/[^\x00-\x7F]/g, "");

// Replace non-word characters with spaces and collapse multiple spaces into one
processedText = processedText.replace(/[^\w-]+/g, ' ').replace(/\s+/g, ' ').trim();

// Remove words that are too short
processedText = processedText.split(' ').filter(word => word.length > 2).join(' ');

// Removes sequences of 9 or more digits that might include spaces/dashes, optionally starting with a plus sign
processedText = processedText.replace(/\+?\b(?:\d[ -]?){9,}\b/g, ' ');

return processedText;
""";

Usage #

I assume you know your way in BigQuery. I usually create an extra STRING column. The value is whatever your keyword column is called, with this function wrapped around it:

`project.dataset.basic_text_preprocess`(keyword)

It's also nice to add another column that checks if the keyword was changed, by comparing the two text strings with each other. That could be a BOOLEAN value of something more elaborate, like a diff type function. But that is beyond the scope of this article ;)

Happy keyword researching!

→ Call to action ←

Want more?

Then let's close the virtual gap between us by one step.