(Table of contents & other stuff:  )

How to turn 1.2 million domain names into 4000 partitions in BigQuery

I have a table in BigQuery with SERP data. It contains 12.484.559 different domain names. How to select all data from specific domains without needing to go over the full table (421 GB) over and over again?

Those who've worked with relational databases and made the switch to BigQuery know what I'm talking about: BigQuery doesn't support indexes. That really sucks. It makes specific SELECTs, JOINs etc. slower and more expensive. What to do?

Create your own index #

The way to deal with this, is to create indexes anyway, but then utilise the tools we have available. In the case of BigQuery, that means partitioning.

However, you can only create one partition type, it has to be an integer (or date), and there is a limit of 4000 partitions.

So what we do, is translate any given domain name into an integer, and make sure that integer is always between 0 and 4000.

Extract the first 3 characters #

First, we extract the first 3 characters from any domain name. We will use these three to translate it into an integer. We should also first make sure that only characters are extracted that are valid: 0-9, a-z, the dot (.) and a dash (-).

This is the SQL CTE I use for that:

WITH input_check AS (
      SELECT
        DISTINCT domain AS main_domain,
        REGEXP_CONTAINS(domain, r'^[^\\.]+\\.[^\\.]+$') AS valid_format,
        NOT REGEXP_CONTAINS(domain, r'\\s') AS no_spaces
      FROM (select domain)
    ),
    domain_extraction AS (
      SELECT
        main_domain,
        CASE 
          WHEN LENGTH(SUBSTR(main_domain, 1, 3)) >= 3 THEN SUBSTR(main_domain, 1, 3)
          WHEN LENGTH(SUBSTR(main_domain, 1, 3)) >= 2 THEN SUBSTR(main_domain, 1, 2)
          WHEN LENGTH(SUBSTR(main_domain, 1, 3)) = 1 THEN SUBSTR(main_domain, 1, 1)
          ELSE NULL
        END AS hash_basis,
        valid_format,
        no_spaces
      FROM input_check
    ),

Translate to ascii #

Next, the extracted valid characters need to be to an numerical equivalent. For that, we can use a classic ascii lookup:

   character_mapping AS (
      SELECT
        main_domain,
        hash_basis,
        (
          CASE
            WHEN ASCII(SUBSTR(hash_basis, 1, 1)) BETWEEN 48 AND 57 THEN ASCII(SUBSTR(hash_basis, 1, 1)) - 48
            WHEN ASCII(SUBSTR(hash_basis, 1, 1)) BETWEEN 97 AND 122 THEN ASCII(SUBSTR(hash_basis, 1, 1)) - 97 + 10
            WHEN SUBSTR(hash_basis, 1, 1) = '.' THEN 36
            WHEN SUBSTR(hash_basis, 1, 1) = '-' THEN 37
            ELSE NULL
          END +
          IF(LENGTH(hash_basis) > 1,
            CASE
              WHEN ASCII(SUBSTR(hash_basis, 2, 1)) BETWEEN 48 AND 57 THEN (ASCII(SUBSTR(hash_basis, 2, 1)) - 48) * 38
              WHEN ASCII(SUBSTR(hash_basis, 2, 1)) BETWEEN 97 AND 122 THEN (ASCII(SUBSTR(hash_basis, 2, 1)) - 97 + 10) * 38
              WHEN SUBSTR(hash_basis, 2, 1) = '.' THEN 36 * 38
              WHEN SUBSTR(hash_basis, 2, 1) = '-' THEN 37 * 38
              ELSE 0
            END, 0) +
          IF(LENGTH(hash_basis) > 2,
            CASE
              WHEN ASCII(SUBSTR(hash_basis, 3, 1)) BETWEEN 48 AND 57 THEN (ASCII(SUBSTR(hash_basis, 3, 1)) - 48) * 38 * 38
              WHEN ASCII(SUBSTR(hash_basis, 3, 1)) BETWEEN 97 AND 122 THEN (ASCII(SUBSTR(hash_basis, 3, 1)) - 97 + 10) * 38 * 38
              WHEN SUBSTR(hash_basis, 3, 1) = '.' THEN 36 * 38 * 38
              WHEN SUBSTR(hash_basis, 3, 1) = '-' THEN 37 * 38 * 38
              ELSE 0
            END, 0)
        ) AS domain_hash
      FROM domain_extraction
    )

Make sure the integer is between 0 and 4000 #

And finally, we SELECT the integer value we created, divide it by the maximum number it can have, multiply it by 4000 and make sure it is a rounded number:

CAST(ROUND((domain_hash / 54872.0) * 4000) AS INT64) AS domain_partition

Putting it all together: an SQL UDF #

To be able to re-use it, we need to put all this into an SQL UDF, so it becomes a function.

I use it to create the partitions in the source table, but also to filter SELECT statements (thus making sure the SELECT is faster and only looks in the designated partition(s), instead of in the whole table: this divides the 421GB table by approx 4000 for every single query!).

Make sure: replace your_dataset.your_procedures_folder with the values applicable to your own BigQuery environment. And also make sure that domain is the actual domain name, and omits the protocol (https://) and optional subdomain (like 'www'). So in my case, that would be: eikhart.com.

This is the full code of the UDF:

CREATE OR REPLACE FUNCTION `your_dataset.your_procedures_folder.HASH_DOMAIN`(domain STRING)
RETURNS INT64
AS (
  (
    WITH input_check AS (
      SELECT
        DISTINCT domain AS main_domain,
        REGEXP_CONTAINS(domain, r'^[^\\.]+\\.[^\\.]+$') AS valid_format,
        NOT REGEXP_CONTAINS(domain, r'\\s') AS no_spaces
      FROM (select domain)
    ),
    domain_extraction AS (
      SELECT
        main_domain,
        CASE 
          WHEN LENGTH(SUBSTR(main_domain, 1, 3)) >= 3 THEN SUBSTR(main_domain, 1, 3)
          WHEN LENGTH(SUBSTR(main_domain, 1, 3)) >= 2 THEN SUBSTR(main_domain, 1, 2)
          WHEN LENGTH(SUBSTR(main_domain, 1, 3)) = 1 THEN SUBSTR(main_domain, 1, 1)
          ELSE NULL
        END AS hash_basis,
        valid_format,
        no_spaces
      FROM input_check
    ),
    character_mapping AS (
      SELECT
        main_domain,
        hash_basis,
        (
          CASE
            WHEN ASCII(SUBSTR(hash_basis, 1, 1)) BETWEEN 48 AND 57 THEN ASCII(SUBSTR(hash_basis, 1, 1)) - 48
            WHEN ASCII(SUBSTR(hash_basis, 1, 1)) BETWEEN 97 AND 122 THEN ASCII(SUBSTR(hash_basis, 1, 1)) - 97 + 10
            WHEN SUBSTR(hash_basis, 1, 1) = '.' THEN 36
            WHEN SUBSTR(hash_basis, 1, 1) = '-' THEN 37
            ELSE NULL
          END +
          IF(LENGTH(hash_basis) > 1,
            CASE
              WHEN ASCII(SUBSTR(hash_basis, 2, 1)) BETWEEN 48 AND 57 THEN (ASCII(SUBSTR(hash_basis, 2, 1)) - 48) * 38
              WHEN ASCII(SUBSTR(hash_basis, 2, 1)) BETWEEN 97 AND 122 THEN (ASCII(SUBSTR(hash_basis, 2, 1)) - 97 + 10) * 38
              WHEN SUBSTR(hash_basis, 2, 1) = '.' THEN 36 * 38
              WHEN SUBSTR(hash_basis, 2, 1) = '-' THEN 37 * 38
              ELSE 0
            END, 0) +
          IF(LENGTH(hash_basis) > 2,
            CASE
              WHEN ASCII(SUBSTR(hash_basis, 3, 1)) BETWEEN 48 AND 57 THEN (ASCII(SUBSTR(hash_basis, 3, 1)) - 48) * 38 * 38
              WHEN ASCII(SUBSTR(hash_basis, 3, 1)) BETWEEN 97 AND 122 THEN (ASCII(SUBSTR(hash_basis, 3, 1)) - 97 + 10) * 38 * 38
              WHEN SUBSTR(hash_basis, 3, 1) = '.' THEN 36 * 38 * 38
              WHEN SUBSTR(hash_basis, 3, 1) = '-' THEN 37 * 38 * 38
              ELSE 0
            END, 0)
        ) AS domain_hash
      FROM domain_extraction
    )
    SELECT
      -- Calculate the partition
      CAST(ROUND((domain_hash / 54872.0) * 4000) AS INT64) AS domain_partition
    FROM character_mapping
    WHERE domain_hash IS NOT NULL
    LIMIT 1
  )
);

The result #

Now, if you first create these partitions, and then SELECT by them, you're basically filtering your BigQuery table, WITHOUT UNNECESSARY COST. It's not perfect (a higher partition limit would be nice ;)), but it does save significant costs and probably also noticable speed.

→ Call to action ←

Want more?

Then let's close the virtual gap between us by one step.