Introducing the Assist of Lateral Column Alias


We’re thrilled to introduce the assist of a brand new SQL function in Apache Spark and Databricks: Lateral Column Alias (LCA). This function simplifies advanced SQL queries by permitting customers to reuse an expression specified earlier in the identical SELECT checklist, eliminating the necessity to use nested subqueries and Frequent Desk Expressions (CTEs) in lots of instances. This weblog submit discusses the use instances of the function and the advantages it brings to Spark and Databricks customers.

What’s Lateral Column Alias Assist?

Lateral Column Alias (LCA) supplies customers the aptitude to reuse an expression specified earlier throughout the similar SELECT checklist.
This function will be higher understood by means of the instance offered under. Right here is an easy question:


SELECT 1 AS a, a + 1

Within the absence of LCA assist, customers will get an error on this question that the latter a within the SELECT checklist can’t be resolved:

 

[UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column or perform parameter with title `a` can't be resolved. ; line 1 pos 15;

 

Thankfully, with the LCA function, this second a within the question now efficiently identifies because the beforehand outlined alias in the identical SELECT checklist : 1 AS a. Customers are now not confronted with an error, however as a substitute supplied with the next outcomes:


> SELECT 1 AS a, a + 1
1 2

Eradicate Advanced Subqueries and CTEs with LCA Chaining

Whereas the earlier examples showcase the essential idea of LCA, the true energy of this function lies in its capability to get rid of advanced subqueries and CTEs.

Earlier than the introduction of LCA, customers needed to take care of a number of subqueries and CTEs when attempting to reference any attribute outlined by a earlier alias. This elevated the complexity and verbosity of SQL queries, making them onerous to learn, write and keep. In distinction, LCA assist basically simplifies these queries, making them extra user-friendly and manageable.

Let’s take an instance. Suppose there’s a merchandise desk storing product info akin to title, class, worth and buyer score. Our purpose is to compute an adjusted worth primarily based on a number of influencing components. The situation will clearly delineate how LCA can flip a convoluted question right into a considerably simplified model.

Right here is the desk construction:


CREATE TABLE merchandise (
  id INT,
  title STRING,
  class INT,
  worth DECIMAL(10, 2),
  member_price DECIMAL(10, 2) COMMENT 'Particular worth for members',
  score INT COMMENT 'Buyer score ranges between 1 to five'
) USING delta;

INSERT INTO
  merchandise
VALUES
  (1, 'Product 1', 0, 100.00,  90.00, 3),
  (2, 'Product 2', 1, 150.00, 120.00, 4),
  (3, 'Product 3', 0, 200.00, 190.00, 5),
  (4, 'Product 4', 2, 250.00, 210.00, 2),
  (5, 'Product 5', 1, 300.00, 150.00, 1);

We want to calculate the adjusted worth for every product on the better worth of two components: the value improve share primarily based on customers’ score of the product and primarily based on the rank of the product inside its class. With out LCA assist, the question seems to be like this:


-- BEFORE: with out LCA
WITH cte1 AS (
  SELECT id, title, class, score, worth, 
         CASE
           WHEN score = 1 THEN 0.02
           WHEN score = 2 THEN 0.04
           WHEN score = 3 THEN 0.06
           WHEN score = 4 THEN 0.08
           ELSE 0.1
         END AS increase_percentage_based_on_rating,
         rank() OVER (PARTITION BY class ORDER BY score) AS rating_rank
  FROM merchandise
), cte2 AS (
  SELECT id, title, class, score, worth, increase_percentage_based_on_rating,
         CASE
           WHEN rating_rank = 1 THEN 0.2
           WHEN rating_rank = 2 THEN 0.1
           ELSE 0
         END AS increase_percentage_based_on_rank
  FROM cte1
), cte3 AS (
  SELECT id, title, class, score, worth, 
         GREATEST(increase_percentage_based_on_rating,
                  increase_percentage_based_on_rank) AS final_increase_percentage
  FROM cte2
)
SELECT id, title, class, score, worth, 
CAST(worth * (1 + final_increase_percentage) AS DECIMAL(10, 2)) 
  AS adjusted_price
FROM cte3

The logic accommodates many chaining operations whereby a latter calculation is determined by beforehand calculated outcomes. Due to this fact it requires a number of CTEs to retailer every intermediate calculation in a fashion appropriate for later references within the subsequent levels of the question.

Nonetheless, with LCA, it’s attainable to specific the question as one single SELECT assertion as a substitute:


-- AFTER: with LCA
SELECT id, title, class, worth, score,
       CASE
         WHEN score = 1 THEN 0.02
         WHEN score = 2 THEN 0.04
         WHEN score = 3 THEN 0.06
         WHEN score = 4 THEN 0.08
         ELSE 0.1
       END AS increase_percentage_based_on_rating,
       rank() OVER (PARTITION BY class ORDER BY score) AS rating_rank,
       CASE
         WHEN rating_rank = 1 THEN 0.2
         WHEN rating_rank = 2 THEN 0.1
         ELSE 0
       END AS increase_percentage_based_on_rank,
       GREATEST(increase_percentage_based_on_rating,
                increase_percentage_based_on_rank) AS final_increase_percentage,
       CAST(worth * (1 + final_increase_percentage) AS DECIMAL(10, 2))
         AS adjusted_price
FROM merchandise

LCAs can be chained! This implies the present alias expression, which will be referenced by subsequent expressions, can reference a beforehand outlined lateral alias. For instance, the definition of final_increase_percentage is determined by two lateral column aliases: increase_percentage_based_on_rating and increase_percentage_based_on_rank. The next calculation of adjusted_price then refers to final_increase_percentage. This chaining energy of LCA permits customers to create a sequence of dependent calculations, the place the outcomes of 1 calculation are used as inputs for the following.

As we will see within the above instance, LCA largely simplifies the question, eliminating repeated calculation or the necessity for a number of CTEs, making it simpler to grasp, keep and debug. It additionally improves readability because the calculation definition and the utilization are shut collectively within the question.

LCA Every thing

Easy, aggregation or window expressions

Virtually each expression can reside inside a lateral column alias. The examples within the final part present that advanced CASE-WHEN expressions, in addition to GREATEST perform expressions and even window features, can reside inside a lateral column alias for additional use in subsequent expressions.

By the identical token, we may nest aggregation expressions on this means. Right here is an instance on the identical merchandise desk:


SELECT class AS c, string(c) AS c_str, avg(score) AS avg_rating,
       concat('class ', c_str, ' has common score ', string(avg_rating))
FROM merchandise
GROUP BY class


| 1 | 1 | 2.5 | class 1 has common score 2.5 |
| 2 | 2 | 2   | class 2 has common score 2.0 |
| 0 | 0 | 4   | class 0 has common score 4.0 |

Advanced information sorts

LCA additionally works nicely with advanced information sorts like struct, array and map. For instance,


SELECT named_struct('a', named_struct('b', 1)) AS foo1, foo1.a.b + 1 AS bar1,
       map('a', 1, 'b', 2) AS foo2, foo2['b'] AS bar2,
       array(named_struct('a', 1)) AS foo3, foo3[0].a AS bar3;


| foo1            | bar1 | foo2             | bar2 | foo3       | bar3 |
|-----------------|------|------------------|------|------------|------|
| {"a": {"b": 1}} | 2    | {"a": 1, "b": 2} | 2    | [{"a": 1}] | 1    |

Non-deterministic expressions

LCA ensures that non-deterministic expressions are evaluated solely as soon as, mirroring the “run-once” semantics that CTEs provide. This ensures constant outcomes when utilizing non-deterministic expressions within the question.

For instance, take into account a situation the place there’s a member_price for every product within the above merchandise desk. We want to apply a random low cost share between 0% and 5% to every product after which calculate the discounted worth of each the worth and member_price. This train ought to assure that the low cost share utilized to each costs stays the identical.

With LCA, we will write:


SELECT id, worth, member_price,
       1 - RAND() * 0.05 AS discounted_rate,
       CAST(discounted_rate * worth AS DECIMAL(10, 2)) AS adjusted_price,
       CAST(discounted_rate * member_price AS DECIMAL(10, 2))
         AS adjusted_member_price
FROM merchandise
WHERE id = 3;

| 3 | 200.00 | 190.00 | 0.961144856978617 | 192.23 | 182.62 |

On this instance, Databricks calculates the discounted_rate as soon as, and this worth stays the identical by means of all subsequent references together with the calculation of adjusted_price and adjusted_member_price.

Alternatively, if we’re merely copying non-deterministic expressions, this conduct doesn’t apply as a result of it will consider every expression individually, inflicting inconsistent low cost charges for the 2 costs:


SELECT id, title, worth, member_price,
       CAST((1 - RAND() * 0.05) * worth AS DECIMAL(10, 2)) AS adjusted_price,
       CAST((1 - RAND() * 0.05) * member_price AS DECIMAL(10, 2))
         AS adjusted_member_price
FROM merchandise
WHERE id = 3;

| 3 | 200.00 | 190.00 | 191.44 | 184.32 |

Attempt LCA!

In abstract, Lateral Column Alias is a robust function that considerably simplifies SQL queries by permitting customers to outline a named alias over an expression tree after which reference this alias later throughout the similar SELECT clause.

  • This protects repeating the identical expressions a number of occasions or the necessity for subqueries or CTEs, as a substitute producing concise and readable SELECT queries.
  • It’s appropriate with all types of expressions and sophisticated information sorts. The SQL syntax helps chaining these aliases for better flexibility as nicely.
  • It ensures that every non-deterministic expression is evaluated solely as soon as, thus enabling constant outcomes throughout a number of references.

LCA is absolutely out there and enabled by default in Databricks Runtime 12.2 LTS and later, in Databricks SQL 2023.20 and above, and Apache Spark 3.4.

Learn Extra

  • Decision order
    Curious readers could also be within the title decision order in SQL queries with the introduction of LCA. This Databricks Identify decision doc (AWS, Azure) defines a transparent set of ordered guidelines and concrete examples to resolve references, together with the position of LCA on this course of.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles