. : Analytic functions in BigQuery in Google Cloud Platform (GCP)

Analytic functions in BigQuery perform calculations across a set of rows related to the current row. These functions are essential for ranking, accessing data from other rows, and calculating relative positions. They are commonly used in reporting, leaderboards, time-series analysis, and cumulative calculations.

ROW_NUMBER()

- Explanation: The ROW_NUMBER() function assigns a unique, sequential integer to rows within a result set, starting from 1. The numbering is based on the order specified in the OVER() clause.

- Usage:

- ROW_NUMBER() OVER (PARTITION BY partition_column ORDER BY sort_column): Assigns a unique number to each row within a partition and orders them by the specified column.

- Example:

SELECT order_id, customer_id, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS row_num

FROM orders;

This query assigns a unique number to each order for every customer, ordered by the date of the order.

- Practical Use:

ROW_NUMBER() is useful for identifying the first or last occurrence of an event or for numbering records. For example, finding the most recent order for each customer:

WITH ranked_orders AS (

SELECT order_id, customer_id, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rank

FROM orders

)

SELECT order_id, customer_id

FROM ranked_orders

WHERE rank = 1;

This query returns the most recent order for each customer.

RANK()

- Explanation: The RANK() function assigns a rank to rows based on the values in a specific column. Unlike ROW_NUMBER(), RANK() assigns the same rank to rows with the same value and skips subsequent ranks. It’s useful for ranking rows where ties need to be handled.

- Usage:

- RANK() OVER (PARTITION BY partition_column ORDER BY sort_column): Assigns a rank to each row within a partition, ordered by the specified column.

- Example:

SELECT customer_id, order_amount, RANK() OVER (ORDER BY order_amount DESC) AS rank

FROM orders;

This query ranks orders by their order_amount in descending order. If two orders have the same amount, they will get the same rank, and the next rank will be skipped (e.g., if two rows are ranked 1, the next row gets rank 3).

- Practical Use:

RANK() is useful for creating leaderboards or ranking lists, especially when there are ties. For example, ranking employees by performance:

SELECT employee_id, sales, RANK() OVER (ORDER BY sales DESC) AS sales_rank

FROM employees;

LEAD() / LAG()

- Explanation: The LEAD() and LAG() functions allow access to data from a subsequent or preceding row in a result set. These functions are useful for comparisons between rows, such as identifying changes in values or calculating differences between periods.

- LEAD(): Accesses the data from the following row.

- LAG(): Accesses the data from the previous row.

- Usage:

- LEAD(column, offset, default_value) OVER (PARTITION BY partition_column ORDER BY sort_column): Fetches the value from a subsequent row.

- LAG(column, offset, default_value) OVER (PARTITION BY partition_column ORDER BY sort_column): Fetches the value from a preceding row.

- Example:

SELECT order_id, order_date, LAG(order_date, 1) OVER (PARTITION BY customer_id ORDER BY order_date) AS previous_order_date

FROM orders;

This query returns the date of each order along with the date of the previous order for the same customer.

SELECT order_id, order_amount, LEAD(order_amount) OVER (ORDER BY order_date) AS next_order_amount

FROM orders;

This query retrieves the order amount and the amount for the following order in chronological order.

- Practical Use:

LEAD() and LAG() are widely used for calculating differences between rows, such as time differences between events or tracking changes in values over time:

SELECT order_id, order_amount, LAG(order_amount) OVER (ORDER BY order_date) AS previous_order,

(order_amount - LAG(order_amount) OVER (ORDER BY order_date)) AS change_in_order

FROM orders;

This query calculates the change in order amount from one order to the next.

PERCENT_RANK()

- Explanation: The PERCENT_RANK() function calculates the relative rank of a row within its partition as a percentage. This function is useful for comparing rows across a range of values and is particularly valuable in percentile calculations or distribution analysis.

- Usage:

- PERCENT_RANK() OVER (PARTITION BY partition_column ORDER BY sort_column): Computes the relative rank of each row within a partition.

- Example:

SELECT order_id, order_amount, PERCENT_RANK() OVER (ORDER BY order_amount) AS percent_rank

FROM orders;

This query calculates the relative rank of each order based on order_amount, where the rank is expressed as a percentage between 0 and 1.

- How It Works:

The first row has a percent rank of 0, and the last row has a percent rank of 1. Rows in between have a rank that is a fractional percentage based on the total number of rows.

- Practical Use:

PERCENT_RANK() is often used for statistical analysis, like calculating the distribution of sales performance:

SELECT employee_id, sales, PERCENT_RANK() OVER (ORDER BY sales) AS sales_percent_rank

FROM employees;

This query assigns a relative rank (as a percentage) to each employee based on their sales performance.

Summary of Use Cases

Analytic functions in BigQuery are essential for working with ranked data and accessing values from other rows:

- ROW_NUMBER() is useful for generating unique row numbers and finding the first or last occurrence of an event.

- RANK() helps rank data with ties, making it suitable for leaderboards and ranked lists.

- LEAD() and LAG() provide access to data from preceding or following rows, which is invaluable for time series analysis and tracking changes.

- PERCENT_RANK() calculates relative ranks as percentages, useful for analyzing data distributions and percentiles.

These functions empower users to perform sophisticated analytical calculations over large datasets.

.

Friday, 20 September 2024

Analytic functions in BigQuery in Google Cloud Platform (GCP)

No comments:

Post a Comment