.

Window functions perform calculations across rows that are related to the current row, but without collapsing the rows into one. Two key window functions in BigQuery include NTILE() and CUME_DIST(), which are used for grouping and distribution analysis.

NTILE()

- Explanation: The NTILE() function divides the rows in a result set into a specified number of groups or "buckets" and assigns a bucket number to each row. This function is useful when you want to break a dataset into equally sized groups.

- Usage:

- NTILE(n) OVER (PARTITION BY partition_column ORDER BY sort_column): Divides the result set into n groups and assigns a group number to each row.

- Example:

SELECT employee_id, salary, NTILE(4) OVER (ORDER BY salary DESC) AS salary_quartile

FROM employees;

This query divides employees into 4 salary quartiles (groups), based on their salary, assigning each employee a group number between 1 and 4. The highest salaries will be in the 1st quartile and the lowest in the 4th.

- Practical Use:

NTILE() is useful when you want to create percentiles or quartiles for a dataset. For instance, dividing customers into top, middle, and bottom spenders:

SELECT customer_id, total_spent, NTILE(3) OVER (ORDER BY total_spent DESC) AS spending_group

FROM customers;

This divides customers into three groups based on how much they have spent.

CUME_DIST()

- Explanation: The CUME_DIST() function calculates the cumulative distribution of a value in a result set. It returns the relative position of each row within the partition as a value between 0 and 1, indicating the fraction of rows that have a value less than or equal to the current row’s value.

- Usage:

- CUME_DIST() OVER (PARTITION BY partition_column ORDER BY sort_column): Calculates the cumulative distribution of a value in the result set.

- Example:

SELECT employee_id, salary, CUME_DIST() OVER (ORDER BY salary DESC) AS salary_cume_dist

FROM employees;

This query calculates the cumulative distribution of each employee’s salary, showing the fraction of employees that have a salary less than or equal to the current row's salary.

- How It Works:

The cumulative distribution starts from 0 and moves towards 1. For example, the employee with the highest salary will have a cumulative distribution closer to 1, and the employee with the lowest salary will have a value closer to 0.

- Practical Use:

CUME_DIST() is valuable for understanding how values are distributed within a dataset. For instance, if you want to see how a student’s test score ranks relative to the rest of the class:

SELECT student_id, test_score, CUME_DIST() OVER (ORDER BY test_score DESC) AS test_rank

FROM test_scores;

This query calculates how each student's test score ranks compared to others in the class, with scores distributed between 0 and 1.

Summary of Use Cases

Window functions like NTILE() and CUME_DIST() are useful in ranking and distribution analysis:

- NTILE() is great for dividing rows into equal groups, such as creating percentiles, quartiles, or deciles.

- CUME_DIST() helps measure the relative position or cumulative distribution of a row within a dataset, useful for ranking and comparison purposes.

These functions enable deeper insights into ranking and distribution patterns in datasets.

.

Friday, 20 September 2024

Window functions in BigQuery in Google Cloud Platform (GCP)