. : Window functions in BigQuery in Google Cloud Platform (GCP)

Window functions perform calculations across rows that are related to the current row, but without collapsing the rows into one. Two key window functions in BigQuery include NTILE() and CUME_DIST(), which are used for grouping and distribution analysis.

NTILE()

- Explanation: The NTILE() function divides the rows in a result set into a specified number of groups or "buckets" and assigns a bucket number to each row. This function is useful when you want to break a dataset into equally sized groups.

- Usage:

- NTILE(n) OVER (PARTITION BY partition_column ORDER BY sort_column): Divides the result set into n groups and assigns a group number to each row.

- Example:

SELECT employee_id, salary, NTILE(4) OVER (ORDER BY salary DESC) AS salary_quartile

FROM employees;

This query divides employees into 4 salary quartiles (groups), based on their salary, assigning each employee a group number between 1 and 4. The highest salaries will be in the 1st quartile and the lowest in the 4th.

- Practical Use:

NTILE() is useful when you want to create percentiles or quartiles for a dataset. For instance, dividing customers into top, middle, and bottom spenders:

SELECT customer_id, total_spent, NTILE(3) OVER (ORDER BY total_spent DESC) AS spending_group

FROM customers;

This divides customers into three groups based on how much they have spent.

CUME_DIST()

- Explanation: The CUME_DIST() function calculates the cumulative distribution of a value in a result set. It returns the relative position of each row within the partition as a value between 0 and 1, indicating the fraction of rows that have a value less than or equal to the current row’s value.

- Usage:

- CUME_DIST() OVER (PARTITION BY partition_column ORDER BY sort_column): Calculates the cumulative distribution of a value in the result set.

- Example:

SELECT employee_id, salary, CUME_DIST() OVER (ORDER BY salary DESC) AS salary_cume_dist

FROM employees;

This query calculates the cumulative distribution of each employee’s salary, showing the fraction of employees that have a salary less than or equal to the current row's salary.

- How It Works:

The cumulative distribution starts from 0 and moves towards 1. For example, the employee with the highest salary will have a cumulative distribution closer to 1, and the employee with the lowest salary will have a value closer to 0.

- Practical Use:

CUME_DIST() is valuable for understanding how values are distributed within a dataset. For instance, if you want to see how a student’s test score ranks relative to the rest of the class:

SELECT student_id, test_score, CUME_DIST() OVER (ORDER BY test_score DESC) AS test_rank

FROM test_scores;

This query calculates how each student's test score ranks compared to others in the class, with scores distributed between 0 and 1.

Summary of Use Cases

Window functions like NTILE() and CUME_DIST() are useful in ranking and distribution analysis:

- NTILE() is great for dividing rows into equal groups, such as creating percentiles, quartiles, or deciles.

- CUME_DIST() helps measure the relative position or cumulative distribution of a row within a dataset, useful for ranking and comparison purposes.

These functions enable deeper insights into ranking and distribution patterns in datasets.

.

Friday, 20 September 2024

Window functions in BigQuery in Google Cloud Platform (GCP)

No comments:

Post a Comment