Window functions perform calculations across rows that are related to the current row, but without collapsing the rows into one. Two key window functions in BigQuery include NTILE() and CUME_DIST(), which are used for grouping and distribution analysis.
NTILE()
- Explanation: The NTILE() function divides the rows in a result set into a specified number of groups or "buckets" and assigns a bucket number to each row. This function is useful when you want to break a dataset into equally sized groups.
- Usage:
- NTILE(n) OVER (PARTITION BY partition_column ORDER BY sort_column): Divides the result set into n groups and assigns a group number to each row.
- Example:
SELECT employee_id, salary, NTILE(4) OVER (ORDER BY salary DESC) AS salary_quartile
FROM employees;
This query divides employees into 4 salary quartiles (groups), based on their salary, assigning each employee a group number between 1 and 4. The highest salaries will be in the 1st quartile and the lowest in the 4th.
- Practical Use:
NTILE() is useful when you want to create percentiles or quartiles for a dataset. For instance, dividing customers into top, middle, and bottom spenders:
SELECT customer_id, total_spent, NTILE(3) OVER (ORDER BY total_spent DESC) AS spending_group
FROM customers;
This divides customers into three groups based on how much they have spent.
CUME_DIST()
- Explanation: The CUME_DIST() function calculates the cumulative distribution of a value in a result set. It returns the relative position of each row within the partition as a value between 0 and 1, indicating the fraction of rows that have a value less than or equal to the current row’s value.
- Usage:
- CUME_DIST() OVER (PARTITION BY partition_column ORDER BY sort_column): Calculates the cumulative distribution of a value in the result set.
- Example:
SELECT employee_id, salary, CUME_DIST() OVER (ORDER BY salary DESC) AS salary_cume_dist
FROM employees;
This query calculates the cumulative distribution of each employee’s salary, showing the fraction of employees that have a salary less than or equal to the current row's salary.
- How It Works:
The cumulative distribution starts from 0 and moves towards 1. For example, the employee with the highest salary will have a cumulative distribution closer to 1, and the employee with the lowest salary will have a value closer to 0.
- Practical Use:
CUME_DIST() is valuable for understanding how values are distributed within a dataset. For instance, if you want to see how a student’s test score ranks relative to the rest of the class:
SELECT student_id, test_score, CUME_DIST() OVER (ORDER BY test_score DESC) AS test_rank
FROM test_scores;
This query calculates how each student's test score ranks compared to others in the class, with scores distributed between 0 and 1.
Summary of Use Cases
Window functions like NTILE() and CUME_DIST() are useful in ranking and distribution analysis:
- NTILE() is great for dividing rows into equal groups, such as creating percentiles, quartiles, or deciles.
- CUME_DIST() helps measure the relative position or cumulative distribution of a row within a dataset, useful for ranking and comparison purposes.
These functions enable deeper insights into ranking and distribution patterns in datasets.