String functions in BigQuery are used to manipulate and process text data. These functions help format, extract, and transform strings for data cleaning, reporting, or matching patterns. Below are the most commonly used string functions:
CONCAT()
- Explanation: The CONCAT() function concatenates (joins) two or more strings into a single string. It can be used to combine different text columns or string literals into one output.
- Usage:
- CONCAT(string1, string2, ..., stringN): Concatenates the input strings in the given order.
- Example:
SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM customers;
This query combines the first_name and last_name columns to create a full_name string with a space in between.
- Multiple Strings:
You can concatenate multiple strings together:
SELECT CONCAT('Order ID: ', CAST(order_id AS STRING), ' - ', customer_name) AS order_info
FROM orders;
This example creates a string containing the order ID and customer name.
FORMAT()
- Explanation: The FORMAT() function formats strings by replacing placeholders with specified values. It works similarly to string formatting functions in other programming languages, like Python’s .format().
- Usage:
- FORMAT(format_string, value1, value2, ...): The format_string contains placeholders like %s (for strings), %d (for integers), or %f (for floating-point numbers), which are replaced by the respective values.
- Example:
SELECT FORMAT('Customer %s has made %d purchases.', customer_name, purchase_count) AS message
FROM customers;
This example outputs a message like "Customer John has made 5 purchases."
- Use for Number Formatting:
SELECT FORMAT('The total amount is $%.2f', total_amount) AS formatted_amount
FROM orders;
This formats a floating-point number (the total amount) to two decimal places.
SUBSTR()
- Explanation: The SUBSTR() function extracts a substring from a given string starting at a specific position for a given length. It is useful for breaking up larger strings or extracting specific parts of text.
- Usage:
- SUBSTR(string, start_position[, length]): Extracts the substring starting at start_position (1-based index). If length is provided, it extracts that many characters; otherwise, it returns the substring from the start position to the end of the string.
- Example:
SELECT SUBSTR(customer_name, 1, 3) AS name_initials
FROM customers;
This example extracts the first three characters from each customer_name.
- Without Length:
If you omit the length, it extracts from the start position to the end of the string:
SELECT SUBSTR(customer_name, 5) AS part_of_name
FROM customers;
This would return the string starting from the 5th character to the end.
UPPER()/LOWER()
- Explanation: These functions convert text to either uppercase or lowercase.
- UPPER(): Converts a string to all uppercase letters.
- LOWER(): Converts a string to all lowercase letters.
- Usage:
- UPPER(string): Converts all characters in the string to uppercase.
- LOWER(string): Converts all characters in the string to lowercase.
- Example:
SELECT UPPER(customer_name) AS upper_name
FROM customers;
This query converts all customer names to uppercase.
SELECT LOWER(email) AS normalized_email
FROM customers;
This query converts email addresses to lowercase for normalization purposes (e.g., for case-insensitive matching).
- Combining with Other Functions:
These functions are often used in combination with other string functions, such as:
SELECT CONCAT(UPPER(SUBSTR(customer_name, 1, 1)), LOWER(SUBSTR(customer_name, 2))) AS formatted_name
FROM customers;
This query capitalizes the first letter of the name while converting the rest to lowercase.
REGEXP_EXTRACT()
- Explanation: The REGEXP_EXTRACT() function extracts a substring that matches a regular expression pattern from a string. It is highly useful for extracting specific parts of text based on patterns, such as extracting numbers, emails, or specific tokens.
- Usage:
- REGEXP_EXTRACT(string, regex): The regex is a regular expression pattern, and the function extracts the first matching substring.
- Example:
SELECT REGEXP_EXTRACT(email, r'@(.+)$') AS email_domain
FROM customers;
This query extracts the domain part of an email address (everything after the @ symbol).
- More Complex Patterns:
You can use more advanced regular expressions to match more specific patterns. For example, to extract the first set of digits from a string:
SELECT REGEXP_EXTRACT(order_info, r'\d+') AS first_number
FROM orders;
This query extracts the first numeric value from the order_info field, which could be useful for finding order IDs or quantities.
Summary of Use Cases
These string functions are vital in handling text data, transforming values for reporting, or extracting specific parts of a string:
- CONCAT() is useful for combining columns or literals to create composite text values.
- FORMAT() is excellent for formatting messages or numbers into human-readable strings.
- SUBSTR() allows you to extract specific portions of a string, which can be helpful in parsing and working with structured text.
- UPPER() and LOWER() are often used for standardizing text (e.g., normalizing email addresses).
- REGEXP_EXTRACT() is a powerful tool for extracting information from structured or semi-structured text data using regular expressions.
These functions provide flexible and powerful ways to manipulate strings for data transformation and analysis.