SQL is one of the most in-demand skills that you find in job descriptions from the data domain industry, whether you’re applying for a data analyst, a data engineer, a data scientist, or any other role.

To demonstrate the importance of SQL for data-related jobs, I have analyzed more than 72,000 job listings on Indeed, looking at key skills mentioned in job posts with “data” in the title.

As we can see, SQL is the most in-demand skill among all jobs in data, appearing in 45.4% of job postings.

Interestingly, the proportion of data jobs listing SQL has been increasing! I performed this same analysis multiple times over the years, and here’s what I found:

  • 2017: 35.7% (SQL #1)
  • 2021: 42.7% (SQL #1)
  • 2022: 45.4% (SQL #1)

But why exactly is it that it is so frequently used. And why isn’t it dead even though it has been around for such a long time?

There are several reasons: one of the main reasons would be that companies mostly store data in Relational Database Management Systems (RDBMS) or in Relational Data Stream Management Systems (RDSMS) and you need SQL to access that data. SQL is the lingua franca of data: it gives you the ability to interact with almost any database or to even build your own locally!

Write Fast SQL Query

SQL is a declarative language, that is, each query declares what we want the SQL engine to do, but it doesn’t say how. As it turns out, the how — the “plan” — is what affects the efficiency of the queries, however, so it’s pretty important.

An inefficient query will drain the production database’s resources, and cause slow performance or loss of service for other users if the query contains errors. It’s vital you optimize your queries for minimum impact on database performance.

There is no step-by-step guide for the same. In turn, we must use general guidelines for writing queries, and which operators to use.

Benefits of Query Optimization

  • Minimize production issues: Inefficient queries require high CPU, memory, and I/O operations that mostly lead to deadlocks, failed transactions, system hangs, and other production issues.
  • Performance issues: Slow query means slower response time of applications using the query, which results in poor user experience.
  • Save infra cost: As non-optimized query requires more CPU and memory, they lead to higher infrastructure cost.

Query Optimization Best Practices

Some of the commonly used tips to optimize SQL queries are

Query 1: Use Column Names Instead of * in a SELECT Statement
If you are selecting only a few columns from a table there is no need to use SELECT *. Though this is easier to write, it will cost more time for the database to complete the query. By selecting only the columns you need, you are reducing the size of the result table, reducing the network traffic, and also in turn boosting the overall performance of the query.

Original QueryImproved Query
SELECT * FROM SH.Sales;SELECT s.prod_id FROM SH.sales s;

Query 2: Avoid including a HAVING clause in SELECT statements
The HAVING clause is used to filter the rows after all the rows are selected and it is used like a filter. It is quite useless in a SELECT statement. It works by going through the final result table of the query parsing out the rows that don’t meet the HAVING condition.

Original QueryImproved Query
SELECT s.cust_id,count(s.cust_id)
FROM SH.sales s
GROUP BY s.cust_id
HAVING s.cust_id != ‘1660’ AND s.cust_id != ‘2’;
SELECT s.cust_id,count(cust_id)
FROM SH.sales s
WHERE s.cust_id != ‘1660’
AND s.cust_id !=’2′
GROUP BY s.cust_id;

Query 3: Eliminate Unnecessary DISTINCT Conditions
Considering the case of the following example, the DISTINCT keyword in the original query is unnecessary because the table_name contains the primary key p.ID, which is part of the result set.

Original QueryImproved Query
SELECT DISTINCT * FROM SH.sales s
JOIN SH.customers c
ON s.cust_id= c.cust_id
WHERE c.cust_marital_status = ‘single’;
SELECT * FROM SH.sales s JOIN
SH.customers c
ON s.cust_id = c.cust_id
WHERE c.cust_marital_status=’single’;

Query 4: Un-nest subqueries

Rewriting nested queries as joins often leads to more efficient execution and more effective optimization. In general, sub-query un-nesting is always done for correlated sub-queries with, at most, one table in the FROM clause, which is used in ANY, ALL, and EXISTS predicates. An uncorrelated sub-query, or a sub-query with more than one table in the FROM clause, is flattened if it can be decided, based on the query semantics, that the sub-query returns at most one row.

Original QueryImproved Query
SELECT *
FROM SH.products p
WHERE p.prod_id =
(SELECT s.prod_id
FROM SH.sales s
WHERE s.cust_id = 100996
AND s.quantity_sold = 1 );
SELECT p.*
FROM SH.products p, sales s
WHERE p.prod_id = s.prod_id
AND s.cust_id = 100996
AND s.quantity_sold = 1;

Query 5: Consider using an IN predicate when querying an indexed column

The IN-list predicate can be exploited for indexed retrieval and also, the optimizer can sort the IN-list to match the sort sequence of the index, leading to more efficient retrieval. Note that the IN-list must contain only constants or values that are constant during one execution of the query block, such as outer references.

Original QueryImproved Query
Original query:
SELECT s.*
FROM SH.sales s
WHERE s.prod_id = 14
OR s.prod_id = 17;
SELECT s.*
FROM SH.sales s
WHERE s.prod_id IN (14, 17);

Query 6: Use EXISTS instead of DISTINCT when using table joins that involve tables having one-to-many relationships

The DISTINCT keyword works by selecting all the columns in the table and then parsing out any duplicates. Instead, if you use a sub query with the EXISTS keyword, you can avoid having to return an entire table.

Original QueryImproved Query
SELECT DISTINCT c.country_id, c.country_name
FROM SH.countries c,SH.customers e
WHERE e.country_id = c.country_id;
SELECT c.country_id, c.country_name
FROM SH.countries c
WHERE EXISTS (SELECT ‘X’ FROM SH.customers e WHERE e.country_id = c.country_id);

Query 7: Try to use UNION ALL in place of UNION

The UNION ALL statement is faster than UNION, because UNION ALL statement does not consider duplicates, and the UNION statement does look for duplicates in a table while selecting rows, whether or not they exist.

Original QueryImproved Query
SELECT cust_id
FROM SH.sales
UNION
SELECT cust_id
FROM customers;
SELECT cust_id
FROM SH.sales
UNION ALL
SELECT cust_id
FROM customers;

Query 8: Avoid using OR in join conditions

Any time you place an ‘OR’ in the join condition, the query will slow down by at least a factor of two.

Original QueryImproved Query
SELECT *
FROM SH.costs c
INNER JOIN SH.products p ON c.unit_price = p.prod_min_price OR c.unit_price = p.prod_list_price;
SELECT *
FROM SH.costs c
INNER JOIN SH.products p ON c.unit_price = p.prod_min_price
UNION ALL
SELECT *
FROM SH.costs c
INNER JOIN SH.products p ON c.unit_price = p.prod_list_price;

Query 9: Avoid functions on the right-hand side of the operator

Functions or methods are used very often with their SQL queries. Rewriting the query by removing aggregate functions will increase the performance tremendously.

Original QueryImproved Query
SELECT *
FROM SH.sales
WHERE EXTRACT (YEAR FROM TO_DATE (time_id, ‘DD-MON-RR’)) = 2001 AND EXTRACT (MONTH FROM TO_DATE (time_id, ‘DD-MON-RR’)) =12;
SELECT * FROM SH.sales
WHERE TRUNC (time_id) BETWEEN TRUNC(TO_DATE(‘12/01/2001’, ’mm/dd/yyyy’)) AND TRUNC (TO_DATE (‘12/30/2001’,’mm/dd/yyyy’));

Query 10: Remove any redundant mathematics

There will be times when you will be performing mathematics within an SQL statement. They can be a drag on the performance if written improperly. For each time the query finds a row it will recalculate the math. So eliminating any unnecessary math in the statement will make it perform faster.

Original queryImproved query
SELECT *
FROM SH.sales s
WHERE s.cust_id + 10000 < 35000;
SELECT *
FROM SH.sales s
WHERE s.cust_id < 25000;

Query 11: Use ‘regexp_like’ to replace ‘LIKE’ clauses

Original queryImproved query
SELECT *
FROM
table1
WHERE
lower(item_name) LIKE ‘%samsung%’ OR
lower(item_name) LIKE ‘%xiaomi%’ OR
lower(item_name) LIKE ‘%iphone%’ OR
lower(item_name) LIKE ‘%huawei%’
SELECT *
FROM
table1
WHERE
REGEXP_LIKE(lower(item_name),
‘samsung|xiaomi|iphone|huawei’)

Query 11: Use ‘regexp_extract’ to replace ‘Case-when Like’

Original queryImproved query
SELECT
CASE
WHEN concat(‘ ‘,item_name,’ ‘) LIKE ‘%acer%’ then ‘Acer’
WHEN concat(‘ ‘,item_name,’ ‘) LIKE ‘%advance%’ then ‘Advance’
WHEN concat(‘ ‘,item_name,’ ‘) LIKE ‘%alfalink%’ then ‘Alfalink’

AS brand
FROM item_list
SELECT
regexp_extract(item_name,'(asus|lenovo|hp|acer|dell|zyrex|…)’)
AS brand
FROM item_list

Query 12: Always order your JOINs from the largest tables to the smallest tables

Original queryImproved query
SELECT
*
FROM
small_table
JOIN
large_table
ON small_table.id = large_table.id
SELECT
*
FROM
large_table
JOIN
small_table
ON small_table.id = large_table.id

Query 13: Always “GROUP BY” by the attribute/column with the largest number of unique entities/values

Original queryImproved query
select
main_category,
sub_category,
itemid,
sum(price)
from
table1
group by
main_category, sub_category, itemid
select
main_category,
sub_category,
itemid,
sum(price)
from
table1
group by
itemid, sub_category, main_category

Query 14: Use Max instead of Rank

Original queryImproved query
SELECT *
from (
select
userid,
rank() over (order by prdate desc) as rank
from table1
)
where ranking = 1
SELECT userid, max(prdate)
from table1
group by 1

Other Optimization Technique

  • Choose Datatype wisely: Using Varchar(10) for storing fixed-length data (like Mobile Number) will incur a 20% higher query executing penalty when compared to using char(10) for fixed-length data.
  • Avoid using implicit conversion: When a query compares different data types, it uses implicit datatype conversion. Hence, in queries, one should avoid comparing different datatype columns and values.
  • Avoid using DISTINCT and GROUP BY at the same time: If you already have GROUP BY in your query, there is no need to having DISTINCT separately.
  • Avoid using Order by in the Sub-query: Ordering in sub-queries is mostly redundant and leads to significant performance issues.
  • Minimize use of SELF Joins: Self-joins are computationally more expensive and can be replaced with the Window function in many cases.
  • Don’t join tables with OR conditions: It can be optimized by using UNION ALL in place of OR-based joins.
  • Avoid join with not equal condition: When a query joins with NOT EQUAL the operator searches all rows and uses a full table scan which is highly inefficient.
  • Avoid Full-Text Search: When a query searches for keywords with a wildcard in the beginning it does not utilize an index, and the database is tasked with searching all records for a match anywhere within the selected field. Hence, if needed prefer to use wildcards at the end of a phrase only.
  • Use approx_distinct() instead of count(distinct) for very large datasets
  • Use approx_percentile(metric, 0.5) for median

Query optimization is a common task performed by database administrators and application designers in order to tune the overall performance of the database system. The purpose of this paper is to provide SQL scenarios to serve as a quick and easy reference guide during the development phase and maintenance of database queries. Even if you have a powerful infrastructure, the performance can be significantly degraded by inefficient queries. Query optimization has a very big impact on the performance of a DBMS and it continuously evolves with new, more sophisticated optimization strategies. So, we should try to follow the general tips mentioned above to get better performance of queries. Optimization can be achieved with some effort if we make it a general practice to follow the rules. The main focus was on query optimizations.

Please follow our other blogs related to SQL:

References

  1. 10 Ways to Improve SQL Query Performance
  2. 15 Ways to Optimize Your SQL Queries
  3. Optimize SQL Server queries with these advanced tuning techniques.
  4. Query Optimization Techniques in Microsoft SQL

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published.