Wednesday, 25 February 2015

when using an OUTER JOIN, you should put criteria on the outer table in the join condition, not in the WHERE clause

As this nice SQLTeam article explains, when using an OUTER JOIN, you should put criteria on the outer table in the join condition, not in the WHERE clause.  However, I often see a "workaround" to avoid this simple and solid rule, which might seem to work but actually doesn't.  Since it is hard to explain why over and over in forum posts, I thought it might be helpful to address that here once and for all with an example.  Confused? Stick with me, I promise this will make sense eventually.

Suppose we have the following tables:

create table Depts  (deptID int, DeptName varchar(20))
create table Emps (empID int, EmpName varchar(20), deptID int, salary money)

insert into Depts (deptID, deptName)
select 1, 'Dept A' union all
select 2, 'Dept B' union all
select 3, 'Dept C'

insert into Emps (empID, EmpName, DeptID, salary)
select 1, 'Emp 1', 1, 50 union all
select 2, 'Emp 2', 1, 75 union all
select 3, 'Emp 3', 2, 60 union all
select 3, 'Emp 4', 2, 45

Notice the following:
  • We have 3 departments
  • Each employee is assigned to a department
  • There are no employees assigned to 'Dept C'
Now, if we want to select all departments along with any employees that happen to be in that department, we use an OUTER JOIN since an INNER JOIN will only return departments with matching employees:

select d.deptID, d.DeptName, e.empID, e.empName, e.salary
from Depts d
left outer join
    Emps e on e.deptID = d.deptID

deptID      DeptName             empID       empName              salary

----------- -------------------- ----------- -------------------- ---------
1           Dept A               1           Emp 1                50.00
1           Dept A               2           Emp 2                75.00
2           Dept B               3           Emp 3                60.00
2           Dept B               3           Emp 4                45.00
3           Dept C               NULL        NULL                 NULL

(5 row(s) affected)


Ok, so far so good.  Now, what if we wish to return only employees with a salary of >50?  If we add the criteria to our WHERE clause (WHERE e.salary > 50) this means that the last row in the above result -- 'Dept C' --  will not be returned, since the salary column is null:

select d.deptID, d.DeptName, e.empID, e.empName, e.salary
from Depts d
left outer join
    Emps e on e.deptID = d.deptID
where
    e.salary > 50

deptID      DeptName             empID       empName              salary
----------- -------------------- ----------- -------------------- ---------------------
1           Dept A               2           Emp 2                75.00
2           Dept B               3           Emp 3                60.00

(2 row(s) affected)


Thus, as Mark's article explains quite well, you simply put that criteria directly into the JOIN condition and you get the results you need:

select d.deptID, d.DeptName, e.empID, e.empName, e.salary
from Depts d
left outer join
    Emps e on e.deptID = d.deptID and e.salary > 50

deptID      DeptName             empID       empName              salary
----------- -------------------- ----------- -------------------- ---------------------
1           Dept A               2           Emp 2                75.00
2           Dept B               3           Emp 3                60.00
3           Dept C               NULL        NULL                 NULL

(3 row(s) affected)

However, you might be tempted to look at your data and think that to return these results, instead of putting the criteria in the JOIN, you can simply handle that NULL situation in your WHERE clause like this:

select d.deptID, d.DeptName, e.empID, e.empName, e.salary
from Depts d
left outer join
    Emps e on e.deptID = d.deptID 
where
    e.salary > 50 or e.empID is null

deptID      DeptName             empID       empName              salary
----------- -------------------- ----------- -------------------- ---------------------
1           Dept A               2           Emp 2                75.00
2           Dept B               3           Emp 3                60.00
3           Dept C               NULL        NULL                 NULL

(3 row(s) affected)


That works, right?  The NULL results are returned, and we only have employees with a salary above 50.  Case closed!  This would seem to indicate that we can indeed reference our outer tables in our WHERE clause, we just have to make sure that we deal with the NULL's that come back so that we don't filter them out.

However, that does not work!  Why not?  Well, right now we have no employees in Dept C at all, but what if we add one, with a salary below $50?  Let's try it:

insert into Emps (empID, EmpName, deptID, salary) values(4, 'Emp 5', 3, 35)

select d.deptID, d.DeptName, e.empID, e.empName, e.salary
from Depts d
left outer join
    Emps e on e.deptID = d.deptID 
where
    e.salary > 50 or e.empID is null

deptID      DeptName             empID       empName              salary
----------- -------------------- ----------- -------------------- ---------------------
1           Dept A               2           Emp 2                75.00
2           Dept B               3           Emp 3                60.00

(2 row(s) affected)

What happened here?  Where did Dept C go?  Suddenly, our SQL statement no longer returns the desired results.  Why not?  Well, like pretty much all programming bugs, it's because that silly computer did exactly what you told it to do.  Let's look at our data again, without any WHERE clause at all:

select d.deptID, d.DeptName, e.empID, e.empName, e.salary
from Depts d
left outer join
    Emps e on e.deptID = d.deptID 

deptID      DeptName             empID       empName              salary
----------- -------------------- ----------- -------------------- ---------------------
1           Dept A               1           Emp 1                50.00
1           Dept A               2           Emp 2                75.00
2           Dept B               3           Emp 3                60.00
2           Dept B               3           Emp 4                45.00
3           Dept C               4           Emp 5                35.00

(5 row(s) affected)

Notice that we have no NULL values now, and there is at least 1 employee for each department.  Now, let's go through those results and manually apply our WHERE clause, row by row, removing all rows that don't meet the criteria:

where e.salary > 50 or e.empID is null

deptID      DeptName             empID       empName              salary
----------- -------------------- ----------- -------------------- ---------------------
1           Dept A               1           Emp 1                50.00
1           Dept A               2           Emp 2                75.00
2           Dept B               3           Emp 3                60.00
2           Dept B               3           Emp 4                45.00
3           Dept C               4           Emp 5                35.00


So, you can see now that your WHERE clause simply was not correctly written!  What condition, then, it is applying on our data?  Well, remember that, when using OUTER JOINS, the expression "outertable.primarykeycolumn is null" means "there is no matching row in the outer table meeting the join criteria".  Thus, our WHERE clause really reads:

"return all departments and employees with a salary > 50, or in which there are no employees at all."

So, don't try to use this "trick" to avoid putting criteria in your outer joins.   Of course, this also applies if you use ISNULL, CASE or COALESCE to handle the nulls in your WHERE clause -- the same (wrong) results will be returned.

 The rule is very simple and very easy to remember:

Never reference OUTER JOIN-ed tables in your WHERE clause.

The exception is when you want to test if your outer join expression matched any rows at all; i.e.,  checking to see if the outer table's primary key column is NULL.  Otherwise, put any criteria on your outer tables within the join expression itself.

. . .

Side note: To apply criteria to your outer tables, in addition to the JOIN clause, you could also use a derived table or CTE, like this:

select d.deptID, d.DeptName, e.empID, e.empName, e.salary
from Depts d
left outer join
    (select * from Emps where salary > 50) e on e.deptID = d.deptID 

deptID      DeptName             empID       empName              salary
----------- -------------------- ----------- -------------------- ---------------------
1           Dept A               2           Emp 2                75.00
2           Dept B               3           Emp 3                60.00
3           Dept C               NULL        NULL                 NULL

(3 row(s) affected)

...

Any questions or comments are welcome!  I hope this helps clear a few things up, and if you've ever used the "where ... or outertable.pk is null" technique in the past, you might want to double-check your code and your results, they may not be what you expect!

see also:

No comments:

Post a Comment