Wednesday, June 6, 2012

COUNT DISTINCT in Access: Part 5

Comparison of the Methods

SQL Server has a nice built-in function called COUNT DISTINCT, which is missing in Access SQL.

Over the last four posts, I've discussed different ways to simulate it in Access :

  1. Subqueries the FROM Clause
  2. Subqueries in the Field List
  3. User-Defined Function
  4. Crosstab Query (reader submitted method)

Each of these methods have advantages and disadvantages, and as promised, I'll address them here.

Subquery in FROM Clause

The main advantage of creating a subquery in the FROM clause is ease of use, that is, it's the easiest to figure out. It's possible to approach it step-wise by first removing the duplicates from the list to be counted.  See Problem 1 in Subqueries the FROM Clause.

The main disadvantage is lack of flexibility.  As I showed in Problems 2 and 3 in Subqueries the FROM Clause, you can't easily created other levels of aggregation nor simply add a second aggregate to the same level.  The reason is you've pre-limited the values available, so you need to create extra levels of subqueries to compensate.

Over all, this method is useful for simple distinct counts, but not for more complex ones.

Subquery in Field List

The main advantage of create a subquery in the Field List is it's flexibility.  Once you've created the initial query, adding a second aggregate can be added just like in any other aggregate query, that is, just add another field to the field list with an aggregate function.

The main disadvantage is that it's a little harder to figure out in the first place.  The subquery must be a correlated subquery, which is conceptually more difficult.  A correlated subquery is evaluated for each row in the main query, so it must be tied to the main query, and that's a more difficult concept.

This method is useful for more complex queries.  It also more closely simulates the T-SQL Count Distinct, which also works at the Field List level.  This means that if you need to upsize this query to T-SQL, it's as simple as replacing the subquery with the COUNT DISTINCT.

User-Defined Function

One advantage of with user-defined function is that once created, you don't have to figure out how to do a subquery for each query.  You can simply call the function and send in the appropriate values.  It also appears to perform fairly well, but I'll address performance below.

The main disadvantage is that it's fairly easy to produce an incorrect result. The Where and Group By arguments must match the main Where and Group By clauses of the main query or the value will be be incorrect.  However, In an extremely complex query, this method may be useful to reduce the level of complexity in the main query.

Crosstab Query Method

If you're familiar with crosstab queries, this method is nice and clean.  And it performs very well (see below).

But as I've noted before, the main problem is that you can't add additional aggregates.  For instance, you can't show the sum of one field and the count distinct of another in the same query.  This makes it more limited than the other methods.

Performance

First of all, it's silly to talk about performance without discussing indexes.  On non-indexed fields, each of the methods will perform much worse.  In this case, I indexed all of the fields involved in the aggregation: OrderID (primary key), Customer, and Order_Date.

When each of the methods were run against the sample data listed, they all ran nearly instantaneously.  That table consists of 4 customers and 3 days, totaling 22 records.

To test the performance, I created a table consisting of 19 customers over the course of 30 days totaling 113,000 rows.  Then I wrote a subroutine that opens each query, recording the time when it opens and when the query completes.  The code looks like this:

Sub test()
Dim starttime As Date
Dim endtime As Date

'test From
starttime = Now
DoCmd.OpenQuery "TestFROMLarge"
endtime = Now
Debug.Print "TestFROMLarge: " & DateDiff("s", starttime, endtime)
DoCmd.Close acQuery, "TestFROMLarge"

'test UDF
starttime = Now
DoCmd.OpenQuery "TestUDF_Large"
endtime = Now
Debug.Print "TestUDF_Large: " & DateDiff("s", starttime, endtime)
DoCmd.Close acQuery, "TestUDF_Large"

'test Xtab
starttime = Now
DoCmd.OpenQuery "XTab_Prob2_Large"
endtime = Now
Debug.Print "XTab_Prob2_Large: " & DateDiff("s", starttime, endtime)
DoCmd.Close acQuery, "XTab_Prob2_Large"

'test FieldList
starttime = Now
DoCmd.OpenQuery "TestFieldListLarge"
endtime = Now
Debug.Print "TestFieldListLarge: " & DateDiff("s", starttime, endtime)
DoCmd.Close acQuery, "TestFieldListLarge"

End Sub

Running this code against the 100K table produced this:

TestFROMLarge: 1
TestUDF_Large: 1
XTab_Prob2_Large: 1
TestFieldListLarge: 8

That gives me some information, but not enough.  So I created an even larger file.  Still using the same 19 customers, but with data spanning a whole year.  This new table had nearly 1 million records.

Running my test code against the 1M table produced this:

TestFROMLarge: 10
TestUDF_Large: 4
XTab_Prob2_Large: 6
TestFieldListLarge: 74

This file gives me the granularity to see differences.  The first three still execute within similar time frames.  The Field List method takes nearly 8 times longer.

Surprisingly, the User Defined Function performs the best of all of them.  My expectation would have been that it was the slowest.  Also surprisingly, the Field List method was the slowest.  I would have thought that a correlated subquery would execute faster.

If you'd like to test this for yourself, I've bundled this whole series with the database into a sample available my my website: http://www.rogersaccesslibrary.com/forum/countdistinctmdb-intermediate_topic595.html

Because of size considerations, it does not have the 1M table, so you'd have to create that yourself.

No comments: