Below we have data from our Stack Overflow database.
SELECT TOP 10
U.AccountId,
U.DisplayName,
P.Title AS PostTitle
FROM
Posts AS P
INNER JOIN Users AS U ON P.OwnerUserId = U.Id
WHERE
PostTypeId = 1 /* 1 = Posts */
And this is what our data looks like.
Suppose I want to know how many posts each USER had. We can do that by using the GROUP BY.
What does GROUP BY do exactly? If you’ve done any type of programming, you’ve probably heard of a for each loop. For each object (whatever that is), do something. That’s how GROUP BY works.
Using this example, we can say: for each DisplayName, give me the count of posts.
SELECT TOP 10
U.DisplayName,
COUNT(*) AS PostsCreated
FROM
Posts AS P
INNER JOIN Users AS U ON P.OwnerUserId = U.Id
WHERE
PostTypeId = 1
GROUP BY
U.DisplayName
This is our output.
We’re not quite done yet. We need to validate our data. One question we can ask is, can multiple users have the same display name? We can use GROUP BY to figure that out.
SELECT
DisplayName,
COUNT(*) AS DisplayNameCount
FROM
Users
GROUP BY
DisplayName
ORDER BY
COUNT(*) desc
Whoa! Good thing we checked! The Display Name John is used 932 times! I bet you’re wondering what impact does that have on our initial group by statement? I’m so glad you asked! This has a HUGE impact.
Let’s take our validation one step further. We know that 932 users have a Display Name of John. Let’s see how many posts each John has. But… how do we do that? We’ll need to find a unique column for each User. The primary key of the Users table is a great option. Primary keys are always unique. This ensures that one id belongs to only one user. Let’s take a look at the below query.
SELECT
U.Id,
U.DisplayName,
count(*) AS PostsCreated
FROM
Posts AS P
INNER JOIN Users AS U ON P.OwnerUserId = U.Id
WHERE
PostTypeId = 1
AND U.DisplayName = 'John'
GROUP BY
U.Id,
U.DisplayName
ORDER BY
COUNT(*) DESC
Results of the query.
This paints us a very different picture. The first John 167 posts; second John has 124 posts. If we kept our initial query, we would have said the Display Name, John, has 2,443 posts, that isn’t each USER. All we need to do now is remove line 10 from our query.
SELECT
U.Id,
U.DisplayName,
COUNT(*) AS PostsCreated
FROM
Posts AS P
INNER JOIN Users AS U ON P.OwnerUserId = U.Id
WHERE
PostTypeId = 1
GROUP BY
U.Id,
U.DisplayName
ORDER BY
COUNT(*) DESC
I encourage you to start thinking about scenarios like this, because they do happen. You’ll quickly find out that a query that sounds easy, might not be as easy as you thought. It’s not all bad news, there’s a silver lining. Once you start asking customers questions like this, you’ll be able to nail down requirements precisely and have accurate data. It’s a win win!