Aggregating Data Within Groups

Log Parser

Aggregating Data Within Groups

All the query examples that we have seen so far share a common characteristic: the values of each output record were built upon the values of a single input record.
Sometimes, however, we might need to aggregate multiple input records together and perform some operation on groups of input records.
To accomplish this task, the Log Parser SQL-Like language has a special set of functions that can be used to perform basic calculations on multiple records. These aggregate functions (also referred to as "SQL functions") include SUM, COUNT, MAX, MIN, and AVG.

Aggregating Data

To show a classic example of the use of aggregate functions, assume that given an IIS W3C log file, we want to calculate the total number of bytes sent by the IIS server during the whole period recorded in the log file.
Considering that the number of bytes sent by the IIS server for each HTTP request is logged in the "sc-bytes" field, our command will look like the following example:

C:\>LogParser -i:IISW3C -o:NAT "SELECT SUM(sc-bytes) FROM ex040528.log"
Since the SELECT clause of this query makes use of the SUM aggregate function, the query will automatically aggregate all the input records, and calculate the sum of all the values of the "sc-bytes" field across all the input records; the output of this command will then look like the following output:
SUM(sc-bytes)
-------------
242834732
As the example shows, the result of the query is a single output record, containing a single value calculated across all the input records.


As another example, we might want to calculate how many requests have been logged in the log file.
Considering that each log file entry represents a single HTTP request, this task can be accomplished by simply counting how many input records are logged in the file:

C:\>LogParser -i:IISW3C -o:NAT "SELECT COUNT(*) FROM ex040528.log"
The example above makes use of the COUNT aggregate function. When used with the special "*" argument, the COUNT function returns the total number of input records processed by the query.

If we want to calculate how many requests satisfy a particular condition, for example how many requests were for an ASP page, we can add a WHERE clause to the query, and the COUNT function will only count input records satisfying the WHERE condition:

SELECT COUNT(*) FROM ex040528.log WHERE EXTRACT_EXTENSION(cs-uri-stem) LIKE 'asp'


Creating Groups

In the examples above, we have been using aggregate functions to calculate a value across all the input records; sometimes, however, we might want to calculate values across groups of input records.

As an example, we might want to calculate the total number of bytes sent by the IIS server for each URL. To perform this task, we need to divide all the input records into groups according to the URL requested, and then use the SUM aggregate function separately on each group.

This can be accomplished by using another building block of the Log Parser SQL language: the GROUP BY clause.
The GROUP BY clause is used to specify which fields we want the group subdivision to be based on; after the input records have been divided into these groups, all the aggregate functions in the SELECT clause will be calculated separately on each of these groups, and the query will return an output record for each group created.

Using the GROUP BY clause, our example query and its output will look like this:

SELECT cs-uri-stem, COUNT(*) FROM ex040528.log GROUP BY cs-uri-stem
cs-uri-stem            COUNT(*)
---------------------- --------
/Home/default.asp      5
/Home/images/bckgd.gif 419
/Docs/expl.htm         12
/Docs/main.htm         26
/login/frmx.dll        1


To make another example, assume that we want to calculate how many requests have been served for each page type (ASP, html, CSS, etc.).
First of all, we need to create separate groups according to the extension of the URL; after this group subdivision has been done, we can calculate a COUNT(*) on each group:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, COUNT(*) 
FROM ex040528.log 
GROUP BY PageType
The output will look like:
PageType COUNT(ALL *)
-------- ------------
htm      115
css      22
gif      585
exe      25
nsf      142
swf      11
jpg      77
html     1
dll      1
asp      5
js       11
class    5

If we sort the output above according to the number of requests for each group, we will be creating a list showing the most requested page types first:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, COUNT(*) AS PageTypeHits
FROM ex040528.log 
GROUP BY PageType
ORDER BY PageTypeHits DESC
The output will look like:
PageType PageTypeHits
-------- ------------
gif      585
nsf      142
htm      115
jpg      77
exe      25
css      22
js       11
swf      11
asp      5
class    5
dll      1
html     1


Groups can also be built on multiple fields, thus creating a hierarchy of groups.

For example, consider the following query:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, COUNT(*) 
FROM ex040528.log 
GROUP BY PageType, sc-status
This query creates groups according to the requested page type, and within each of these groups, sub-groups are created according to the HTTP status sent by the IIS server for the group page type; the aggregate function "COUNT" will then be calculated on each sub-group.
The output will look like:
PageType sc-status PageTypeHits
-------- --------- ------------
htm      304       79
css      304       10
gif      304       450
exe      200       25
nsf      200       129
swf      200       3
gif      404       12
css      404       9
htm      200       34
css      200       3
jpg      200       17
gif      200       123
jpg      304       60
swf      304       8
nsf      403       3
html     404       1
dll      500       1
asp      200       5
js       304       7
class    304       4
js       200       4
htm      404       2
class    200       1
nsf      304       9
nsf      302       1


It's important to note a particular language constraint derived from the use of the GROUP BY clause.
Whenever a query contains a GROUP BY clause, its SELECT clause can only contain any of the following:

In other words, the following example is a correct query:
SELECT 'hello', TO_UPPERCASE(cs-uri-stem), COUNT(*), SUM(sc-bytes)
FROM ex040528.log 
GROUP BY cs-uri-stem
In fact, the SELECT clause in the example above contains:
  • A constant ("'hello'");
  • A field-expression ("TO_UPPERCASE(cs-uri-stem)") whose argument appears in the GROUP BY clause;
  • Two aggregate functions.

However, the following example is NOT a correct query:
SELECT date, COUNT(*), SUM(sc-bytes)
FROM ex040528.log 
GROUP BY cs-uri-stem
The SELECT clause in the example above contains a field-expression ("date") that does not appear in the GROUP BY clause.

The following example is also NOT a correct query:
SELECT TO_UPPERCASE(cs-uri-stem), COUNT(*), SUM(sc-bytes)
FROM ex040528.log 
GROUP BY SUBSTR(TO_UPPERCASE(cs-uri-stem), 0, 5)
The SELECT clause in the example above contains a field-expression ("TO_UPPERCASE(cs-uri-stem)") that is not derived from any field-expression in the GROUP BY clause; in this case, it's actually the field-expression in the GROUP BY clause that is derived from a field-expression in the SELECT clause.
The previous example can be corrected as follows:
SELECT SUBSTR(TO_UPPERCASE(cs-uri-stem), 0, 5), COUNT(*), SUM(sc-bytes)
FROM ex040528.log 
GROUP BY SUBSTR(TO_UPPERCASE(cs-uri-stem), 0, 5)


© 2004 Microsoft Corporation. All rights reserved.