Aggregating Data Within Groups
All the query examples that we have seen so far share a common characteristic: the values
of each output record were built upon the values of a single input record.
Sometimes, however, we might need to aggregate multiple input records together
and perform some operation on groups of input records.
To accomplish this task, the Log Parser SQL-Like language has a special set of functions
that can be used to perform basic calculations on
multiple records. These aggregate functions (also referred to as "SQL functions") include SUM, COUNT, MAX,
MIN, and AVG.
Aggregating Data
To show a classic example of the use of aggregate functions, assume that given an IIS W3C
log file, we want to calculate the total number of bytes sent by the IIS server during
the whole period recorded in the log file.
Considering that the number of bytes sent by the IIS server for each HTTP request is
logged in the "sc-bytes" field, our command will look like the following example:
C:\>LogParser -i:IISW3C -o:NAT "SELECT SUM(sc-bytes) FROM ex040528.log"Since the SELECT clause of this query makes use of the SUM aggregate function, the query will automatically aggregate all the input records, and calculate the sum of all the values of the "sc-bytes" field across all the input records; the output of this command will then look like the following output:
SUM(sc-bytes) ------------- 242834732As the example shows, the result of the query is a single output record, containing a single value calculated across all the input records.
As another example, we might want to calculate how many requests have been logged
in the log file.
Considering that each log file entry represents a single HTTP request, this task can
be accomplished by simply counting how many input records are logged in the file:
C:\>LogParser -i:IISW3C -o:NAT "SELECT COUNT(*) FROM ex040528.log"The example above makes use of the COUNT aggregate function. When used with the special "*" argument, the COUNT function returns the total number of input records processed by the query.
If we want to calculate how many requests satisfy a particular condition, for example
how many requests were for an ASP page, we can add a WHERE clause to the query, and the
COUNT function will only count input records satisfying the WHERE condition:
SELECT COUNT(*) FROM ex040528.log WHERE EXTRACT_EXTENSION(cs-uri-stem) LIKE 'asp'
Creating Groups
In the examples above, we have been using aggregate functions to calculate a value across
all the input records; sometimes, however, we might want to calculate values across
groups of input records.
As an example, we might want to calculate the total number of bytes sent by the IIS server for each URL. To perform this task, we need to divide all the input records into groups according to the URL requested, and then use the SUM aggregate function separately on each group.
This can be accomplished by using another building block of the Log Parser SQL language:
the GROUP BY clause.
The GROUP BY clause is used to specify which fields we want the group subdivision to be
based on; after the input records have been divided into these groups, all the aggregate
functions in the SELECT clause will be calculated separately on each of these groups,
and the query will return an output record for each group created.
Using the GROUP BY clause, our example query and its output will look like this:
SELECT cs-uri-stem, COUNT(*) FROM ex040528.log GROUP BY cs-uri-stem
cs-uri-stem COUNT(*) ---------------------- -------- /Home/default.asp 5 /Home/images/bckgd.gif 419 /Docs/expl.htm 12 /Docs/main.htm 26 /login/frmx.dll 1
To make another example, assume that we want to calculate how many requests have been served
for each page type (ASP, html, CSS, etc.).
First of all, we need to create separate groups according to the extension of the URL; after
this group subdivision has been done, we can calculate a COUNT(*) on each group:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, COUNT(*) FROM ex040528.log GROUP BY PageTypeThe output will look like:
PageType COUNT(ALL *) -------- ------------ htm 115 css 22 gif 585 exe 25 nsf 142 swf 11 jpg 77 html 1 dll 1 asp 5 js 11 class 5
If we sort the output above according to the number of requests for each group, we will
be creating a list showing the most requested page types first:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, COUNT(*) AS PageTypeHits FROM ex040528.log GROUP BY PageType ORDER BY PageTypeHits DESCThe output will look like:
PageType PageTypeHits -------- ------------ gif 585 nsf 142 htm 115 jpg 77 exe 25 css 22 js 11 swf 11 asp 5 class 5 dll 1 html 1
Groups can also be built on multiple fields, thus creating a hierarchy of groups.
For example, consider the following query:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, COUNT(*) FROM ex040528.log GROUP BY PageType, sc-statusThis query creates groups according to the requested page type, and within each of these groups, sub-groups are created according to the HTTP status sent by the IIS server for the group page type; the aggregate function "COUNT" will then be calculated on each sub-group.
The output will look like:
PageType sc-status PageTypeHits -------- --------- ------------ htm 304 79 css 304 10 gif 304 450 exe 200 25 nsf 200 129 swf 200 3 gif 404 12 css 404 9 htm 200 34 css 200 3 jpg 200 17 gif 200 123 jpg 304 60 swf 304 8 nsf 403 3 html 404 1 dll 500 1 asp 200 5 js 304 7 class 304 4 js 200 4 htm 404 2 class 200 1 nsf 304 9 nsf 302 1
It's important to note a particular language constraint derived from the use of the GROUP BY
clause.
Whenever a query contains a GROUP BY clause, its SELECT clause can only contain
any of the following:
- Aggregate functions
- Field-expressions appearing also in the GROUP BY clause, or deriving from the field-expressions used in the GROUP BY clause
- Constants
SELECT 'hello', TO_UPPERCASE(cs-uri-stem), COUNT(*), SUM(sc-bytes) FROM ex040528.log GROUP BY cs-uri-stemIn fact, the SELECT clause in the example above contains:
- A constant ("'hello'");
- A field-expression ("TO_UPPERCASE(cs-uri-stem)") whose argument appears in the GROUP BY clause;
- Two aggregate functions.
However, the following example is NOT a correct query:
SELECT date, COUNT(*), SUM(sc-bytes) FROM ex040528.log GROUP BY cs-uri-stemThe SELECT clause in the example above contains a field-expression ("date") that does not appear in the GROUP BY clause.
The following example is also NOT a correct query:
SELECT TO_UPPERCASE(cs-uri-stem), COUNT(*), SUM(sc-bytes) FROM ex040528.log GROUP BY SUBSTR(TO_UPPERCASE(cs-uri-stem), 0, 5)The SELECT clause in the example above contains a field-expression ("TO_UPPERCASE(cs-uri-stem)") that is not derived from any field-expression in the GROUP BY clause; in this case, it's actually the field-expression in the GROUP BY clause that is derived from a field-expression in the SELECT clause.
The previous example can be corrected as follows:
SELECT SUBSTR(TO_UPPERCASE(cs-uri-stem), 0, 5), COUNT(*), SUM(sc-bytes) FROM ex040528.log GROUP BY SUBSTR(TO_UPPERCASE(cs-uri-stem), 0, 5)