Calculating Percentages

Log Parser

Calculating Percentages

When working with groups and aggregate functions, it is often needed to represent an aggregate value as a percentage, rather than as an absolute value.
We might want, for example, to calculate the number of hits per page type from a Web server log as a percentage relative to the total number of hits, rather than as the absolute number itself.

Consider the previous example query, that calculates the count of hits per requested page type:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, COUNT(*) 
FROM ex040528.log 
GROUP BY PageType
PageType COUNT(ALL *)
-------- ------------
htm      115
css      22
gif      585
exe      25
nsf      142
swf      11
jpg      77
html     1
dll      1
asp      5
js       11
class    5
If we wanted to calculate the percentage of hits for each group, we would need to divide the number of hits within each group by the total number of hits in the whole log file; however, the use of the GROUP BY clause restricts each aggregate function to operate within the single groups, thus making it impossible to calculate at the same time the total number of hits across all groups.


To workaround this problem, we use two special aggregate functions available in the Log Parser SQL language: PROPCOUNT and PROPSUM.
When used in their basic forms, these functions calculate the ratio of the COUNT or ADD aggregate functions within a group to the COUNT or ADD aggregate functions on all of the input records.

Using the PROPCOUNT function, we can change the query above as follows:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, PROPCOUNT(*) 
FROM ex040528.log 
GROUP BY PageType
And obtain:
PageType PROPCOUNT(ALL *)
-------- ----------------
htm      0.115000
css      0.022000
gif      0.585000
exe      0.025000
nsf      0.142000
swf      0.011000
jpg      0.077000
html     0.001000
dll      0.001000
asp      0.005000
js       0.011000
class    0.005000
To show real percentages, we can multiply the aggregate function values by 100:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, MUL( PROPCOUNT(*), 100.0 ) AS PageTypeHits
FROM ex040528.log 
GROUP BY PageType
PageType PageTypeHits
-------- ------------
htm      11.500000
css      2.200000
gif      58.500000
exe      2.500000
nsf      14.200000
swf      1.100000
jpg      7.700000
html     0.100000
dll      0.100000
asp      0.500000
js       1.100000
class    0.500000
From the results of this query we can infer that, for example, requests to "css" pages represent the 2.2% of the total number of requests in this log file.

Calculating Percentages Across Multiple Group Hierarchies

The examples above show the basic form of the PROPCOUNT and PROPSUM functions, which calculates the percentage of an aggregate function within a group relative to all of the input records.
However, it is also possible to use the PROPCOUNT and PROPSUM functions to calculate percentages relative to hierarchically higher groups.
To do so, we can use the ON keyword after the PROPCOUNT or PROPSUM function name followed by a list of the GROUP BY field-expressions identifying which hierarchically higher group we want the percentage to be relative to.

Consider one of the previous examples, in which we calculated the total number of hits per page type per HTTP status code, modified to show percentages rather than absolute numbers:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL(PROPCOUNT(*), 100.0) AS Hits
FROM ex040528.log 
GROUP BY PageType, sc-status
ORDER BY PageType, sc-status
PageType sc-status Hits
-------- --------- ------------
asp      200       0.500000
class    200       0.100000
class    304       0.400000
css      200       0.300000
css      304       1.000000
css      404       0.900000
dll      500       0.100000
exe      200       2.500000
gif      200       12.300000
gif      304       45.000000
gif      404       1.200000
htm      200       3.400000
htm      304       7.900000
htm      404       0.200000
html     404       0.100000
jpg      200       1.700000
jpg      304       6.000000
js       200       0.400000
js       304       0.700000
nsf      200       12.900000
nsf      302       0.100000
nsf      304       0.900000
nsf      403       0.300000
swf      200       0.300000
swf      304       0.800000
The "Hits" field shows the percentage of hits for a page type and HTTP status code relative to the total number of hits.

If we wanted to calculate the percentage of hits for a page type and HTTP status code relative to the number of hits for that page type (i.e. the distribution of HTTP status codes within each page type), we would have written the query as follows:

SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL(PROPCOUNT(*) ON (PageType), 100.0) AS Hits
FROM ex040528.log 
GROUP BY PageType, sc-status
ORDER BY PageType, sc-status
The output would be:
PageType sc-status Hits
-------- --------- ----------
asp      200       100.000000
class    200       20.000000
class    304       80.000000
css      200       13.636364
css      304       45.454545
css      404       40.909091
dll      500       100.000000
exe      200       100.000000
gif      200       21.025641
gif      304       76.923077
gif      404       2.051282
htm      200       29.565217
htm      304       68.695652
htm      404       1.739130
html     404       100.000000
jpg      200       22.077922
jpg      304       77.922078
js       200       36.363636
js       304       63.636364
nsf      200       90.845070
nsf      302       0.704225
nsf      304       6.338028
nsf      403       2.112676
swf      200       27.272727
swf      304       72.727273
We can now infer that, for example, about 45% of requests to "css" pages returned an HTTP status code of 304.

Here we have used the ON keyword followed by the "PageType" GROUP BY field-expression. This notation indicates that we want the PROPCOUNT function to calculate the ratio of the COUNT aggregate function within a single group to the COUNT aggregate function within the hierarchically higher group identified by the "PageType" field-expression.

As another example, we can modify the previous example query to create groups based on the time the request was made at (quantized at 20-second intervals), the page type, and the HTTP status code:

SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status
FROM ex040528.log 
GROUP BY Interval, PageType, sc-status
ORDER BY Interval, PageType, sc-status
For each group, we can calculate the percentage of hits relative to the number of hits within the time interval and page type, the percentage of hits relative to the number of hits within the time interval alone, and the percentage of hits relative to the total number of hits:
SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status,
     MUL( PROPCOUNT(*) ON (Interval, PageType), 100.0) AS Hits1,
     MUL( PROPCOUNT(*) ON (Interval), 100.0) AS Hits2,
     MUL( PROPCOUNT(*), 100.0) AS Hits3 
FROM ex040528.log 
GROUP BY Interval, PageType, sc-status
ORDER BY Interval, PageType, sc-status
Interval PageType sc-status Hits1      Hits2     Hits3
-------- -------- --------- ---------- --------- ---------
00:28:40 css      200       20.000000  1.470588  0.100000
00:28:40 css      304       60.000000  4.411765  0.300000
00:28:40 css      404       20.000000  1.470588  0.100000
00:28:40 exe      200       100.000000 7.352941  0.500000
00:28:40 gif      200       10.000000  1.470588  0.100000
00:28:40 gif      304       70.000000  10.294118 0.700000
00:28:40 gif      404       20.000000  2.941176  0.200000
00:28:40 htm      200       11.764706  2.941176  0.200000
00:28:40 htm      304       88.235294  22.058824 1.500000
00:28:40 jpg      200       25.000000  1.470588  0.100000
00:28:40 jpg      304       75.000000  4.411765  0.300000
00:28:40 nsf      200       100.000000 35.294118 2.400000
00:28:40 swf      200       33.333333  1.470588  0.100000
00:28:40 swf      304       66.666667  2.941176  0.200000
00:29:00 ASP      200       100.000000 0.216920  0.100000
00:29:00 GIF      200       100.000000 0.433839  0.200000
00:29:00 asp      200       100.000000 0.216920  0.100000
00:29:00 class    200       50.000000  0.216920  0.100000
00:29:00 class    304       50.000000  0.216920  0.100000
00:29:00 css      200       14.285714  0.216920  0.100000
00:29:00 css      304       28.571429  0.433839  0.200000
00:29:00 css      404       57.142857  0.867679  0.400000
00:29:00 dll      500       100.000000 0.216920  0.100000
00:29:00 exe      200       100.000000 1.952278  0.900000
00:29:00 gif      200       21.794872  14.750542 6.800000
00:29:00 gif      304       76.923077  52.060738 24.000000
00:29:00 gif      404       1.282051   0.867679  0.400000
00:29:00 htm      200       34.090909  3.253796  1.500000
00:29:00 htm      304       63.636364  6.073753  2.800000
00:29:00 htm      404       2.272727   0.216920  0.100000
00:29:00 html     404       100.000000 0.216920  0.100000
00:29:00 jpg      200       35.000000  1.518438  0.700000
00:29:00 jpg      304       65.000000  2.819957  1.300000
00:29:00 js       200       50.000000  0.433839  0.200000
00:29:00 js       304       50.000000  0.433839  0.200000
00:29:00 nsf      200       94.339623  10.845987 5.000000
00:29:00 nsf      403       5.660377   0.650759  0.300000
00:29:00 swf      200       50.000000  0.433839  0.200000
00:29:00 swf      304       50.000000  0.433839  0.200000
00:29:20 NSF      200       100.000000 2.127660  0.300000
00:29:20 asp      200       100.000000 0.709220  0.100000
00:29:20 class    304       100.000000 0.709220  0.100000
00:29:20 css      304       60.000000  2.127660  0.300000
00:29:20 css      404       40.000000  1.418440  0.200000
00:29:20 exe      200       100.000000 2.836879  0.400000
00:29:20 gif      304       97.142857  48.226950 6.800000
00:29:20 gif      404       2.857143   1.418440  0.200000
00:29:20 htm      200       15.789474  2.127660  0.300000
00:29:20 htm      304       78.947368  10.638298 1.500000
00:29:20 htm      404       5.263158   0.709220  0.100000
00:29:20 jpg      200       15.384615  1.418440  0.200000
00:29:20 jpg      304       84.615385  7.801418  1.100000
00:29:20 js       200       50.000000  1.418440  0.200000
00:29:20 js       304       50.000000  1.418440  0.200000
00:29:20 nsf      200       61.111111  7.801418  1.100000
00:29:20 nsf      302       5.555556   0.709220  0.100000
00:29:20 nsf      304       33.333333  4.255319  0.600000
00:29:20 swf      304       100.000000 2.127660  0.300000
From the query results we can infer, for example, that during the "00:29:20" time interval, about 78% of the requests to "htm" pages returned the HTTP status code 304.
In the same time interval, requests to "htm" pages returning the HTTP status code 304 made up for about 10% of the requests, and these requests represent the 1.5% of the total number of requests in the log.

The example above shows that a PROPCOUNT or PROPSUM function with no ON keyword is logically equivalent to using the ON keyword followed by an empty list of GROUP BY field-expressions, meaning that the percentage to be calculated should be relative to the highest hierarchical group identified by no field-expression, i.e. the whole set of input records.

In addition, it is also worth mentioning that the list of field-expressions specified after the ON keyword must be a proper prefix of the GROUP BY field-expressions. If, for example, the ON keyword is followed by three field-expressions, then these three field-expressions must match the first three field-expressions in the GROUP BY clause, and they must also appear in the same order as they do in the GROUP BY clause.
In other words, each PROPCOUNT function in the following query is correct, since the lists of field-expressions after the ON keyword are all a proper prefix of the GROUP BY field-expressions:

SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status,
     MUL( PROPCOUNT(*) ON (Interval, PageType), 100.0) AS Hits1,
     MUL( PROPCOUNT(*) ON (Interval), 100.0) AS Hits2
FROM ex040528.log 
GROUP BY Interval, PageType, sc-status
However, none of the PROPCOUNT functions in the following query is correct, since the lists of field-expressions after the ON keyword are not a proper prefix of the GROUP BY field-expressions:
SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status,
     MUL( PROPCOUNT(*) ON (PageType, sc-status), 100.0) AS Hits1,
     MUL( PROPCOUNT(*) ON (PageType), 100.0) AS Hits2,
     MUL( PROPCOUNT(*) ON (Interval, sc-status), 100.0) AS Hits2,
FROM ex040528.log 
GROUP BY Interval, PageType, sc-status


© 2004 Microsoft Corporation. All rights reserved.