Calculating Percentages
When working with groups and aggregate functions, it is often needed to represent an
aggregate value as a percentage, rather than as an absolute value.
We might want, for example, to calculate the number of hits per page type from a Web
server log as a percentage relative to the total number of hits, rather than as
the absolute number itself.
Consider the previous example query, that calculates the count of hits per requested
page type:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, COUNT(*) FROM ex040528.log GROUP BY PageType
PageType COUNT(ALL *) -------- ------------ htm 115 css 22 gif 585 exe 25 nsf 142 swf 11 jpg 77 html 1 dll 1 asp 5 js 11 class 5If we wanted to calculate the percentage of hits for each group, we would need to divide the number of hits within each group by the total number of hits in the whole log file; however, the use of the GROUP BY clause restricts each aggregate function to operate within the single groups, thus making it impossible to calculate at the same time the total number of hits across all groups.
To workaround this problem, we use two special aggregate functions available in the
Log Parser SQL language: PROPCOUNT and PROPSUM.
When used in their basic forms, these functions calculate the ratio of the COUNT or ADD
aggregate functions within a group to the COUNT or ADD aggregate functions on all of
the input records.
Using the PROPCOUNT function, we can change the query above as follows:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, PROPCOUNT(*) FROM ex040528.log GROUP BY PageTypeAnd obtain:
PageType PROPCOUNT(ALL *) -------- ---------------- htm 0.115000 css 0.022000 gif 0.585000 exe 0.025000 nsf 0.142000 swf 0.011000 jpg 0.077000 html 0.001000 dll 0.001000 asp 0.005000 js 0.011000 class 0.005000To show real percentages, we can multiply the aggregate function values by 100:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, MUL( PROPCOUNT(*), 100.0 ) AS PageTypeHits FROM ex040528.log GROUP BY PageType
PageType PageTypeHits -------- ------------ htm 11.500000 css 2.200000 gif 58.500000 exe 2.500000 nsf 14.200000 swf 1.100000 jpg 7.700000 html 0.100000 dll 0.100000 asp 0.500000 js 1.100000 class 0.500000From the results of this query we can infer that, for example, requests to "css" pages represent the 2.2% of the total number of requests in this log file.
Calculating Percentages Across Multiple Group Hierarchies
The examples above show the basic form of the PROPCOUNT and PROPSUM functions, which
calculates the percentage of an aggregate function within a group relative to
all of the input records.
However, it is also possible to use the PROPCOUNT and PROPSUM functions to calculate
percentages relative to hierarchically higher groups.
To do so, we can use the ON keyword after the PROPCOUNT or PROPSUM function name
followed by a list of the GROUP BY field-expressions identifying which hierarchically
higher group we want the percentage to be relative to.
Consider one of the previous examples, in which we calculated the total number of hits
per page type per HTTP status code, modified to show percentages rather than
absolute numbers:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL(PROPCOUNT(*), 100.0) AS Hits FROM ex040528.log GROUP BY PageType, sc-status ORDER BY PageType, sc-status
PageType sc-status Hits -------- --------- ------------ asp 200 0.500000 class 200 0.100000 class 304 0.400000 css 200 0.300000 css 304 1.000000 css 404 0.900000 dll 500 0.100000 exe 200 2.500000 gif 200 12.300000 gif 304 45.000000 gif 404 1.200000 htm 200 3.400000 htm 304 7.900000 htm 404 0.200000 html 404 0.100000 jpg 200 1.700000 jpg 304 6.000000 js 200 0.400000 js 304 0.700000 nsf 200 12.900000 nsf 302 0.100000 nsf 304 0.900000 nsf 403 0.300000 swf 200 0.300000 swf 304 0.800000The "Hits" field shows the percentage of hits for a page type and HTTP status code relative to the total number of hits.
If we wanted to calculate the percentage of hits for a page type and HTTP status code
relative to the number of hits for that page type (i.e. the distribution of HTTP status codes
within each page type), we would have written the query as follows:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL(PROPCOUNT(*) ON (PageType), 100.0) AS Hits FROM ex040528.log GROUP BY PageType, sc-status ORDER BY PageType, sc-statusThe output would be:
PageType sc-status Hits -------- --------- ---------- asp 200 100.000000 class 200 20.000000 class 304 80.000000 css 200 13.636364 css 304 45.454545 css 404 40.909091 dll 500 100.000000 exe 200 100.000000 gif 200 21.025641 gif 304 76.923077 gif 404 2.051282 htm 200 29.565217 htm 304 68.695652 htm 404 1.739130 html 404 100.000000 jpg 200 22.077922 jpg 304 77.922078 js 200 36.363636 js 304 63.636364 nsf 200 90.845070 nsf 302 0.704225 nsf 304 6.338028 nsf 403 2.112676 swf 200 27.272727 swf 304 72.727273We can now infer that, for example, about 45% of requests to "css" pages returned an HTTP status code of 304.
Here we have used the ON keyword followed by the "PageType"
GROUP BY field-expression. This notation indicates that we want the PROPCOUNT function to
calculate the ratio of the COUNT aggregate function within a single group to the COUNT
aggregate function within the hierarchically higher group identified by the "PageType"
field-expression.
As another example, we can modify the previous example query to create groups based on
the time the request was made at (quantized at 20-second intervals), the page type, and
the HTTP status code:
SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status FROM ex040528.log GROUP BY Interval, PageType, sc-status ORDER BY Interval, PageType, sc-statusFor each group, we can calculate the percentage of hits relative to the number of hits within the time interval and page type, the percentage of hits relative to the number of hits within the time interval alone, and the percentage of hits relative to the total number of hits:
SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL( PROPCOUNT(*) ON (Interval, PageType), 100.0) AS Hits1, MUL( PROPCOUNT(*) ON (Interval), 100.0) AS Hits2, MUL( PROPCOUNT(*), 100.0) AS Hits3 FROM ex040528.log GROUP BY Interval, PageType, sc-status ORDER BY Interval, PageType, sc-status
Interval PageType sc-status Hits1 Hits2 Hits3 -------- -------- --------- ---------- --------- --------- 00:28:40 css 200 20.000000 1.470588 0.100000 00:28:40 css 304 60.000000 4.411765 0.300000 00:28:40 css 404 20.000000 1.470588 0.100000 00:28:40 exe 200 100.000000 7.352941 0.500000 00:28:40 gif 200 10.000000 1.470588 0.100000 00:28:40 gif 304 70.000000 10.294118 0.700000 00:28:40 gif 404 20.000000 2.941176 0.200000 00:28:40 htm 200 11.764706 2.941176 0.200000 00:28:40 htm 304 88.235294 22.058824 1.500000 00:28:40 jpg 200 25.000000 1.470588 0.100000 00:28:40 jpg 304 75.000000 4.411765 0.300000 00:28:40 nsf 200 100.000000 35.294118 2.400000 00:28:40 swf 200 33.333333 1.470588 0.100000 00:28:40 swf 304 66.666667 2.941176 0.200000 00:29:00 ASP 200 100.000000 0.216920 0.100000 00:29:00 GIF 200 100.000000 0.433839 0.200000 00:29:00 asp 200 100.000000 0.216920 0.100000 00:29:00 class 200 50.000000 0.216920 0.100000 00:29:00 class 304 50.000000 0.216920 0.100000 00:29:00 css 200 14.285714 0.216920 0.100000 00:29:00 css 304 28.571429 0.433839 0.200000 00:29:00 css 404 57.142857 0.867679 0.400000 00:29:00 dll 500 100.000000 0.216920 0.100000 00:29:00 exe 200 100.000000 1.952278 0.900000 00:29:00 gif 200 21.794872 14.750542 6.800000 00:29:00 gif 304 76.923077 52.060738 24.000000 00:29:00 gif 404 1.282051 0.867679 0.400000 00:29:00 htm 200 34.090909 3.253796 1.500000 00:29:00 htm 304 63.636364 6.073753 2.800000 00:29:00 htm 404 2.272727 0.216920 0.100000 00:29:00 html 404 100.000000 0.216920 0.100000 00:29:00 jpg 200 35.000000 1.518438 0.700000 00:29:00 jpg 304 65.000000 2.819957 1.300000 00:29:00 js 200 50.000000 0.433839 0.200000 00:29:00 js 304 50.000000 0.433839 0.200000 00:29:00 nsf 200 94.339623 10.845987 5.000000 00:29:00 nsf 403 5.660377 0.650759 0.300000 00:29:00 swf 200 50.000000 0.433839 0.200000 00:29:00 swf 304 50.000000 0.433839 0.200000 00:29:20 NSF 200 100.000000 2.127660 0.300000 00:29:20 asp 200 100.000000 0.709220 0.100000 00:29:20 class 304 100.000000 0.709220 0.100000 00:29:20 css 304 60.000000 2.127660 0.300000 00:29:20 css 404 40.000000 1.418440 0.200000 00:29:20 exe 200 100.000000 2.836879 0.400000 00:29:20 gif 304 97.142857 48.226950 6.800000 00:29:20 gif 404 2.857143 1.418440 0.200000 00:29:20 htm 200 15.789474 2.127660 0.300000 00:29:20 htm 304 78.947368 10.638298 1.500000 00:29:20 htm 404 5.263158 0.709220 0.100000 00:29:20 jpg 200 15.384615 1.418440 0.200000 00:29:20 jpg 304 84.615385 7.801418 1.100000 00:29:20 js 200 50.000000 1.418440 0.200000 00:29:20 js 304 50.000000 1.418440 0.200000 00:29:20 nsf 200 61.111111 7.801418 1.100000 00:29:20 nsf 302 5.555556 0.709220 0.100000 00:29:20 nsf 304 33.333333 4.255319 0.600000 00:29:20 swf 304 100.000000 2.127660 0.300000From the query results we can infer, for example, that during the "00:29:20" time interval, about 78% of the requests to "htm" pages returned the HTTP status code 304.
In the same time interval, requests to "htm" pages returning the HTTP status code 304 made up for about 10% of the requests, and these requests represent the 1.5% of the total number of requests in the log.
The example above shows that a PROPCOUNT or PROPSUM function with no ON keyword is logically equivalent to using the ON keyword followed by an empty list of GROUP BY field-expressions, meaning that the percentage to be calculated should be relative to the highest hierarchical group identified by no field-expression, i.e. the whole set of input records.
In addition, it is also worth mentioning that the list of field-expressions specified after
the ON keyword must be a proper prefix of the GROUP BY field-expressions.
If, for example, the ON keyword is followed by
three field-expressions, then these three field-expressions must match the first
three field-expressions in the GROUP BY clause, and they must also appear in the same
order as they do in the GROUP BY clause.
In other words, each PROPCOUNT function in the following query is correct, since the
lists of field-expressions after the ON keyword are all a proper prefix of the GROUP
BY field-expressions:
SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL( PROPCOUNT(*) ON (Interval, PageType), 100.0) AS Hits1, MUL( PROPCOUNT(*) ON (Interval), 100.0) AS Hits2 FROM ex040528.log GROUP BY Interval, PageType, sc-statusHowever, none of the PROPCOUNT functions in the following query is correct, since the lists of field-expressions after the ON keyword are not a proper prefix of the GROUP BY field-expressions:
SELECT QUANTIZE(time, 20) AS Interval, EXTRACT_EXTENSION(cs-uri-stem) AS PageType, sc-status, MUL( PROPCOUNT(*) ON (PageType, sc-status), 100.0) AS Hits1, MUL( PROPCOUNT(*) ON (PageType), 100.0) AS Hits2, MUL( PROPCOUNT(*) ON (Interval, sc-status), 100.0) AS Hits2, FROM ex040528.log GROUP BY Interval, PageType, sc-status