Parsing Input Incrementally

Log Parser

Parsing Input Incrementally

Log Parser is often used to parse logs that grow over time.
For example, the IIS logs and the Windows Event Log are continuously updated with new information, and in some cases, we would like to parse these logs periodically and only retrieve the new records that have been logged since the last time.
This is especially true for scenarios in which, for example, we use Log Parser to consolidate logs to a database in an almost real-time fashion, or when we use Log Parser to build a monitoring system that periodically scans logs for new entries of interest.

For these scenarios, Log Parser offers a feature that allows sequential executions of the same query to only process new data that has been logged since the last execution.
This feature can be enabled with the iCheckPoint parameter of the following input formats:

The "iCheckPoint" parameter is used to specify the name of a "checkpoint" file that Log Parser uses to store and retrieve information about the "position" of the last entry parsed from each of the logs that appear in a command.
When we execute a command with a checkpoint file for the first time (i.e. when the specified checkpoint file does not exist), Log Parser executes the query normally and processes all the logs in the command, saving for each the "position" of the last parsed entry to the checkpoint file.
If later on we execute the same command specifying the same checkpoint file, Log Parser will parse again all the logs in the command, but each log will be parsed starting after the entry that was last parsed by the previous command, thus producing records for new entries only. When the new command execution is complete, the information in the checkpoint file is updated with the new "position" of the last entry in each log.

Note: Checkpoint files are updated only when a query executes succesfully. If an error causes the execution of a query to abort, the checkpoint file is not updated.


To make an example, let's assume that the "MyLogs" folder contains the following text files:

  • Log1.txt, 50 lines
  • Log2.txt, 100 lines
  • Log3.txt, 20 lines
  • Log4.txt, 30 lines
Let's also assume that we want to parse these text files incrementally using the TEXTLINE Input Format, which returns an input record for each line in the input text files.
In order to parse these logs incrementally, we specify the name of a checkpoint file, making sure that the file does not exist prior to the command execution. Our command would look like this:
logparser "SELECT * FROM MyLogs\*.*" -i:TEXTLINE -iCheckPoint:myCheckPoint.lpc
When this command is executed for the first time, Log Parser will return all the 200 lines from all of the four log files, and it will create the "myCheckPoint.lpc" checkpoint file containing the position of the last line in each of the four log files.

Tip: When the checkpoint file is specified without a path, Log Parser will create the checkpoint file in the folder currently set for the %TEMP% environment variable, usually "\Documents and Settings\<user name>\Local Settings\Temp".;

Let's now assume that the "Log3.txt" file is updated, and that ten new lines are added to the log file.
At this moment, the log files and the information stored in the checkpoint file will look like this:
Log Files Checkpoint file
Log1.txt, 50 lines Log1.txt, line 50
Log2.txt, 100 lines Log2.txt, line 100
Log3.txt, 30 lines Log3.txt, line 20
Log4.txt, 30 lines Log4.txt, line 30
If we execute again the same command, Log Parser will use the "myCheckPoint.lpc" file to determine where to start parsing each of the log files, and it will only parse and return the ten new lines in the "Log3.txt" file. When the command execution is complete, the "myCheckPoint.lpc" checkpoint file is updated to reflect the new position of the last line in the "Log3.txt" file.

If now a new "Log5.txt" file is created containing ten lines, the log files and the information stored in the checkpoint file will look like this:
Log Files Checkpoint file
Log1.txt, 50 lines Log1.txt, line 50
Log2.txt, 100 lines Log2.txt, line 100
Log3.txt, 30 lines Log3.txt, line 30
Log4.txt, 30 lines Log4.txt, line 30
Log5.txt, 10 lines not recorded
If we execute again the command, Log Parser will only parse the new "Log5.txt" file, returning its ten lines.

As another example showing how the checkpoint file is updated, let's assume now that the "Log2.txt" file is deleted.
The log files and the information stored in the checkpoint file will now look like this:
Log Files Checkpoint file
Log1.txt, 50 lines Log1.txt, line 50
non-existing Log2.txt, line 100
Log3.txt, 30 lines Log3.txt, line 30
Log4.txt, 30 lines Log4.txt, line 30
Log5.txt, 10 lines Log5.txt, line 10
When we execute the command, Log Parser will detect that there are no new entries to parse, and it will return no records. However, upon updating the checkpoint file, it will determine that the "Log2.txt" file doesn't exist anymore, and it will remove all the information associated with the log file from the checkpoint file, which will now look like this:
Log Files Checkpoint file
Log1.txt, 50 lines Log1.txt, line 50
Log3.txt, 30 lines Log3.txt, line 30
Log4.txt, 30 lines Log4.txt, line 30
Log5.txt, 10 lines Log5.txt, line 10
At this moment the checkpoint file does not contain anymore information on the "Log2.txt" file; should a new "Log2.txt" file appear again for any reason, a subsequent command would treat the file as a new file, and all of its entries would be parsed from the beginning of the file.

As a last example, let's now assume that the "Log1.txt" file is updated, but this time its size shrinks and it ends up containing ten lines only.
The log files and the information stored in the checkpoint file will now look like this:
Log Files Checkpoint file
Log1.txt, 10 lines Log1.txt, line 50
Log3.txt, 30 lines Log3.txt, line 30
Log4.txt, 30 lines Log4.txt, line 30
Log5.txt, 10 lines Log5.txt, line 10
When we execute the command, Log Parser will detect that the size of the "Log1.txt" file has changed, but instead of growing larger, the file is actually smaller. In this situation, Log Parser assumes that the file has been replaced with a new one, and it will parse it as if it was a new file, returning all of its ten entries.
After the command execution is complete, the "myCheckPoint.lpc" checkpoint file is updated to reflect the new situation, and the log files and the information stored in the checkpoint file will look like this:
Log Files Checkpoint file
Log1.txt, 10 lines Log1.txt, line 10
Log3.txt, 30 lines Log3.txt, line 30
Log4.txt, 30 lines Log4.txt, line 30
Log5.txt, 10 lines Log5.txt, line 10

Incremental Parsing and Aggregated Data

It's important to note that the checkpoint file only records information about the files being parsed; it does not record information about the query being executed.
In other words, when we execute a query multiple times on a set of growing files using a checkpoint file, each time the query results are calculated on the new entries only. This means that queries using aggregated data need to be handled carefully when used with checkpoint files.

As an example, consider again the four text files in the first scenario above, and the following command:
logparser "SELECT COUNT(*) AS Total FROM MyLogs\*.*" -i:TEXTLINE -iCheckPoint:myCheckPoint.lpc
When the command is executed for the first time, the "Total" field in the output record returned by the query will be equal to 200, that is, the total number of lines in the four log files.
As in the first example, let's now assume that the "Log3.txt" file is updated, and that ten new lines are added to the log file.
When we execute the command again, the "Total" field in the output record returned by the query will be now equal to 10, the total number of new lines in the four log files, and not to 210, as one would expect from the total number of rows.

In cases where it is desirable to calculate aggregated data across multiple executions of the same query when using incremental parsing, a possible solution is to save the partial results of each query to temporary files, and then aggregate all the partial results with an additional step.
Using the example above, we could save the result of the first query ("200") to the "FirstResults.csv" file, and the result of the second query ("10") to the "LastResults.csv" file. The two files could then be consolidated into a single file with a command like this:
logparser "SELECT SUM(Total) FROM FirstResults.csv, LastResults.csv" -i:CSV


© 2004 Microsoft Corporation. All rights reserved.