Using Microsoft Internet Information Services and Indexing Service for File Content Searches
Microsoft® Internet Information Services (IIS) 4.0 and Indexing Service version 2.0 (both part of Microsoft Windows NT® 4.0 Option Pack) combine to provide property filtering and searching as well as full-text indexing and searching of file data. Text query support against file data has an advantage over text query support against database data because, in Microsoft SQL Server™ the latter is limited to queries against character-based columns. These file content search capabilities are independent of SQL Server, and support SQL-based queries within ADO (ActiveX® Data Objects). The SQL used in ADO queries is consistent with the SQL extensions explained here.
Format Filters
Indexing Service provides filters for several popular file formats including Microsoft Word, Microsoft PowerPoint®, Microsoft Excel, and HTML. Filters are also available for plain-text. Filters can be written by customers and third-party vendors for other formats as well. One purpose of a filter is to provide support for nonplain-text documents. The other purpose is to capture property values both from the file content and about the files. Assuming that each file is a document, examples of properties include each document's title, the number of pages with notes in each PowerPoint document, the number of paragraphs in each document, the last date and time each file was accessed, and the physical path to each file. For a list of properties, see Using File Properties for File Content Searches. For a complete list, the Indexing Service documentation.
Phrase and Proximity Searches
Full-text indexes for file system searches are created by scanning the content of files. The process consists of keeping track of the significant words that are used and where they are located. For example, a full-text index may indicate that the word Canada is found at word number 227, word 473, and word number 1017 in a given file. This index structure supports an efficient search for all items containing indexed words, and advanced search operations such as phrase searches and proximity searches. An example of a phrase search is looking for "white elephant", where white is followed by elephant. An example of a proximity search is looking for phrases in which big occurs near house. To prevent the full-text index from becoming bloated with words that do not help the search, noise-words, such as a, and, and the, are ignored.
Noise Words
Noise-word lists for many languages are provided and the set of supported languages is growing. The choice of a particular noise-word list is based on the language of the material which is file-format-dependent during the filtering process: Some files have the language setting by section (for example, by paragraph), whereas some specify the language setting as a property of the document. These noise-word lists should be sufficient for most normal operations, but can be modified for specific environments with a text editor. For more information, see the Indexing Service 2.0 documentation in the Windows NT 4.0 Option Pack.
Text-search Catalog
Indexing Service stores indexes and property values in a text-search catalog. By default, a text-search catalog named Web is created when Indexing Service is installed. A given text-search catalog references one or more IIS virtual directories (also known as virtual roots). A virtual directory references one or more physical directories and, optionally, other virtual directories. After a real file is linked to the text catalog through a virtual directory, Indexing Service is notified of the new files that must be indexed and begins the filtering and indexing of the properties and content associated with these files. Indexing Service is also notified of any subsequent changes to these files and will refilter and reindex the updated files.