Indexing

WinHex & X-Ways

Indexing

 

Part of volume snapshot refinement. Available only with a forensic license. Reads the data with the same logic as a logical search, with the same advantages (see that topic).

 

Creates indexes of all words in all or certain files in the volume snapshot, based on characters you provide, based on the Unicode character set and/or up to two code pages that you select. It is possible to have up to three such indexes per evidence object (e.g. Cyrillic characters indexed in Unicode and two Cyrillic code pages). X-Ways Forensics allows you to conveniently select characters from more than 22 languages for indexing. Currently, most European and many Asian languages are predefined, e.g. German, Spanish, French, Portuguese, Italian, Scandinavian languages, Russian, South Slavic languages, Eastern European languages, Greek, Turkish, Hebrew, Arabic, Thai, Vietnamese. You may specify each and every character explicitly, or specify ranges of characters that can optionally be followed by additional single characters (e.g. a-zA-Zäöü) if the edit box for the character pool starts with "range:". To index the dash itself (not recommended), specify it as the last character in the edit box.

 

Indexing is a potentially time-consuming process and may require a large amount of drive space (rule of thumb for default settings and average data: 5-25% of the original amount of data). However, the index will allow you to conduct further searches very quickly and spontaneously. The index files are saved in the subdirectories of the metadata folder of the corresponding evidence object. The scope of the index, i.e. which files are to be indexed, can be fine-tuned. Note that the index of partitioned media such as physical hard disks solely covers unpartitioned areas. That's because each partition can have its own index.

 

Words shorter than a lower limit you specify are ignored. The longer the minimum length in characters, the smaller the index and the faster the indexing procedure. The default lower limit is 4 characters. Frequent irrelevant words can be excluded from the index in the exception list with a minus prefix (e.g. -and, if 3-letter words are already accepted), which reduces the size of the index and the time needed to create it. The larger the range of accepted word lengths, the larger the index becomes and the more time indexing takes. Important 3-letter words can be added to the exclusion list with a plus prefix (e.g. +xtc), which overrides the default lower limit of 4 characters. The exception list does not have to be sorted alphabetically. Words in the exception list longer than the upper limit you specify are truncated in the index. Words in the exception list are bound by the character pool and cannot contain different characters.

 

X-Ways Forensics can optionally distinguish between uppercase and lowercase letters, i.e. create a case-sensitive index. This can be useful e.g. if you create the index for the purpose of later exporting a word list for a customized dictionary attack.

 

If you have X-Ways Forensics include substrings in the index, this will further slow down index creation (by a factor of 3 to 5) and inflate the index, however, you will later be able to find e.g. "wife" in "housewife" and "solve" in "resolve". If you do not include substrings in the index, it will still be possible to search the index for substrings later, but the result will be incomplete, and the search speed much slower. Please note that it is the responsibility of the user to enable substring indexing if the words in the language to index are not delimited with spaces (Chinese, Japanese, Thai, ...).

 

Indexing will be unnecessarily slow if the data to be indexed resides on the same disk with the case file and directory, where the index is created. Try to avoid indexing with an active Internet connection if your Windows system is configured to download updates and reboot automatically upon installation.

 

Optionally, text in certain file types can be decoded for indexing (cf. Logical Search), and it is possible to create indexes for multiple selected computer media/images associated with a case in a single step. You can index in up to six different code pages simultaneously.

 

It is possible to define a character substitution list in Unicode that causes certain letters to be indexed as other letters (e.g. "é" as just "e"). This will allow you to find certain spelling variations with a single index search, e.g. both the name "René" with an accented e at the end and "Rene" without, with either spelling. This list must have the structure

é>e

è>e

à>a                

...

(i.e. 1 substition per line) and needs to be present as a Unicode text file named "indexsub.txt" that starts with the LE Unicode indicator 0xFF 0xFE. "indexsub.txt" is an optional file and expected in the X-Ways Forensics installation directory.

 

You will be warned if you define a space character as part of words. That is because space characters are meant to delimit words, they are not part of the words themselves. If a space character is defined to be part of words, that means a whole sentence like "Mike Smith lost his credit card today." is considered just a single word.

 

You can delete all indexes for an evidence object by removing the "Already done" check mark in the Refine Volume Snapshot dialog. This will also clear the "i" flag from all indexed files in the volume snapshot.

 

Search in Index: After indexing files, you may search the index for keywords very quickly, using the Simultaneous Search function. Select "Search in Index" from the drop-down box at the bottom. Anything in excess of the maximum word length used for indexing is ignored (so that "ridiculous" is found in the index even if in the index that word was truncated to "ridicul" based on a maximum word length of 7 letters). X-Ways Forensics does not distinguish between uppercase and lowercase letters except if a case-sensitive index was created. In a search hit list populated by an index search, physical offsets are not available.

 

You may conveniently run non-GREP index searches for search terms that contain space characters, just like in conventional searches. This is very important for names (e.g. "John Doe" or "XYZ Technology Ltd") and spaced compound words (e.g. "bank account" or "credit card limit"). This works even if the individual components of the compound already exceed the maximum word length that was indexed (by default 7 characters), so that you will have no trouble finding "basketball positions" (10+9 letters) or "skyscraper architecture" (10+12 letters). Just as always the components are only matched up to the length that was indexed, which is not a big problem because there are not many words other than "basketball" and "skyscraper" that start with "basketb" or "skyscra", respectively. In fact the spaces in the search terms match unindexed word delimiters other than spaces as well, such as hyphens, so you will also find "Spider-Man" and "freeze-dried" when searching for "spider man" and "freeze dried", or underscores as in "bank_account" (think of a filename like "bank_account.html"), or plus signs as in "credit+card" (e.g. common in Google search URLs when searching for more than 1 word), or periods as in "interview.pdf". So in that respect index searches are even more powerful than conventional searches. Defining spaces as being part of words is a big no-no.