Topic29

WinHex & X-Ways

Search Options

 

Case sensitive: If a search is case-sensitive, that means that upper and lower case characters are distinguished and e. g. “Option” with a capital “O” is not found in the word “optionally”. By unchecking the checkbox, you search for all upper-case/lower-case variants of the search terms. Searches are fully case insensitive only with the Simultaneous Search, with the Find Text command only for letters from the Latin/English alphabet and German umlauts. In the Simultaneous Search you may use case-sensitive and non-case-sensitive search terms at the same time if the “Match case” option is half selected. In that case you may prepend search terms with “case:” to mark them as case-sensitive.

 

Unicode: The specified text is searched in UTF-16 Little Endian. The simultaneous search allows to search for the same text at the same time in Unicode and in other code pages.

 

You may specify a wildcard (one character or a two-digit hex value), which represents one byte. For example this option can be used to find "Speck" as well as "Spock" when searching for "Sp?ck" with the question mark as the wildcard.

 

Only whole words: The search term is found only if it occurs as a whole word, i.e. if delimited from other words by any character other than a...z, A..Z and German and French letters (e. g. by punctuation marks, blanks, binary control codes, digits). If this option is enabled, for example "tomato" is not found in "automaton". Reliable to reduce the number of hits for English, German, and French text only. In a Simultaneous Search either all search terms are searched as whole words or only those that are indented (prepended with a tab character) or none, depending on the state of the corresponding check box. If you wish to combine the indention for a search as a whole word with the "case:" prefix for case sensitivity, enter the "case:" prefix first and then insert the tab character for the indention.

 

For a Simultaneous Search you may customize the word boundary detection for languages that utilize the Latin 1 code page, i.e. make it more strict (for less search hits) or more relaxed (for more search hits), by defining the alphabet of characters that are considered letters (i.e. characters belonging to words) as opposed to non-word characters. A word character followed by a non-word character or the other way around is considered a word boundary. There are three easy-to-use pre-defined settings. The setting for the most thorough search results is the default. Users that are overwhelmed by garbage hits for short keywords in non-text data such as Base64 or binary garbage may want to try the other two options. These other two options could lead to valid search hits being missed in some constellations (depends on the file format), but can still be justifiable as a great time saver for searches in text documents, e.g. rather in electronic discovery, rather not in computer forensics.

 

For more explanation and an example of how the whole words option works, please read on: A word boundary is a boundary between two consecutive characters of which one character is a word character and the other character is not a word character. If two consecutive characters are both word characters (e.g. "ns"), then obviously the "s" does not start a new whole word, and the "n" cannot be the end of a whole word. It can be somewhere in the middle of a whole word (e.g. "mansion"), but in between these two characters "ns" there is definitely no word boundary. If both characters are non-word characters (e.g. "! ", exclamation mark followed by a space), then obviously the position between the two is not a word boundary either. The exclamation mark cannot be the end of a word (cannot occur anywhere within a word), and the space cannot be the start of a word (cannot occur anywhere within a word either, excluding compound words). If you are searching for "man" as a whole word within "our mansion", then XWF will provisionally/internally find "man", and then first check whether the character before the "m" is a word character. That character is a space. A space character is not a word character. Then it also checks whether "m" is a word character according to the alphabet. It is. That means there is a word boundary before the "m". Next XWF needs to check whether "n" and "s" are word characters. Both are. That means that after the "n" there is no word boundary. Hence the three letters "man" within "mansion" are not considered a whole word occurrence of "man".

 

The whole words only restriction of the Simultaneous Search is not applied to search hits that are not words according to the user's selected alphabet definition (checking only the first and the last character in the search hit). For example if you are searching for "LOL!!", then this cannot possibly be a whole word because the exclamation mark is not a letter and thus not contained in the defined alphabet (well, unless you have added the exclamation mark to it manually). However, the GREP word boundary indicator \b is still applied in such a case, for example to be able to search for certain data in between words, data that is not considered a word itself.

 

In addition to the alphabet of characters for the Latin 1 code page (for all Western European languages), an optional additional alphabet can be defined for letters of another language. If activated, it is used for searches in UTF-16 and searches in regional ANSI/OEM/IBM/ISO/Mac code pages with only 1 byte character such as for Cyrillic, Greek, Turkish, Arabic, Hebrew, Vietnamese, and various Central/Eastern/South Eastern European languages. The Cyrillic alphabet is predefined.

 

Search direction: Decide whether WinHex shall search from the beginning to the end, or downwards or upwards from the current position.

 

Condition: Offset modulo x = y: The search algorithm accepts search string occurrences only at offsets that meet the given requirements. E. g. if you search for data that typically occurs at the 10th byte of a hard disk sector, you may specify x=512, y=10. If you are looking for DWORD-aligned data, you may use x=4, y=0 to narrow down the number of hits.

 

Search in block only: The search operation is limited to the current block.
 

Search in all open windows: The search operation is applied to all open edit windows. Press F4 to continue the search in the next window. If "Search in block only" is enabled at the same time, the search operation is limited to the current block in each window.

 

Count occurrences/List search hits: Causes WinHex not to jump to each single occurrence, but to count or list them.

 

Search for "non-matches": In "Find Hex Values" you may specify a single hex value with an exclamation mark as a prefix (e.g. !00) to make WinHex stop when it encounters the first byte value that differs.

 

Options and advantages of the logical search

 

GREP syntax: Search option available with the Simultaneous Search only. Regular expressions are a powerful search tool. A single regular expression may match many different words. Either all search terms are considered GREP expressions or only those prepended with "grep:" or none, depending on the state of the corresponding checkbox. You may prepend a search term with both "case:" (see above) and "grep:" in that order. The following characters have a special meaning in regular expressions, as explained below: ( ) [ ] { } | \ . # + ?. Where these special characters are to be taken literally, you need to prefix them with a backslash character (\).

 

The | operator is used to denote alternative matches. You can use the regular expression car (wheel|tire) to search for the words "car wheel" and "car tire". Any match must equal the parts before, after, or between any | operators present. The effect of | is only limited by parentheses.

 

. and # are wildcards: . matches any character, # matches any numeric character. You can define sets of characters with the help of square brackets: [xyz] will match any of the characters x, y, z. [^xyz] will match any character except x, y, or z. You can define ranges of characters using a dash: [a-z] matches any lower-case letter. [^a-z] matches all characters except lower-case letters. The listing may comprise individually listed characters and ranges at the same time: [aceg-loq] matches a, c, e, g, h, i, j, k, l, o, and q. All characters except [, ], -, and \ are taken literally between square brackets, even the wildcard characters . and #.

 

\b stands for the start or end of a word, i.e. the boundary between a word character and a non-word character. Which characters/letters are considered word characters by the Simultaneous Search is user-defined. The start and end of a file also count as word boundaries. \b is only supported at the start and/or at the end of the search term, and not in conjunction with |. \b, ^, and $ anchors only work only when searching in evidence objects of a case, and not for index searches.

 

Byte values that correspond to ASCII characters that cannot be easily produced with a keyboard can be specified in decimal or hexadecimal notation: For example, \032 and \x20 are both equivalent to the space character in the ASCII character set. This kind of notation is supported even in between square brackets. E.g. [\000-\x1f] matches non-printable ASCII characters.

 

Multiplier characters (*, +, and ?) indicate that the preceding character(s) may or must occur more than once (see below). Complex example: a(b|cd|e[f-h]i)*j matches aj, abj, acdj, aefij, aegij, aehij, abcdj, and abefij.

 

Within [] brackets, the characters .*+?{}()| are not treated as special characters, but literally.

 

Brief overview of supported syntax features (everything else is interpreted literally)

.                                A period matches any single character.

#                                A pound sign matches any numeric character [0-9].

\nnn                A byte value specified with three decimal digits (\000...\255).

\xnn                A byte value specified with two hexadecimal digits (\x00...\xFF). For example, \x0D\x0A is a Windows line break.

\unnnn                A Unicode value specified with four hexadecimal digits. Depending on the selected code page(s), corresponds to different byte values.

?                                Matches one or zero occurrences of the preceding character or set.

*                                Matches any number of occurrences of the preceding character, including zero time.

+                                A plus sign after a character matches any number of occurrences of that character except zero.

[XYZ]                Characters in brackets match any one character that appears in the brackets.

[^XYZ]                A circumflex at the start of the string in brackets means NOT.

[A-Z]                A dash within the brackets signifies a range of characters.

\                                Indicates that the following special GREP character is to be treated literally.

{X,Y}                                Repeats the preceding character or group of characters X-Y times.

(ab)                                Functions like a parenthesis in a mathematical expression. Groups ab together for +, ?, *, | and { }.

a|b                                The pipe acts as a logical OR. So it would read "a or b".

\b                                Matches a word boundary.

^                                Matches the start of a file.

$                                Matches the logical or physical end of a file, depending on the search options.

 

GREP Examples

 

E-mail addresses

[a-zA-Z0-9_\-\+\.]{1,20}@[a-zA-Z0-9\-\.]{2,20}\.[a-zA-Z]{2,7}

(the + before the @ is supported in Gmail addresses)

 

Internet addresses starting with http://, https://, ftp://

[a-zA-Z]+://[a-zA-Z0-9/_\?$&=\-\.]+

 

Visa and Mastercard credit card numbers

[^#a-z][45]###############[^#a-z]

[45]###-####-####-####

[45]### #### #### ####

(ideally check results via an X-Tension with the Luhn algorithm to reduce the number of false hits and search without [^#a-z])

 

Search Menu

Replace Options

Technical Hints

 

Allow overlapping hits: If you use GREP syntax to search for search hits of variable length, multiple valid hits at the same location may be the result. If you search for example for e-mail addresses, and the search algorithm is fed with the character sequence "[email protected]", then it will determine that the characters from the "m" in "mail" match the GREP expression and it will record a hit. After that, it proceeds with the "a" in "mail" and realizes, that [email protected] fits the bill as well, and so do [email protected] and [email protected]. All of these might be valid e-mail addresses. So the search algorithm is entirely right, but typically users do not wish to see those additional hits. So if you do not allow for overlapping hits, new hits are recorded only after the "m" in ".com". Not allowing overlapping hits means to exclusively assign the characters covered by a hit to that hit and not to potential other hits any more.

 

 

Search window, proximity searches

 

The GREP search window width is 128 bytes by default. That means it is not guaranteed that with a variable-length GREP search term (i.e. using {+ syntax) you can find data that is longer than 128 bytes. You may increase the search window width if you need to cover more than that.

 

This is needed for example for proximity searches. If you require that a document contains two search terms at the same time, and that the search terms should occur close to one another, you could search for these search terms with two GREP expressions and specify the maximum distance allowed between them as the second parameter in the braces:

keyword1.{0,maxdistance}keyword2

keyword2.{0,maxdistance}keyword1

The search window width in bytes required when searching with an 8-bit character set is the sum of maxdistance, length(keyword1) and length(keyword2).

 

Please note that the preferred method to find two search terms near to each other is the NEAR combination in the search term list, when two search terms are already combined with a logical AND, after they have been searched for separately.