Tokens

FreeBASIC

Tokens
 
Interface

The basic public interface of the lexer is from lex.bas:
    • lexGetToken(): Retrieve current token's id, an FB_TK_* value.
    • lexGetLookAhead(N): Look ahead N tokens
    • lexSkipToken(): Go to next token
    • lexGetText(): Returns a zstring ptr to the text of the current token, e.g. string/number literals (their values are retrieved like this), or the text representation of other tokens (e.g. operators).
    • some more lexGet*() accessors to data of the current token
    • lexPeekLine(): Used by error reporting to retrieve the current line of code.

Current token + look ahead tokens

Tokens are a pretty short-living thing. There only is the current token and a few look ahead tokens in the token queue. That's all the parser needs to decipher FB code. The usual pattern is to check the current token, decide what to do next based on what it is, then skip it and move on. Backward movement is not possible. The file name, line number and token position shown during error reporting also comes from the current lexer state.

The token queue is a static array of tokens, containing space for the current token plus the few look ahead tokens. The token structures contain fairly huge (static) buffers for token text. Each token has a pointer to the next one, so they form a circular list. This is a cheap way to move forward and skip tokens, without having to take care of an array index. Copying around the tokens themselves is out of question, because of the huge text buffers. The "head" points to the current token; the next "k" tokens are look ahead tokens; the rest is unused. When skipping we simply do "head = head->next". Unless the new head already contains a token (from some look ahead done before), we load a new token into the new current token struct (via lexNextToken()). Look ahead works by loading the following tokens in the queue (but without skipping the current one).

Tokenization
lex.bas:lexNextToken()

The lexer breaks down the file input into tokens. A token conceptually is an identifier, a keyword, a string literal, a number literal, an operator, EOL or EOF, or other characters like parentheses and commas. Each token as an unique value assigned to it that the parser will use to identify it, instead of doing string comparisons (which would be too slow).

lexNextToken() uses the current char, and if needed also the look ahead char, to parse the input. Number and string literals are handled here too. Alphanumeric identifiers are looked up in the symb hash table, which will tell whether it's a keyword, a macro, or another FB symbol (type, procedure, variable, ...).

Identifiers containing dots (QB compatibility) and identifier type suffixes (as in stringvar$) are handled here too (but not namespace/structure member access). Tokens can have a data type associated with them. That is also used with number literals, which can have type suffixes (as in &hFFFFFFFFFFFFFFFFull;).

Side note on single-line comments

Quite unusual, single-line comments are handled by the parser instead of being skipped in the lexer. This is done so that usage of REM can easily be restricted as in QB, afterall REM is more like a statement than a comment. Besides that, comments can contain QB meta statements, so comments cannot just be ignored. Note that the parser will still skip the rest of a comment (without tokenizing it), if it does not find a QB meta statement.

(Multi-line comments are completely handled during tokenization though.)

File input
lex.bas:hReadChar()

The input file is opened in fb.bas:fbCompile(); the file number is stored in the global env context (similar for #includes in fb.bas:fbIncludeFile()). The lexer uses the file number from the env context to read input from. It has a static zstring buffer that is used to stream the file contents (instead of reading character per character), and for Unicode input, the lexer uses a wstring buffer and decodes UTF32 or UTF8 to UTF16. The lexer advances through the chars in the buffer and then reads in the next chunk from the file. EOF is represented by returning a NULL character.