226 lines
8.8 KiB
226 lines
8.8 KiB
14 years ago
|
How to write a scintilla lexer
|
||
|
|
||
|
A lexer for a particular language determines how a specified range of
|
||
|
text shall be colored. Writing a lexer is relatively straightforward
|
||
|
because the lexer need only color given text. The harder job of
|
||
|
determining how much text actually needs to be colored is handled by
|
||
|
Scintilla itself, that is, the lexer's caller.
|
||
|
|
||
|
|
||
|
Parameters
|
||
|
|
||
|
The lexer for language LLL has the following prototype:
|
||
|
|
||
|
static void ColouriseLLLDoc (
|
||
|
unsigned int startPos, int length,
|
||
|
int initStyle,
|
||
|
WordList *keywordlists[],
|
||
|
Accessor &styler);
|
||
|
|
||
|
The styler parameter is an Accessor object. The lexer must use this
|
||
|
object to access the text to be colored. The lexer gets the character
|
||
|
at position i using styler.SafeGetCharAt(i);
|
||
|
|
||
|
The startPos and length parameters indicate the range of text to be
|
||
|
recolored; the lexer must determine the proper color for all characters
|
||
|
in positions startPos through startPos+length.
|
||
|
|
||
|
The initStyle parameter indicates the initial state, that is, the state
|
||
|
at the character before startPos. States also indicate the coloring to
|
||
|
be used for a particular range of text.
|
||
|
|
||
|
Note: the character at StartPos is assumed to start a line, so if a
|
||
|
newline terminates the initStyle state the lexer should enter its
|
||
|
default state (or whatever state should follow initStyle).
|
||
|
|
||
|
The keywordlists parameter specifies the keywords that the lexer must
|
||
|
recognize. A WordList class object contains methods that make simplify
|
||
|
the recognition of keywords. Present lexers use a helper function
|
||
|
called classifyWordLLL to recognize keywords. These functions show how
|
||
|
to use the keywordlists parameter to recognize keywords. This
|
||
|
documentation will not discuss keywords further.
|
||
|
|
||
|
|
||
|
The lexer code
|
||
|
|
||
|
The task of a lexer can be summarized briefly: for each range r of
|
||
|
characters that are to be colored the same, the lexer should call
|
||
|
|
||
|
styler.ColourTo(i, state)
|
||
|
|
||
|
where i is the position of the last character of the range r. The lexer
|
||
|
should set the state variable to the coloring state of the character at
|
||
|
position i and continue until the entire text has been colored.
|
||
|
|
||
|
Note 1: the styler (Accessor) object remembers the i parameter in the
|
||
|
previous calls to styler.ColourTo, so the single i parameter suffices to
|
||
|
indicate a range of characters.
|
||
|
|
||
|
Note 2: As a side effect of calling styler.ColourTo(i,state), the
|
||
|
coloring states of all characters in the range are remembered so that
|
||
|
Scintilla may set the initStyle parameter correctly on future calls to
|
||
|
the
|
||
|
lexer.
|
||
|
|
||
|
|
||
|
Lexer organization
|
||
|
|
||
|
There are at least two ways to organize the code of each lexer. Present
|
||
|
lexers use what might be called a "character-based" approach: the outer
|
||
|
loop iterates over characters, like this:
|
||
|
|
||
|
lengthDoc = startPos + length ;
|
||
|
for (unsigned int i = startPos; i < lengthDoc; i++) {
|
||
|
chNext = styler.SafeGetCharAt(i + 1);
|
||
|
<< handle special cases >>
|
||
|
switch(state) {
|
||
|
// Handlers examine only ch and chNext.
|
||
|
// Handlers call styler.ColorTo(i,state) if the state changes.
|
||
|
case state_1: << handle ch in state 1 >>
|
||
|
case state_2: << handle ch in state 2 >>
|
||
|
...
|
||
|
case state_n: << handle ch in state n >>
|
||
|
}
|
||
|
chPrev = ch;
|
||
|
}
|
||
|
styler.ColourTo(lengthDoc - 1, state);
|
||
|
|
||
|
|
||
|
An alternative would be to use a "state-based" approach. The outer loop
|
||
|
would iterate over states, like this:
|
||
|
|
||
|
lengthDoc = startPos+lenth ;
|
||
|
for ( unsigned int i = startPos ;; ) {
|
||
|
char ch = styler.SafeGetCharAt(i);
|
||
|
int new_state = 0 ;
|
||
|
switch ( state ) {
|
||
|
// scanners set new_state if they set the next state.
|
||
|
case state_1: << scan to the end of state 1 >> break ;
|
||
|
case state_2: << scan to the end of state 2 >> break ;
|
||
|
case default_state:
|
||
|
<< scan to the next non-default state and set new_state >>
|
||
|
}
|
||
|
styler.ColourTo(i, state);
|
||
|
if ( i >= lengthDoc ) break ;
|
||
|
if ( ! new_state ) {
|
||
|
ch = styler.SafeGetCharAt(i);
|
||
|
<< set state based on ch in the default state >>
|
||
|
}
|
||
|
}
|
||
|
styler.ColourTo(lengthDoc - 1, state);
|
||
|
|
||
|
This approach might seem to be more natural. State scanners are simpler
|
||
|
than character scanners because less needs to be done. For example,
|
||
|
there is no need to test for the start of a C string inside the scanner
|
||
|
for a C comment. Also this way makes it natural to define routines that
|
||
|
could be used by more than one scanner; for example, a scanToEndOfLine
|
||
|
routine.
|
||
|
|
||
|
However, the special cases handled in the main loop in the
|
||
|
character-based approach would have to be handled by each state scanner,
|
||
|
so both approaches have advantages. These special cases are discussed
|
||
|
below.
|
||
|
|
||
|
Special case: Lead characters
|
||
|
|
||
|
Lead bytes are part of DBCS processing for languages such as Japanese
|
||
|
using an encoding such as Shift-JIS. In these encodings, extended
|
||
|
(16-bit) characters are encoded as a lead byte followed by a trail byte.
|
||
|
|
||
|
Lead bytes are rarely of any lexical significance, normally only being
|
||
|
allowed within strings and comments. In such contexts, lexers should
|
||
|
ignore ch if styler.IsLeadByte(ch) returns TRUE.
|
||
|
|
||
|
Note: UTF-8 is simpler than Shift-JIS, so no special handling is
|
||
|
applied for it. All UTF-8 extended characters are >= 128 and none are
|
||
|
lexically significant in programming languages which, so far, use only
|
||
|
characters in ASCII for operators, comment markers, etc.
|
||
|
|
||
|
|
||
|
Special case: Folding
|
||
|
|
||
|
Folding may be performed in the lexer function. It is better to use a
|
||
|
separate folder function as that avoids some troublesome interaction
|
||
|
between styling and folding. The folder function will be run after the
|
||
|
lexer function if folding is enabled. The rest of this section explains
|
||
|
how to perform folding within the lexer function.
|
||
|
|
||
|
During initialization, lexers that support folding set
|
||
|
|
||
|
bool fold = styler.GetPropertyInt("fold");
|
||
|
|
||
|
If folding is enabled in the editor, fold will be TRUE and the lexer
|
||
|
should call:
|
||
|
|
||
|
styler.SetLevel(line, level);
|
||
|
|
||
|
at the end of each line and just before exiting.
|
||
|
|
||
|
The line parameter is simply the count of the number of newlines seen.
|
||
|
It's initial value is styler.GetLine(startPos) and it is incremented
|
||
|
(after calling styler.SetLevel) whenever a newline is seen.
|
||
|
|
||
|
The level parameter is the desired indentation level in the low 12 bits,
|
||
|
along with flag bits in the upper four bits. The indentation level
|
||
|
depends on the language. For C++, it is incremented when the lexer sees
|
||
|
a '{' and decremented when the lexer sees a '}' (outside of strings and
|
||
|
comments, of course).
|
||
|
|
||
|
The following flag bits, defined in Scintilla.h, may be set or cleared
|
||
|
in the flags parameter. The SC_FOLDLEVELWHITEFLAG flag is set if the
|
||
|
lexer considers that the line contains nothing but whitespace. The
|
||
|
SC_FOLDLEVELHEADERFLAG flag indicates that the line is a fold point.
|
||
|
This normally means that the next line has a greater level than present
|
||
|
line. However, the lexer may have some other basis for determining a
|
||
|
fold point. For example, a lexer might create a header line for the
|
||
|
first line of a function definition rather than the last.
|
||
|
|
||
|
The SC_FOLDLEVELNUMBERMASK mask denotes the level number in the low 12
|
||
|
bits of the level param. This mask may be used to isolate either flags
|
||
|
or level numbers.
|
||
|
|
||
|
For example, the C++ lexer contains the following code when a newline is
|
||
|
seen:
|
||
|
|
||
|
if (fold) {
|
||
|
int lev = levelPrev;
|
||
|
|
||
|
// Set the "all whitespace" bit if the line is blank.
|
||
|
if (visChars == 0)
|
||
|
lev |= SC_FOLDLEVELWHITEFLAG;
|
||
|
|
||
|
// Set the "header" bit if needed.
|
||
|
if ((levelCurrent > levelPrev) && (visChars > 0))
|
||
|
lev |= SC_FOLDLEVELHEADERFLAG;
|
||
|
styler.SetLevel(lineCurrent, lev);
|
||
|
|
||
|
// reinitialize the folding vars describing the present line.
|
||
|
lineCurrent++;
|
||
|
visChars = 0; // Number of non-whitespace characters on the line.
|
||
|
levelPrev = levelCurrent;
|
||
|
}
|
||
|
|
||
|
The following code appears in the C++ lexer just before exit:
|
||
|
|
||
|
// Fill in the real level of the next line, keeping the current flags
|
||
|
// as they will be filled in later.
|
||
|
if (fold) {
|
||
|
// Mask off the level number, leaving only the previous flags.
|
||
|
int flagsNext = styler.LevelAt(lineCurrent);
|
||
|
flagsNext &= ~SC_FOLDLEVELNUMBERMASK;
|
||
|
styler.SetLevel(lineCurrent, levelPrev | flagsNext);
|
||
|
}
|
||
|
|
||
|
|
||
|
Don't worry about performance
|
||
|
|
||
|
The writer of a lexer may safely ignore performance considerations: the
|
||
|
cost of redrawing the screen is several orders of magnitude greater than
|
||
|
the cost of function calls, etc. Moreover, Scintilla performs all the
|
||
|
important optimizations; Scintilla ensures that a lexer will be called
|
||
|
only to recolor text that actually needs to be recolored. Finally, it
|
||
|
is not necessary to avoid extra calls to styler.ColourTo: the sytler
|
||
|
object buffers calls to ColourTo to avoid multiple updates of the
|
||
|
screen.
|
||
|
|
||
|
Page contributed by Edward K. Ream
|