Question (arch-what):
What is this project good for?
Answer:
TBD
Question (arch-overall):
Describe the overall architecture.
Answer:
Motivation
The Lexer module is being created to address problems that were discovered
with existing lexical support
(org.netbeans.editor.Syntax class and related classes).
The existing support is in fact a full lexical scanner
that is typically used in a way that it only recognizes
token identification and boundaries but it does not produce permanent tokens.
This was important at JDK1.1 times when object creation was expensive
but it is not so critical with nowadays JVMs.
Although the current approach is still usable it has the following problems:
- Frequent lexing.
There are many operations in the editor that require lexing (typically
after modification) such as drawing, brace completion and highlighting,
code completion and text formatting to name a few.
This leads to frequent lexing (typically of similar document regions).
To retain reasonable performance the lexical scanner must remain
simple and efficient which can be problematic in certain cases.
Producing of any extra objects (e.g. temporary tokens from lexer generators
such as javacc or antlr) during scanning can lead to noticeable
slowness of the editor. This generally complicates possible use of lexer generators
and leads to handcoded scanners which on the other hand
just few people are willing to write and maintain.
- Poor language embedding capabilities.
The current approach
makes it difficult and inefficient to delegate lexing
among Syntax
-es in a clean way
when there is a language embedding present.
Although currently a solution is present for embedding
java and html into JSPs
it is complex and hard to maintain
(see >2000 lines long
web/jspsyntax/src/org/netbeans/modules/web/core/syntax/JspMultiSyntax.java).
- Suboptimal recovery after modifications.
As there are no permanent tokens
the recovery after document modification requires to relex at least
one line of the text (the lexer's state gets serialized at begining of each line
so the lexer can only be restarted at line beginings).
Although this is not an issue during regular typing it becomes a problem
during operations that do many small modifications such as
shifting of a block of text or document reformatting.
- Backward iteration through the tokens.
It is often necessary to traverse tokens in a backward
direction (e.g. for searching for a matching opening brace or during formatting).
It is unnatural and inefficient to traverse tokens
in a backward direction with the current implementation.
Temporary tokens are created in batches with increasing size.
Each batch requires lexer restarting from the begining
of the corresponding line.
- Large tokens handling.
As there are no permanent tokens
it's not possible to determine whether the document modification
has really affected validity of the particular token or not.
The whole token must always be relexed.
This has negative performance implications
when editing large tokens such as large xml comments
(see Issue 49296
for details).
Requirements
Requirements for the new framework should basically address the issues above:
- Eliminate frequent lexing.
This requirement leads to solution that creates permanent tokens
that can be reused by subsequent lexing requests.
Lifetime of such permanent tokens is a subject for further discussion.
-
Language embedding support.
The new framework should support efficient and easy to use
language embedding capability.
Besides java and html in JSPs a demand for several other language embeddings
already exists. For example JSPs support should be extended
to include other languages such as JavaScript.
Detailed syntax coloring of javadoc is another example.
Multi-level language embedding should be supported
e.g. javadoc in java in jsp.
-
Improve recovery after modifications if possible.
Ideally just the tokens that are really affected by the modification
should be relexed.
-
Support backward iteration through the tokens.
If permanent tokens will be created it should be easy
to iterate through them in backward direction.
-
Efficient large tokens handling.
For large tokens there should be a mechanism
that could check whether a particular
document modification inside of a large token
really affects it or whether just token boundaries
should be changed.
-
Reasonable memory requirements.
There is a large number of tokens present in a typical file.
For example a line of java code can have from 1 to 20 or even more tokens.
Therefore it's important to shrink the memory requirements
per token instance to minimum and control the lifetime
of tokens as well.
Observations
When searching for best lexical framework there were several observations
made from some competitive solutions:
-
Line-based lexer state storing.
Some frameworks including our existing NB framework stores
lexer state at the begining of a line
(line representation is meant here e.g. a line element
for swing editors).
The lexer state allows
the lexer to be restarted at the begining of a line
instead of starting lexing from the begining of document each time.
Although this approach works fine it may be difficult to implement it when
the line representations updating is controlled separately.
For example in Swing-based editors the line elements
are updated independently of the lexer states that
however need to be stored in the line elements.
It's rather complex to properly
update the lexer states in all situations
mainly during undo/redo handling.
-
Linked lists of tokens.
Although it may be convenient to design token
instances to be linked together to form a linked list
this approach has drawbacks as well.
As there are many tokens with the same identification
and text content there is a possibility to use flyweight
tokens for such representation. Using linked lists would
however eliminate such possibility.
Moreover the linked lists do not naturally allow to efficiently
position (e.g. by offset) in them (e.g. by using binary search).
-
Token's text body representation.
It's a tough question of how the token body should be represented.
Of course the first idea is to use java.lang.String
which has the best stability and support. Unfortunately it's too
expensive regarding storage requirements
(char-array, offset, length and cached hash value).
Substring over a large string would eliminate the requirement
of creation of many small char-arrays but the swing document
contents are not modelled as strings (obviously due to possible modifications
which would contradict with String's immutability) so it would
lead to duplications of data.
Some frameworks use javax.swing.text.Segment
however the segments are only stable until a next document modification
occurs.
Moreover the segments lead to data copying if the particular
text spans a character gap in the document character data representation
(see e.g. javax.swing.text.GapContent
).
Use of Segments in API is disputable as they directly access the document's
content data so a corrupted client can damage the document's data.
The best choice seems to be to use the java.lang.CharSequence
interface introduced in J2SE 1.4.
Good implementations allow efficient access
to document's character data without a risk of their corruption.
Lifetime of CharSequence implementations can be well controlled.
-
Token identification.
Some frameworks use just plain int
for token identification
representation.
int
has a great advantage that it can be used directly
in switch-case statements.
On the other hand int
is just a number
so there must be an extra infrastructure to give the right
semantics to it. It can get misused easily.
It may be better to "encapsulate" the original int
identification
into a token identification object that can carry additional information
such as token's name etc.
Main Features
The proposed lexer module solution should address
the requirements above completely. The main features of the proposed solution:
-
Fully functional lexer.
The lexer module is a fully functional lexer
which, regarding scope of functionality,
fully replaces the original solution.
It can be used for many purposes
including syntax coloring, brace matching,
code completion, formatting and parsing.
-
Incremental lexing.
Lexer module is able to recover the existing
token sequence from document modifications.
It is able to precisely
identify a list of tokens that need to be replaced
due to document modification, remove them
and add the new tokens so that the token sequence
is accurate again.
The list of tokens that were removed and added
is delivered to listening clients for analysis.
-
Base for incremental parsing.
Thanks to the incremental lexing support
the lexer module is suitable base for incremental
parser.
A well imlemented incremental parser has advantages
that the traditional batch parser cannot match
because it does not watch how the document evolves
by the individual edits.
For example the incremental parser can well adapt
to the fact that the user
has added left brace '{' into a syntactically well
formed document. Although this edit creates a syntax error
the incremental parser can limit its scope
by creating a fake '}' at the appropriate place
in its syntax tree and
retain the rest of the document to stay valid.
Batch parser on the other hand will ignore rest
of the methods in the class (the extra '{' made
their declarations to be part of errorneous method's body)
and at the end of the source it will find out
that there is one '}' missing.
Although there are heuristical ways of how to improve
the situation of batch parsers the essential difference
still remains.
-
Optimal token replacement.
A mathematical prove exists that the tokens
replaced by the lexer module's incremental algorithm are the only ones
that are really necessary to be replaced.
-
Efficient large tokens handling.
Besides the optimal token replacement
there is an additional mechanism of token validators
to improve efficiency especially when editing
inside large tokens.
Token validators allow to confirm that the particular
token is not affected by the performed document modification
and can stay as is just with its boundaries being updated.
This improves performance because the token validator
implementation typically only looks to few
adjacent characters around modification
while relexing of the token would process
at least all the characters of the token.
-
Batch lexing support.
Besides incremental lexing the regular batch lexing
is also supported by the lexer framework.
It allows many tools working with the tokens (e.g. parsers)
to be run without necessity of document creation in case
the document is not going to be edited.
As a side effect of batch lexing support presence the incrementality
of the lexers can be well tested by comparing the token sequence
updated by incremental algorithm with token sequence
obtained by batch lexing (from the begining till the end
of the document).
-
Lexer generators support.
Besides handcoded lexers the lexer module
can support lexers from popular lexer generators.
Currently Antlr and JavaCC lexer generators
are supported.
There are only two extra requirements on a batch
lexer in order to be used by the lexer framework:
- It must be transparent which characters
the batch lexer had to read in order
to recognize a token.
-
The lexer must be able to express its internal state
(as a
java.lang.Object
) on token boundaries
and it must be able to restart itself
from that state.
-
Language Embedding.
Language embedding in lexer module tries to be as natural as possible
while retaining simplicity and effectiveness.
The language embedding primarily attempts to reuse
existing lexers for the nested languages.
-
Arbitrary language embedding nesting.
The lexer module is specifically designed to support
many nesting levels without compromising performance.
Thus the number of nesting levels can be driven
by logic and not by the goal of minimizing the number of levels
in order to retain reasonable performance.
-
Optimized memory footprint.
A significant amount of work was spent to fit
a typical token instance into 24 bytes (not including the characters
of the token body).
In addition there is flyweight token support that can further
decrease the memory footprint.
-
Threading.
The lexer framework can be accessed by multiple threads
simultaneously. Each thread must first acquire
document read lock (or write lock) to guarantee document's stability.
Question (arch-usecases):
Describe the main
use cases of the new API. Who will use it under
what circumstances? What kind of code would typically need to be written
to use the module?
Answer:
General token exploration
Exploration of token sequence can be performed
in a bidirectional iterator-like way.
It will be possible to efficiently position (binary search) in the token
sequence by the given offset.
Syntax Coloring
Swing views can be created for tokens provided by the lexer framework.
As each token's identification has a unique name among the token names for the given language
the view can search for text attributes in a map having the token's name
as a key.
Upon modification some tokens can be removed and new ones created.
The views will listen on token changes and update themselves accordingly.
Brace matching
The code for matching of braces will first position to the starting brace
in the token sequence.
It will then iterate in either forward or backward direction
until it reaches the matching brace or start/end of the token sequence.
Language embedding
It will be possible to ask the token sequence whether
particular token has language embedding (is branched into subtokens)
or not.
The branching will be done lazily upon request.
Question (arch-time):
What are the time estimates of the work?
Answer:
TBD
Question (arch-quality):
How will the quality
of your code be tested and
how are future regressions going to be prevented?
Answer:
The lexer module will primarily be tested by using unit tests.
The primary subject of testing is the TokenUpdater
containing the incremental algorithm. It will be tested
by using a special lexer which (when used by the algorithm)
will verify that the incremental algorithm is correct and optimal.
Another set of tests will be handy for all the language developers
because they will allow to verify that the lexer is correct
from the incrementality point of view.
The test will simply maintain a document and do modifications to it.
Once the modification gets done the underlying token sequence
will get updated by the token updater. The test will then take a snapshot
of the document's text and run batch lexer on it to verify whether
the token sequence produced by the batch lexer is exactly the same
as the one maintained by the token updater.
The modifications of the documents can be randomized
and it will be possible to assign a ratio to individual characters
of how frequently they should appear in the insertion string.
Finally there will be a set of tests that will control
memory requirements of token instances.