NetBeans Architecture Answers for editor/fold module

Author: mmetelka@netbeans.org
Answers as of: Nov 11, 2004
Answers for questions version: 1.24
Latest available version of questions: 1.24

Interfaces table

WARNING: No imported or exported interfaces!

General Information

Question (arch-what): What is this project good for?

Answer:

Question (arch-overall): Describe the overall architecture.

Answer:

Motivation

org.netbeans.editor.Syntax

Frequent lexing.
There are many operations in the editor that require lexing (typically after modification) such as drawing, brace completion and highlighting, code completion and text formatting to name a few.
This leads to frequent lexing (typically of similar document regions).
To retain reasonable performance the lexical scanner must remain simple and efficient which can be problematic in certain cases.
Producing of any extra objects (e.g. temporary tokens from lexer generators such as javacc or antlr) during scanning can lead to noticeable slowness of the editor. This generally complicates possible use of lexer generators and leads to handcoded scanners which on the other hand just few people are willing to write and maintain.
Poor language embedding capabilities.
The current approach makes it difficult and inefficient to delegate lexing among Syntax-es in a clean way when there is a language embedding present.
Although currently a solution is present for embedding java and html into JSPs it is complex and hard to maintain (see >2000 lines long web/jspsyntax/src/org/netbeans/modules/web/core/syntax/JspMultiSyntax.java).
Suboptimal recovery after modifications.
As there are no permanent tokens the recovery after document modification requires to relex at least one line of the text (the lexer's state gets serialized at begining of each line so the lexer can only be restarted at line beginings).
Although this is not an issue during regular typing it becomes a problem during operations that do many small modifications such as shifting of a block of text or document reformatting.
Backward iteration through the tokens.
It is often necessary to traverse tokens in a backward direction (e.g. for searching for a matching opening brace or during formatting).
It is unnatural and inefficient to traverse tokens in a backward direction with the current implementation. Temporary tokens are created in batches with increasing size. Each batch requires lexer restarting from the begining of the corresponding line.
Large tokens handling.
As there are no permanent tokens it's not possible to determine whether the document modification has really affected validity of the particular token or not.
The whole token must always be relexed.
This has negative performance implications when editing large tokens such as large xml comments (see Issue 49296 for details).

Requirements

Requirements for the new framework should basically address the issues above:

Eliminate frequent lexing.
This requirement leads to solution that creates permanent tokens that can be reused by subsequent lexing requests.
Lifetime of such permanent tokens is a subject for further discussion.
Language embedding support.
The new framework should support efficient and easy to use language embedding capability.
Besides java and html in JSPs a demand for several other language embeddings already exists. For example JSPs support should be extended to include other languages such as JavaScript. Detailed syntax coloring of javadoc is another example.
Multi-level language embedding should be supported e.g. javadoc in java in jsp.
Improve recovery after modifications if possible.
Ideally just the tokens that are really affected by the modification should be relexed.
Support backward iteration through the tokens.
If permanent tokens will be created it should be easy to iterate through them in backward direction.
Efficient large tokens handling.
For large tokens there should be a mechanism that could check whether a particular document modification inside of a large token really affects it or whether just token boundaries should be changed.
Reasonable memory requirements.
There is a large number of tokens present in a typical file. For example a line of java code can have from 1 to 20 or even more tokens.
Therefore it's important to shrink the memory requirements per token instance to minimum and control the lifetime of tokens as well.

Observations

When searching for best lexical framework there were several observations made from some competitive solutions:

Line-based lexer state storing.
Some frameworks including our existing NB framework stores lexer state at the begining of a line (line representation is meant here e.g. a line element for swing editors). The lexer state allows the lexer to be restarted at the begining of a line instead of starting lexing from the begining of document each time.
Although this approach works fine it may be difficult to implement it when the line representations updating is controlled separately.
For example in Swing-based editors the line elements are updated independently of the lexer states that however need to be stored in the line elements.
It's rather complex to properly update the lexer states in all situations mainly during undo/redo handling.
Linked lists of tokens.
Although it may be convenient to design token instances to be linked together to form a linked list this approach has drawbacks as well.
As there are many tokens with the same identification and text content there is a possibility to use flyweight tokens for such representation. Using linked lists would however eliminate such possibility.
Moreover the linked lists do not naturally allow to efficiently position (e.g. by offset) in them (e.g. by using binary search).
Token's text body representation.
It's a tough question of how the token body should be represented.
Of course the first idea is to use java.lang.String which has the best stability and support. Unfortunately it's too expensive regarding storage requirements (char-array, offset, length and cached hash value).
Substring over a large string would eliminate the requirement of creation of many small char-arrays but the swing document contents are not modelled as strings (obviously due to possible modifications which would contradict with String's immutability) so it would lead to duplications of data.
Some frameworks use javax.swing.text.Segment however the segments are only stable until a next document modification occurs.
Moreover the segments lead to data copying if the particular text spans a character gap in the document character data representation (see e.g. javax.swing.text.GapContent).
Use of Segments in API is disputable as they directly access the document's content data so a corrupted client can damage the document's data.

The best choice seems to be to use the java.lang.CharSequence interface introduced in J2SE 1.4.
Good implementations allow efficient access to document's character data without a risk of their corruption.
Lifetime of CharSequence implementations can be well controlled.
Token identification.
Some frameworks use just plain int for token identification representation.
int has a great advantage that it can be used directly in switch-case statements.
On the other hand int is just a number so there must be an extra infrastructure to give the right semantics to it. It can get misused easily.
It may be better to "encapsulate" the original int identification into a token identification object that can carry additional information such as token's name etc.

Main Features

The proposed lexer module solution should address the requirements above completely. The main features of the proposed solution:

Fully functional lexer.
The lexer module is a fully functional lexer which, regarding scope of functionality, fully replaces the original solution.
It can be used for many purposes including syntax coloring, brace matching, code completion, formatting and parsing.
Incremental lexing.
Lexer module is able to recover the existing token sequence from document modifications.
It is able to precisely identify a list of tokens that need to be replaced due to document modification, remove them and add the new tokens so that the token sequence is accurate again.
The list of tokens that were removed and added is delivered to listening clients for analysis.
Base for incremental parsing.
Thanks to the incremental lexing support the lexer module is suitable base for incremental parser.
A well imlemented incremental parser has advantages that the traditional batch parser cannot match because it does not watch how the document evolves by the individual edits.
For example the incremental parser can well adapt to the fact that the user has added left brace '{' into a syntactically well formed document. Although this edit creates a syntax error the incremental parser can limit its scope by creating a fake '}' at the appropriate place in its syntax tree and retain the rest of the document to stay valid.
Batch parser on the other hand will ignore rest of the methods in the class (the extra '{' made their declarations to be part of errorneous method's body) and at the end of the source it will find out that there is one '}' missing.
Although there are heuristical ways of how to improve the situation of batch parsers the essential difference still remains.
Optimal token replacement.
A mathematical prove exists that the tokens replaced by the lexer module's incremental algorithm are the only ones that are really necessary to be replaced.
Efficient large tokens handling.
Besides the optimal token replacement there is an additional mechanism of token validators to improve efficiency especially when editing inside large tokens.
Token validators allow to confirm that the particular token is not affected by the performed document modification and can stay as is just with its boundaries being updated.
This improves performance because the token validator implementation typically only looks to few adjacent characters around modification while relexing of the token would process at least all the characters of the token.
Batch lexing support.
Besides incremental lexing the regular batch lexing is also supported by the lexer framework.
It allows many tools working with the tokens (e.g. parsers) to be run without necessity of document creation in case the document is not going to be edited.
As a side effect of batch lexing support presence the incrementality of the lexers can be well tested by comparing the token sequence updated by incremental algorithm with token sequence obtained by batch lexing (from the begining till the end of the document).
Lexer generators support.
Besides handcoded lexers the lexer module can support lexers from popular lexer generators.
Currently Antlr and JavaCC lexer generators are supported.
There are only two extra requirements on a batch lexer in order to be used by the lexer framework:
- It must be transparent which characters the batch lexer had to read in order to recognize a token.
- The lexer must be able to express its internal state (as a java.lang.Object) on token boundaries and it must be able to restart itself from that state.
Language Embedding.
Language embedding in lexer module tries to be as natural as possible while retaining simplicity and effectiveness.
The language embedding primarily attempts to reuse existing lexers for the nested languages.
Arbitrary language embedding nesting.
The lexer module is specifically designed to support many nesting levels without compromising performance.
Thus the number of nesting levels can be driven by logic and not by the goal of minimizing the number of levels in order to retain reasonable performance.
Optimized memory footprint.
A significant amount of work was spent to fit a typical token instance into 24 bytes (not including the characters of the token body).
In addition there is flyweight token support that can further decrease the memory footprint.
Threading.
The lexer framework can be accessed by multiple threads simultaneously. Each thread must first acquire document read lock (or write lock) to guarantee document's stability.

Question (arch-usecases): Describe the main use cases of the new API. Who will use it under what circumstances? What kind of code would typically need to be written to use the module?

Answer:

General token exploration

Exploration of token sequence can be performed in a bidirectional iterator-like way.
It will be possible to efficiently position (binary search) in the token sequence by the given offset.

Syntax Coloring

Swing views can be created for tokens provided by the lexer framework.
As each token's identification has a unique name among the token names for the given language the view can search for text attributes in a map having the token's name as a key.
Upon modification some tokens can be removed and new ones created. The views will listen on token changes and update themselves accordingly.

Brace matching

The code for matching of braces will first position to the starting brace in the token sequence. It will then iterate in either forward or backward direction until it reaches the matching brace or start/end of the token sequence.

Language embedding

It will be possible to ask the token sequence whether particular token has language embedding (is branched into subtokens) or not.
The branching will be done lazily upon request.

Question (arch-time): What are the time estimates of the work?

Answer:

Question (arch-quality): How will the quality of your code be tested and how are future regressions going to be prevented?

Answer:

TokenUpdater

Another set of tests will be handy for all the language developers because they will allow to verify that the lexer is correct from the incrementality point of view.
The test will simply maintain a document and do modifications to it. Once the modification gets done the underlying token sequence will get updated by the token updater. The test will then take a snapshot of the document's text and run batch lexer on it to verify whether the token sequence produced by the batch lexer is exactly the same as the one maintained by the token updater.
The modifications of the documents can be randomized and it will be possible to assign a ratio to individual characters of how frequently they should appear in the insertion string.

Finally there will be a set of tests that will control memory requirements of token instances.

Project and platform dependencies

Question (dep-nb): What other NetBeans projects and modules does this one depend on?

Answer:

Question (dep-non-nb): What other projects outside NetBeans does this one depend on?

Answer:

Question (dep-platform): On which platforms does your module run? Does it run in the same way on each?

Answer:

Question (dep-jre): Which version of JRE do you need (1.2, 1.3, 1.4, etc.)?

Answer:

Question (dep-jrejdk): Do you require the JDK or is the JRE enough?

Answer:

Deployment

Question (deploy-jar): Do you deploy just module JAR file(s) or other files as well?

Answer:

Question (deploy-nbm): Can you deploy an NBM via the Update Center?

Answer:

Question (deploy-shared): Do you need to be installed in the shared location only, or in the user directory only, or can your module be installed anywhere?

Answer:

Question (deploy-packages): Are packages of your module made inaccessible by not declaring them public?

Answer:

Question (deploy-dependencies): What do other modules need to do to declare a dependency on this one?

Answer:

OpenIDE-Module-Module-Dependencies: org.netbeans.modules.lexer/1 > @SPECIFICATION-VERSION@

Compatibility with environment

Question (compat-i18n): Is your module correctly internationalized?

Answer:

Question (compat-standards): Does the module implement or define any standards? Is the implementation exact or does it deviate somehow?

Answer:

Question (compat-version): Can your module coexist with earlier and future versions of itself? Can you correctly read all old settings? Will future versions be able to read your current settings? Can you read or politely ignore settings stored by a future version?

Answer:

Access to resources

Question (resources-file): Does your module use java.io.File directly?

Answer:

Question (resources-layer): Does your module provide own layer? Does it create any files or folders in it? What it is trying to communicate by that and with which components?

Answer:

Question (resources-read): Does your module read any resources from layers? For what purpose?

Answer:

Question (resources-mask): Does your module mask/hide/override any resources provided by other modules in their layers?

Answer:

Lookup of components

Question (lookup-lookup): Does your module use org.openide.util.Lookup or any similar technology to find any components to communicate with? Which ones?

Answer:

Question (lookup-register): Do you register anything into lookup for other code to find?

Answer:

Question (lookup-remove): Do you remove entries of other modules from lookup?

Answer:

Execution Environment

Question (exec-property): Is execution of your code influenced by any environment or Java system (System.getProperty) property?

Answer:

Question (exec-component): Is execution of your code influenced by any (string) property of any of your components?

Answer:

Question (exec-classloader): Does your code create its own class loader(s)?

Answer:

Question (exec-reflection): Does your code use Java Reflection to execute other code?

Answer:

Question (exec-privateaccess): Are you aware of any other parts of the system calling some of your methods by reflection?

Answer:

Question (exec-process): Do you execute an external process from your module? How do you ensure that the result is the same on different platforms? Do you parse output? Do you depend on result code?

Answer:

Question (exec-introspection): Does your module use any kind of runtime type information (instanceof, work with java.lang.Class, etc.)?

Answer:

Question (exec-threading): What threading models, if any, does your module adhere to?

Answer:

Question (security-policy): Does your functionality require modifications to the standard policy file?

Answer:

Question (security-grant): Does your code grant additional rights to some other code?

Answer:

Format of files and protocols

Question (format-types): Which protocols and file formats (if any) does your module read or write on disk, or transmit or receive over the network?

Answer:

Question (format-dnd): Which protocols (if any) does your code understand during Drag & Drop?

Answer:

Question (format-clipboard): Which data flavors (if any) does your code read from or insert to the clipboard (by access to clipboard on means calling methods on java.awt.datatransfer.Transferable?

Answer:

Performance and Scalability

Question (perf-startup): Does your module run any code on startup?

Answer:

Question (perf-exit): Does your module run any code on exit?

Answer:

Question (perf-scale): Which external criteria influence the performance of your program (size of file in editor, number of files in menu, in source directory, etc.) and how well your code scales?

Answer:

Question (perf-limit): Are there any hard-coded or practical limits in the number or size of elements your code can handle?

Answer:

Question (perf-mem): How much memory does your component consume? Estimate with a relation to the number of windows, etc.

Answer:

Question (perf-wakeup): Does any piece of your code wake up periodically and do something even when the system is otherwise idle (no user interaction)?

Answer:

Question (perf-progress): Does your module execute any long-running tasks?

Answer:

Question (perf-huge_dialogs): Does your module contain any dialogs or wizards with a large number of GUI controls such as combo boxes, lists, trees, or text areas?

Answer:

Question (perf-menus): Does your module use dynamically updated context menus, or context-sensitive actions with complicated enablement logic?

Answer:

Question (perf-spi): How the performance of the plugged in code will be enforced?

Answer: