The parsing of simple xml files is much slower relatively to older xml4j version from ibm (e.g. 2.0.15). Even after turning off all features it is still almost twice slower than xml4j 2.0.15. I checked only parsing (very) large files, not parsing many small files.
You don't give any specific details on what parser are you using: do you use DOM or SAX? Is it a validating parser? Do you use DTDs or XML Schemas? Since XML4J 2.0.15 we've added several enhancements to the parser, like W3C DOM L2 implementation, W3C XML Schema implementation. Thus, it is acceptable that the parser became slower. We are shifting our development efforts towards Xerces2, and we've stopped working on Xerces (1.4.4 is probably the last release). If you provide more additional information and patches to the code, we will gladly accept those. Thank you!
Ok, few details - I'm using sax parser using the sax 2.0 framework, although i don't use any features specific to 2.0. I don't use validation. I have a dtd embedded into the file. I see the same performance both in 1.4.4 and in 2.0.0 beta3. Also any tips on making the parsing faster will be welcomed! (I already used those on the web). Genady
I'll also try to benchmark the parser and send you the results. Genady
Genady, given your requirements you should use Xerces2. In Xerces2 there are different parser configurations that include different components in the pipeline. By default, Xerces2b4 parsers are created with xerces.parsers.StandardParserCofiguration which includes: Scanner, DTDValidator, DTDScanner, NamespaceBinder. Validating parser must read DTD if it is present, even if you don't need validation. If you don't want external DTD to be read set http://apache.org/xml/features/nonvalidating/load-external- dtd to false [the internal subset will be always read]. If you have more about performance email to the xerces-j-dev list.