Bug 3013 - Large File Parsing
Summary: Large File Parsing
Status: NEW
Alias: None
Product: Xerces-J
Classification: Unclassified
Component: SAX (show other bugs)
Version: 1.4.2
Hardware: PC Linux
: P3 normal
Target Milestone: ---
Assignee: Xerces-J Developers Mailing List
Depends on:
Reported: 2001-08-06 20:47 UTC by lgalanis
Modified: 2004-11-16 19:05 UTC (History)
0 users


Note You need to log in before you can comment on or make changes to this bug.
Description lgalanis 2001-08-06 20:47:58 UTC
Using the Xmark benchmark (found at http://monetdb.cwi.nl/xml/index.html) I
tried to pare a really big file using SAX (doing nothing but parsing). When
piping the output of 

<xmarkbinary> -f 20 through sax (approx. 2GB) I got the following:

java.lang.RuntimeException: Internal Error: fPreviousChunk == NULL
        at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1094)
        at niagara.search_engine.xmark.DummyParser.main(DummyParser.java:22)

For values of -f such as 10,15,18  there is no problem. The binary can be made
using the file at http://monetdb.cwi.nl/xml/Assets/unix.c
Comment 1 jjc 2001-08-07 11:53:23 UTC
I reproduced this.

The problem is the input file is more than 2^31 bytes long.

The offset (XMLEntityReader.fCurrentOffset) hence wraps around to a negative 
Shortly after xerces falls over in

I don't know what should be done. I would guess this is a WONTFIX, but the 
error messages could be improved. Difficult to choose best place to catch it 
though; I would assume that a minor change in the file would cause the sympton 
(i.e. the exact place things go wrong) to be very different.

The value of the argument offset to UTF8DataChunk.addSymbol when it crashes is
-2147483551, there have been numerous calls to addSymbol with very large values 
of offset near Integer.MAX_VALUE.

Comment 2 robw 2002-01-30 21:16:25 UTC
This is a show-stopper for many applications. Other Java parsers do not have
this problem...
Comment 3 Glenn Marcy 2002-01-30 23:29:20 UTC
While this is true, Xerces 1 is not really where the current focus of the Apache 
parser development lies at this point.  Has anyone tried this with Xerces 2?  If 
it is not a problem, then the answer would be for you to switch to the new 
version.  If the problem does still exists, then the version of this defect 
should be changed to reflect that.  There are a great many things that could be 
done to improve Xerces 1 at this point, but with limited resources the main 
development effort is on Xerces 2 now.  Considering that Xerces 1 has never been 
able to parse documents that large, it is not a regression but a limitation of 
the old architecture that Xerces 1 was based upon.