Bug 35834

Summary: file size limitation when specify RE.MATCH_SINGLELINE | RE.MATCH_CASEINDEPENDENT on RE constructor.
Product: Regexp Reporter: Nancy Farnsworth <nancy.l.farnsworth>
Component: OtherAssignee: Jakarta Notifications Mailing List <notifications>
Status: CLOSED DUPLICATE    
Severity: critical    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: Other   
OS: AIX   

Description Nancy Farnsworth 2005-07-23 01:19:14 UTC
When the flags RE.MATCH_SINGLELINE and RE.MATCH_CASEINDEPENDENT are both set, 
the program dies if the scanned file is too long.  If I set either of the 
flags, but NOT BOTH, the file is scanned successfully.  However, if I set both 
flags, the program dies if scanned file is over a specific length.  

code snippet:
int iFlags = RE.MATCH_CASEINDEPENDENT|RE.MATCH_SINGLELINE;			
RE re = new RE(strPattern,iFlags);			
Reader reader = new FileReader(strFilePathAndName);
CharacterIterator in = new ReaderCharacterIterator(reader);
int iEnd=0;
while(re.match(in,iEnd))
{									
iEnd= re.getParenEnd(0)								
String strFoundTag = re.getParen(0);
...
}
Comment 1 Vadim Gritsenko 2005-08-11 16:26:45 UTC
How does the program 'die', please describe.
Please also provide sample strPattern and file.
Thanks.
Comment 2 Nancy Farnsworth 2005-08-16 18:15:10 UTC
die  
The program simply stops executing.
I do not receive any errors.  I do not
see any exceptions in the log.

comments:
The code executes successfully when I do
not set the flag to RE.MATCH_SINGLELINE.
Unfortunately, but as would be expected,
I do not get a match when the
content is continued onto the next line.
However, when I set the flag to RE.MATCH_
SINGLELINE, the program simply stops
executing partially through the file.  If the file
is short, it completes successfully.  However,
if the file is longer, execution stops.  I receive
no errors or thrown exceptions.

notes:
I have since changed the pattern to read as follows:
(<A[^>]*>)|(<APPLET[^>]*>)|(<AREA[^>]*>)|       etc
It seems to work.  

thoughts:
Even if the old pattern and code are stupid, it
seems I should still get some type of error.
I would think that there would be some type
of exception that I could trap or at least see
in the log.

problem pattern:
(<A(.)*>)|(<APPLET(.)*>)|(<AREA(.)*>)|       etc

code:
//Search html file for pattern.			
try
	{
	//Construct an RE object
	int flags = RE.MATCH_CASEINDEPENDENT|RE.MATCH_SINGLELINE; 		
	
	RE re = new RE(strPattern,flags);

	//Use the object to match to the input.
	Reader r = new FileReader(strFilePathAndName);
	CharacterIterator in = new ReaderCharacterIterator(r);
	int end=0;

	while(re.match(in,end))
		{
		//Reset starting point in input file
		end = re.getParenEnd(0);
				
		//Retrieve Tag
		String strFoundTag = re.getParen(0);
		logger.debug("Found Tag:"+strFoundTag);

		//Process tag appropriately.
		//Retrieve urls from tag and add to array of urls.
		Iterator iterator = alResourceElementsList.iterator();
		while (iterator.hasNext())
			{
			//Search for each possible element of the tag.
			String strElement = (String)iterator.next();
			int iBeginUrl = strFoundTag.trim().toUpperCase().indexOf
(strElement);
			//If an element is found, retrieve the url from the 
element.
			char cEndChar = '"';
			if (iBeginUrl >= 0)  //Element found
				{
				int iEndUrl = 					
	                      strFoundTag.trim().toUpperCase().indexOf
(cEndChar,iBeginUrl+strElement.length()+2);
				if( ! ((iBeginUrl+2) <= iEndUrl) )
					{
					logger.error("Cannot retrieve url from 
element in scanHtmlFileForResourceTags.");
					logger.error
("FilePathAndName: "+strFilePathAndName);
					logger.error("Tag: "+ strFoundTag);
					logger.error("Element: "+ strElement);
					return 1;
					}					
				String strTempUrl = strFoundTag.substring
(iBeginUrl+strElement.length()+2,iEndUrl);

				//TO DO: Test for CODEBASE/CODE for APPLET!	
					
				String strUrl;
				if (strElement.trim().equalsIgnoreCase
("CODEBASE"))
					{
					//Do not code until determined that 
this code is necessary.
					strUrl = strTempUrl;
					logger.error("APPLET tag contains 
element CODEBASE.");
					logger.error("Program does not contain 
code to process CODEBASE.");
					logger.error("Base url to resolve 
relative url is current directory of html file.");
					logger.error("The corresponding 
database entry is incorrect.");

					}//EndProcessCodeBase
				else
					{
					strUrl = strTempUrl;
					}//EndProcessAllOtherTags
							
				logger.debug("Url:"+strUrl);
						
				//Save each url that does not start with "#"
				//(Tags A and FRAME can start w/# - see doc for 
details)
				if (strUrl != null)
					{
					if (strUrl.startsWith("#"))	
						break;
					}//EndUrlNotNull

				//Save url.
				alUrl.add(strUrl);
				//Save tag name only, not entire tag.
				String strSpace = " ";
				String strTag = strFoundTag.substring
(1,strFoundTag.indexOf(strSpace));
				alLinkTagType.add(strTag);

				}///EndElementFound
					
			}//EndIterateThroughElements

				}//EndWhileMatchesInHtmlFile
				
				
			}//EndTryFindMatchesInHtml
		catch(RESyntaxException e)
			{
			logger.error("Regular Expression syntax expression.");	
			logger.error("File Path and Name:"+strFilePathAndName);
			return 1;
			}		 
		catch(FileNotFoundException e)
			{
			logger.error("FileNotFoundException on 
scanHtmlFileForResourceTags");	
			logger.error("File Not Found:"+strFilePathAndName);
			return 1;
			}
		logger.info("End of Routine");


input file:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 
<html>
<head>
<title>Index of Pages in AWStats Web Site</title>
<!--- link rel="STYLESHEET" type="text/css" href="../styles/w3c_oldstyle.css"---
>
<link rel="STYLESHEET" type="text/css" href="../styles/default.css">
 
</head>
 
<body>
<div id="pageContent">
<h1>Documentation for AWStats Pilot Deployment in APHIS</h1>
<h2>Current Notices</h2>
<p>This weekend, 1-3 July, testing will be preformed to determine is AWStats 
can process logs
   for periods within a month which were missed in a previous run. The 
application
   will also be tested to determine if logs for a previous month can be run.  In
   other words, can logs for March be processed if the April logs have already 
be run.</p>
<h2>Web Server Reports</h2>
<ul class="menuLinks" title="Dynamic WWW Web Server Reports for Current Month">
  <ul class="menuLinkItem">
    <li class="source"><a href="/awstats/awstats.pl?config=www.aphis.usda.gov"
     target="_blank">AWStats Report for Web Server WWW for Current Month</a> 
(Opens new window)</li>
    <li class="desc">This link connects the viewer to a collection of AWStats 
reports
    for the APHIS Internet web server (www.aphis.usda.gov) for the current 
month.  As
    these report are created dynamically, be aware that the report can take up 
to 10
    seconds to appear at high-demand times.</li>
    <li class="format">(html)</li>
  </ul>
  <ul class="menuLinks" title="List of Static WWW Web Server Reports for 
Current Month">
    <ul class="menuLinkItem">
      <li class="source"><a href="./AWStats_report_index.html">List of AWStats 
Reports for Web Server WWW for Current Month<
/a></li>
      <li class="desc">This link connects the viewer to a list of static 
AWStats reports
        for the APHIS Internet web server (www.aphis.usda.gov) for the current 
month.  These reports are
        regenerated every day at 3:00 AM.</li>
      <li class="format">(html)</li>
    </ul>
  </ul>
</ul>
 
<h2>AWStats Internal On-Line Resources</h2>
<ul class="menuLinks" title="AWStats Internal On-Line Resources">
  <ul class="menuLinkItem">
    <li class="source"><a 
href="/pages/how_to_run_web_analytic_reports_using_AWStats.html">
    How to Run Web Analytic Reports Using AWStats</a></li>
    <li class="desc">This document briefly describes how to run standard
          reports using the AWStats web server log analysis tool as a CGI 
application
          from a browser</li>
    <li class="format">(html)</li>
  </ul>
  <ul class="menuLinkItem">
    <li class="source"><a href="/pages/example_AWStats_reports.html">
    Examples of Creating On-Line reports with AWStats</a></li>
    <li class="desc">This document provides several examples of running AWstats 
reports
      as a CGI script using a browser.</li>
    <li class="format">(html)</li>
  </ul>
  <ul class="menuLinkItem">
    <li class="source"><a href="/docs/AWStats_pilot_options.doc">
    AWStats Options for Pilot Deployment</a></li>
    <li class="desc">This document provides a brief overview of the
    business case for deploying AWStats web analytics application in
    a pilot mode. Also details configuration for AWStats during
    pilot mode.</li>
    <li class="format">(MS-Word)</li>
  </ul>
  <ul class="menuLinkItem">
    <li class="source"><a href="/pages/faq.html">
    Frequently Asked Questions</a></li>
    <li class="desc">This document provides answers to questions
    frequently asked by AWStats users.</li>
    <li class="format">(html)</li>
 </ul>
  <ul class="menuLinkItem">
    <li class="source"><a href="/pages/todo.html">
    AWStats Web Site To-Do List</a></li>
    <li class="desc">This document is a list of items that need to be 
accomplished
    to support the AWStats pilot deployment.</li>
    <li class="format">(html)</li>
  </ul>
</ul>
<h2>AWStats External On-Line Resources</h2>
<ul class="menuLinks" title="AWStats External On-Line Resources">
  <ul class="menuLinkItem">
    <li class="source"><a href="http://awstats.sourceforge.net/index.html">
    AWStats Project Page</a></li>
    <li class="desc">Main Sourceforge project web site for AWStats, which bills
    itself as a free powerful and featureful tool that generates advanced web,
    streaming, ftp or mail server statistics, graphically.</li>
    <li class="format">(html)</li>
  </ul>
  <ul class="menuLinkItem">
    <li class="source"><a href="http://sourceforge.net/forum/forum.php?
forum_id=43428">
    AWStats Forum (General)</a></li>
    <li class="desc">AWStats forum for general users hosted by Sourceforge.</li>
    <li class="format">(php)</li>
  </ul>
  <ul class="menuLinkItem">
    <li class="source"><a 
href="http://awstats.sourceforge.net/docs/awstats.pdf">
    AWStats Documentation</a></li>
    <li class="desc">User documentation for AWStats.
    <li class="format">(pdf)</li>
  </ul>
</ul>
</div>
</body>
</html>




Comment 3 Vadim Gritsenko 2005-08-17 02:41:00 UTC
This simple program (adapted from what you have sent) shows what is really
happening:

import org.apache.regexp.CharacterIterator;
import org.apache.regexp.RE;
import org.apache.regexp.ReaderCharacterIterator;

import java.io.FileReader;
import java.io.IOException;

public class Test {
    public static void main(String[] args) throws IOException {
        RE re = new RE("(<A(.)*>)|(<APPLET(.)*>)|(<AREA(.)*>)",
                       RE.MATCH_CASEINDEPENDENT | RE.MATCH_SINGLELINE);
        CharacterIterator in = new ReaderCharacterIterator(new
FileReader("index.html"));
        int end = 0;
        try {
            while (re.match(in, end)) {
                System.out.println("Matched " + re.getParen(0));
                end = re.getParenEnd(0);
            }
            System.out.println("Done");
        } catch (Throwable e) {
            System.out.println("Exception " + e);
        }
    }
}

If you run it with the input file 'index.html' in the same directory, you'd see:

  Exception java.lang.StackOverflowError

It is duplicate of bug #764. If you have ideas how to fix it please comment in
bug #764.

*** This bug has been marked as a duplicate of 764 ***