Bug 37382 - stack over flow while using a Regex
Summary: stack over flow while using a Regex
Status: RESOLVED DUPLICATE of bug 3561
Alias: None
Product: ORO
Classification: Unclassified
Component: Main (show other bugs)
Version: 2.0.7
Hardware: Other other
: P2 normal (vote)
Target Milestone: ---
Assignee: Jakarta Notifications Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-07 14:11 UTC by Pushpesh Kr. Rajwanshi
Modified: 2005-11-09 12:10 UTC (History)
1 user (show)



Attachments
file used in code (41.60 KB, text/plain)
2005-11-07 14:12 UTC, Pushpesh Kr. Rajwanshi
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pushpesh Kr. Rajwanshi 2005-11-07 14:11:08 UTC
Hi,

I am using ORO Regex API version 2.0.7 and my objective is to extract some 
tagged data from html source. For example i am interested in getting the source 
code for all the forms found in a html page. So i made my regex like this:

Regex formReg = new Regex("(?i)(<form(.|\\s)*?>(.|\\s)*?</form>)");

because following one didn't work,

Regex formReg = new Regex("(?i)(<form.*?>.*?</form>)");

because . is taken as any character but not newline.

So my first regex worked well and i was able to get complete form data starting 
from <form..... to </form>

BUT

when the form was big say like it had around 400 lines and 30K bytes then it 
failed and resulted in Stack Overflow. I am pasting below the stack overflow 
error:

Matched <form name="param" action="http://www/parametric/ProductParametric" 
method="post">
<input name="sterm" type="hidden">
</form>
matcher.getMatch().endOffset(1) 4480
Matched <form name="cross" action="http://www/crossref/search.jsp" 
method="post">
<input name="partNumber" type="hidden">
</form>
matcher.getMatch().endOffset(1) 127
java.lang.StackOverflowError
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)


Also i am pasting my code(method) which i wrote for extraction and it can be 
simply called from main method and run,

----------------------------------------------------------------------------

public static void testRegOro() {
		try {
			String html = IoUtils.readFile("file.txt");
//			String html = "all work and no play makes jack a dull 
boy"; //IoUtils.readFile("file.txt");
			Perl5Compiler compiler=new Perl5Compiler();
			Perl5Pattern pattern = (Perl5Pattern) compiler.compile
("(<form(.|\\s)*?>(.|\\s)*?</form>)",
			          Perl5Compiler.CASE_INSENSITIVE_MASK | 
Perl5Compiler.READ_ONLY_MASK);
			PatternMatcher matcher = new Perl5Matcher();
			int i=0;
			while(matcher.contains(html,pattern) && i++<3) {
		        System.out.println("Matched " + matcher.getMatch().group
(1));
		        System.out.println("matcher.getMatch().endOffset(1) " + 
matcher.getMatch().endOffset(1));
		        html = html.substring(matcher.getMatch().endOffset(1));
		        //System.out.println("html " + html);
		      }
		} catch (Throwable e) {
			e.printStackTrace();
		}
	}

------------------------------------------------------------------------------

As my code shows i am reading a file.txt file i am attaching that file also in 
the bug.

I will really appreciate if you can look into this and throw some light on this 
and if it can be improved?

Thanks in Advance!
Regards,
Pushpesh Kr. Rajwanshi
Comment 1 Pushpesh Kr. Rajwanshi 2005-11-07 14:12:25 UTC
Created attachment 16891 [details]
file used in code

this is the file i used which is read from code.
Comment 2 Daniel F. Savarese 2005-11-08 20:14:08 UTC
This is a duplicate of Bug #3561 (summary: rewrite the regular expression).

>because following one didn't work,
>
>Regex formReg = new Regex("(?i)(<form.*?>.*?</form>)");
>
>because . is taken as any character but not newline.

If you want . to match newlines, then use the SINGLELINE_MASK
option when compiling the expression.

You should upgrade to version 2.0.8 as it fixed a couple of
problems.


*** This bug has been marked as a duplicate of 3561 ***
Comment 3 Pushpesh Kr. Rajwanshi 2005-11-08 20:42:20 UTC
Thanks Dan... I guessed something like this must be there but didn't knew so 
thanks for this also is javadoc the only way to get familier with this regex 
api or u have some tutorial too?

Thanks again for early response
Pushpesh
Comment 4 Daniel F. Savarese 2005-11-09 17:01:09 UTC
(In reply to comment #3)
> thanks for this also is javadoc the only way to get familier with this regex 
> api or u have some tutorial too?

Only the javadoc :(  There used to be a user's guide of sorts for OROMatcher,
but it was never updated and expanded for Jakarta ORO.
Comment 5 Pushpesh Kr. Rajwanshi 2005-11-09 21:10:54 UTC
hmmm... no problem i've gone through it and looks more or less simple to 
understand and kind of similar to other regex apis i learnt... anyways thanks 
for the quicker reply...

Regards
Pushpesh