Hi, I am using ORO Regex API version 2.0.7 and my objective is to extract some tagged data from html source. For example i am interested in getting the source code for all the forms found in a html page. So i made my regex like this: Regex formReg = new Regex("(?i)(<form(.|\\s)*?>(.|\\s)*?</form>)"); because following one didn't work, Regex formReg = new Regex("(?i)(<form.*?>.*?</form>)"); because . is taken as any character but not newline. So my first regex worked well and i was able to get complete form data starting from <form..... to </form> BUT when the form was big say like it had around 400 lines and 30K bytes then it failed and resulted in Stack Overflow. I am pasting below the stack overflow error: Matched <form name="param" action="http://www/parametric/ProductParametric" method="post"> <input name="sterm" type="hidden"> </form> matcher.getMatch().endOffset(1) 4480 Matched <form name="cross" action="http://www/crossref/search.jsp" method="post"> <input name="partNumber" type="hidden"> </form> matcher.getMatch().endOffset(1) 127 java.lang.StackOverflowError at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source) Also i am pasting my code(method) which i wrote for extraction and it can be simply called from main method and run, ---------------------------------------------------------------------------- public static void testRegOro() { try { String html = IoUtils.readFile("file.txt"); // String html = "all work and no play makes jack a dull boy"; //IoUtils.readFile("file.txt"); Perl5Compiler compiler=new Perl5Compiler(); Perl5Pattern pattern = (Perl5Pattern) compiler.compile ("(<form(.|\\s)*?>(.|\\s)*?</form>)", Perl5Compiler.CASE_INSENSITIVE_MASK | Perl5Compiler.READ_ONLY_MASK); PatternMatcher matcher = new Perl5Matcher(); int i=0; while(matcher.contains(html,pattern) && i++<3) { System.out.println("Matched " + matcher.getMatch().group (1)); System.out.println("matcher.getMatch().endOffset(1) " + matcher.getMatch().endOffset(1)); html = html.substring(matcher.getMatch().endOffset(1)); //System.out.println("html " + html); } } catch (Throwable e) { e.printStackTrace(); } } ------------------------------------------------------------------------------ As my code shows i am reading a file.txt file i am attaching that file also in the bug. I will really appreciate if you can look into this and throw some light on this and if it can be improved? Thanks in Advance! Regards, Pushpesh Kr. Rajwanshi
Created attachment 16891 [details] file used in code this is the file i used which is read from code.
This is a duplicate of Bug #3561 (summary: rewrite the regular expression). >because following one didn't work, > >Regex formReg = new Regex("(?i)(<form.*?>.*?</form>)"); > >because . is taken as any character but not newline. If you want . to match newlines, then use the SINGLELINE_MASK option when compiling the expression. You should upgrade to version 2.0.8 as it fixed a couple of problems. *** This bug has been marked as a duplicate of 3561 ***
Thanks Dan... I guessed something like this must be there but didn't knew so thanks for this also is javadoc the only way to get familier with this regex api or u have some tutorial too? Thanks again for early response Pushpesh
(In reply to comment #3) > thanks for this also is javadoc the only way to get familier with this regex > api or u have some tutorial too? Only the javadoc :( There used to be a user's guide of sorts for OROMatcher, but it was never updated and expanded for Jakarta ORO.
hmmm... no problem i've gone through it and looks more or less simple to understand and kind of similar to other regex apis i learnt... anyways thanks for the quicker reply... Regards Pushpesh