Issue 121962 - Endless loop with regular expression search
Summary: Endless loop with regular expression search
Status: ACCEPTED
Alias: None
Product: Writer
Classification: Application
Component: editing (show other issues)
Version: 4.0.1
Hardware: All All
: P3 Normal (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-03-26 14:42 UTC by zhivko
Modified: 2017-05-20 11:55 UTC (History)
5 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---
jsc: 4.0.0_release_blocker-


Attachments
attachment that will cause unusual behaviour (30.12 KB, application/vnd.oasis.opendocument.text)
2013-03-26 14:42 UTC, zhivko
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description zhivko 2013-03-26 14:42:24 UTC
Created attachment 80465 [details]
attachment that will cause unusual behaviour

Problem description: 

Steps to reproduce:
1. open openoffice 
2. Load attached odt
3. Click search and replace
4. in search text enter check\((.*)+\)
5. Mark regular expressions
6. Click FindAll

"c:\Temp\inOdt_withData_4695156_1360828411187_820742690.odt

Current behavior:
CPU goes to 100% and search operation never completes

Expected behavior:
search regular expression should be found without processor to go to 100%

              
Operating System: Windows 7
Version: 4.0.0.3 release
Comment 1 Ariel Constenla-Haile 2013-03-31 09:31:47 UTC
Changing the subject, assuming the reporter didn't submit the bug to the wrong bugzilla
Comment 2 Ariel Constenla-Haile 2013-03-31 09:37:37 UTC
Confirmed with 3.4.1

Attaching the debugger and interrupting, leads always to icu_4_0::RegexMatcher::MatchAt(int, signed char, UErrorCode&)

(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007f3c439c114d in icu_4_0::RegexMatcher::MatchAt(int, signed char, UErrorCode&) () from /opt/openoffice.org3/program/../basis-link/program/libicui18n.so.40
(gdb) bt
#0  0x00007f3c439c114d in icu_4_0::RegexMatcher::MatchAt(int, signed char, UErrorCode&) () from /opt/openoffice.org3/program/../basis-link/program/libicui18n.so.40
#1  0x00007f3c439c3f3a in icu_4_0::RegexMatcher::find() () from /opt/openoffice.org3/program/../basis-link/program/libicui18n.so.40
#2  0x00007f3c439c0029 in icu_4_0::RegexMatcher::find(int, UErrorCode&) () from /opt/openoffice.org3/program/../basis-link/program/libicui18n.so.40
#3  0x00007f3c232ccc15 in ?? () from /opt/openoffice.org3/program/../basis-link/program/i18nsearch.uno.so
#4  0x00007f3c232cb247 in ?? () from /opt/openoffice.org3/program/../basis-link/program/i18nsearch.uno.so
#5  0x00007f3c57589f9a in utl::TextSearch::SearchFrwrd(String const&, unsigned short*, unsigned short*, com::sun::star::util::SearchResult*) ()
   from /opt/openoffice.org3/program/../basis-link/program/libutl.so
#6  0x00007f3c2b34c08d in SwPaM::DoSearch(com::sun::star::util::SearchOptions const&, utl::TextSearch&, SwMoveFnCollection*, unsigned char, unsigned char, unsigned char, unsigned char, unsigned short&, unsigned short&, unsigned short, SwNode*, SwPaM*) () from /opt/openoffice.org3/program/../basis-link/program/libsw.so
#7  0x00007f3c2b34ca94 in SwPaM::Find(com::sun::star::util::SearchOptions const&, unsigned char, utl::TextSearch&, SwMoveFnCollection*, SwPaM const*, unsigned char) ()
   from /opt/openoffice.org3/program/../basis-link/program/libsw.so
#8  0x00007f3c2b34cbaa in ?? () from /opt/openoffice.org3/program/../basis-link/program/libsw.so
#9  0x00007f3c2b351f0d in ?? () from /opt/openoffice.org3/program/../basis-link/program/libsw.so
#10 0x00007f3c2b35492c in ?? () from /opt/openoffice.org3/program/../basis-link/program/libsw.so
#11 0x00007f3c2b34bc5f in ?? () from /opt/openoffice.org3/program/../basis-link/program/libsw.so
#12 0x00007f3c2b33ec4e in SwCrsrShell::Find(com::sun::star::util::SearchOptions const&, unsigned char, SwDocPositions, SwDocPositions, unsigned char&, FindRanges, int) ()
   from /opt/openoffice.org3/program/../basis-link/program/libsw.so
#13 0x00007f3c2b8bab76 in SwWrtShell::SearchPattern(com::sun::star::util::SearchOptions const&, unsigned char, SwDocPositions, SwDocPositions, FindRanges, int) ()
   from /opt/openoffice.org3/program/../basis-link/program/libsw.so
#14 0x00007f3c2b859f75 in ?? () from /opt/openoffice.org3/program/../basis-link/program/libsw.so
#15 0x00007f3c2b85a433 in ?? () from /opt/openoffice.org3/program/../basis-link/program/libsw.so
#16 0x00007f3c2b85aea5 in SwView::ExecSearch(SfxRequest&, unsigned char) () from /opt/openoffice.org3/program/../basis-link/program/libsw.so
#17 0x00007f3c58ff2378 in ?? () from /opt/openoffice.org3/program/../basis-link/program/libsfx.so
#18 0x00007f3c58ff2695 in SfxDispatcher::_Execute(SfxShell&, SfxSlot const&, SfxRequest&, unsigned short) () from /opt/openoffice.org3/program/../basis-link/program/libsfx.so
Comment 3 Ariel Constenla-Haile 2013-03-31 09:46:17 UTC
Confirmed on trunk, too:

(gdb) bt
#0  0x00007f371a330208 in icu_4_0::RegexMatcher::MatchAt(int, signed char, UErrorCode&) () from /home/ariel/aoo/openoffice4/program/../basis-link/program/libicui18n.so.40
#1  0x00007f371a32c390 in icu_4_0::RegexMatcher::find() () from /home/ariel/aoo/openoffice4/program/../basis-link/program/libicui18n.so.40
#2  0x00007f371a32c9c9 in icu_4_0::RegexMatcher::find(int, UErrorCode&) () from /home/ariel/aoo/openoffice4/program/../basis-link/program/libicui18n.so.40
#3  0x00007f36f0fd77f7 in TextSearch::RESrchFrwrd (this=0x4943910, searchStr=
    "check(a14_c3,1) da ne bo enostransko prekinil naročniškega razmerja ali spremenil izbranega naročniškega paketa v naročniški paket SOS, Enotni paket, Podatkovni bonus, paket Telemetrija, paket Interne"..., startPos=0, endPos=429) at /build/aoo/src/playground/trunk/main/i18npool/source/search/textsearch.cxx:769
#4  0x00007f36f0fd4d2a in TextSearch::searchForward (this=0x4943910, searchStr=
    "check(a14_c3,1) da ne bo enostransko prekinil naročniškega razmerja ali spremenil izbranega naročniškega paketa v naročniški paket SOS, Enotni paket, Podatkovni bonus, paket Telemetrija, paket Interne"..., startPos=0, endPos=429) at /build/aoo/src/playground/trunk/main/i18npool/source/search/textsearch.cxx:241
#5  0x00007f372959164b in utl::TextSearch::SearchFrwrd (this=0x7fff588ac648, rStr=
    "check(a14_c3,1) da ne bo enostransko prekinil naročniškega razmerja ali spremenil izbranega naročniškega paketa v naročniški paket SOS, Enotni paket, Podatkovni bonus, paket Telemetrija, paket Interne"..., pStart=0x7fff588ac214, pEnde=0x7fff588ac212, pRes=0x0) at /build/aoo/src/playground/trunk/main/unotools/source/i18n/textsearch.cxx:241
#6  0x00007f37018824a3 in SwPaM::DoSearch (this=0x7f3719e5b050, rSearchOpt=..., rSTxt=..., fnMove=0x7f37028306c0 <aFwrd>, bSrchForward=1 '\001', bRegSearch=1 '\001', 
    bChkEmptyPara=0 '\000', bChkParaEnd=0 '\000', nStart=@0x7fff588ac214: 0, nEnde=@0x7fff588ac212: 429, nTxtLen=429, pNode=0x7f3719f03be8, pPam=0x7f3719f10370)
    at /build/aoo/src/playground/trunk/main/sw/source/core/crsr/findtxt.cxx:492
Comment 4 Ariel Constenla-Haile 2013-03-31 09:47:24 UTC
Adding hdu on Cc, who might be in the know.
Comment 5 hdu@apache.org 2013-04-08 08:17:04 UTC
ICU has some known problems [1] ("Some regular expression patterns can result in very slow match operations, sometimes so slow that it will appear as though the match has gone into an infinite loop").

We should investigate whether the reported problem is already solved in newer versions of ICU and file a bug report to ICU if not. Experimenting whether applying the experimental ICU regex loop breaking patch [2] helps might be a good idea too.

[1] http://userguide.icu-project.org/strings/regexp#TOC-Performance-Tips
[2] http://www.icu-project.org/trac/ticket/2905

Thanks for the report and the stacks!
Comment 6 hdu@apache.org 2013-04-08 08:39:15 UTC
Another issue [3] ("better handling of Exponential Time expressions") has an interesting pointer to things done in Perl for detecting such problematic search patterns and returning a "no match" result then. Unfortunately [2] mentioned in my previous comment doesn't have a patch yet.

[3[ http://www.icu-project.org/trac/ticket/5419
Comment 7 zhivko 2013-04-11 10:44:48 UTC
Another issue with regular expression search in OO (I don't know if it is corellated with this issue), is that searching for text that is located in frame sometimes founds solution to some regex and other times not - has somebody have idea how to troubleshoot debug this anomally?
Comment 8 Ariel Constenla-Haile 2013-04-11 14:24:34 UTC
(In reply to comment #7)
> Another issue with regular expression search in OO (I don't know if it is
> corellated with this issue), is that searching for text that is located in
> frame sometimes founds solution to some regex and other times not 

Please open another bug for this one.
Does this happen only with 3.4.*? Did you try with 3.3 or older? (3.4.* switched the regex engine, so it is good to test issues against 3.3 or older).

> has somebody have idea how to troubleshoot debug this anomally?

To debug this you have to build OO yourself. OO does not provide debug builds. But with a good and reproducible bug report, developers can debug this for you, and hopefully fix the bug :)
Comment 9 hdu@apache.org 2013-07-03 09:44:44 UTC
FWIW the critical part in the reported regexp is the "(.*)+" which ICU4.0 treats as infinitely many empty groups that are matching. If the regexp was written as "check\([^)]*\)" to match everything inside the right parentheses there would be no problem.
Comment 10 jsc 2013-07-03 09:48:16 UTC
remove showstopper request based on hdu's analysis. And a workaround exists. 

We should focus on an update to a newer ICU vewrsion
Comment 11 zhivko 2013-07-03 10:20:26 UTC
(In reply to hdu@apache.org from comment #9)
> FWIW the critical part in the reported regexp is the "(.*)+" which ICU4.0
> treats as infinitely many empty groups that are matching. If the regexp was
> written as "check\([^)]*\)" to match everything inside the right parentheses
> there would be no problem.

But are there infinitely many empty groups? Seems to me that there is finite number of such groups.

What is the plan for including newer version of ICU? Is this issue to be fixed in newer verision?
Comment 12 jsc 2013-07-03 10:54:36 UTC
an update would be appreciated but as always the work have to be done and updating ICU is a lot of work. Volunteers are welcome
Comment 13 hdu@apache.org 2013-07-03 12:05:02 UTC
> But are there infinitely many empty groups? Seems to me that there is finite
> number of such groups.

Run
   echo AB | grep "A\(.*\)\(.*\)\(.*\)B"
and see that it matches with three empty groups. Now extend the number of empty groups to N and check again. Now extend that number to N+1...

> What is the plan for including newer version of ICU?

ASAP (depending on resource constraints)

> Is this issue to be fixed in newer versions?

There is no known patch to ICU that would fix it. If there was one we'd backport it. Developing a solution shouldn't be too difficult for someone familiar with that part of ICU4C's code.
Comment 14 zhivko 2014-04-08 07:10:39 UTC
There is still same problem in 4.0.1 version - just confirmed it...
Can you raise the priority on this issue? Escalate this to somebody else maybe?
I changed version to 4.0.1.
Comment 15 zhivko 2014-04-08 07:12:09 UTC
The symptom is that CPU goes to 100% and openoffice program becomes unresponsive - user needs to manually kill it.
This bug is opened more than a year now. Please escalate this.
Comment 16 zhivko 2014-04-08 07:18:01 UTC
I don't understand how this could cause indefinite loop in ICU engine.
If I compare this to java regexp engine - the endless loops on finite text would not happened there. There sre a lot regexp expression testers online and you can try this regexp expression - you would never get endless loop there.

For me it means ICU engine has a bug - is there a chance of taking different engine? Php, Pearl, Java - everywhere this expression works without ending in endless loop - only in OpenOffice or ICU - user gets stucked!?

What are the chance to move to less bugy regexp engine for OO ?
Comment 17 Marcus 2017-05-20 10:45:25 UTC
Reset the assignee to the default "issues@openoffice.apache.org".