Apache OpenOffice (AOO) Bugzilla – Issue 108273
regular expression skips long paragraphs finding quoted text
Last modified: 2017-05-20 11:20:12 UTC
Hi! I hope this is not a duplicate issue. Here is description of the problem. I noticed that if the size of a paragraph is relatively large (more than one page Letter), the a search using regular expressions skips that paragraph. I mention that I am aware of the fact that there is a limit size of characters in a paragraph, however the issue is with "normal" sizes. Attached it is included a test doc file. It includes 3 paragraphs, with the 2nd the largest. The point of interest is the quoted text that it is emphasis. 1. Put the cursor at the beginning of the 1st para. 2. Ctrl+f to open the search engine. 3. In the "search for" field put “(.*)†Please copy and paste the above text exactly, since the quotation characters are different at the beginning and at the end of string. You can look to see the difference better in the attached file. The search string should capture all the text in between two quotation marks. 4. Select Regular expressions from More Options expansion. 5. Click once Find button. The text “Evidently some ...†should be highlighted. 6. Click Find one more time Notice that the 2nd para is skipped although it has text in quotes, and only the text in 3rd para is selected. Here there are 2 observations. a) If the 2nd para is broken in 2 almost equal sizes and perform the same operations as above, the find will identify the text quoted in the 2 para (that is now in 2 parts). b) Let's go back at the 3rd para. Notice that the text identified include 2 consecutive sentences that are quoted, i.e. the text FROM “The promises were ... TO with the promiseâ€. This is not normal. It should find the two different sentences, first “The promises were ...his seed†and after that the following sentence "“What I mean ... with the promiseâ€. Note that if I use the adds-on "AltSearch" extension http://extensions.services.openoffice.org/project/AltSearch the search in 3rd para identifies the 2 sentences correctly. However, the issue with searching in 2nd para persists.
Created attachment 67163 [details] doc
MRU-OS: can confirm this. perform the steps as described above and you will see that the search text in the second (long, page breaking) paragraph will not be found. When making the paragraph shorter, the text will be found - but also the not desired text between the quoted sections.
I'm not so sure this is a defect; it's more an inherent limitation of the regular expression search, along with misunderstanding of the pattern language. Because the repetition elements (* or +) are defined as "greedy", searching for x.*y must look at every character in the paragraph, starting at the first "x", all the way to the end of the paragraph, then backtrack, looking for a "y" to end the match. It's not unreasonable to limit the amount of backtracking that the search will perform. In Writer, this seems to be around 5850 characters. The "greedy" repetition is also the reason that the regex in question here matches multiple quoted passages in the same paragraph: the .* matches _any_ character, /including a closing quote/. So while the pattern may look like it should match a single quotation, it actually matches everything between the first open quote and the last closing quote in a paragraph--that could be one quotation or a hundred. The greedy behavior and backtracking problem can be avoided by using something more specific than . (any character): “([^â€]*)†I.e., repeat any character _except_ a close quote, any number of times. This is foundational for using regexes, which is unfortunate for new users, but good explanations are easily found, including http://wiki.services.openoffice.org/wiki/Documentation/How_Tos/Regular_Expressions_in_Writer I'll attach another test document that has a number of paragraphs, starting with a quoted word and followed by varying number of characters. The search for “(.*)†works for all the sample paragraphs, up to 5840 characters after the quotation, then fails for the last paragraph, with 5850 characters after the quotation.
Created attachment 67172 [details] Sample document for testing and further information
Reset assigne to the default "issues@openoffice.apache.org".