Issue 108273 - regular expression skips long paragraphs finding quoted text
Summary: regular expression skips long paragraphs finding quoted text
Status: CONFIRMED
Alias: None
Product: Writer
Classification: Application
Component: editing (show other issues)
Version: OOo 3.1.1
Hardware: PC All
: P3 Trivial (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-01-13 03:55 UTC by badrian
Modified: 2017-05-20 11:20 UTC (History)
1 user (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
doc (32.53 KB, application/vnd.oasis.opendocument.text)
2010-01-13 03:57 UTC, badrian
no flags Details
Sample document for testing and further information (11.10 KB, application/vnd.oasis.opendocument.text)
2010-01-13 14:16 UTC, Joe Smith
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description badrian 2010-01-13 03:55:59 UTC
Hi!

I hope this is not a duplicate issue.
Here is description of the problem.
I noticed that if the size of a paragraph is relatively large (more than one
page Letter), the a search using regular expressions skips that paragraph. I
mention that I am aware of the fact that there is a limit size of characters in
a paragraph, however the issue is with "normal" sizes.

Attached it is included a test doc file. It includes 3 paragraphs, with the 2nd
the largest. The point of interest is the quoted text that it is emphasis.

1. Put the cursor at the beginning of the 1st para.
2. Ctrl+f to open the search engine.
3. In the "search for" field put 
“(.*)â€
Please copy and paste the above text exactly, since the quotation characters are
different at the beginning and at the end of string. You can look to see the
difference better in the attached file.
The search string should capture all the text in between two quotation marks.
4. Select Regular expressions from More Options expansion.
5. Click once Find button.
The text 
“Evidently some ...â€
should be highlighted.
6. Click Find one more time
Notice that the 2nd para is skipped although it has text in quotes, and only the
text in 3rd para is selected.

Here there are 2 observations.
a) If the 2nd para is broken in 2 almost equal sizes and perform the same
operations as above, the find will identify the text quoted in the 2 para (that
is now in 2 parts).
b) Let's go back at the 3rd para. Notice that the text identified include 2
consecutive sentences that are quoted, i.e. the text FROM “The promises were ...
TO with the promiseâ€.
This is not normal. It should find the two different sentences, first “The
promises were ...his seed†and after that the following sentence "“What I mean
... with the promiseâ€.
Note that if I use the adds-on "AltSearch" extension
http://extensions.services.openoffice.org/project/AltSearch
the search in 3rd para identifies the 2 sentences correctly. However, the issue
with searching in 2nd para persists.
Comment 1 badrian 2010-01-13 03:57:29 UTC
Created attachment 67163 [details]
doc
Comment 2 michael.ruess 2010-01-13 08:42:00 UTC
MRU-OS: can confirm this. perform the steps as described above and you will see
that the search text in the second (long, page breaking) paragraph will not be
found. When making the paragraph shorter, the text will be found - but also the
not desired text between the quoted sections.
Comment 3 Joe Smith 2010-01-13 14:15:43 UTC
I'm not so sure this is a defect; it's more an inherent limitation of the
regular expression search, along with misunderstanding of the pattern language.

Because the repetition elements (* or +) are defined as "greedy", searching for
x.*y must look at every character in the paragraph, starting at the first "x",
all the way to the end of the paragraph, then backtrack, looking for a "y" to
end the match.

It's not unreasonable to limit the amount of backtracking that the search will
perform. In Writer, this seems to be around 5850 characters.

The "greedy" repetition is also the reason that the regex in question here
matches multiple quoted passages in the same paragraph: the .* matches _any_
character, /including a closing quote/. So while the pattern may look like it
should match a single quotation, it actually matches everything between the
first open quote and the last closing quote in a paragraph--that could be one
quotation or a hundred.

The greedy behavior and backtracking problem can be avoided by using something
more specific than . (any character): “([^â€]*)†I.e., repeat any character
_except_ a close quote, any number of times.

This is foundational for using regexes, which is unfortunate for new users, but
good explanations are easily found, including
http://wiki.services.openoffice.org/wiki/Documentation/How_Tos/Regular_Expressions_in_Writer

I'll attach another test document that has a number of paragraphs, starting with
a quoted word and followed by varying number of characters. The search for
“(.*)†works for all the sample paragraphs, up to 5840 characters after the
quotation, then fails for the last paragraph, with 5850 characters after the
quotation.
Comment 4 Joe Smith 2010-01-13 14:16:32 UTC
Created attachment 67172 [details]
Sample document for testing and further information
Comment 5 Marcus 2017-05-20 11:20:12 UTC
Reset assigne to the default "issues@openoffice.apache.org".