Issue 75525

Summary: Bad document correction -line breaks removal
Product: Writer Reporter: tuharsky <tuharsky>
Component: editingAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: issues
Version: OOo 2.1   
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: ENHANCEMENT Latest Confirmation in: ---
Developer Difficulty: ---
Issue Depends on: 75524    
Issue Blocks:    

Description tuharsky 2007-03-19 15:49:40 UTC
This is related to Issue 75524.

One option of "Bad Document Correcting Tool" could offer the intelligent removal
of unnecessarry line breaks, being the sub-option of general BDC Tool.

Purpose:
Some users type the text on PC the same way they did on mechanical typewriters
-they give a line break at the end of every line. Such document is impossible to
format, one must manually delete the line breaks. Moreover, if the document
suffered some printer-aided reformatting, the situation is even worse -You have
for example single line of text continuing on the next line (single word or a
few) and THEN suddenly the line break. Next line performs similary and so on.

I'm talking about the same effect as in the mail clients that put line breaks
automatically. Then You open the mail in other mail client, forward it etc. At
the end, You have the mentioned ugly corrupted formatting of text.

So, the option should offer a convenient way of automatical removal of such
mis-breaked lines. An algorithm is to be made to do the proper mis-breaked line
detection, for the start some simple set of rules could do:


1, The text section should be considered as "intended consistent", if there is
no empty line. Other words, even if the text contains line breaks, it is
considered as "should be consistent" if it dosen't contain ENTIRELY EMPTY line.
Other words, the text between two empty lines is considered as single consistent
block.
The "line mis-breaks" should be removed on this general basis, with more fine
tuned heuristics rules as follows:

2, The line is considered "intentionally ended (with line break that should
remain untouched)" if it's length is less than, say, 3/4 of the full line length.

3, If in the defined "intended consistent" block there are lines, that are just
a few (up to, say, 20) letters longer than full line length (so that just few
characters are in the next line and then ended with line break), it is
considered as probably line mis-break.

4, The lines, that contain bullets or numbering at the beginning, are considered
as intentionally (regulary) ended, thus the line break at the end of such line
should remain untouched.

5, If the whole line is based on different font than the majority of the
"intended consistent" block, the probability of line mis-break is smaller; the
line could also represent kinda header.


Please, add more rules if You wish.

In general, the function would analyse the text block, or whole document if
selected, and remove the line breaks that are suspected of being "unintentional"
or "mis-used".
Comment 1 tuharsky 2007-03-19 16:05:23 UTC
6, The probability of line mis-break is higher, if the line starts with down-letter

7, The probability of line mis-break is higher, if the previous line dosent's
end with dot or similar interpunct sign.

8, The probability of line mis-break is higher, if the previous line ends with comma

9, The probability of line mis-break is significantly lower, if the majority of
lines in the "intended consistent" block also end up with comma (because this
could suggest some kind of bulletting)
Comment 2 michael.ruess 2007-03-19 16:44:44 UTC
Reassigned to requirements.
Comment 3 tuharsky 2007-03-20 08:07:04 UTC
Hi, mru

"Enhancement is an improvement to an existing feature.
Feature is an addition to the software to add a piece of functionality that does
not yet exist."

Do You mean such functionality already exists and just needs to be improved? I'd
like to see it, thus I could better cooperate with improvements..