Apache OpenOffice (AOO) Bugzilla – Issue 75550
AutoCorrect option for recognizing typed pagenumber
Last modified: 2013-02-07 22:36:53 UTC
This is another option for the Bad Document Correction Tool (Issue 75524). The purpose of the option is: detect and remove hardwired page numbering. Sometimes the user dosen't know, how to put proper page numbering into document, so he does it as on mechanical typewriter. He places some -1- or similar text on the bottom of every page. Since the pages are somewhat fuzzy, after intentional or printer-aided reformatting, the "page numbering" is corrupted. If the pages are "shortened", the "page numbers" are moving up on every page. If page is "prolonged", the "page numbers" leak to the beginning of next page. Either way, they're bad. The option should detect them somehow and remove completely, let the user make his numbering, or offer call to the Page Numbering Wizzard (issue 7065). The detection of "hardwired page numbering" could be a difficult task if done automatically, however much simpler if the user is asked to cooperate. For example, the user could be asked to select the first sample of the numbering. OOo could do some corrections of the sample (spacebars removal before and after the pattern, line break removal etc). Even variations of the patterns could be considered to allow flexibility and respect discrepancies and typos. For example, if the user selected some "- 1 -" sample, then also variants "- 1-", "-1-", "-1 -" etc should be considered. The help of the user would make the search much easier for the OpenOffice.org. Not even the patterns could be detected more precisely than if they were totally unknown. Also the distances between the "hardwired page numbers" could be guessed from the user selected sample. OpenOffice.org should then search the patterns. All lines, that contain only a number or somehow resembles the searched patterns, are suspected. OOo should search similar pattern with number increased. Then it could do some statistical analysis. Gauss-curve could be computated, representing, how often are such patterns numbers repeating inside the document. If the curve is very sharp and concentrated, meaning that such a pattern is repeated in quite predictible intervals, then we have probably found the "hardwired numbering" and such lines could be safely removed. Please note: This option of Bad Document Correction Tool should take place BEFORE the Line breaks correction (Issue 75549) and probably before any other options, because the less corrections have been already taken to the document, the bigger chance of precise results of "hardwired page numbers" detection.
Reassigned to requirements.