Bug 45575

Summary: [PATCH] Code to know if a Range is in body, header/footer, footnote etc.
Product: POI Reporter: dnapoletano <domenico.napoletano>
Component: HWPFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: enhancement CC: domenico.napoletano
Priority: P2    
Version: 3.0-dev   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: Simple test doc with body, header/footer, annotations, footnotes and endnotes

Description dnapoletano 2008-08-06 01:32:22 UTC
Created attachment 22394 [details]
Simple test doc with body, header/footer, annotations, footnotes and endnotes

Using a small trick (based on text length) it's possibile to get the location of a Range (body? header/footer? footnote? etc.). For example, let's suppose to have 3 character runs:
1) coded in ASCII, ending at 2000 
2) coded in Unicode, ending at 4050
3) coded in ASCII, ending 2100
4) coded in Unicode, ending at 4200
5) coded in Unicode, ending at 4500
and that ccpText field of the document they belong is 2100.
If every chacater run was in ASCII (we can know if a character run is Unicode or ASCII, comparing length in characters from text and length in bytes from end-start), the end values would be
1) 2000
2) 2025
3) 2100
4) 2100
5) 2250
and then, comparing *these* end values with ccpText, we can conclude that the character runs are
1) in body
2) in body
3) at end of body
4) at end of body
5) out of body, maybe in footnote
This same algorithm can be applied to all Range types (paragraph, section, and so on) and to all locations (body, header/footer, footnote, etc.)

To make it possible, it's necessary to;

1) add to FileInformationBlock class the new lines

    public int getCcpFtn() {
    	return _longHandler.getLong(FIBLongHandler.CCPFTN);
    }
    
    public int getCcpHdd() {
    	return _longHandler.getLong(FIBLongHandler.CCPHDD);
    }
    
    public int getCcpAtn() {
    	return _longHandler.getLong(FIBLongHandler.CCPATN);
    }
    
    public int getCcpEdn() {
    	return _longHandler.getLong(FIBLongHandler.CCPEDN);
    }

to know limits in characters of footnotes, header/footer, annotations and endnotes respectively

2) create a new enum in "usermodel" package to represent locations
public enum Location {
	BODY,
	FOOTNOTE,
	HEADER_FOOTER,
	ANNOTATION,
	ENDNOTE,
	UNKNOWN;
}

Instead of an enum, also a series of int constants defined in Range may be used.

3) add to Range class the new member variable

protected Location _location = null;

and the new method

public Location getLocationType() {
		if(_location == null)
		{
                        //it stores the end in characters
			int x = 0;
			
			int charLen = this.text().length();
			int byteLen = _end - _start;
			if(byteLen == charLen)
				x = _end;   //ASCII
			else
				x = _end / 2;  //Unicode
			
			FileInformationBlock fib = _doc.getFileInformationBlock();
			if(x <= fib.getCcpText())
				_location = Location.BODY;
			else if(x <= fib.getCcpText() + fib.getCcpFtn())
				_location = Location.FOOTNOTE;
			else if(x <= fib.getCcpText() + fib.getCcpFtn() + fib.getCcpHdd())
				_location = Location.HEADER_FOOTER;
			else if(x <= fib.getCcpText() + fib.getCcpFtn() + fib.getCcpHdd() + fib.getCcpAtn())
				_location = Location.ANNOTATION;
			else if(x <= fib.getCcpText() + fib.getCcpFtn() + fib.getCcpHdd() + fib.getCcpAtn() + fib.getCcpEdn())
				_location = Location.ENDNOTE;
			else
				_location = Location.UNKNOWN;
		}

		return _location;
	}

This is a simple test class (perhaps it can be transformed in a JUnit testcase?) to test my code:

public class QuickTest
{
  public QuickTest()
  {
  }

  public static void main(String[] args)
  {
          try
          {

        	  JFileChooser jfc = new JFileChooser();

                int esito = jfc.showOpenDialog(null);

                if(esito != JFileChooser.APPROVE_OPTION)
                {
                        JOptionPane.showMessageDialog(null, "No file selected");
                }
                else
                {
                        String percorso = jfc.getSelectedFile().getAbsolutePath();

                        HWPFDocument doc = new HWPFDocument(new FileInputStream(percorso));
                        Range r = doc.getRange();
                        for(int i = 0; i < r.numParagraphs(); i++)
                        {
                                //Paragraph, CharacterRun, Section... it's equivalent
                        	Paragraph cr = r.getParagraph(i);
                        	System.out.println("<" + cr.text().trim() + "> " + cr.getLocationType());
                        }
                }
          }
          catch(Exception er)
          {
                  er.printStackTrace();
          }
  }

}

which, applied to test doc I have attached, produces the output

<BODY TEXT FRAGMENT 1> BODY
<BODY TEXT FRAGMENT 2> BODY
<> BODY
<FOOTNOTE TEXT 1> FOOTNOTE
<FOOTNOTE TEXT 2> FOOTNOTE
<> FOOTNOTE
<> HEADER_FOOTER
<> HEADER_FOOTER
<> HEADER_FOOTER
<> HEADER_FOOTER
<> HEADER_FOOTER
<> HEADER_FOOTER
<> HEADER_FOOTER
<> HEADER_FOOTER
<HEADER TEXT FRAGMENT 1> HEADER_FOOTER
<HEADER TEXT FRAGMENT 2> HEADER_FOOTER
<> HEADER_FOOTER
<FOOTER TEXT FRAGMENT 1> HEADER_FOOTER
<FOOTER TEXT FRAGMENT 2> HEADER_FOOTER
<> HEADER_FOOTER
<> HEADER_FOOTER
<ANNOTATION 1> ANNOTATION
<ANNOTATION 2> ANNOTATION
<> ANNOTATION
<ENDNOTE TEXT> ENDNOTE
<> ENDNOTE
<> UNKNOWN
Comment 1 Nick Burch 2008-08-11 04:41:11 UTC
Something similar to this is now in svn

getRange() now only returns the main body, but getOverallRange() gives you the lot. There are also a few other Range getters, for the other things like header+footer

The unicode stuff has also been made a bit nicer, so the range detection stuff is much simpler now too :)
Comment 2 derf 2008-10-01 03:49:48 UTC
(In reply to comment #1)
> Something similar to this is now in svn
> 
> getRange() now only returns the main body, but getOverallRange() gives you the
> lot. There are also a few other Range getters, for the other things like
> header+footer
> 
> The unicode stuff has also been made a bit nicer, so the range detection stuff
> is much simpler now too :)
> 

I have poi-3.1-final but I dont see any getOverallRange nor any Range getters.