Since 2007, Microsoft Office files have used OpenXML based file formats (e.g .docx, .pptx, .xlsx). XML stands for eXtensible Markup Language and was designed as a format for encoding documents that both a human and machine can easily interpret.
The contents of a document are stored as a series of tagged elements, with each element representing a different part of the document, such as paragraphs, tables, images, and so on. This means it is inherently dynamic, and the XML code defining a certain file can be specific to the software it is intended for.
When you open a Word document and navigate to a specific page, you may notice that the page number is displayed in the status bar at the bottom of the screen. However, this page number is not actually stored as a separate element in the document. Instead, it is calculated dynamically based on the layout of the document and the settings for margins, page size, and other formatting options. The page layout can vary depending on the contents of the document, and there may be other elements that affect the pagination, such as section breaks, headers, and footers.
Furthermore, Word uses a technique called "page caching" to improve performance when displaying large documents. This means that not all of the content for a given page may be stored in a single element. Instead, the content for several pages may be grouped together in a single element, which can make it difficult to extract a specific page without also including content from other pages.
There are two tags within a Microsoft Word OpenXML file that define where a Page Break occurs.
Manual Page Break (Hard Break) | <w:br w:type="page" /> |
Implicit Page Break | <w:lastRenderedPageBreak /> |
Most Word files today don't use Manual Page Breaks, instead relying on the Implicit Page Breaks created when your content 'overflows' to the next page. When using Microsoft Word, you don't notice how these page breaks change the document but are essential for rendering pages in the way you would expect. When looking at the XML code, these Implicit Page Breaks aren't always in the position you would expect when comparing to the visual page breaks within Word, however they are the only option for tags to look for when determining the pages to extract (if manual page breaks are not utilised).
In the end, the dynamic nature of Word's pagination, page caching, and the way page breaks are saved within the XML code, make it difficult to obtain the exact pages requested from the Import Options.