Programming question regarding MS Word

Solarion

Honorary Master
Joined
Nov 14, 2012
Messages
28,050
Reaction score
17,804
Hey guys. I am busy experimenting with using C# to read Microsoft Word Documents. It it related to a task I may or may not have coming up in around July.

So I will not be able to use Interops for security reasons on the server itself. I have tried all day with Open XML and also with DOCX but so far nothing has been able to achieve what I want to.

I have a word document. In this document is a table which has 12 rows and 1 single column. Each row has some text and is text field is numbered 1,2,3.. up to 12 respectively. So 1.xxxx, 2.xxxx etc.

Now here is that catch. The number is auto generated by Microsoft Word. It only displays when the document is opened with Microsoft Word. If I use a C# Console App to read in this document using either Open XML or DOCX, the numbers are gone, but the text in each column is all I see when displayed in the console. I've tried this with all of the free packages with all the same result.

After doing much digging and head scratching it turns out that these numbers do not actually appear on a Word document as text, but are formatted elements. I have tried with both packages to read these formatted elements to no avail. They simply do not have this capability.

I tried converting the document to text, html, even alien, but nothing helps. So at this point I am concluding that this project may not actually meet a feasibility analysis, which is the point it's at now. It's not a big project, not at all. But it is one of those things which could come in handy if accomplished. Which may be wishful thinking.

Tl'dr: Well hopefully you read at least a little bit, but essentially, I'm really just wanting to know if any of you have come across having to do this in the past?
 
Last edited:
I haven't used Open XML or DOCX, but yes, the automatic numbering will be under List Formatting.

From mucking around a bit with Word VBA, each cell in that table is going to have a Range object, and while the text you type in cell can be set/get from Range.Text, the automatic numbering as it appears in the document (e.g., 1.2.3.) is a read-only property under Range.ListFormat.ListString.

Surprisingly, it doesn't look like the .docx file contains a cached string value of it though, seems like they're regenerated when opening the file.

Having a look at the XML inside of a Word document, a single line (paragraph) looks like this:
XML:
<w:p w14:paraId="54F6AB83" w14:textId="2A6F9071" w:rsidR="007454F5" w:rsidRPr="007454F5" w:rsidRDefault="007454F5" w:rsidP="007454F5">
    <w:pPr>
        <w:pStyle w:val="ListParagraph"/>
        <w:numPr>
            <w:ilvl w:val="0"/>
            <w:numId w:val="1"/>
        </w:numPr>
        <w:rPr><w:lang w:val="en-US"/></w:rPr>
    </w:pPr>
    <w:r w:rsidRPr="007454F5">
        <w:rPr><w:lang w:val="en-US"/></w:rPr>
        <w:t>Charlie</w:t>
    </w:r>
</w:p>

where w:ilvl is the Indent Level of the number (0 for 1., 1 for 1.1., etc.).
w:numId appears to be the index of the list from which the numbering continues.
The above is the 3rd appearance of <w:numPr> for the numbered list w:numId=1, and since the prior 2 occurrences are also w:ilvl=0, so it displays in Word as 3. Charlie (using the "1, 2, 3" numbering format).
 
Hi Swift, thanks for the response.

Yup, it is something to behold. I've been staring at these Word back-end XML definitions for a while. This is another example with 6 paragraphs numbered 1 through 6 with some basic text:

XML:
<w:p>
    <w:pPr>
        <w:pStyle w:val="ListParagraph"/>
        <w:numPr>
            <w:ilvl w:val="0"/>
            <w:numId w:val="1"/>
        </w:numPr>
    </w:pPr>
    <w:r>
        <w:t>Paragraph 1</w:t>
    </w:r>
</w:p>

<w:p>
    <w:pPr>
        <w:pStyle w:val="ListParagraph"/>
        <w:numPr>
            <w:ilvl w:val="0"/>
            <w:numId w:val="1"/>
        </w:numPr>
    </w:pPr>
    <w:r>
        <w:t>Paragraph 2</w:t>
    </w:r>
</w:p>

<!-- And so on for paragraphs 3 through 6 -->
 
ChatGPT's summary on a task such as the one I described.

Challenges in Extracting Auto-Formatted Numbering from Word Documents with Open XML and DocX

Overview:
Microsoft Word's auto-formatted numbering (such as in lists and tables) presents a significant challenge for automated extraction. This is due to the way numbering is implemented and stored in Word's internal structure. Understanding these challenges is crucial for setting realistic expectations and exploring viable solutions.

1. Auto-Formatting vs. Text Content:
In Word, auto-formatted numbers (like those in bulleted or numbered lists) are not part of the text content. Instead, they are part of Word's styling and formatting system. When Word renders a document for display or printing, it applies these styles and formats, including generating numbers for lists, but these numbers are not stored as raw text in the document's file (.docx).

2. Open XML SDK Limitations:
The Open XML SDK provides access to the underlying XML structure of Word documents. However, it does not perform document rendering. Numbering in Word documents is managed by a complex system of numbering definitions (numbering.xml) that reference styles and formatting rules rather than explicit text. Extracting auto-formatted numbering with the Open XML SDK would require manually interpreting these definitions and reconstructing the numbering, a process both complex and error-prone.

3. DocX Library Capabilities:
The DocX library is a .NET library designed to simplify working with .docx files, offering an easier interface compared to the raw Open XML SDK. Despite its user-friendliness, DocX faces similar challenges to Open XML in handling auto-formatted numbering. It can manipulate and read document elements but does not render them as Word does. As with Open XML, numbers that are part of Word's automatic numbering system are not directly accessible as plain text using DocX.

4. Practical Implications:
For documents where numbering is crucial, relying on text extraction alone (either through Open XML or DocX) will likely miss these auto-generated numbers. Any solution attempting to replicate Word's rendering logic would need to handle various complexities and exceptions, leading to significant development effort.

5. Alternatives and Workarounds:
Manual Inclusion of Numbers: If maintaining numbering is essential, consider manually including numbers in the document's text. Use of Word's Capabilities: Where possible, leverage Word itself (e.g., saving the document as a text file from Word) to retain formatting. Third-Party Libraries: Some third-party libraries can render Word documents similarly to Word itself but may come with licensing costs and other considerations.

Conclusion:
Extracting auto-formatted numbering from Word documents using Open XML or DocX is inherently challenging due to the nature of Word's formatting and rendering system. Alternative approaches or adjustments to the project requirements may be necessary to achieve the desired outcomes.
 
The only viable solution I can see without deep diving MS Word XML definitions and encoding until I grow grass around my chair, is to convert the word document into a plain text document, save it and then try to extrapolate the information I need. I'll give that a crack and see how that goes. I'm not going to spend too much time on this at this point.
 
I'm surprised neither of the libraries implemented it. But yeah, I wouldn't get stuck on it either.

You'd have to lookup the <w:numId/> tags from document.xml, then find the matching <w:num> tag in numbering.xml, which then points to a <w:abstractNum/>, which containers what kind of number formatting it is (e.g., "a, b, c" number is <w:numFmt w:val="lowerLetter"/>. And if you have multi level numbering, you need to handle each possible level separately since the delimiter between them is not always a period, the final number might not have a period after it, it might have a right bracket e.g. "ii) foobar", etc etc. And that's just the formatting.

Then you'd have to enumerate every <w:numPr> tag in the document. Way too much work for the gain IMHO.

If you only ever have to handle the use case of a) single level lists that are b) always decimal numbers, and c) you only ever have one single list in the document, then it becomes much easier.
 
Microsoft Sharepoint or Microsoft Forms may be a better option.

Why are you using word? It sounds like youre trying to build a webform however.
 
I'm not sure what this is for. That has not been made clear yet. At this point this is more a test-the-waters feasibility. Still going to give it a hard crack though. I have an idea which my brother gave me, he's an engineer and has worked with some pretty weird requirements. He gave me an out of the box approach. I'll post results by end of today.
 
Top
Sign up to the MyBroadband newsletter
X