🔴 The Problem (Observed Failure)
Pasting directly from Microsoft Word or Google Docs into a web editor (WordPress, Strapi, Custom CMS) results in “Dirty HTML”. The output is bloated with proprietary XML namespaces and non-semantic tags.
<!-- The infamous "Mso" junk code -->
<p class="MsoNormal" style="margin-bottom:0in;line-height:normal;mso-outline-level:1">
<span style="font-size:14.0pt;font-family:'Segoe UI',sans-serif;mso-fareast-font-family:'Times New Roman';color:#222222">
Technical Specification<o:p></o:p>
</span>
</p>
The bloat ratio is often 10:1. A 1kb document becomes 10kb of markup, causing CSS conflicts, accessibility issues, and bloated DOM size.
❌ What Did NOT Work
- Standard ‘Paste as Plain Text’: Strips everything, including essential bold, italic, and link formatting, requiring hours of manual re-formatting.
- Word’s ‘Save as Web Page’: Even the “Filtered” export leaves behind
v:shape,mso-attributes, and proprietary meta tags. - Simple Regex:
replace(/class=".*?"/g, '')often leaves empty tags like<span></span>which still clutter the DOM and break rendering.
✅ The Fix (Algorithm-Based Cleaning)
To solve this deterministically, you need a DOM parser that traverses the tree and applies an allowlist strategy.
1. Identify and Strip Proprietary Namespaces
A proper converter must target the mso- prefix and specific XML schemas. If you are building a custom pipeline, ensure you strip:
mso-content-provider,mso-font-kerning,mso-ansi-languagestyleattributes (unless specifically allowed)classandidattributes that don’t match your design system.
2. Using IZHubs HTML Cleaner
Our HTML Cleaner uses a recursive DOM tree traversal to:
- Flatten Nested Spans: Removes redundant wrappers while keeping the text.
- Map Semantic Tags: Automatically converts Word “Heading” classes into real
<h1>-<h6>tags. - Attribute Sanitization: Removes all inline CSS while preserving
<strong>,<em>, and<a>(withhref).
// Logic Example: Recursive Attribute Stripping
function cleanNode(node) {
const allowedAttrs = ['href', 'src', 'alt', 'target'];
if (node.attributes) {
Array.from(node.attributes).forEach(attr => {
if (!allowedAttrs.includes(attr.name)) {
node.removeAttribute(attr.name);
}
});
}
}
⚠️ Edge Cases & Trade-offs
- Complex Tables: Word tables use absolute widths (
pt). Our tool flattens these to standard<table>tags, but complex merged cells may require manual review. - Image Hosting: Word exports local paths (
file:///C:/Users/...). You MUST upload images to your CMS separately; the cleaner will strip the broken<img>tags to prevent 404s.
🛠 Related Tool
- IZHubs HTML Cleaner: The fastest way to convert Google Docs or Word files to clean, production-ready HTML5.