2009-09-11

How to convert Microsoft Word documents to clean HTML

I often want to take a MS Word document that someone has sent me and convert it to HTML so that I can distribute it to others via the web rather than emailing it around to people. MS Word has built-in HTML conversion, but the resulting HTML is so ugly, confusing, and bloated, that it is very hard to work with if I want to do edits or tweak formatting.

Here is my current favorite method for converting Microsoft Word documents to clean HTML using Gmail:

  1. I email the document to my Gmail account.
  2. When I view the email in Gmail there is a "View as HTML" link provided for the attachment, which I click to see the Word document displayed in my web browser.
  3. At the next screen click the View link in the upper left, which brings up a menu of viewing options. Choose Plain HTML.
  4. Then with the document showing in my browser I go to view page source (Firefox for Mac = View -> Page Source) on my browser, which shows me the underlying HTML, which I block and copy into a new file in a test editor.
The HTML produced by Gmail is very old school (it doesn't use CSS and probably half the tags it uses are deprecated), but it is very compact, clean, and consitent, so you can quickly convert it to more modern HTML using search and replace.