We have relaunched: What's new at ZDNet Asia?

Create clean HTML in Word

Summary

Learn how to use Microsoft's Office HTML Filter to remove the extra tags that Word generates and create squeaky-clean HTML documents.

Events

Microsoft MSDN/Developer Event
25 Mar 2010

One Marina Boulevard, Microsoft Singapore

IT Architect Regional Conference Singapore 2010
20 - 21 Apr 2010

Singapore Management University, Singapore

The Internet Show 2010
21-22 Apr 2010

Suntec Singapore

Although you can create clean HTML code to produce content for a Web site using just about any text editor, Microsoft Word doesn't do a very good job of producing efficient HTML. On the other hand, Word is very good for collaboration and is just about as universally accepted for document creation as the pen.

So what do you do if you want to create clean HTML but don't want to abandon Word and the useful things it does bring to the table? You can use Microsoft's Office HTML Filter to remove the extra tags that Word generates and create squeaky-clean HTML documents.

What's wrong with Word?

Microsoft Word does a great job as a word processor, but it's not very useful for creating HTML documents that you can quickly plug into a Web site. When you a Word document as HTML, Word adds page- formatting tags that can make the document very large. These page-formatting tags may also cause content management programs and Web sites to behave unexpectedly.

Microsoft added the special tags to Word's HTML with an eye toward backward compatibility. Microsoft wanted you to be able to save files in HTML complete with all of the tracking, comments, formatting, and other special Word features found in traditional DOC files. If you save a file in HTML and then reload it in Word, theoretically you don't loose anything at all.

Unfortunately, when you then move a standard Word-generated HTML file to a Web site, bad things can sometimes happen. Formatting tags included in the Word file can conflict with settings on a Web server, causing the document to display incorrectly. Additionally, a browser may misinterpret the tags and display the file incorrectly. The HTML file also contains versioning and authoring information that you may not want to have appearing on a Web site.

To save a Word document in HTML, select Save As Web Page from the file menu. Using this article as an example, Figure A shows the clutter that Word adds to an HTML document.

Figure A

Microsoft Word adds its own formatting information to HTML files.

Basically, the first 100 lines of the HTML file contained nothing but formatting information. Actual information didn't appear until line 93 of the file. This complete article, saved in Microsoft Word's default HTML format, consumed 18 KB of space. As you can see, it's both large and inefficient. Obtaining and using the filter

Both Microsoft Word 2002 and Word 2002/XP include an option to save Filtered HTML, but the filtered versions still include a lot of clutter. Word 2000 doesn't include a Filtered HTML option at all. That's where the Office 2000 HTML Filter 2.0 comes in. This is a freeware utility that you can download from Microsoft's Download Center that will strip the excess formatting tags from Word-generated HTML files.

The file you'll download, msohtmf2.exe, is small (only 256 KB), so it will download very quickly. Save the file to a temporary location on your workstation. You'll install the filter using this file.

When you start the installation, you'll notice that it installs just like any other Windows program you've ever installed. There are no gotchas along the way; just follow the on-screen prompts.

After the installation is done, you can use the filter. Begin by restarting Microsoft Word. In the File menu, you'll now notice CompactHTML in the Export To menu choice. Open a Word document and save the file by clicking File | Export To | Compact HTML. When the Export To HTML As window appears, give the document a file name and click Save.

As you can see in Figure B, the resulting HTML code is somewhat cleaner. Also, the file size is reduced dramatically. Using the CompactHTML feature, the file size for this document went from 18 KB to 12 KB.

Figure B

The CompactHTML settings create cleaner HTML code.
Cleaning things up even more

Even though CompactHTML is an improvement, you can strip even more information out of the document by using the Office 2000 HTML Filter's actual utility. To start it, click Start | Programs | Microsoft Office Tools | Microsoft Office HTML Filter 2.0. When you do, you'll see the utility window shown in Figure C.

Figure C

You can create cleaner code by using the filter interactively.

The filter is very easy to use. Click Add, select the file you want to convert, and click Apply. You can convert multiple files by continually clicking Add and adding files before clicking Apply.

By default, the filter doesn't clean the HTML any better than the CompactHTML settings in Word do. However, you can customize the filter by clicking the Options button. When you do, you'll see the screen shown in Figure D.

Figure D

You can control filter settings.

Options you can control here include:

  • Delete Backups After Processing - The filter creates a backup copy of your file before conversion that you can revert to in case the conversion is not to your satisfaction. Selecting this checkbox eliminates the original.
  • Delete Non-Essential Linked Files - Selecting this checkbox removes any references to linked files in the document.
  • Remove Microsoft Office Native Markup - You can select this checkbox to remove all of the Word-related tags from the document.
  • Remove LANG Attributes - If you select this checkbox, the filter removes all language related tags such as <body lang=EN-US>.
  • Remove Non-Essential META Tags - Selecting this switch removes meta tag information that could confuse search engines, such as the name of the program you used to create the document.
  • Use VML For Displaying Graphics - This switch removes static images in the document.
  • Remove Standard CSS - This switch removes any Cascading Style Sheet information.
  • Remove All STYLE Elements - If you select this switch, then the filter will remove all STYLE references that are used by Cascading Style Sheets.
  • Remove Standard @Rule Constructs - This checkbox controls whether or not the document will include @rule definitions such as @font-face.

I've found the best results by selecting all of the checkboxes except for Delete Backups After Processing and Use VML For Displaying Graphics. You should experiment to see which settings work best for your situation. Using these settings, the filter produced the HTML for this article as shown in Figure E.

Figure E

Here's how the filter created HTML for this article.

As you can see, the HTML is much cleaner. It's smaller too: The final converted article is only 10 KB in size.

Who needs a GUI?

The Office 2000 HTML Filter also allows you to convert files from the command line as well as from the GUI. To use it, open a command prompt. You'll use the filter command to filter your HTML files.

You don't need to worry about knowing where Filter is located. During Setup, the Office 2000 HTML Filter setup program installs Filter.exe to the \Windows directory so it's already in your path.

To convert a file, type filter file1.htm file2.htm and press [Enter], where file1 is the name of the source file and file2 is the name of the target filtered file. Filter includes switches that you can use to control just how much information is removed from the source file. To get a complete list of switches and how to use them, type filter /? and press [Enter].

Office 2000 HTML Filter caveats

Don't let the Office 2000 in the title discourage you if you use Word XP. The Office 2000 HTML Filter 2.0 works just as well with Word XP generated HTML as it does Word 2000 HTML. The problem is that the installer for the Office 2000 HTML Filter won't allow the program to install unless you have Office 2000 on your system.

You can get around the limitation by first installing the filter on a computer that already has Office 2000 on it. Then, copy these files from the Office 2000 workstation to your workstation:

  • MSFilter.exe
  • MSFilter.dll
  • Filter.exe

The DLL file is best placed in your C:\Windows\System32 directory, but you can also place all of the files into an OfficeFilter directory. Just create a shortcut to the MSFilter.exe file and you're ready to go.

Talkback

Add your opinion

In order to post a comment, you need to be registered. (Sign In or register below)

Post your comment
Transform your business interactions with real-time voice, video and telepresence solutions.
Tech Vendor: Cisco

ZDNet Asia Live

Zdnetasia.com Estimated Worth $178,365 USD. Daily Ad Revenue:$244 USD, Daily Views:81,445 Pages... - http://www.haplog.com/www.zdneta...

recently estimated website net worth of zdnetasia.com - http://www.haplog.com/www.zdneta...

6 hours 32 minutes ago by haplog on topsy

When I create an event, I click on an approximate time during the day when I want the event to occur, then I click "edit event detail...

21 hours 8 minutes ago by bessellbrowne on Google Calendar gets 'smart' rescheduling

ipads break alott i had one it broke three times in the month i had it so i got rid of the damn thing id just go for the laptop Top Grade...

21 hours 10 minutes ago by bessellbrowne on Report: 'Hundreds of thousands' of iPad preorders

There are a number of websites that still require Internet Explorer to view and IE for Mac Stinks (it is really ies4osx which is the Wind...

21 hours 12 minutes ago by bessellbrowne on Microsoft: Only minor tweaks in Windows 7 SP1

The receivers don't transmit back to the satellite. Unless there is a phone line attached to the receiver, they don't have any wa...

21 hours 15 minutes ago by bessellbrowne on Apple to join the geolocation craze?

What to expect from open source Symbian http://is.gd/aPIGL

21 hours 30 minutes ago by rebelk0de on topsy

"Lead Cognos BI Developer Insurance - Jobs - ZDNet Asia" http://bit.ly/bRcxOG

22 hours 9 minutes ago by rhrcognos on topsy

whatever little understanding I have we 'll only progress toward end of the world if we use HPCs to lenthen life of human being. Huma...

1 day 21 minutes ago by abhi32002@gmail.com on High computing promises elixir of life

Thanks for the knowledgeable article on SDDs. Allas...when all this reasearch will happen in Indian Universities. Hope the new bill on Fo...

1 day 34 minutes ago by abhi32002@gmail.com on APAC HPC users eye solid-state drives

It was a good article. This brings a good opportunity for Indian IT firms to come up with new solutions in this field. HPC can become a b...

1 day 53 minutes ago by abhi32002@gmail.com on High computing most-wanted job in Asia

COL KR DHARMADHIKARY(RETD) its very late to reply the link, but if it is still alive and looking for opportunity, i would like to know th...

1 day 50 minutes ago by deb021280 on Education takes off in rural India, helped by PCs

It was just a matter of time until google was marginalised anyway. I'm afraid this will be forgotten in China very quickly. Still, it...

1 day 55 minutes ago by robinsmith on Report: Google to leave China on April 10

High performance computing (HPC) most-wanted job in Asia http://bit.ly/9vFC3i (via @zdnetasia) #singapore

He doesn't care if her shoes are of glass, All he wants to see is a huge rack and nice a*s. Sleeping beauty's not awoken by true ...

1 day 24 minutes ago by warlowdavies on One pair of 3D glasses to rule them all

RT @zdnetasia: EMC COO, Pat Gelsinger, on bridging gaps in the organization and its cloud ambitions in Asia. (cont) http://tl.gd/i5jjd

EMC COO, Pat Gelsinger, on bridging gaps in the organization and its cloud ambitions in Asia. http://bit.ly/9etOZW

Asian SMBs need to pay more attention to disaster recovery planning http://bit.ly/bDet08 via @zdnetasia

Asian SMBs need to pay more attention to disaster recovery planning http://bit.ly/bDet08

[TECH] URL Shorteners slow Web redirection. - http://bit.ly/bySnWK @zdnetasia

URL shorteners are great but they can slow web redirection & you pray it would never go down http://bit.ly/bySnWK via @zdnetasia

Temasek Holdings eyeing tech stocks, indicating optimistic outlook on IT sector. http://bit.ly/aM7VwU

URL shorteners slow Web redirection. http://bit.ly/bySnWK

Chinese agencies cry foul over Google. http://bit.ly/by6rwV

Philippine antipiracy drive focuses on enterprises. http://bit.ly/aWryDC

Gartner: China to become world's fastest-growing enterprise software market. http://bit.ly/bqJTtb

all of sg's isps have been practising compulsory invisible proxy for all home subscribers at their backend since many years back alre...

2 days 34 minutes ago by melvinchia on Web filters mean bad news for business

it is not to good for china.
Proactol

2 days 18 minutes ago by nathonastle on Chinese ad partners beg Google for information

RT @zdnetasia: HP touts new products and management and productivity tools to address business computing pain points. http://bit.ly/dudgA6

For those with a computer science background, or interested in the high performance computing scene: http://bit.ly/9vFC3i

IT security insiders rob casinos of $50K http://is.gd/aPIKR

3 days 50 minutes ago by rebelk0de on topsy

Very good explanation of JMX

3 days 24 minutes ago by Babith B on Managing applications with JMX

The reaction to a report issued Tuesday by Flurry Analytics managed to completely overlook some interesting news--the Android-based Motorola Droid outsold the original iPhone over the same period of time following their respective launches--to focus instead on the sales numbers for the Nexus One.

3 days 27 minutes ago by lonemavericks on diggs

Another ZTE story....

4 days 29 minutes ago by Moderate Your Greed on Philippines opens bid for final 3G license

We at www.fifosys.com have also seen a growth in IT outsourcing and anticipate it as a growing field.

4 days 2 minutes ago by sarah Jane on Companies' outsourcing spend to increase