I was chatting with Danielle Gehrmann (@danielletrans), friend and Italian to English translator from just ‘across the ditch’ (the Tasman Sea) in Sydney, and the conversation turned to translating website content…
Danielle: Sometimes my clients want their web pages translated, but I find dealing with these sorts of files can often be quite awkward.
Paul: You’re right, Danielle. A lot of translators find it very messy dealing with HTML files. Many are not always quite sure how to handle them or what tools they need to use .
Danielle: Most often my needs are really simple. I usually don’t want to use any special tools or deal with html tags. Anyway, customers don’t always want me to do a full ‘localisation job’. Sometimes I just need to deliver translated web pages to the client as a nicely formatted Word document – but I’d like to have all the images in place and to present it with the same layout as the original web page. I’ve tried so many different copying and pasting techniques in Word and even in PowerPoint, but can’t get them to transfer well enough. Isn’t there a simple way I can get web pages into Word format?
Paul: Word is definitely not a good choice if you are working on files which are to be published on the web in HTML format . But if you’re sure that your customer is not expecting to receive a HTML file and is happy to receive the translation as a Word document, then there are certainly some ways you can try converting web pages into documents.
Danielle: When you say “try”, do you mean that it’s actually not simple?
Paul: Think of converting a web page into a word processing document as a bit like doing a translation – there’s not always an exact one-to-one correspondence you can rely on; web pages and word processing documents are built on different principles. But there are some tricks we can make use of.
Danielle: OK, let’s hear them.
Paul: First of all you can just type the URL of the website directly into Word’s “File|Open” dialogue box. When you click OK, it might take a while for Word to download the web page and then convert it into Word format.
Danielle: Yes, but the pictures in the document might still be linked to the web. If Word tries to download those images again from the web every time I open the document that might not be very convenient.
Paul: Rather than opening the URL directly in Word, you can save the web page as a separate file from your browser and then open that file from within Word. The best way to do this is to use a special Microsoft file type called “MHT”, which is available in Internet Explorer. Web pages are often made up of dozens of separate files. The special MHT format collects all the various components of the page (all the words, pictures and formatting) and puts them all together into one single file. This format is really good if you want to email a complete web page to someone. Because Internet Explorer and Word are Microsoft products, they both ‘understand’ this format. Word will recognise this file format and will know how to convert it to its own format.
Danielle: Oh dear! I’ve got Apple MacIntosh, so I don’t use Internet Explorer – I use Firefox or Safari. So will this work for me?
Paul: Hmmm… MacIntosh’s Safari and Firefox don’t recognise Microsoft’s special MHT file type, so we can’t use these browsers to save the file in this format. The Opera browser recognises MHT though. Let’s do a bit of experimenting and see what works.
Danielle: OK, I’m up for it. Here’s a sample web page we can use to play with: http://news.bbc.co.uk/2/hi/asia-pacific/8011846.stm
Paul: OK, I’ve opened it in Internet Explorer and saved it in MHT or “Web archive” format. I’m sending it to you as an attachment in an email. See if you can open it in Word on your Mac.
Danielle: Got it! It opens just fine. I can see that it pretty much has the same layout as the original web page. I could certainly work with that, overtyping the text with my translation or using Wordfast perhaps.
Here’s how the original web page looks in a browser (click to enlarge):
Here’s how it looks after converting it to Word format (click to enlarge):
Paul: Now before you start working on the file, you should go to “File|Save As”… and save it as a proper Word document. There may be a bit of manual tidying up to do as well. Click the “Hide|Show” formatting button on Word’s toolbar… this one:
Danielle: OK. I can see some unusual formatting symbols in the document. What are they?
Paul: These little graphics represent some invisible programming elements which were in the original web page. You don’t need them in your Word document. Just select them with your mouse and delete them.
Danielle: OK. Now, I’ve just tried saving the web page in “Web archive” (MHT) format in Opera for Mac. I’ll open it in Word and see how it looks.
Paul: Did it work?
Danielle: Hmmm, no. I’ve got all the words, but the all images are missing. Good to highlight complications – that converting files doesn’t always work first go & in particular, this one we’re demonstrating.
Paul: OK. Trick No. 1 for Mac users is to have a friend with Internet Explorer who will save the page in MHT format for you…
Danielle: Yes, or I guess you could ask the client to save the page for you in the right format. But I’d still like to be able to make the conversion myself without bothering the client with such technicalities, if it’s something that I can sort out myself.
Paul: OK, then let’s try a different method. We’ll try saving the web page as a PDF file first and then try converting the PDF to Word.
Danielle: Let’s try that. Do I need a web page to PDF converter?
Paul: Yes, but there are plenty of free ones and you can do it easily online. You can just Google “convert web page to PDF” to find one. Try this one: www.web2pdfconvert.com
Danielle: Got it. I just need to paste in the URL I want to convert. OK, it’s converting… It’s done and I’ve saved it to my hard disk. I’ve opened it as a PDF – and it looks good. What’s next?
Paul: Now we need to convert that PDF to a Word document. I can see that the web2pdf application you’ve just used also has a PDF-to-Word converter – there’s a link at the top of the page. Try it.
Danielle: Yes, I can see it. OK, I’ve uploaded the PDF file. It’s converting… It seems to be taking a long time…
Paul: Ha, ha! As I said, it’s like doing a language translation – it’s more complicated than most people would think!
Danielle: OK, it’s done. I’ve saved it to my hard drive and am opening it now in Word.
Paul: How does it look?
Danielle: The conversion is not bad, but it isn’t great. The headings in bold on the web page will need to be rebolded in the converted document. It’s not as good as the conversion from the MHT file that you saved in Internet Explorer – but maybe I’m expecting too much. I guess there might always be some tweaking needed.
Paul: Probably. Even the trick of using an MHT file doesn’t always work perfectly.
Danielle: Any more ideas?
Paul: I know you’ve tried copy and pasting the website directly into your Word document before – but with variable results… You can improve the success of this approach sometimes by not selecting the whole web page, but by copying smaller sections of the web page and pasting them one by one into your document. But this isn’t a perfect method either and will depend on how complex the HTML formatting is.
Danielle: Of course I can make a PDF and then try your method on how to manually convert a PDF to a Word document.
Paul: Yes. Well, let’s put what we’ve discovered online – maybe someone else knows of some other better ways worth trying too!
 You can find a useful discussion on translating HTML files for beginners here: http://www.your-translations.com/html-for-translator_1.php. The author covers how HTML works and what tools (including free ones) are available for translators (6 pages). A basic tutorial on HTML for translators is here: http://www.translatorscafe.com/cafe/html_tutorial.asp?pn=hg_toc.html
 Better choices range from using a simple text editor such as Notepad to more sophisticated CAT tools. A comprehensive list of the various available CAT tools is given here: http://en.wikipedia.org/wiki/Computer-assisted_translation and here http://www.i18nguy.com/TranslationTools.html. If you really want to come to terms with CAT tools, you might want to consider buying The Translator’s Tool Box (e-book), by Jost Zetzsche.
 The University of Aberdeen has a web page with some other techniques for converting web pages to Word you might like to try: http://www.abdn.ac.uk/webpack/factsheet5.shtml