I have often observed that while many translators and project managers may be skilled users of a number of sophisticated software tools, they sometimes lack some really simple skills in Word. Like for instance knowing how to find and replace tabs or paragraph and line markers…
“But why would we ever want to do that?” they might ask.
In this post we look how such simple skills can be used to solve some awkward problems. As an illustration, we’ll look at how to drop a PDF file into Word (and from there into a TM tool, if required).
So – what’s the problem with PDF files?
Many translators are dismayed when they discover that the source text is in PDF format – and for good reason. Getting it into an editable format or getting it into a TM tool is not always straightforward. A quick search on Google will turn up a variety of different “PDF converters”. Some TM software tools will also convert PDFs into an editable format. However, in my experience, it is very rare that the converted text is without some mucky problems. Many translators just give up on trying to extract text from PDFs.
Alejandro Moreno-Ramos has the best possible solution:
However, if Alejandro’s method fails, (and you really want editable text) then…
This is what you can do:
If you are able to select the text in a PDF with your mouse then you will be able to copy it and paste it directly into Word (if not, you should quickly abandon all hope!). Copying and pasting the text will not transfer the document properties (e.g. margins, columns etc), but you’ll get the text with most of its formatting properties (fonts, text size, bold and italics etc):
You’ve now got editable text… But whoa! In the illustration above, you can see that there is a paragraph marker at the end of every line! The text doesn’t wrap properly in the Word document.
Not at all! It’s easy enough to get rid of the paragraph markers (as we’ll see) using simple find & replace. But this would make the whole document one huge paragraph. We need to retain one critical piece of information – where the real paragraphs start and end!
We need to get rid of the surplus paragraph markers (shown in the red circles below) – but we need to keep the ones marked in blue. These mark the end of the real paragraphs.
There are a few steps involved in doing this – but they are all very simple – the only skills required are to know how to copy, paste and find & replace! Here’s how to do it:
1 Get the text from the PDF into Word
Select the text in the PDF. Copy it and paste it into Word.
(Some care needs to be taken when selecting text in a PDF. You may find that the PDF won’t allow you to select paragraphs in the correct order. You may need to copy & paste several individual sections, one at a time, to ensure you get the text flowing in the right sequence.)
2 Make sure that you have Word’s “Show/Hide” button switched to “Show”.
Toggling this button to “show” displays the document’s formatting marks (tabs, paragraph marks, picture anchors etc.) . I’ve noted that many young translators (and some older ones too) try to work in Word with this button switched to “Hide”. The usual excuse is that seeing the formatting marks is aesthetically unappealing or distracting. My usual response is “Get over it!” (I don’t usually get a good reaction to that advice!) But my opinion is that working on a document with the formatting marks turned off is like groping around in a dark room – you unwittingly bump into, trip over and break things. Has your formatting ever gone unexpectedly haywire? Maybe you too like to keep this button in “Hide” mode? However, being able to see all the formatting marks helps you understand the structure of the document and lets you see if the original author has made any silly formatting mistakes. Walking blind into the client’s problems can spoil your day! Try it… It really doesn’t hurt (much)!
3 Identify where the real paragraphs end
Now, this requires a few minutes of manual work – it requires hitting the [Enter] key a few times on every page of the document to mark the end of each paragraph. If you really want editable text – it’s worth the small effort required.
Look for the spot where each paragraph ends, insert your cursor and then hit [Enter] to create an empty line. In many cases it is clear to the eye where each paragraph should end – but not always! So keep an eye on the original text. It only takes a few minutes to do this for an average-sized document.
Now you have double paragraph markers (aka “a blank line”) which indicate where the paragraphs are supposed to end:
4 Preserve these paragraph breaks
Our ultimate task is to get rid of all the surplus paragraph markers at the end of the lines. This is easy to do – we just replace them with spaces using Word’s Find & Replace function.
If we replace all the paragraph markers with spaces, we’ll lose the paragraph breaks we’ve just marked with a blank line! They’ll just turn into two consecutive spaces (there may be lots of other double spaces hiding in the document too!). So we need to temporarily mark the paragraph breaks with something else before we can get rid of the unnecessary markers at the end of each line. You can use pretty much any sequence of characters you like – you just need to be sure that whatever you use is unlikely to occur in the text. You might like to make up something like “@#$%” or some such. I always use “[para]” as a placeholder.
The task now is to replace all instances of two consecutive paragraph markers with the temporary placeholder. Word uses the characters ^p to represent a paragraph marker (or ^p^p for two of them), so:
- Type “^p^p” into the Find what box; and
- Type “[para]” into the Replace with box; then
- Click Replace All:
This is how the document should change once the blank lines have gone:
5 Now get rid of all the redundant paragraph marks
We are now going to search for all the extra paragraph markers and replace them with spaces. (Look at the paragraph markers – if there is already a space in front of them, then you need to replace them with “nothing” – i.e. you leave the Replace with box blank.)
- Type ^p into the Find what box;
- Put your cursor into the Replace with box and hit the space bar. (If you don’t need spaces, use your mouse to select and delete any invisible spaces which might be lurking there); then
- Hit Replace All:
You document should now be a complete mess and look something like this (paragraph breaks highlighted):
6 Now reinstate the paragraph breaks
This is where the magic really happens and the mess instantly becomes a nicely formatted document. We now need to get rid of the temporary “[para]” placeholders and replace them with real paragraph breaks.
- Type “[para]” into the Find what box;
- Type “^p into the Replace with box ;
- Hit “Replace All”:
- Don’t try to do this with tables (the subject of a future post maybe).
- If you use a PDF-to-Word converter, then these same Find & Replace techniques can often be used to fix up poorly converted text.
- Rather than going through all these steps every time, they can be automated by recording a macro and putting a button to do the job on the toolbar. One click and the job is done! (Again this could be the subject of another post!)
- Qabiria.com have an excellent, detailed article on using PDF-to-Word converters here: http://bit.ly/9TqbGH
- The examples in this post were illustrated using Microsoft Word 2007 and Adobe Reader X.
 You can control which formatting marks you would like to have displayed when you switch the “Show” button on. Click Office Button|Word options|Display. Because translators usually work on documents that other people have created and formatted, I recommend that they select “Show all formatting marks” so that they can always see (and work around) formatting mistakes made by others.
 If you want to control paragraph spacing with a blank line, then you’ll want to use two paragraph markers (i.e. “^p^p”).