Part 11: begin the conversion of your Word or Writer manuscript to an ebook in minutes, with regular expressions.
Word processors like Microsoft Word and Open Office Writer are a boon for writers, allowing quick and powerful formatting. For instance, if there’s something I know I’ll have to check after I’ve finished writing a draft, I can mark it on the fly by filling the text background in neon yellow with one drag and two clicks of the mouse.
However, this boon becomes a curse for the publisher. Invisible and persistent formatting code can carry across from Word or Writer to other programs, and affect formatting added there. Think of it as building a new house on the partially demolished foundations of an existing home – there can only be structural problems down the track.
Issues with autoconverters
This is the major concern with programs which purport to perform conversions between word processed manuscript and ebook. Ebook forums are littered with panicked posting by authors who have found dramatic changes in ebooks they have uploaded to Amazon: entire pages of italics or bold text, tiny headings and huge margins, to name a few, all caused by remnant formatting from the original manuscript.
However, simply dropping the document into a text editor like Notepad++ will strip all preexisting formatting from it. The kicker is, though, that we need to insert the proper HTML tags to set the formatting we are stripping out! If you think that all this seems a bit like an episode of Yes, Minister, then you are right. However, if you wish to properly code an ebook which will look consistent across a variety of devices, then there’s no other way around it. The good news is that the process can be quick and painless, and should only take an hour or two for even the longest book.
There is no way that I can hope to cover the different permutations of how your manuscript has been created, compared to mine. For the record, I’m using Open Office 3.4.1. However, by using my actions as a template, hopefully you can google responses and solutions specific to the word processor you are using.
Fixing bad habits with regular expressions
There are a multitude of lazy shortcuts that people take with word processors. For instance, rather than set a proper ‘style’ for indents, I tend to use tabbed indents (that is, hit the TAB key with my left little finger for each new paragraph). That was fine, until I needed to delete them prior to HTML encoding, and I suddenly realised that I might have delete a couple of hundred indents manually instead of just altering one style. (In fact, using styles and changing these globally, instead of paragraph by paragraph, is a lot like the reason we use head HTML to control a HTML document’s body.)
Luckily, getting rid of these shortcuts and then inserting the correct HTML tags is made easy by using Writer’s find & replace tool alongside ‘regular expressions’. A regular expression is just shorthand code which represents various types of advanced search actions.
Note for the following instructions, italicised words = actions, i.e. don’t write ‘blank’ in the field – leave it empty.
1. You have your finished Writer manuscript. Now, save it under a different name to indicate that you’re modifying it as HTML.
2. Remember that, if something goes wrong, you can use CTRL Z to undo the action.
3. Deleting tabbed indents, spaces before paragraphs and spaces after paragraphs (whitespace): CTRL F → more options → regular expressions = tick → search for =
^[ \t]+|[ \t]+$ → replace with = blank. Now, you can either replace them one by one with ‘replace’, or do them all globally with ‘replace all’. Go on. Be a devil.
4. If you just wish to delete:
a) tabbed indents: find (regular expression)
\t → replace with = blank
b) leading whitespace: find (regular expression)
^[ \t]+ → replace with = blank
c) trailing whitespace: find (regular expression)
[ \t]+$ → replace with = blank
5. Replacing double spaces with single spaces: CTRL F → search for = space space → replace with = space. Note that, depending on how many blocks of extra spaces appear in your manuscript, you may need to perform this search multiple times to eliminate them all (i.e. until the ‘Search key not found’ dialogue box appears.
6. Remove any blank paragraphs: you want one between sections breaks contained within the same chapter and one between the chapters themselves, that’s it.
7. Leave any chapter and section headings uncoded. We’ll wrap these in header tags down the track.
Before you perform any of the actions below, skip forward and read the ‘Automatically replacing characters with entity codes‘ section from the next article, then return here. There is a choice you must make. You can either (1) replace all the entity codes via the automatic method and manually code the italic and bold tags once the manuscript is in Notepad++, or (2) automatically code the italic and bold tags with AltSearch and then manually replace the entity codes once the manuscript is in Notepad++.
Which should you do? If you have italic and bold text all over the place, choose method (2). If you don’t, choose method (1).
Tagging italics and bold text
All <em>italic</em> and <strong>bold</strong> text in an ebook needs to be wrapped in proper HTML tags, as indicated, and this is one area that I struggled to find a good shortcut. It appears that there is a known bug in Writer that prevents tag wrapping via regular expressions.
A solution to this is to download and install the Writer plugin AltSearch. AltSearch 1.3.2 worked with my Open Office 3.4.1 install. This alternate find and replace is similar to Writer’s native tool, but more powerful. Once you have installed it and restarted Writer, you will notice a green binocular symbol in the toolbar. Click on this → properties dropdown = italic → replace =
<em>$</em> → replace all.
To perform the same action for bold text, click on the binoculars → properties dropdown = bold → replace =
<strong>$</strong> → replace all.
If, for some reason, you can’t get AltSearch to work, you can always code the bold and italics by hand. In that case:
Italics: CTRL F → more options → format → typeface = ‘italic’. Find each instance and wrap it in <em>italic tags</em>.
Bold: CTRL F → more options → format → typeface = ‘bold’. Find each instance and wrap it in <strong>bold tags</strong>.
Remember that it is very easy to overuse italics in fiction, and I can think of few reasons for the use of bold text. In fact, I use the process of tagging this formatting to remove a lot of it. Trust that your readers will understand how a character is emphasising dialogue, without resorting to italics.
You’re now ready to move your entire manuscript across to Notepad++ to finish coding the HTML, which I’ll cover in the next article.
Proceed to Part 12: ‘Preparing the HTML file, part 1’, or return to the article index.
While I’ve endeavoured to provide you with accurate information, what is considered ‘accurate’ will change over time. If I’m wrong, or you’d like to ask a question or share your thoughts, I’d love to hear your take on things.