Preparing the manuscript for coding

Part 11: begin the conversion of your Word or Writer manuscript to an ebook in minutes, with regular expressions.
Word processors like Microsoft Word and Open Office Writer are a boon for writers, allowing quick and powerful formatting. For instance, if there’s something I know I’ll have to check after I’ve finished writing a draft, I can mark it on the fly by filling the text background in neon yellow with one drag and two clicks of the mouse.
However, this boon becomes a curse for the publisher. Invisible and persistent formatting code can carry across from Word or Writer to other programs, and affect formatting added there. Think of it as building a new house on the partially demolished foundations of an existing home – there can only be structural problems down the track.

Issues with autoconverters

This is the major concern with programs which purport to perform conversions between word processed manuscript and ebook. Ebook forums are littered with panicked posting by authors who have found dramatic changes in ebooks they have uploaded to Amazon: entire pages of italics or bold text, tiny headings and huge margins, to name a few, all caused by remnant formatting from the original manuscript.
However, simply dropping the document into a text editor like Notepad++ will strip all preexisting formatting from it. The kicker is, though, that we need to insert the proper HTML tags to set the formatting we are stripping out! If you think that all this seems a bit like an episode of Yes, Minister, then you are right. However, if you wish to properly code an ebook which will look consistent across a variety of devices, then there’s no other way around it. The good news is that the process can be quick and painless, and should only take an hour or two for even the longest book.
There is no way that I can hope to cover the different permutations of how your manuscript has been created, compared to mine. For the record, I’m using Open Office 3.4.1. However, by using my actions as a template, hopefully you can google responses and solutions specific to the word processor you are using.

Fixing bad habits with regular expressions

There are a multitude of lazy shortcuts that people take with word processors. For instance, rather than set a proper ‘style’ for indents, I tend to use tabbed indents (that is, hit the TAB key with my left little finger for each new paragraph). That was fine, until I needed to delete them prior to HTML encoding, and I suddenly realised that I might have delete a couple of hundred indents manually instead of just altering one style. (In fact, using styles and changing these globally, instead of paragraph by paragraph, is a lot like the reason we use head HTML to control a HTML document’s body.)
Luckily, getting rid of these shortcuts and then inserting the correct HTML tags is made easy by using Writer’s find & replace tool alongside ‘regular expressions’. A regular expression is just shorthand code which represents various types of advanced search actions.

Regular expressions
Regular expressions on Office 3.4.1

Note for the following instructions, italicised words = actions, i.e. don’t write ‘blank’ in the field – leave it empty.
1. You have your finished Writer manuscript. Now, save it under a different name to indicate that you’re modifying it as HTML.
2. Remember that, if something goes wrong, you can use CTRL Z to undo the action.
3. Deleting tabbed indents, spaces before paragraphs and spaces after paragraphs (whitespace): CTRL F → more options → regular expressions = tick → search for =
^[ \t]+|[ \t]+$ → replace with = blank. Now, you can either replace them one by one with ‘replace’, or do them all globally with ‘replace all’. Go on. Be a devil.
4. If you just wish to delete:
a) tabbed indents: find (regular expression) \t → replace with = blank
b) leading whitespace: find (regular expression) ^[ \t]+ → replace with = blank
c) trailing whitespace: find (regular expression) [ \t]+$ → replace with = blank
5. Replacing double spaces with single spaces: CTRL F → search for = space space → replace with = space. Note that, depending on how many blocks of extra spaces appear in your manuscript, you may need to perform this search multiple times to eliminate them all (i.e. until the ‘Search key not found’ dialogue box appears.
6. Remove any blank paragraphs: you want one between sections breaks contained within the same chapter and one between the chapters themselves, that’s it.
7. Leave any chapter and section headings uncoded. We’ll wrap these in header tags down the track.


Before you perform any of the actions below, skip forward and read the ‘Automatically replacing characters with entity codes‘ section from the next article, then return here. There is a choice you must make. You can either (1) replace all the entity codes via the automatic method and manually code the italic and bold tags once the manuscript is in Notepad++, or (2) automatically code the italic and bold tags with AltSearch and then manually replace the entity codes once the manuscript is in Notepad++.
Which should you do? If you have italic and bold text all over the place, choose method (2). If you don’t, choose method (1).

Tagging italics and bold text

All <em>italic</em> and <strong>bold</strong> text in an ebook needs to be wrapped in proper HTML tags, as indicated, and this is one area that I struggled to find a good shortcut. It appears that there is a known bug in Writer that prevents tag wrapping via regular expressions.

AltSearch 1.3.2 on Office 3.4.1

A solution to this is to download and install the Writer plugin AltSearch. AltSearch 1.3.2 worked with my Open Office 3.4.1 install. This alternate find and replace is similar to Writer’s native tool, but more powerful. Once you have installed it and restarted Writer, you will notice a green binocular symbol in the toolbar. Click on this → properties dropdown = italic → replace = <em>$</em> → replace all.
To perform the same action for bold text, click on the binoculars → properties dropdown = bold → replace = <strong>$</strong> → replace all.
If, for some reason, you can’t get AltSearch to work, you can always code the bold and italics by hand. In that case:
Italics: CTRL F → more options → format → typeface = ‘italic’. Find each instance and wrap it in <em>italic tags</em>.
Bold: CTRL F → more options → format → typeface = ‘bold’. Find each instance and wrap it in <strong>bold tags</strong>.
Remember that it is very easy to overuse italics in fiction, and I can think of few reasons for the use of bold text. In fact, I use the process of tagging this formatting to remove a lot of it. Trust that your readers will understand how a character is emphasising dialogue, without resorting to italics.
You’re now ready to move your entire manuscript across to Notepad++ to finish coding the HTML, which I’ll cover in the next article.

Proceed to Part 12: ‘Preparing the HTML file, part 1’, or return to the article index.
Return to Re: writing
While I’ve endeavoured to provide you with accurate information, what is considered ‘accurate’ will change over time. If I’m wrong, or you’d like to ask a question or share your thoughts, I’d love to hear your take on things.

Rhys About Rhys

Teacher, writer, editor, cook: a bit like that nursery rhyme, really.
Facebook / Google+ / Twitter


  1. Hi Rhys,

    Very informative post. Question, if someone created their ebook in gooogle docs, could they download as a doc or docx, then upload to ms word to tweek any mistakes found in Kindle Previewer? I know Sigil has a pretty good program, but it seems like the “find and replace” is going to work better for what I need to do to this (very long) ebook.

    Any pointers you can offer would be great!



    • G’day Lauren. I don’t see any problem with that. The ebook creation method I’ve outlined in this tutorial is the ‘long version’, where you’re stripping all word processor formatting, then adding it back via HTML tags. Note that this was written five years ago, so ebook conversion software will probably have improved in the meantime. At the time, people leaving Word and Writer formatting in their manuscripts were finding that this screwed with their ebook formatting when they uploaded it to Amazon. Perhaps it’s changed now, and Amazon have refined their ebook conversion software, which means you could fix your ms in Word and then just convert. But the ‘long version’ isn’t really that long when you learn to work with regular expressions in a word processor. So the long process is: (1) use regular expressions to wrap italics, bolded text, headings etc with HTML tags, (2) drop the whole thing into Notepad++ to strip out remaining word processor formatting and (3) use a converter to convert to the ebook format. Hope this helps!

Speak Your Mind