Archive for October, 2012

Words 2

Sunday, October 21st, 2012

A few months ago I posted a musing about text on computers, and about document editing, called Words. It dragged on for so long that I felt compassion on all y’all readers, and stopped half-way in. So here, at long last, is the concluding installment.

Be bold

At the end of Words, our intrepid author could write and edit “plain text” content using a computer. (For grammatical simplicity, I’m going to pretend our writer is named Paul West. For more about the Paul West who inspired this choice, see this Writer’s Almanac entry.)

Now let us suppose that Paul has tired of plain text, and is ready to introduce bold thoughts, big headings, small footnotes, and italian emphasis. (I suppose that’s a reasonable explanation for italics, at least for this posting.)

Of course, those options have been available to book, magazine, and broadsheet printers, with their metal type elements, for centuries. And almost immediately editors developed ways to mark up hand-written and typed text to indicate to the typesetter how the text should be formatted. (For some entertaining editor marks I found in my googling, see this page). And computer program coders often used special characters as a way to *emphasize* words or to _highlight_ text.

So the original approach was to use a visible “mark-up” language, which turned out to be a robust approach, and a solid foundation. We are somewhat familiar with “html://”, that odd but well-known collection of characters that begin the uniform resource locator (URL) for a world-wide-web page. Well, the “ml” stands for “mark-up language.” (The “ht”, by the way, stands for “hypertext.”)

Don’t run off

Apparently one of the first recorded uses of a mark-up language hearkens back to the sixties on computers using the Multics operating system, which used RUNOFF-marked files in its Compatible Time Sharing System (CTSS.) Although Multics was written decades ago, it offered considerably stronger security than most systems commercially available even as late as 2002. (For a little more on Multics, see this Wikipedia entry, and for an even deeper dive into CTSS, check out this “Multicians” site.) Apparently, in one of the earliest version of Multics, if your program had too many errors in it, the system printed an ASCII image of Mad Magazine’s Alfred E Newman on your output page. And for those of you interested in the recent cloud computing trend, take note that the Multics operating system introduced the concept of virtual machines before most of you were born. (Yes, I am making assumptions about the average age of my readers, but I feel like I’m on firm ground.)

Multics also sired another durable operating system, Bell Labs’ UNIX, which is still the heart of Apple OS X’s, famously named for members of the family Felidae. Unix birthed a series of RUNOFF-based formatting programs called roff, eroff, nroff, and troff, leading eventually (but not alphabetically) to groff, a GNU-licensed version still in use. These programs allowed our writer Paul to insert formatting commands between his characters and lines of text to indicate how the content should be formatted for printing.

Many excellent documents were written using the *roffs, including the under-appreciated Unix manual (“man”) pages. If you open a Mac Terminal window today and type “man groff” you’ll enjoy the meta-experience of reading the GNU roff manual page, as formatted by groff. And the LaTex format is still widely used in many academic settings because of its reliability and flexibility.

No, that’s not cursing, it’s formatting

In practice, however, the use and adoption of mark-up languages by “non-professionals” was very limited, due to some significant shortcomings. First, the commands were obscurely terse, with arcane grammar and punctuation. To address this problem for non-geek users, programmers created “macro” scripts, which combined the necessary commands to rapidly format selected document types: a manuscript, a technical document, and, yes, a “man” page. Macros were slightly less daunting, but they were still often perplexing, and, without additional fiddling, all of the document types looked alike. This was good for academic and professional consistency, but bad for creativity. And thirdly, you couldn’t tell what the finished product would look like until you printed it out. This was partly due to the nature of the line- and character-based terminals used in the day, and partly because of the nature of a markup language. (For a first-hand example, look through your browser’s menu until you find “view source”, then take a look at HTML source of a web page. For a meta homework assignment, find this sentence in the page source code. I just noticed that viewing html may be difficult on a newer browser. Apparently “view source” is considered pedantic. Sigh.)

Toward the mid-80s, graphics-capable computer systems finally became available (and “affordable”). These computers used hardware and software that could display and print bit-mapped content, which means that they could render fonts and graphics very similar to typeset content. This enabled the development of editing systems described as WYSIWYG (What You See Is What You Get) writing and editing programs. The Radio Shack TRS-80 I used about this time at Superior Steel included a very functional document editor, a spreadsheet, a database manager, and a terminal program. It was admittedly a little weak on drawing, since there were very few affordable printers that could produce graphical output (although we did have an 8-color, letter-sized plotter that I could create diagrams on. It was painful).

Don’t tell me, show me

Armed with these early document-creation programs, our writer Paul can now use special key strokes, menu options, and (eventually) a “mouse”, to select and format characters, words, lines, paragraphs, and sections. These programs became known as word processing programs, and some early versions such as WordPerfect even allowed the user to make the formatting codes visible. This gave great power to the editor, who could see exactly what formatting codes were being used. Today’s programs, such as MS Word, provide a huge set of editing and review functions, but they also hide the more complex codes from the user. This often results in a frustrating confusion of overlapping and unintended formatting commands, especially with the addition of multiple, obscure, “automatic” numbering and formatting options, which almost do what you want them to do.

Early developers of word processing software made up their own codes, resulting in document formats which were proprietary and non-interchangeable. Eventually, thanks to its inclusion with every Microsoft operating system, Word became the de facto standard, although there are still users who are devoted to other software like WordPerfect, and newer programs such as Apple’s Pages. My personal favorite is still FrameMaker. I have used FM for the creation of documents for over 20 years despite many of Adobe’s business decisions, which, although presumably necessary, have significantly reduced FrameMaker’s availability to me. FM is a professional-grade product which teams still use to create large documents, and while it is a little difficult to learn, it is almost totally free of surprises – it simply does exactly what you ask of it.

Returning to the evolution of document processing, I must acknowledge the Microsoft Office suite, which is built on the strength of Word and bolstered by Excel, PowerPoint, and Outlook. MS Office, continues to dominate consumer-quality software, and is still included with most Windows-based computer systems. In fact, most computer users today utilize the MS Word editing format for document distribution, but this is not really a great option, for reasons which I will mention later. Note also that the most recent Office formats such as those with the .docx extension are actually saved in eXtensible Mark-up Language (XML) format, which is, at least technically, human-readable. It’s not quite as accessible as WordPerfect’s old View Codes option, but it’s somethin’.

Fortunately, the open-source movement, with the significant support of Sun Microsystems, and now Oracle America, has produced and maintained the Open Office suite of programs which provide most of the functionality needed by a consumer at no charge, including word processing, spreadsheet, drawing, and presentation software. Open Office programs use their own proprietary format, but can also read and write most common document formats.


Two other document formats are worth mentioning before I finally summarize this tome and say goodnight.

The Rich Text format (extension .rtf) was developed by Microsoft to allow the interchange of formatted documents between otherwise incompatible programs (and even other versions of Word). RTF uses a small set of the most common embedded codes so other programs can export and import content without losing fonts and formats. On multiple occasions, I have encountered a Word document whose formatting was so entangled, the only option for fixing it was to export it as an RTF file, import it into FrameMaker, fix it using FM’s predictable formatting tools, then reverse the process to get the document back into Word.

The Portable Document Format (extension .pdf) is a document distribution format, rather than a document editing format. Created and still maintained by Adobe, it has become the most common document distribution format on the internet, largely because the consistency of PDF document rendering. A PDF document looks and paginates the same across multiple platforms, unlike a Word document, whose appearance varies with local settings, and which might not even be readable by users of certain devices. In addition, for business use, the PDF document format avoids the unintended, but potentially embarrassing and costly, disclosure of private information which is often buried in a Word document.

Bring it on home

I suppose the main take-home for this whole discussion is that there are five main document formats modern writers should understand: plain text, proprietary editing, rich text, document distribution, and html. Here are the uses and advantages for each:

Plain text is the safest, least tampered-with, way of creating and sending content. It can be created by a wide variety of text editors, including MS Notepad, Apple TextEdit (although the default in TextEdit is RTF), and many command-line programs such as ed, emacs, and vi (short for visual editor). True plain text can be opened by any document editing program, and it can generally be pasted into any other program (e-mail, browser form) without screwing up your formatting. I download a copy of the Programmers File Editor (PFE) onto every Windows PC that I operate. Last updated in 1999, totally unsupported, but solid as a rock, it runs on any version of Windows, and it is guaranteed to remove all formatting, including some sneaky formatting that Windows retains between programs even when you select the “paste plain text” option. It also addresses the one cross-system “gotcha” that remains between Windows and Unix/Apple machines: what character signals the end of a line. (See “For More Information”, below.)

Proprietary editing formats are what most people use most often. This includes Microsoft Word, WordPerfect, Apple’s Pages, Open Office Text Document, and older programs such as PCWrite, Appleworks, and Claris, to name just a very few. Besides usability and availability, the most important thing to understand about these document formats is that they are proprietary. This means that a recipient of your document must have a program capable of rendering the document. This is less of a problem for Word documents, since the format is generally well-understood, but you should not assume that every recipient will be able to open (and, especially, edit) your document sent in a proprietary format. (See RTF and PDF, below.)

Rich text is a reasonable format for exchanging documents between users where further editing might be expected. Many programs such as text editors and e-mail programs support the use of Rich Text, and it allows you to add specific fonts and font sizes, character formatting such as bold, underlined and italicized characters, and line formatting such as indenting. Bear in mind that some older e-mail systems don’t support rich text, and the formatting may be stripped off. And it does not support document-wide formatting tools such as Styles (I could write a whole posting about the use of styles, and maybe I will, later.) But Rich Text is generally a reasonable choice for basic document sharing.

Portable Document Format may not be the only document distribution format, but it is certainly the de facto standard, and you should use it any time you want your document to be presented in an available, consistent, searchable, manner, to the maximum number of recipients. The most important fact about PDF is that Adobe Acrobat reading software has for years been freely available for all operating systems, and is already included on most computers. (I mentioned “searchable” in the list of pdf features, because a single jpeg or png file will also retain a constant image, but the text is neither selectable nor searchable. Images don’t work very well for multi-page documents, either.) I personally also like to have a copy of Adobe Acrobat Standard available at work, because it allows me to extract, add, delete, assemble, and even renumber whole pdf pages and documents. For some reason, I use that capability a lot.

HTML is the format used for web pages. I think everyone should know how to create and understand a basic web page. It is the best introduction to “programming” I know of, primarily because the tools are available on every computer system, and it allows novices to learn a little about how computers process information in a safe and friendly environment. Using a text editor and a browser, you can create a basic web page using a few simple commands such as html, head, body, p, img, h1, and h2. A great place to start is the HTML tutorial from W3 Schools. Sooner or later, understanding the basics of HTML will pay off for you. Trust me.

For more information

OK, enough. I suspect it is a character flaw that makes this stuff so interesting to me. If you have read this far, you must suffer from the same genetic malfunction. If you have a sufficiently bad case of it, I have posted some raw notes from my “research” (the high-class name for browsing and googling) at this location.

Anyway, thanks for hanging with me this far. Until next timeā€¦.