Operationalizing that digital strategy thing.

Taxonomy vs. Ontology vs. Controlled Vocabulary vs. Topic Maps

Hi there! Welcome to our blog. Don't forget to sign up for our free RSS feed. We Triple Dog Dare Ya! And thanks for visiting!

Sometimes you’ll hear the terms taxonomy and ontology bandied about interchangeably. A taxonomy is a hierarchical categorization of concepts based on relatedness. Ontologies model a little piece of existence or knowledge. For example, philosophers use ontology as a tool to describe states of being in the physical world (as opposed to discussions about metaphysics and the beyond).

So far they sound exactly the same: some kind of tool for describing things in their proper place and how they relate to each other. From a more modern perspective, taxonomies are used to help people find and retrieve information, and ontologies are used by computer programmers to reuse and transmit data.

So what’s a controlled vocabulary? A controlled vocabulary is usually a strict list of terms that describe some kind of subject matter. From a strict point of view, a controlled vocabulary is fundamentally identical to a taxonomy. They both provide a roadmap to working effectively with a body of knowledge. A thesaurus is a special kind of controlled vocabulary, one that allows you to define synonyms, antonyms, and other relationships. (In your ecommerce store, anyone who searches for a jacket is also shown blazers and coats, for instance.)

Meanwhile, keep an eye out for the Semantic Web and Topic Maps. Topic maps are very similar to other categorization schemas, and even has its own international standard (ISO 13250). We’re getting pretty long here (and believe me I can go on all day!) so I’m going to stop now.

XML Primer (pt. 1)

Who here has heard of XML? Okay, just about everybody. If there ever were a candidate for most-hyped technology during the late 90’s  and early 21st century, XML would certainly be first or second (with Java winning—or losing—by a nose either way).

Whenever I talk to developers, designers, technical writers, or other Web professionals about XML, I find that the number one question on their minds regarding XML is “What’s the big deal?” Well, I want to show you what the big deal is all about—how it can be used to make your web applications smarter, more versatile, and more powerful. I’ll try to stay away from all the grandstanding hooplah that has characterized much of the discussion of XML and give you the background and skills to make XML a part of your professional skill set.

 So What is
XML?

So what is XML? Any time a group of people asks me this question, I always look for body language. A significant portion of the group leans forward in anticipation, wanting to learn more. The others either roll their eyes (anticipating only hype and half-formed theories) or cringe (anticipating a long, boring, dry history of markup languages).

So you can say that I’ve learned to keep it brief.

The essence of XML is all in its name: Extensible  Markup Language.

·         Language. XML is a language very similar to SGML and HTML. It’s much more flexible then HTML, because it allows you to create your own custom tags. It’s more lightweight and simplified than SGML, which would, in many applications, be impractical on the Web. But it’s important to realize that XML is not just a language, it’s a meta-language: a language that allows us to create or define other languages . For example, with XML we can create other languages, such as RSS, MathML (a mathematical markup language), and even tools like XSLT.

·         Extensible. It’s extensible because you get to define your own tags, what order they come in, and how they should be processed or displayed. Another way to think of extensibility is that XML allows all of us to extend our notion of what a document is: it can be a file that lives on a file server, or it can be a transient piece of data that flows between two computer systems (as in the case of Web Services).

·         Markup. The most recognizable feature of XML are it’s tags , or elements to be more accurate. In fact, the elements you’ll be creating will be very similar to the elements you’ve already been creating in your HTML documents. However, XML allows you to define your own set of tags.

Why Do We Need XML?

But why do we need XML? Because HTML it is specifically designed to display documents in a Web browser, and that’s about it. It becomes cumbersome if you want to display documents in a mobile device or do anything remotely complicated, such as translating it from German to English . HTML’s sole purpose is to allow anyone to quickly create Web documents that can be shared with other people. XML isn’t just good for the Web, but can be used in a variety of different contexts, some of which may not have anything to do with humans interacting with content (for example, Web services uses XML to send requests and responses back and forth). 

Unfortunately, HTML mostly offers ways to tag documents so they can be displayed in a Web browser, and rarely (if ever) provides information about how the document is structured or what it means . Put in layman’s terms, HTML is a presentation language, whereas XML is a data-description language.

For example, if you were to go to any e-commerce Web site and download a product listing, you’ll likely get something like this:

<html><head><title>ABC Products</title></head>

<body>

<h1>ABC Products</h1>

<h2>Product One</h2>

<p>Product One is an exciting new widget that
will simplify your life.</p>

<p><b>Cost: $19.95</b></p>

<p><b>Shipping: $2.95</b></p>

<h2>Product Two</h2>

<h3>Product Three</h3>

<p><i>Cost: $24.95</i></p>

<p>This is such a terrific widget that you will most certainly want to buy one for your home and another one for your office!

</body></html>

Take a good look at that albeit simple code sample from a computer’s perspective. A human can certainly parse this document and make some semantic leaps, but a computer wouldn’t.

 

Semantics and Other Jargon

You’re going to be hearing a lot of talk about “semantics” and other linguistics terms in this article. It’s unavoidable, so bear with me. Semantics is the study of meaning in language.

Humans are way better at the semantic game then computers, because humans are really good at parsing out meaning. For example, if I asked you to list as many names for “female animals” as you could, you’d probably start with “lioness”, “tigress”, “ewe”, “doe” and so on. If you were presented with a list of these names and asked to provide a category that contained them all, it’s likely you’d say something like “female animals.”

Furthermore, if I asked you what a lioness was, you’d say, “female lion.” If I further asked you to list associated words, you might say “pride”, “hunt”, “savannah”, “Africa” and the like. From there you could make the leap to other feral cats, and then to house cats and maybe even dogs (cats and dogs are both pets, after all). With very little effort, you’d be able to build a stunning semantic landscape, as it were.

Needless to say, computers are really bad at this game, which is a shame, as many computing tasks require semantic skill. And that is why we need to give them as much help as we can.

For example, a human can probably deduce that the <h2> tag is being used to tag a product name within a product listing. Furthermore, a human might be able to guess that the first paragraph after an <h2> held the description, and that the next two paragraphs contained price and shipping information, in bold.

However, even a cursory glance at the rest of the document reveals some, well, very human errors. For example, the last product name is encapsulated in <h3> tags, not <h2> tags. This last product listing also displays a price before the description and the price is italicized instead of bolded.

A computer program (and even some humans) trying to decipher this wouldn’t be able to make the kinds of semantic leaps required to make sense of it. All it would be able to do is render it to a browser with the styles associated for each tag. That’s because HTML is chiefly a set of instructions for rendering a document inside of a Web browser, and not a way to structure a document to bring out its meaning. 

If the above document were created in XML, it might look a little like this:

 <productListing title=”ABC Products”>

<product>

<name>Product One</name>

<description>Product One is an exciting new
widget that will simplify your life.</description>

<cost>$19.95</cost>

<shipping>$2.95</shipping>

</product>

<product>

<name>Product Two</name>

</product>

</productListing>

Notice that this new document contains absolutely no information in it regarding display. What does a <product> tag look like in a browser? Beats me—we haven’t defined that yet. Later on in this series of articles, I’ll show you how to use technologies like CSS and XSLT  to transform your XML into any format you like.

Essentially, XML allows you to separate information from presentation, which is just one of its most powerful attributes. When we concentrate more on structure, as we’ve done here, we ensure that our information is correct. For example, we start thinking about our documents as containing certain key pieces of data. (For those of you with database programming experience, you’ll start to notice that what XML allows you to do is essentially make your documents into databases.)

In theory, we should be able to look at any XML document and inherently understand what’s going on. In our example above, we know that a product listing contains products, and each product has a name, a description, a price, and a shipping cost. You could say, rightly, that each XML documentis  readable by both humans and software.

Naturally, everyone makes mistakes, and it’s no different in XML. You might start sharing your XML documents with another developer or another company and somewhere along the line someone puts the description after the price for a given product. Normally, this wouldn’t be a big deal, but maybe in your case your Web application really needs the description to come after the product name every time.

What you need to ensure that everyone plays by the rules is a DTD (a document type definition or schema) . A DTD basically provides instructions on how your particular XML document should be structured. With a DTD in place, anyone who creates product listings for your application would have to follow the rules. It’s like having a rulebook that lays down the law on what tags are legal where. We’ll get into DTDs and XML Schemas later on.

For now, though, let’s stay out of the weeds and continue with the basics.

A Closer Look at Our XML Example

From a casual observer’s viewpoint, any given XML document, like the one we looked at in the previous section, appears to be just a bunch of tags and letters. But there’s much more going on then that.

A Structural Viewpoint

Let’s examine our XML example from a structural standpoint. No, not the kind of structure we bring to a document by marking it up with XML tags, but on a more granular level . I want to examine what a typical XML file contains, character by character.

The simplest XML elements contain an opening tag, a closing tag, and some content in between.

The opening tag begins with a left angle bracket (<), has an element name that contains letters and numbers (but no spaces!), and is followed by a right angle bracket (>). Content, from the standpoint of XML, is usually parsed character data—in other words, plain text, other XML elements, and even things like XML entities. Following your content is your closing tag, which is spelled and capitalized like your opening tag but with one tiny change: a / right before the element name.

Here are a few examples of valid elements in XML :

<myElement>some content here</myElement>

<elements>

<myelement>one</myelement>

<myelement>two</myelement>

</elements>

Elements, Tags, or Nodes?

You’ll see me refer to XML elements, XML tags, and XML nodes at different points in this book. What’s the deal? Well, for the layman, these terms are interchangeable , but if you want to get technical (and who’d want to do that in a technical book?) they really are different:

·         An element consists of an opening tag, its attributes, any content, and a closing tag.

·         A tag—either opening or closing—is part of an element.

·         A node is how developers think of XML elements. Each element becomes a node in a hierarchical structure. It’s good to point out at this point that even an attribute of an XML element or an XML comment can be considered a node, so the concept of a node can get a bit fuzzy around the edges. 

Warning!

If you’re used to working with HTML, you’ve probably created many documents with missing end tags, used different capitalization on your opening and closing tags, or improperly nested your tags.

You won’t be able to get away with any of that in XML! In XML, the <myElement> tag
is different from the <MYELEMENT> tag, and both are different from the <myELEMENT> tag. If your opening tag is <myELEMENT> and your closing tag is </Myelement> then your document won’t be valid.

If you use attributes on any elements, then attribute values must be single- or double-quoted. No more getting away with bare attribute values like you did in HTML! For example, the following is okay in HTML:

<h1 class=topHeader>

But in XML, you’d have to put quotes (either single or double) around the attribute value, like this:

<h1 class=’topHeader’>

Also, if you improperly nest your elements (i.e., start a new element before closing the previous one) then your document won’t be valid. (I keep mentioning validity, and believe me, we’ll talk about it soon !) For example, in HTML, it’s perfectly allowable to do this;

<b>Some text that is bolded, some that is <i>italicized</b></i>.

But in XML, this improper nesting of elements would cause the parser to issue an error.

Because XML allows you to create any language you want, the creators of XML had to institute a rule that is somewhat related to the proper nesting rule. Each XML document must contain a top-level element that contains all of the other elements within it. As you’ll see later on, almost every single piece of XML development you’ll do is facilitated by this one nice little rule.

Attributes

Did you notice the <productListing> opening tag in our example? Inside the tag was title=”ABC Products”. This is called an attribute.

You can think of attributes as adjectives or adverbs—they provide additional information about the element that may not make any sense as content. If you’ve worked with HTML, you’re familiar with such attributes as the ssrc (file source) on the <img> tag . In that case, the attribute defines vertical alignment for a table row, and it wouldn’t make sense to display that as content in a browser.

What should go in an attribute and what should go between the tags of an element? This is a subject of much debate, but don’t worry, there really are no wrong answers here. Remember, you’re the one defining your own language. A good rule of thumb is to use attributes to store data that doesn’t necessarily need to be displayed to a user of the information .

Let’s examine the issue a little closer. Let’s say that you wanted to create an XML document to keep track of your DVD collection. Here’s a short snippet of what that might look like:

<dvdCollection>

<dvd>

<id>1</id>

<title>Raiders of the Lost Ark</title>

<release-year>1981</release-year>

<director>Steven Spielberg</director>

<actors>

<actor>Harrison Ford</actor>

<actor>Karen Allen</actor>

<actor>John Rhys-Davies</actor>

</actors>

</dvd>

….

</dvdCollection>

Anyone who reads this document probably doesn’t need to know what the <id> of a particular DVD in your collection is. We could probably safely set that as an attribute of the <dvd> element, like this:

<dvd id=’1’>

In other places in our DVD listing, though, the information seems a little bare. For instance, we’re only displaying an actor’s name between the <actor> tags. One way to solve this is to add attributes:

<actor type=”superstar” gender=”male” age=”50”>Harrison
Ford</actor>

In this case though, I’d have to revert to our rule of thumb—most users of the information would probably want to know at least some of this information. So we can convert some of these attributes to elements:

<actor type=”superstar”>

<name>Harrison Ford</name>

<gender>male</gender>

<age>50</age>

</actor>

Empty Tags

In other cases, XML tags may be said to be empty—they contain no content whatsoever. These empty tags are similar to the <img> and <br> tags in HTML. In the case of <img>, it contains all of its information in the form of attributes. In the case of <br>, it normally doesn’t contain any attributes—it just signifies a line break.

Remember that in XML all opening tags must have a matching closing tag. You can do this one of two ways when dealing with empty elements:

<myCustomEmptyElement></myCustomEmptyElement>

<myCustomEmptyElement/>

 In the latter case, using the / at the end of your empty tag basically tells the parser that there is no closing tag for this XML element. It’s an efficient shorthand method that you should use when you have empty elements.

Entities

I mentioned entities earlier. An entity is a handy little construct that, at its simplest, allows you to define special characters that you can insert into your documents. If you’ve worked with HTML, you know that the &lt; entity inserts a literal < character into a document, which is the quick and easy way to literally display the name of a tag in an HTML document.

XML, true to its extensible nature, allows you to create your own entities. Let’s say that your company’s copyright notice has to go on every single document. Instead of typing this notice over and over again, you could create an entity reference called %copyright_notice with the proper text, and then use it in your XML documents as &copyright_notice;. What a time saver !

We’ll cover entities in more detail later on.

But It’s About More Than Structure

XML documents are more then just a sequence of elements. If you take another closer look at our product listing or DVD listing examples, you will notice two things:

1.        The documents are self-describing, as we’ve already discussed

2.       The documents are really a hierarchy of nested objects

 Let’s elaborate on the first point very quickly. We’ve already said that most (if not all) XML documents are self-describing. This feature, combined with all that content encapsulated by opening and closing tags takes all XML documents far past the realm of mere data and into the vaunted halls of information .

What’s that, you say? Data can be a string of characters or numbers like 5551238888. This string can represent anything from a laptop’s serial number to a pharmacy’s prescription ID to a phone number in the United States.

The only way to turn this data into information (and therefore make it useful) is to add context to it—once you have context, you can be sure what the data represents. In short, <phone location=’USA’>5551238888</phone>  leaves no doubt that this seemingly arbitrary string of numbers is in fact a U.S. phone number.

When you take into account the second point, that all XML documents are really a hierarchy of objects, then all sorts of possibilities open up. Seen as a hierarchical tree, our product listing document looks a bit like this:

<productListing>

<product>

<name>

<description>

<price>

<shipping>

<product>

….

Remember what I said about having one element that contains all the others? Well, that root element becomes the root of our hierarchical tree. You can think of that tree as a family tree, with the root element having various children (in this case, product elements), and each of those having various children (name, description and so on). In turn, each product element has various siblings (other product elements) and a parent (the root ).

Because what we have is a tree, we should be able to travel up and down and side to side with relative ease. Most of what you’ll be doing with XML from a programmatic stance is properly creating and navigating XML structures. In programmatic circles, XML elements are usually termed nodes, because each element represents a node in some kind of system.

One final point about hierarchical trees. Before, we talked about transforming data into information by adding context. Well, when we start building hierarchies of information that indicate natural relationships (i.e., taxonomies) we’ve just taken the first giant leap toward turning information into knowledge. That statement in and of itself can spawn a whole other book, so I’ll just have to leave it at that and move on!

Formatting Issues

Earlier I made a point about XML allowing you to separate information from presentation. I also mentioned that you could use other technologies like CSS (Cascading Style Sheets) and XSLT (Extensible  Stylesheet  Language Transformations) to actually make our information displayable in different contexts.

In later articles I’ll go into plenty of detail on both CSS and XSLT, but I wanted to make a brief point here. Because we’ve taken the time to create XML documents, it means that our information is no longer locked up inside of proprietary formats like a word processor or spreadsheet. Furthermore, it no longer has to be “re-created” every time you want to have alternate displays of that information. All you have to do is create a style sheet or transformation to make your XML presentable in a given medium.

For example, if you had your information stored in a word processing program, it would contain all kinds of information regarding the way it should look for the printed page—lots of bolding, font sizes, and neat tables. Unfortunately, if that document also had to be posted to the Web as an HTML document, someone would have to convert it (either manually or via software), clean it up, and test it.

If someone else made changes to the original document, those changes wouldn’t cascade to the HTML version. If someone else wanted to take the same information and use it in a slide presentation, they might run the risk of using outdated information from the HTML version. Even if they did get the right information inside their presentation, you would still have to track three places where your information lived. It can get pretty messy after that.

If the same information were in XML, you could create three different XSLT files to transform the XML into HTML, a slide presentation, and a printer-friendly file format like PostScript. If you make changes to the XML file, then the other files would also automatically change once you passed the XML file through the process. (This notion, by the way, is an essential component of single-sourcing—i.e., having a “single source” for any given information that is reused elsewhere.)

As you can see, separating information from presentation makes your XML documents reusable, and can save a lot of hassle and headache in environments where lots of information needs to be stored, processed, handled, and exchanged. 

Well-Formedness and Validity

We’ve talked a little bit about XML, what it’s used for, how it looks, how to conceive of it, and how to transform it. One of the most powerful things about XML, of course, is that it allows you to define your own language.

However, this most powerful feature also exposes a great weakness of XML. If all of us start defining our own languages, then we run the risk of no one understanding anything anyone else says. So the creators of XML had to set down some rules for what was considered a “legal” XML document.

There are two levels of “legality” in XML:

·         Well-formed

·         Valid

A well-formed XML document follows these rules (I’m repeating myself a little bit, but I think it’s valuable to have all this information in one place):

·         An XML document must contain a top-level (or root) element that contains all other elements.

·         All elements must be properly nested.

·         All elements must be closed unless they are empty elements (in which case, they are closed with a special ending / character before the closing angle bracket).

·         Element names are case sensitive.

·         All attribute values must be quoted.

A valid XML document is both well-formed and follows all the rules set down in that document’s DTD (document type definition). A valid document, then, is nothing more then a well-formed document that adheres to its DTD .

The question than becomes, why have two levels of legality? A good question indeed.

For the most part, you will only care that your documents are well formed. In fact, most XML parsers  (software that checks your XML documents) are non-validating parsers—and that includes the ones found in Web browsers like Mozilla and Internet Explorer. The well-formedness level of legality allows you to create ad-hoc XML documents that can be quickly created, added to an application, and tested.

For other applications that are more mission-critical, you will want to use a DTD with your XML documents, and then run your documents through a validating parser.

The bottom line? Well-formedness is mandatory, but validity is an extra, optional step.