BRIEF DTD TUTORIAL
This is very brief introduction to DTD that explains basic notations.
XML DTD or Document Type Definition is expected to define
formal grammar of XML based markup language(s).
Basically DTD contains list of elements that can occur in markup,
list of attributes of each element, possible attribute values
or value types (may declare default attribute values too) and
content model that specifies allowed nesting of elements.
This information can be used in several ways.
1. One can use DTD to validate document, i.e., to check whether document
follows formal rules defined in DTD, in this way one can detect
possible errors (like misspelled element names, attribute names/values,
wrongly nested elements etc.) that otherwise would be difficult to notice.
2. One can use DTD just to provide accurate description of markup language.
Here many things depend on markup language itself, as not all XML
applications can be accurately described using XML DTD.
3. One can use DTD to define character entities, specify default attributes
and bind elements to XML namespaces.
ELEMENTS TYPE DECLARATION
Elements used in markup language are declared as follows
where ElementName is name of element like h1, par, table, ul etc.
(note that each element must be declared only once, multiple
element type declaration with the same element name are not allowed)
and FormalContentModel is expression that specifies its content
model. In XML DTD content model may specify what elements
can be children of given element (and in what order they may appear)
and whether element may contain character data.
There are several possible content models. They are described below.
1. EMPTY
This is the simplest content model that says that element is empty
and should not contain any character data or any nested elements.
For example XHTML 'br' element that is used for forced line breaks
is empty element. In DTD it is described as follows:
Usually empty elements are represented by empty tags like
but
is also valid markup.
2. ANY
Simple content model. It says that element may
contain anything, including character data or any other
elements (that are declared in DTD).
This content model is rarely used as it is too general.
3. Mixed
Mixed content model should be used when element may contain
both character data and other elements.
Content model looks like
(#PCDATA | ChildName1 | ChildName2 | ... | ChildNameN)*
where ChildNames are names of possible child elements.
If no child elements are allowed this content model reduces to
(#PCDATA)
Example:
Suppose that 'group' element may contain text, or 'subgroup' element
and 'subgroup' element may contain only text, no tags inside, like
My Group
First Subgroup
Second Subgroup
In DTD these elements can be described as:
Note that in XML DTD, 'Mixed' content model does not
define order of child elements, does not specify how many
times child element may be repeated in markup, and can not
be combined with other content models.
For example the following models are illegal:
(#PCDATA | em | strong | strong)*
(#PCDATA | em | strong)+
(#PCDATA | (em | strong))*
4. children
Unlike 'Mixed' content model, this one applies to elements
that may contain only child elements and should not contain
any child text nodes. It may specify list of child elements,
in addition it may impose restrictions on their possible order
or specify how many times certain element may occur in
content model. This is achieved by combining sequences and choices.
Sequence is ordered list of child elements that looks like
(FirstChild, SecondChild, ThirdChild)
Choice is unordered list of child elements like
(Child | AnotherChild | YetAnotherChild)
Sequence and choice can be combined to describe more
complex content models (note that in Mixed content
model you can't do this).
Signs '?', '+' and '*' can be used to specify how many times
content model may be repeated ('?' means 1 or 0,
'+' means > 0, '*' means any times)
they may appear after sequence or choice and after any
element name inside sequence or choice.
Examples:
DECLARING ATTRIBUTES
If element has some attributes they must be declared in DTD
as follows
AttributeName is full (qualified) name of attribute like
href, xml:lang, title. If elements has more then one attribute
list declarations then these lists are simply merged and if
certain attribute is declared several times then first declaration
overwrites all subsequent ones.
AttributeType is either string type (CDATA) that means attribute
value may be arbitrary, tokenized type like ID, IDREF, IDREFS,
NMTOKEN, NMTOKENS or enumerated type (list of all possible attribute values).
DefaultDeclaration may specify whether attribute is required and
if attribute is not required then it may specify default attribute value.
STRING TYPE
String type (CDATA) imposes no restriction on attribute
value, it may carry arbitrary character data that does not
break well-formedness of document.
TOKENIZED TYPES
Most important tokenized types are the following:
NMTOKEN. Attributes of this type must have values that consist
from any letters (not necessary Latin), digits or characters '_', '-', '.', ':'
Example:
NMTOKENS. The same as NMTOKEN or space separated list of NMTOKENs
Example:
ID. It is the same as NMTOKEN but first character should be
letter, '_' or ':'
In addition ID type attribute values must be unique
(two ID type attributes that appear in single
document are not allowed to carry the same value).
Example:
IDREF. Must contain reference to unique ID (value of any ID type
attribute).
Example:
IDREFS. Must contain reference to unique ID or space separated list
of such a references.
Example:
ENUMERATED TYPE
This type of attributes may have only limited number of
predefined values.
Example:
Note that each value must be of NMTOKEN type.
For example the following declaration is not allowed
(forward slash breaks well-formedness)
DEFAULT ATTRIBUTE DECLARATION
There are several types of default declarations.
Most important are:
#IMPLIED
Keyword #IMPLIED specifies that attribute can be omitted
#REQUIRED
Keyword #REQUIRED means that attribute value must be explicitly
specified in markup
Default
The same as #IMPLIED but if attribute is omitted
XML parser must attach attribute with default value
to element and pass it to application.
Example:
Text will be treated as Text ?>
#FIXED
The same as default but in this case default value
is the only possible attribute value.
Example:
ATTRIBUTE VALUE NORMALIZATION
Note that values of all attributes are normalized by XML parser.
Basically it means that all tabs, carriage returns and line feed characters
are replaced with space, and if attribute is of tokenized type
then multiple spaces in attribute value are replaced by single
space, while leading and trailing spaces are stripped.
CHARACTER ENTITIES
Custom character entities can be defined as follows
Further they can be referred in XML document as
&EntityName;
They can be used to define convenient notations
for frequently used constructions or difficult to
access characters. If some character entity is declared several times
then first declaration overwrites later ones.
Example:
PARAMETER ENTITIES
Parameter entities can be used to introduce convenient
notation for frequently used constructions.
Parameter entity should be used within DTD (not in XML markup)
they are declared as follows
and further they can be referred in DTD as
%EntityName;
For example:
is equivalent to
If some parameter entity is declared several times
then first declaration overwrites later ones.
Parameter entities may be stored in external DTDs.
In this case they can be declared as follows:
OR
Note that (non validating) XML parsers are not required to read external DTDs.
CONDITIONAL SECTIONS
Conditional sections are used to include or ignore certain
sections from DTD. They look like
]]>
]]>
Usually they are combined with parameter entities as follows
This section is ignored if value of parameter entity 'condition'
is "IGNORE" and included if "INCLUDE". One can redefine value of
parameter entities in any preceding DTD subset (incl. internal one) ?>
]]>
In this way one can reconfigure DTD by redefining certain parameter
entities.
PROCESSING INSTRUCTIONS
Processing instructions look like
they used to pass certain information to applications.
For example the following instruction, included in
XHTML 1.1 DTD, passes title of DTD to W3C markup validator
INTERNAL AND EXTERNAL DTDS
Document Type Definition can be either internal, external or
combination of these two. Internal DTD is included in document's
prolog before root element. It looks like
External DTD is stored in separate dtd file served as application/xml-dtd
and can be linked to document, like
OR
internal and external DTDs can be combined
OR
Note that XML parsers are NOT required to read external DTDs
therefore information that may influence rendering of XML
document should be stored in internal DTD subset (basically this
applies to definitions of character entities and default attribute values).
Note that any attribute type, parameter entity and character entity
declarations specified in internal DTD (or external entities that are included
in internal subset) overwrite those specified in external DTD.