What characters need to be escaped in XML documents, and where can I find a complete list of these special characters?
XML requires five special characters to be escaped in most contexts: <, >, &, ", and ', which have the corresponding entity references <, >, &, ", and '. The ampersand (&) and less-than (<) characters must always be escaped except when they’re used as markup delimiters or within special sections like CDATA blocks. For a complete list of all XML character entities and their numeric references, you can consult the comprehensive Wikipedia page on XML and HTML character entity references or the official W3C XML specifications.
Contents
- The Five Essential XML Characters to Escape
- Context-Dependent Escaping Rules
- Complete Character Entity References
- Numeric Character References
- Best Practices for XML Escaping
- Tools and Resources
The Five Essential XML Characters to Escape
XML defines five fundamental characters that must be escaped when they appear in text content or attribute values, as these characters have special meaning in XML markup:
-
Ampersand (
&) - Must be escaped as&- This character starts entity references and character references
- According to the XML specification, the ampersand character must always be escaped when it doesn’t begin a valid entity reference
-
Less-than (
<) - Must be escaped as<- This character starts element tags and other markup
- The W3C XML specification clearly states that the less-than character is reserved for markup
-
Greater-than (
>) - Should be escaped as>- While not strictly required, escaping this character is good practice for consistency
- The greater-than character ends start-tags and end-tags
-
Double-quote (
") - Must be escaped as"when in double-quoted attributes- Required when the attribute value is delimited by double quotes
- This prevents the parser from interpreting the quote as the end of the attribute value
-
Single/Apostrophe (
') - Must be escaped as'when in single-quoted attributes- Required when the attribute value is delimited by single quotes
- This entity reference was introduced with XHTML 1.0
Important Note: These five characters are the only ones that are always required to be escaped in XML content. Other characters can be represented directly or through numeric character references if needed.
Context-Dependent Escaping Rules
The escaping requirements for XML characters vary depending on where they appear in the document structure:
In Element Content
&and<must always be escaped>,", and'can appear literally but should be escaped for consistency- Example:
3 < 5 & 7rather than3 < 5 & 7
In Attribute Values
&must always be escaped<must always be escaped (element tags aren’t allowed in attributes)"must be escaped if the attribute is delimited by double quotes'must be escaped if the attribute is delimited by single quotes>can appear literally but should be escaped for consistency
Special Contexts Where Escaping is Not Required
Comments (<!-- comment -->):
- All five special characters can appear without escaping
- Example:
<!-- This is a comment with < & > " ' -->
Processing Instructions (<?target data?>):
- The five special characters can appear without escaping
- Exception: The instruction name cannot be “xml” (reserved for XML specification)
- Example:
<?xml-stylesheet type="text/css" href="style.css"?>
CDATA Sections (<![CDATA[...]]>):
- No escaping is required within CDATA sections
- The only restriction is that the sequence
]]>cannot appear - Example:
<![CDATA[3 < 5 & 7 " ' content]]>
Complete Character Entity References
For comprehensive XML character escaping, you can utilize both named entity references and numeric character references.
Named Entity References
The XML specification defines five predefined named entities:
| Character | Entity Reference | Description |
|---|---|---|
< |
< |
Less-than sign |
> |
> |
Greater-than sign |
& |
& |
Ampersand |
" |
" |
Double quote |
' |
' |
Single quote/apostrophe |
Extended Named Entities
Beyond the five predefined entities, XML supports many additional named entities. The most comprehensive list is available on Wikipedia’s List of XML and HTML character entity references, which includes:
- Mathematical symbols:
½,×,÷ - Currency symbols:
€,£,¥ - Punctuation:
–,—,… - Special characters:
©,®,™
Numeric Character References
When named entities aren’t available, you can use numeric character references to represent any Unicode character:
Decimal Format
Format: &#nnnn; where nnnn is the decimal Unicode code point
Hexadecimal Format
Format: &#xhhhh; where hhhh is the hexadecimal Unicode code point
Common Examples:
| Character | Decimal Reference | Hexadecimal Reference |
|---|---|---|
& |
& |
& |
< |
< |
< |
> |
> |
> |
" |
" |
" |
' |
' |
' |
Complete Unicode Reference: For finding the numeric code for any character, you can use resources like Unicode Charts or online tools like Unicode Lookup.
Best Practices for XML Escaping
Programming Language Implementations
JavaScript:
// Using regular expressions
function escapeXml(str) {
return str.replace(/[&<>"']/g, function(char) {
return {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": '''
}[char];
});
}
// Using DOM API
function escapeXmlWithDOM(str) {
const div = document.createElement('div');
div.textContent = str;
return div.innerHTML;
}
Java:
// Using Apache Commons Lang
import org.apache.commons.text.StringEscapeUtils;
String escaped = StringEscapeUtils.escapeXml11("3 < 5 & 7 \" '");
// Using javax.xml.bind
import javax.xml.bind.DatatypeConverter;
String escaped = DatatypeConverter.printString("3 < 5 & 7 \" '");
Python:
# Using xml.sax.saxutils
from xml.sax.saxutils import escape
escaped = escape("3 < 5 & 7")
# For full escaping including quotes
from xml.sax.saxutils import escape, quoteattr
fully_escaped = quoteattr("3 < 5 & 7 \" '")
Security Considerations
- Always Escape User-Generated Content: Never trust input from users - always escape it before including in XML
- Escape in the Right Context: Understand whether you’re escaping for attribute values or element content
- Validate After Escaping: Ensure your escaped XML is still valid and well-formed
- Be Aware of Double Escaping: Some frameworks might escape content twice - test for this
Performance Considerations
- Pre-compile Escape Patterns: For repeated escaping operations, pre-compile regular expressions
- Use Built-in Functions: Prefer language-specific XML escaping functions when available
- Consider CDATA for Large Blocks: For large amounts of text that might contain many special characters, use CDATA sections instead of individual escaping
Tools and Resources
Online Tools
-
FreeFormatter XML Escape Tool: https://www.freeformatter.com/xml-escape.html
- Free online tool to escape or unescape XML documents
- Handles both named and numeric character references
-
LambdaTest XML Escape: https://www.lambdatest.com/free-online-tools/xml-escape
- Converts plain XML content to escaped HTML
- Shows both the original and escaped versions
-
JSONFormatter XML Escape: https://jsonformatter.org/xml-escape
- Online tool to escape ampersand, quote, and all special characters
Authoritative Specifications
-
W3C XML 1.0 Specification: https://www.w3.org/TR/xml/
- The definitive source for XML standards and requirements
-
W3C Character Escapes in Markup: https://www.w3.org/International/questions/qa-escapes
- Official guidelines on using character escapes in markup and CSS
-
XML Entity Definitions for Characters: https://www.w3.org/2003/entities/2007doc/
- Official entity definitions and character mappings
Comprehensive Character Reference Lists
-
Wikipedia: List of XML and HTML character entity references: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
- The most comprehensive list of named character entities
- Includes both XML and HTML entities with Unicode mappings
-
Microsoft: XML Character Entities and XAML: https://learn.microsoft.com/en-us/dotnet/desktop/xaml-services/xml-character-entities
- Microsoft’s documentation on XML character entities
- Includes practical examples and usage guidelines
-
W3Resource: XML Reserved Markup Characters: https://www.w3resource.com/xml/reserved-markup-characters.php
- Detailed chart of character references and their usage
- Includes examples and best practices
Sources
- What characters do I need to escape in XML documents? - Stack Overflow
- XML Reserved Markup Characters - w3resource
- Escaping XML Data - Liquid Technologies
- How to Escape Characters in XML ? - GeeksforGeeks
- What Characters Need to be Escaped in XML Documents? - Novixys Software Dev Blog
- Free Online XML Escape / Unescape Tool - FreeFormatter.com
- List of XML and HTML character entity references - Wikipedia
- Using character escapes in markup and CSS - W3C
- XML Character Entities and XAML - Microsoft Learn
- XML escaped characters - Advanced Installer
Conclusion
XML escaping is essential for maintaining document validity and preventing parsing errors. The five core characters that must be escaped are &, <, >, ", and ', with corresponding entity references &, <, >, ", and '. Remember that escaping requirements vary by context - these characters don’t need to be escaped in comments, processing instructions, or CDATA sections.
For comprehensive character escaping needs, use the complete entity reference lists available on Wikipedia and the official W3C specifications. When working with programming languages, prefer built-in escaping functions and always escape user-generated content to prevent security issues. By following these practices and utilizing the available resources, you can ensure your XML documents remain well-formed and interoperable across different systems and platforms.