Determining the charset set encoding of an XML document it may prove not as easy as it seems, and when the XML document is served over HTTP things get a little more hairy.

Current Java libraries or utilities don't do this by default, A JAXP SAX parser may detect the charset encoding of a XML document looking at the first bytes of the stream as defined Section 4.3.3 and Appendix F.1 of the in XML 1.0 (Third Edition) specification. But Appendix F.1 is non-normative and not all Java XML parsers do it right now. For example the JAXP SAX parser implementation in J2SE 1.4.2 does not handle Appendix F.1 and Xerces 2.6.2 just added support for it. But still this does not solve the whole problem. JAXP SAX parsers are not aware of the HTTP transport rules for charset encoding resolution as defined by RFC 3023. They are not because they operate on a byte stream or a char stream without any knowledge of the stream transport protocol, HTTP in this case.

Mark Pilgrim did a very good job explaining how the charset encoding should be determined, Determining the character encoding of a feed and XML on the Web Has Failed.

To isolate developers from this issue ROME has a special character stream class, the XmlReader. The XmlReader class is a subclass of the java.io.Reader that detects the charset encoding of XML documents read from files, input streams, URLs and input streams obtained over HTTP. Ideally this should be built into the JAXP SAX classes, most likely in the InputSource class.

Default Lenient Behavior

It is very common for many sites, due to improper configuration or lack of knowledge, to declare an invalid charset encoding. Invalid according to the XML 1.0 specification and the RFC 3023. For example a mismatch between the implicit charset encoding in the HTTP content-type and the explicit charset encoding in the XML prolog.

Because of this, ROME XmlReader by default has a lenient detection. This lenient detection works in the following order:

The XmlReader class has 2 constructors that allow a strict (non-lenient) charset encoding detection to be performed. This constructors take a lenient boolean flag, the flag should be set to false for a strict detection.

The Algorithms per XML 1.0 and RFC 3023 specifications

Following it's a detailed explanation on the algorithms the XmlReader uses to determine the charset encoding. These algorithms first appeared in Tucu's Weblog.

Raw XML charset encoding detection

Detection of the charset encoding of a XML document without external information (i.e. reading a XML document from a file). Following Section 4.3.3 and Appendix F.1 of the XML 1.0 specification the charset encoding of an XML document is determined as follows:

BOMEnc     : byte order mark. Possible values: 'UTF-8', 'UTF-16BE', 'UTF-16LE' or NULL
XMLGuessEnc: best guess using the byte representation of the first bytes of XML declaration
             ('<?xml...?>') if present. Possible values: 'UTF-8', 'UTF-16BE', 'UTF-16LE' or NULL
XMLEnc     : encoding in the XML declaration ('<?xml encoding="..."?>'). Possible values: anything or NULL

if BOMEnc is NULL
  if XMLGuessEnc is NULL or XMLEnc is NULL
    encoding is 'UTF-8'                                                                   [1.0]
  else
  if XMLEnc is 'UTF-16' and (XMLGuessEnc is 'UTF-16BE' or XMLGuessEnc is 'UTF-16LE')
    encoding is XMLGuessEnc                                                               [1.1]
  else
    encoding is XMLEnc                                                                    [1.2]
else
if BOMEnc is 'UTF-8'
  if XMLGuessEnc is not NULL and XMLGuessEnc is not 'UTF-8'
    ERROR, encoding mismatch                                                              [1.3]
  if XMLEnc is not NULL and XMLEnc is not 'UTF-8'
    ERROR, encoding mismatch                                                              [1.4]
  encoding is 'UTF-8'
else
if BOMEnc is 'UTF-16BE' or BOMEnc is 'UTF-16LE'
  if XMLGuessEnc is not NULL and XMLGuessEnc is not BOMEnc
    ERROR, encoding mismatch                                                              [1.5]
  if XMLEnc is not NULL and XMLEnc is not 'UTF-16' and XMLEnc is not BOMEnc
    ERROR, encoding mismatch                                                              [1.6]
  encoding is BOMEnc
else
  ERROR, cannot happen given BOMEnc possible values (see above)                           [1.7]

Byte Order Mark encoding and XML guessed encoding detection rules are clearly explained in the XML 1.0 Third Edition Appendix F.1. Note that in this algorithm BOMEnc and XMLGuessEnc are restricted to UTF-8 and UTF-16* encodings.

XML over HTTP charset encoding detection

Detection of the charset encoding of a XML document with external information (provided by HTTP). Following Section 4.3.3, Appendix F.1 and Appendix F.2 of the XML 1.0 specification, plus RFC 3023 the charset encoding of an XML document served over HTTP is determined as follows:

ContentType: Content-Type HTTP header
CTMime     : MIME type defined in the ContentType
CTEnc      : charset encoding defined in the ContentType, NULL otherwise
BOMEnc     : byte order mark. Possible values: 'UTF-8', 'UTF-16BE', 'UTF-16LE' or NULL
XMLGuessEnc: best guess using the byte representation of the first bytes of XML declaration
             ('<?xml...?>') if present. Possible values: 'UTF-8', 'UTF-16BE', 'UTF-16LE' or NULL
XMLEnc     : encoding in the XML declaration ('<?xml encoding="..."?>'). Possible values: anything or NULL
APP-XML    : RFC 3023 defined 'application/*xml' types
TEXT-XML   : RFC 3023 defined 'text/*xml' types

if CTMime is APP-XML or CTMime is TEXT-XML
  if CTEnc is NULL
    if CTMime is APP-XML
      encoding is determined using the Raw XML charset encoding detection algorithm      [2.0]
    else
    if CTMime is TEXT-XML
      encoding is 'US-ASCII'                                                             [2.1]
  else
  if (CTEnc is 'UTF-16BE' or CTEnc is 'UTF-16LE') and BOMEnc is not NULL
    ERROR, RFC 3023 explicitly forbids this                                              [2.2]
  else
  if CTEnc is 'UTF-16'
    if BOMEnc is 'UTF-16BE' or BOMEnc is 'UTF-16LE'
      encoding is BOMEnc                                                                 [2.3]
    else
      ERROR, missing BOM or encoding mismatch                                            [2.4]
  else
    encoding is CTEnc                                                                    [2.5]
else
  ERROR, handling for other MIME types is undefined                                      [2.6]

Byte Order Mark encoding and XML guessed encoding detection rules are clearly explained in the XML 1.0 Third Edition Appendix F.1. Note that in this algorithm BOMEnc and XMLGuessEnc are restricted to UTF-8 and UTF-16* encodings.