Stream Support and Encoding
DomTrip provides comprehensive support for parsing XML from InputStreams and serializing to OutputStreams with automatic encoding detection and proper character encoding handling.
InputStream Parsing
Automatic Encoding Detection
DomTrip can automatically detect the character encoding of XML documents from InputStreams:
// Encoding detected from:
// 1. Byte Order Mark (BOM)
// 2. XML declaration
// 3. Content analysis
// 4. Default fallback (UTF-8)
String xmlContent = createTestXml("root");
try (InputStream inputStream = new ByteArrayInputStream(xmlContent.getBytes(StandardCharsets.UTF_8))) {
Document doc = Document.of(inputStream);
String detectedEncoding = doc.encoding(); // "UTF-8", "UTF-16", etc.
}
Encoding Detection Process
The parser follows this detection process:
- BOM Detection: Checks for Byte Order Marks (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE)
- XML Declaration Reading: Parses the encoding attribute from
<?xml encoding="..." ?>
- Fallback: Uses UTF-8 if no encoding is detected
// With fallback encoding (String)
String xml = "<root>content</root>";
InputStream inputStream = new ByteArrayInputStream(xml.getBytes(StandardCharsets.ISO_8859_1));
Document doc = Document.of(inputStream, "ISO-8859-1");
// With fallback encoding (Charset - preferred)
InputStream inputStream2 = new ByteArrayInputStream(xml.getBytes(StandardCharsets.ISO_8859_1));
Document doc2 = Document.of(inputStream2, StandardCharsets.ISO_8859_1);
Supported Encodings
DomTrip supports all Java-supported character encodings:
- UTF-8 (default)
- UTF-16 (with BOM detection)
- UTF-32 (with BOM detection)
- ISO-8859-1
- Any encoding supported by Java's
Charset
class
XML Declaration Parsing
The parser extracts and applies XML declaration attributes:
String xml = "<?xml version=\"1.1\" encoding=\"UTF-8\" standalone=\"yes\"?><root/>";
InputStream inputStream = new ByteArrayInputStream(xml.getBytes(StandardCharsets.UTF_8));
Document doc = Document.of(inputStream);
// All attributes are parsed and applied
Assertions.assertEquals("1.1", doc.version());
Assertions.assertEquals("UTF-8", doc.encoding());
Assertions.assertTrue(doc.isStandalone());
OutputStream Serialization
Document Serialization
Serialize documents to OutputStreams with proper encoding:
Document doc = Document.of(xmlString);
// Use document's encoding
OutputStream outputStream = new FileOutputStream("output.xml");
doc.toXml(outputStream);
// Specify encoding explicitly (String)
doc.toXml(outputStream, "UTF-16");
// Specify encoding explicitly (Charset - preferred)
doc.toXml(outputStream, StandardCharsets.UTF_16);
Serializer with Encoding
Use the Serializer class for more control:
Serializer serializer = new Serializer();
// Use document's encoding
serializer.serialize(doc, outputStream);
// Specify encoding (String)
serializer.serialize(doc, outputStream, "ISO-8859-1");
// Specify encoding (Charset - preferred)
serializer.serialize(doc, outputStream, StandardCharsets.ISO_8859_1);
Node Serialization
Individual nodes can also be serialized to OutputStreams:
Element element = doc.root();
// Serialize node with UTF-8
serializer.serialize(element, outputStream);
// Serialize node with specific encoding (String)
serializer.serialize(element, outputStream, "UTF-16");
// Serialize node with specific encoding (Charset - preferred)
serializer.serialize(element, outputStream, StandardCharsets.UTF_16);
Round-Trip Processing
DomTrip maintains perfect round-trip fidelity when processing streams:
// Parse from InputStream
InputStream inputStream = new FileInputStream("input.xml");
Document doc = Document.of(inputStream);
// Make modifications
Editor editor = new Editor(doc);
editor.addElement(doc.root(), "newElement", "content");
// Serialize to OutputStream with same encoding
OutputStream outputStream = new FileOutputStream("output.xml");
doc.toXml(outputStream); // Uses document's detected encoding
Encoding Consistency
Automatic Encoding Preservation
When parsing from InputStream, the document's encoding property is automatically set:
// Document with UTF-16 encoding
InputStream utf16Stream = new ByteArrayInputStream(
xmlString.getBytes(StandardCharsets.UTF_16));
Document doc = Document.of(utf16Stream);
assert doc.encoding().equals("UTF-16");
// Serialization uses the same encoding
OutputStream outputStream = new ByteArrayOutputStream();
doc.toXml(outputStream); // Automatically uses UTF-16
Encoding Override
You can override the encoding during serialization:
// Parse with one encoding
Document doc = Document.of(inputStream); // UTF-8 detected
// Serialize with different encoding (String)
doc.toXml(outputStream, "UTF-16");
// Serialize with different encoding (Charset - preferred)
doc.toXml(outputStream, StandardCharsets.UTF_16);
Special Characters and BOMs
BOM Handling
DomTrip automatically detects and handles Byte Order Marks:
// UTF-8 with BOM
byte[] bomBytes = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF};
byte[] xmlBytes = xmlString.getBytes(StandardCharsets.UTF_8);
byte[] xmlWithBom = new byte[bomBytes.length + xmlBytes.length];
System.arraycopy(bomBytes, 0, xmlWithBom, 0, bomBytes.length);
System.arraycopy(xmlBytes, 0, xmlWithBom, bomBytes.length, xmlBytes.length);
InputStream inputStream = new ByteArrayInputStream(xmlWithBom);
Document doc = Document.of(inputStream);
// BOM is detected and UTF-8 encoding is used
Special Characters
DomTrip properly handles special characters across different encodings:
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
"<root><text>Special: àáâãäå èéêë</text></root>";
// Round-trip preserves special characters
InputStream inputStream = new ByteArrayInputStream(xml.getBytes(StandardCharsets.UTF_8));
Document doc = Document.of(inputStream);
OutputStream outputStream = new ByteArrayOutputStream();
doc.toXml(outputStream);
// Special characters are preserved
Error Handling
Common Exceptions
Stream operations can throw DomTripException
for various conditions:
try {
Document doc = Document.of(inputStream);
doc.toXml(outputStream);
} catch (DomTripException e) {
if (e.getCause() instanceof IOException) {
// Handle I/O errors
System.err.println("I/O error: " + e.getMessage());
} else {
// Handle parsing/encoding errors
System.err.println("XML error: " + e.getMessage());
}
}
Invalid Encoding
try {
serializer.serialize(doc, outputStream, "INVALID-ENCODING");
} catch (DomTripException e) {
System.err.println("Unsupported encoding: " + e.getMessage());
}
Charset vs String Encoding
Preferred: Charset Objects
DomTrip supports both String-based encoding names and Charset objects, but Charset objects are preferred:
// ✅ Preferred - Type-safe, no invalid encoding names
Document doc = Document.of(inputStream, StandardCharsets.UTF_8);
doc.toXml(outputStream, StandardCharsets.UTF_16);
// ❌ Acceptable but less safe - String can be invalid
Document doc2 = Document.of(inputStream, "UTF-8");
doc2.toXml(outputStream, "UTF-16");
Benefits of Charset Objects
- Type Safety: Compile-time validation of encoding names
- Performance: No string parsing overhead
- Clarity: Clear intent and better IDE support
- Error Prevention: Eliminates typos in encoding names
Best Practices
1. Use Try-With-Resources
// ✅ Proper resource management
try (InputStream inputStream = new FileInputStream("input.xml");
OutputStream outputStream = new FileOutputStream("output.xml")) {
Document doc = Document.of(inputStream);
doc.toXml(outputStream);
}
2. Prefer Charset Objects
// ✅ Type-safe Charset objects
Document doc = Document.of(inputStream, StandardCharsets.UTF_8);
doc.toXml(outputStream, StandardCharsets.UTF_16);
// ❌ String encoding names (error-prone)
Document doc2 = Document.of(inputStream, "UTF-8");
doc2.toXml(outputStream, "UTF-16");
3. Let DomTrip Detect Encoding
// ✅ Automatic detection
Document doc = Document.of(inputStream);
// ❌ Unnecessary manual specification
Document doc = Document.of(inputStream, StandardCharsets.UTF_8); // Only if needed
4. Preserve Original Encoding
// ✅ Maintain consistency
Document doc = Document.of(inputStream);
doc.toXml(outputStream); // Uses detected encoding
// ❌ Unnecessary encoding changes
doc.toXml(outputStream, StandardCharsets.UTF_16); // Only if intentional
5. Handle Large Files Efficiently
// ✅ Stream processing for large files
try (InputStream inputStream = Files.newInputStream(largePath);
OutputStream outputStream = Files.newOutputStream(outputPath)) {
Document doc = Document.of(inputStream);
// Process document...
doc.toXml(outputStream);
}
Performance Considerations
- Memory Usage: The entire InputStream is read into memory for encoding detection
- Encoding Detection: Multiple encoding attempts may impact performance for edge cases
- BOM Detection: Fast and occurs first to minimize encoding attempts
- Large Files: Consider memory implications when processing very large XML files
Modern File-Based APIs
Path-based Loading (Recommended)
// Load with automatic encoding detection
Document doc = Document.of(path);
String result = doc.toXml();
Files.writeString(outputPath, result);
// Or save with proper encoding
try (OutputStream outputStream = Files.newOutputStream(outputPath)) {
doc.toXml(outputStream);
}
Stream-based Processing
try (InputStream inputStream = Files.newInputStream(path);
OutputStream outputStream = Files.newOutputStream(outputPath)) {
Document doc = Document.of(inputStream);
doc.toXml(outputStream);
}
The stream-based approach provides better encoding handling and is more memory-efficient for large files.