Lossless Parsing
DomTrip's core strength is its ability to parse XML documents while preserving every single detail of the original formatting. This enables true round-trip editing where unmodified sections remain completely unchanged.
What Gets Preserved
1. Comments (Including Multi-line)
String xmlWithComments =
"""
<project>
<!-- Main project coordinates -->
<groupId>com.example</groupId>
<artifactId>my-app</artifactId> <!-- Application name -->
<!-- Version information -->
<version>1.0.0</version>
</project>
""";
Document doc = Document.of(xmlWithComments);
Editor editor = new Editor(doc);
// Comments are preserved in their exact positions
String result = editor.toXml();
Assertions.assertEquals(xmlWithComments, result);
2. Whitespace and Indentation
// Original element with specific indentation
Element element = dependency; // Has specific preceding whitespace
String originalWhitespace = element.precedingWhitespace();
// Comment out - preserves the element's whitespace
Comment comment = editor.commentOutElement(element);
// The comment will have the same indentation as the original element
Assertions.assertEquals(originalWhitespace, comment.precedingWhitespace());
3. Entity Encoding
String xmlWithEntities = """
<message>Hello & goodbye <world></message>
""";
Document doc = Document.of(xmlWithEntities);
Editor editor = new Editor(doc);
Element message = doc.root();
// For your code - entities are decoded
String decoded = message.textContent(); // "Hello & goodbye <world>"
// For serialization - entities are preserved in the XML output
String raw = message.textContent(); // The API handles entity encoding automatically
String result = editor.toXml();
4. Attribute Quote Styles
String xmlWithMixedQuotes =
"""
<dependency scope='test' optional="true" classifier='sources'/>
""";
Document doc = Document.of(xmlWithMixedQuotes);
Editor editor = new Editor(doc);
// Quote styles are preserved exactly
String result = editor.toXml();
Assertions.assertEquals(xmlWithMixedQuotes, result);
5. CDATA Sections
String xmlWithCData =
"""
<script>
<![CDATA[
function example() {
if (x < y && y > z) {
return "complex & special chars";
}
}
]]>
</script>
""";
Document doc = Document.of(xmlWithCData);
Editor editor = new Editor(doc);
// CDATA sections are preserved exactly
String result = editor.toXml();
Assertions.assertTrue(result.contains("<![CDATA["));
Assertions.assertTrue(result.contains("]]>"));
Assertions.assertTrue(result.contains("x < y && y > z"));
6. Processing Instructions
String xml =
"""
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="style.xsl"?>
<document>
<?custom-instruction data="value"?>
<content>text</content>
</document>
""";
Document doc = Document.of(xml);
Editor editor = new Editor(doc);
// Processing instructions with data are preserved exactly
String result = editor.toXml();
Assertions.assertTrue(result.contains("<?xml-stylesheet type=\"text/xsl\" href=\"style.xsl\"?>"));
Assertions.assertTrue(result.contains("<?custom-instruction data=\"value\"?>"));
How It Works
DomTrip achieves lossless parsing through several key techniques:
1. Dual Content Storage
Each text node stores both the decoded content (for programmatic access) and the raw content (for preservation):
// Internal representation
Text textNode = new Text(
"decoded content: < & >", // For your code to use
"raw content: < & >" // For serialization
);
// You work with decoded content
String content = textNode.getTextContent(); // "decoded content: < & >"
// Serialization uses raw content to preserve entities
String xml = textNode.toXml(); // "raw content: < & >"
2. Attribute Metadata
Attributes store comprehensive formatting information:
public class Attribute {
private String value; // The actual value
private QuoteStyle quoteStyle; // SINGLE or DOUBLE
private String whitespace; // Surrounding whitespace
private String rawValue; // Original encoded value
}
3. Whitespace Tracking
Every node tracks its surrounding whitespace:
public abstract class Node {
protected String precedingWhitespace; // Whitespace before the node
// Note: followingWhitespace has been removed in favor of a simplified model
// where whitespace is stored as precedingWhitespace of the next node
}
4. Modification Tracking
Nodes track whether they've been modified to determine serialization strategy:
// Unmodified nodes use original formatting
if (!node.isModified() && !node.getOriginalContent().isEmpty()) {
return node.getOriginalContent();
}
// Modified nodes are rebuilt with preserved style
return buildFromScratch(node);
Round-Trip Verification
You can verify lossless parsing with this simple test:
// Create a temporary file for testing
String complexXml =
"""
<?xml version="1.0" encoding="UTF-8"?>
<!-- Configuration file -->
<config>
<database>
<host>localhost</host>
<port>5432</port>
</database>
</config>
""";
// Load with automatic encoding detection
Document doc = Document.of(complexXml);
Editor editor = new Editor(doc);
String result = editor.toXml();
// Load again to verify round-trip preservation
Document doc2 = Document.of(result);
Editor editor2 = new Editor(doc2);
String result2 = editor2.toXml();
// Should be identical
Assertions.assertEquals(result, result2);
Performance Considerations
Lossless parsing requires additional memory to store formatting metadata:
- Memory overhead: ~20-30% compared to traditional parsers
- Parse time: ~10-15% slower due to metadata collection
- Serialization: Faster for unmodified sections, slower for modified sections
Memory Usage Example
// Traditional parser memory usage
Document traditionalDoc = traditionalParser.parse(xml);
// Memory: ~1x base size
// DomTrip memory usage
Document domtripDoc = domtripParser.parse(xml);
// Memory: ~1.3x base size (includes formatting metadata)
Limitations
While DomTrip preserves almost everything, there are a few edge cases:
- DTD Internal Subsets: Complex DTD declarations may be simplified
- Exotic Encodings: Some rare character encodings may be normalized
- XML Declaration Order: Attribute order in XML declarations may be standardized
Best Practices
1. Use for Editing Scenarios
// ✅ Perfect for editing existing files
String existingConfigXml = createConfigXml();
Document doc = Document.of(existingConfigXml);
Editor editor = new Editor(doc);
Element root = editor.root();
editor.addElement(root, "newSetting", "value");
String result = editor.toXml();
// Result preserves all original formatting
2. Verify Round-Trip in Tests
// ✅ Always test round-trip preservation
@Test
void testConfigurationEditing() {
String original = loadTestXml();
Editor editor = new Editor(original);
// Make changes...
editor.addElement(root, "test", "value");
// Verify only intended changes occurred
String result = editor.toXml();
assertThat(result).contains("<test>value</test>");
assertThat(countLines(result)).isEqualTo(countLines(original) + 1);
}
3. Handle Large Files Carefully
// ✅ For large files, consider streaming or chunking
String xmlContent = createConfigXml();
long fileSize = xmlContent.length();
if (fileSize > 10_000_000) { // 10MB
// Consider alternative approaches for very large files
System.out.println("Large file detected, consider streaming approach");
}
// For normal-sized files, DomTrip works efficiently
Document doc = Document.of(xmlContent);
Editor editor = new Editor(doc);
String result = editor.toXml();
Comparison with Other Libraries
| Feature | DomTrip | DOM4J | JDOM | Java DOM |
|---|---|---|---|---|
| Comment preservation | ✅ Perfect | ✅ Yes | ✅ Yes | ✅ Yes |
| Between-element whitespace | ✅ Exact | ⚠️ Partial | ✅ Yes* | ⚠️ Limited |
| In-element whitespace | ✅ Exact | ❌ Lost | ⚠️ Configurable** | ⚠️ Limited |
| Entity preservation | ✅ Perfect | ❌ Decoded | ❌ Decoded | ❌ Decoded |
| Quote style preservation | ✅ Perfect | ❌ Normalized | ❌ Normalized | ❌ Normalized |
| Attribute order preservation | ✅ Perfect | ❌ Lost | ❌ Lost | ❌ Lost |
| Processing instructions | ✅ Perfect | ✅ Yes | ✅ Yes | ✅ Yes |
| CDATA preservation | ✅ Perfect | ✅ Yes | ✅ Yes | ✅ Yes |
| Round-trip fidelity | ✅ 100% | ❌ ~70% | ⚠️ ~80%*** | ❌ ~75% |
* JDOM: Use Format.getRawFormat() to preserve original whitespace between elements
** JDOM: Configure with TextMode.PRESERVE to maintain text content whitespace
*** JDOM: Higher fidelity possible with careful configuration, but still loses some formatting details
Key Insight: While other libraries can preserve individual aspects of formatting, DomTrip is unique in preserving all formatting details simultaneously without requiring special configuration or losing any information during round-trip operations.