Lossless Parsing

DomTrip's core strength is its ability to parse XML documents while preserving every single detail of the original formatting. This enables true round-trip editing where unmodified sections remain completely unchanged.

What Gets Preserved

1. Comments (Including Multi-line)

// ❌ Snippet 'comments-preservation' not found
// Available snippets: attribute-manipulation, element-creation, modification-tracking, whitespace-preserving-text, thread-safety-pattern, namespace-inheritance, xml-declaration-handling, intelligent-editing, modifying-processing-instructions, processing-instruction-creation, configuration-control, fluent-builder-api, custom-serialization, best-practices-editing, migration-namespace-handling, stream-with-optionals, parsing-exceptions, descendant-streams, indentation-options, document-validation, stream-transformations, namespaced-elements, environment-specific-configurations, xml-declaration, factory-method-best-practices, adding-jdk-toolchains, advanced-attribute-formatting, jackson-xml-object-mapping, stream-aggregation, custom-stream-sources, multi-module-project, encoding-override, complex-stream-queries, whitespace-tracking, basic-attributes, filtering-streams, inner-element-whitespace, adding-servers, inputstream-error-handling, memory-management, finding-elements-basic, configuration-best-practices, doctype-support, encoding-issues, loading-xml-string, input-validation, element-finding, comment-out-single-element, namespace-conflicts, validation-exceptions, version-control, xml-declaration-parsing, preset-configurations, safe-element-handling, attribute-quote-preservation, working-with-existing-documents, adding-elements-simple, modern-java-api, element-addition, namespace-best-practices, advanced-constructor-examples, position-whitespace-preservation, web-service-responses, dom4j-attribute-handling, jdom-document-loading, programmatic-document-creation, removing-elements, commenting-integration, whitespace-configuration, managing-namespace-declarations, minimal-modification, basic-constructors, jackson-xml-simple-parsing, java-dom-document-loading, complex-structure-creation, loading-xml-from-inputstream, parsing-performance, document-traversal, basic-toolchains-creation, finding-processing-instructions, commenting-error-handling, round-trip-verification, xml-stylesheet-declaration, qname-usage, basic-pom-creation, complete-configuration, comment-out-multiple-elements, editing-existing-pom, entity-preservation, loading-xml-config, element-reordering, parsing-documents-with-pis, root-element-namespaces, basic-operations, document-creation, adding-dependencies, configuration-options, node-hierarchy, performance-monitoring, gradual-migration-phase3, error-context, modification-performance, gradual-migration-phase1, gradual-migration-phase2, element-operations, text-content, adding-mirrors, adding-namespace-declarations, adding-new-elements, fluent-chaining, charset-vs-string, buffered-streams, graceful-parsing, element-tag-whitespace, node-serialization, large-file-handling, special-characters, migration-error-handling, text-comment-creation, round-trip-preservation, basic-stream-navigation, bom-handling, encoding-preservation, malformed-xml, element-reordering-before, encoding-management, document-error-handling, processing-instruction-preservation, processing-instructions-with-data, basic-element-creation, complex-namespace-scenario, line-ending-configuration, attribute-operations, serialization-options, resource-cleanup, basic-serialization, error-handling, supported-encodings, migration-memory-usage, validation-with-fallbacks, quick-example, modifying-content, best-practices, using-builder-patterns, maven-pom-updating-version, namespace-validation, child-navigation, complex-structure-preservation, dom4j-document-loading, namespace-preservation, parsing-from-file, whitespace-preservation, parsing-from-network, document-type-preservation, performance-testing, parallel-streams, element-builders, simple-document-creation, application-specific-instructions, intelligent-inference, performance-optimizations, batch-processing, adding-elements-attributes, adding-various-toolchains, basic-editor-usage, fluent-element-addition, php-processing-instructions, quote-style-configuration, dom4j-element-navigation, basic-extensions-creation, logging-integration, spring-configuration, best-practices-preserve-formatting, reusable-factory-methods, basic-configuration, file-based-document-loading, memory-profiling, large-file-processing, text-content-operations, comment-operations, configuration-files, node-whitespace, document-cloning, specific-exception-handling, lossless-round-trip, safe-navigation, configuration-optimization, advanced-document-creation, attribute-formatting, stream-based-navigation, element-builder, jdom-element-operations, doctype-preservation, dual-content-storage, basic-namespace-handling, configuration-system, encoding-detection-fallback, comment-creation, java-dom-element-navigation, stream-modification, complex-reordering, java-dom-creating-elements, namespace-aware-navigation, domtrip-exception, maven-pom-handling, root-element-management, quick-start-basic, dom4j-adding-elements, attribute-creation, safe-element-access, serializer-with-encoding, validation-mode, adding-profiles, maven-pom-adding-dependencies, simple-element-modification, document-serialization, memory-usage, streaming-large-files, namespace-support, batch-attribute-operations, creating-processing-instructions, element-streams, cdata-preservation, optional-based-navigation, available-configuration-methods, namespace-declarations, comment-pi-handling, working-with-namespaces, configuration-patterns, migration-xpath-queries, large-document-processing, batch-operations, fluent-element-builders, java-dom-attributes, installation-test, real-world-maven-example, node-counting, round-trip-processing, editor-integration, finding-elements-by-namespace, best-practices-optional, jdom-text-content, text-node-creation, round-trip-operations, minimal-change-serialization, comment-management, prefixed-namespaces, insert-element-after, stream-chaining, advanced-element-creation, basic-format-preservation, attribute-management, dom4j-serialization, insert-element-before, basic-settings-creation, adding-plugins, automatic-encoding-detection, batch-element-creation, uncomment-element, element-reordering-after, configuration-access, comment-preservation, adding-top-level-nodes, soap-document-handling, element-whitespace, lossless-philosophy, adding-extensions, parsing-from-inputstream, fluent-api, exception-handling, attribute-handling, namespace-attribute-handling, loading-xml-from-file, element-removal, encoding-consistency, insert-element-at, whitespace-inference, creating-namespaced-elements

2. Whitespace and Indentation

String xmlWithWhitespace =
        """
    <project>

        <groupId>com.example</groupId>

        <artifactId>my-app</artifactId>

    </project>
    """;

Document doc = Document.of(xmlWithWhitespace);
Editor editor = new Editor(doc);

// Whitespace between elements is preserved exactly
String result = editor.toXml();

// All blank lines and spacing are maintained
Assertions.assertEquals(xmlWithWhitespace, result);

3. Entity Encoding

String xmlWithEntities = """
    <message>Hello &amp; goodbye &lt;world&gt;</message>
    """;

Document doc = Document.of(xmlWithEntities);
Editor editor = new Editor(doc);

Element message = doc.root();

// For your code - entities are decoded
String decoded = message.textContent(); // "Hello & goodbye <world>"

// For serialization - entities are preserved in the XML output
String raw = message.textContent(); // The API handles entity encoding automatically

String result = editor.toXml();

4. Attribute Quote Styles

String xmlWithMixedQuotes =
        """
    <dependency scope='test' optional="true" classifier='sources'/>
    """;

Document doc = Document.of(xmlWithMixedQuotes);
Editor editor = new Editor(doc);

// Quote styles are preserved exactly
String result = editor.toXml();

Assertions.assertEquals(xmlWithMixedQuotes, result);

5. CDATA Sections

String xmlWithCData =
        """
    <script>
        <![CDATA[
        function example() {
            if (x < y && y > z) {
                return "complex & special chars";
            }
        }
        ]]>
    </script>
    """;

Document doc = Document.of(xmlWithCData);
Editor editor = new Editor(doc);

// CDATA sections are preserved exactly
String result = editor.toXml();

Assertions.assertTrue(result.contains("<![CDATA["));
Assertions.assertTrue(result.contains("]]>"));
Assertions.assertTrue(result.contains("x < y && y > z"));

6. Processing Instructions

String xml =
        """
    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="style.xsl"?>
    <document>
        <?custom-instruction data="value"?>
        <content>text</content>
    </document>
    """;

Document doc = Document.of(xml);
Editor editor = new Editor(doc);

// Processing instructions with data are preserved exactly
String result = editor.toXml();

Assertions.assertTrue(result.contains("<?xml-stylesheet type=\"text/xsl\" href=\"style.xsl\"?>"));
Assertions.assertTrue(result.contains("<?custom-instruction data=\"value\"?>"));

How It Works

DomTrip achieves lossless parsing through several key techniques:

1. Dual Content Storage

Each text node stores both the decoded content (for programmatic access) and the raw content (for preservation):

// Internal representation
Text textNode = new Text(
    "decoded content: < & >",     // For your code to use
    "raw content: &lt; &amp; &gt;"  // For serialization
);

// You work with decoded content
String content = textNode.getTextContent(); // "decoded content: < & >"

// Serialization uses raw content to preserve entities
String xml = textNode.toXml(); // "raw content: &lt; &amp; &gt;"

2. Attribute Metadata

Attributes store comprehensive formatting information:

public class Attribute {
    private String value;           // The actual value
    private QuoteStyle quoteStyle;  // SINGLE or DOUBLE
    private String whitespace;      // Surrounding whitespace
    private String rawValue;        // Original encoded value
}

3. Whitespace Tracking

Every node tracks its surrounding whitespace:

public abstract class Node {
    protected String precedingWhitespace;  // Whitespace before the node
    // Note: followingWhitespace has been removed in favor of a simplified model
    // where whitespace is stored as precedingWhitespace of the next node
}

4. Modification Tracking

Nodes track whether they've been modified to determine serialization strategy:

// Unmodified nodes use original formatting
if (!node.isModified() && !node.getOriginalContent().isEmpty()) {
    return node.getOriginalContent();
}

// Modified nodes are rebuilt with preserved style
return buildFromScratch(node);

Round-Trip Verification

You can verify lossless parsing with this simple test:

// Create a temporary file for testing
String complexXml =
        """
    <?xml version="1.0" encoding="UTF-8"?>
    <!-- Configuration file -->
    <config>
        <database>
            <host>localhost</host>
            <port>5432</port>
        </database>
    </config>
    """;

// Load with automatic encoding detection
Document doc = Document.of(complexXml);
Editor editor = new Editor(doc);
String result = editor.toXml();

// Load again to verify round-trip preservation
Document doc2 = Document.of(result);
Editor editor2 = new Editor(doc2);
String result2 = editor2.toXml();

// Should be identical
Assertions.assertEquals(result, result2);

Performance Considerations

Lossless parsing requires additional memory to store formatting metadata:

  • Memory overhead: ~20-30% compared to traditional parsers
  • Parse time: ~10-15% slower due to metadata collection
  • Serialization: Faster for unmodified sections, slower for modified sections

Memory Usage Example

// Traditional parser memory usage
Document traditionalDoc = traditionalParser.parse(xml);
// Memory: ~1x base size

// DomTrip memory usage  
Document domtripDoc = domtripParser.parse(xml);
// Memory: ~1.3x base size (includes formatting metadata)

Limitations

While DomTrip preserves almost everything, there are a few edge cases:

  1. DTD Internal Subsets: Complex DTD declarations may be simplified
  2. Exotic Encodings: Some rare character encodings may be normalized
  3. XML Declaration Order: Attribute order in XML declarations may be standardized

Best Practices

1. Use for Editing Scenarios

// ✅ Perfect for editing existing files
String existingConfigXml = createConfigXml();
Document doc = Document.of(existingConfigXml);
Editor editor = new Editor(doc);

Element root = editor.root();
editor.addElement(root, "newSetting", "value");

String result = editor.toXml();
// Result preserves all original formatting

2. Verify Round-Trip in Tests

// ✅ Always test round-trip preservation
@Test
void testConfigurationEditing() {
    String original = loadTestXml();
    Editor editor = new Editor(original);
    
    // Make changes...
    editor.addElement(root, "test", "value");
    
    // Verify only intended changes occurred
    String result = editor.toXml();
    assertThat(result).contains("<test>value</test>");
    assertThat(countLines(result)).isEqualTo(countLines(original) + 1);
}

3. Handle Large Files Carefully

// ✅ For large files, consider streaming or chunking
String xmlContent = createConfigXml();
long fileSize = xmlContent.length();

if (fileSize > 10_000_000) { // 10MB
    // Consider alternative approaches for very large files
    System.out.println("Large file detected, consider streaming approach");
}

// For normal-sized files, DomTrip works efficiently
Document doc = Document.of(xmlContent);
Editor editor = new Editor(doc);
String result = editor.toXml();

Comparison with Other Libraries

Feature DomTrip DOM4J JDOM Java DOM
Comment preservation ✅ Perfect ✅ Yes ✅ Yes ✅ Yes
Between-element whitespace ✅ Exact ⚠️ Partial ✅ Yes* ⚠️ Limited
In-element whitespace ✅ Exact ❌ Lost ⚠️ Configurable** ⚠️ Limited
Entity preservation ✅ Perfect ❌ Decoded ❌ Decoded ❌ Decoded
Quote style preservation ✅ Perfect ❌ Normalized ❌ Normalized ❌ Normalized
Attribute order preservation ✅ Perfect ❌ Lost ❌ Lost ❌ Lost
Processing instructions ✅ Perfect ✅ Yes ✅ Yes ✅ Yes
CDATA preservation ✅ Perfect ✅ Yes ✅ Yes ✅ Yes
Round-trip fidelity ✅ 100% ❌ ~70% ⚠️ ~80%*** ❌ ~75%

* JDOM: Use Format.getRawFormat() to preserve original whitespace between elements
** JDOM: Configure with TextMode.PRESERVE to maintain text content whitespace
*** JDOM: Higher fidelity possible with careful configuration, but still loses some formatting details

Key Insight: While other libraries can preserve individual aspects of formatting, DomTrip is unique in preserving all formatting details simultaneously without requiring special configuration or losing any information during round-trip operations.