Preparing your result...
Loading...
Press Esc to dismiss this message

Dynamic repair of malformed XML (29-Jan-2010)

Thumbnail
IP.com Prior Art Database Disclosure (Source: IPCOM)
Disclosure Number IPCOM000192702D dated 29-Jan-2010
Originally published in Prior Art Database
Disclosed by: IBM
Country: Undisclosed
Disclosure File: 4 pages / 62.0 KB / English (United States)

We propose a system which can automatically repair malformed XML documents.We propose that an intelligent heuristic is used to infer a schema from the well formed data in the document. This is then used to generate one or many fixes to the malformed section(s) of the XML file. The file can then be further processed and some value gained from it. In the case where there are a number of fixes to produce a valid document, many possible valid documents could be returned. An inferred schema will be arrived at by means of analysis of the tree structure of the tags in the well formed parts of the document. Such an algorithm could log the frequency and type of sub-tags (and attributes) and use this to judge the probability of a tag existing in the bad block.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 4

Dynamic repair of malformed XML

XML documents are frequently produced where the data is malformed in some way. Such data corruption can occur for many reasons, for example: memory overwrites, failed method calls, or corrupted return values.

    According to the XML specifications, XML parsers must throw an error in such situations.

    Without a schema these errors are not easy to locate and fix, additionally in a large file a single missing tag can lead to a near complete loss of the data.

    We propose a system which can automatically repair malformed XML documents.

    We propose that an intelligent heuristic is used to infer a schema from the well formed data in the document. This is then used to generate one or many fixes to the malformed section(s) of the XML file. The file can then be further processed and some value gained from it. In the case where there are a number of fixes to produce a valid document, many possible valid documents could be returned.

    An inferred schema will be arrived at by means of analysis of the tree structure of the tags in the well formed parts of the document.

    Such an algorithm could log the frequency and type of sub-tags (and attributes) and use this to judge the probability of a tag existing in the bad block.

    The inferred schema is then applied to the complete malformed document. From the schema many 'fixes' could be applied to the malformed section(s), and in this situation all possibilities should be returned. It may be possible to rank the fixes based on the similarity of the fixed section to known good sections. Additionally this method could also show possible fixes that do not appear in the document but are technically allowed by the schema, although these would probably be ranked lower.

    Malformed XML is given as an input to a program, which would cause the XML parser to throw an exception.

The XML document is then processed to determine possible fixes to the document.
1. Feed document to XML parser to generate tree of elements. If document parses, finish. Otherwise...
2. If it is malformed, locate the error(s) in the XML tree. This can be done by walking the XML tree breadth-first, parsing each subtree in turn. If a subtree is invalid then the walker descends the tree, searching for the error. Once the walker has descended so far that the remaining tree is valid, it knows the error lies at the level above. For each error, prune the tree by removing the malformed chunk of XML. Store this chunk separately and note where...

(Source: IPCOM)
First page image
(Source: IPCOM)