HD-Diff is a tree-based algorithm to compute the differences between two documents. The algorithm was presented in a paper at the DocEng 2014 conference.
Unlike other tree-based differencing algorithms HD-Diff can look into text nodes, splits them when necessary and produces a very fine-grained edit script. It is especially suited for tree-based text documents (e.g. office documents or WOM v3-based wiki articles) in which changes often happen to the text inside text nodes and not just to the overall document structure.
The reference implementation of the generic HD-Diff algorithm is made available as part of the Sweble 2.0 release in the module hddiff. An adapter for WOM v3 documents is made available in the module hddiff-wom-adapter.
Additional information on the hddiff project can be found at GitHub, on our HD-Diff project page and in our paper.
Abstract: Detecting and understanding changes between document revisions is an important task. The acquired knowledge can be used to classify the nature of a new document revision or to support a human editor in the review process. While purely textual change detection algorithms offer fine-grained results, they do not understand the syntactic meaning of a change. By representing structured text documents as XML documents we can apply tree-to-tree correction algorithms to identify the syntactic nature of a change. Many algorithms for change detection in XML documents have been propsed but most of them focus on the intricacies of generic XML data and emphasize speed over the quality of the result. Structured text requires a change detection algorithm to pay close attention to the content in text nodes, however, recent algorithms treat text nodes as black boxes. We present an algorithm that combines the advantages of the purely textual approach with the advantages of tree-to-tree change detection by redistributing text from non-over-lapping common substrings to the nodes of the trees. This allows us to not only spot changes in the structure but also in the text itself, thus achieving higher quality and a fine-grained result in linear time on average. The algorithm is evaluated by applying it to the corpus of structured text documents that can be found in the English Wikipedia.
Keywords: XML, WOM, structured text, change detection, tree matching, tree differencing, tree similarity, tree-to-tree correction, diff
Reference: Hannes Dohrn and Dirk Riehle. “Fine-grained Change Detection in Structured Text Documents.” In Proceedings of the 2014 ACM symposium on Document engineering (DocEng ’14). ACM, New York, NY, USA, 87-96. DOI=10.1145/2644866.2644880
The paper is available as a PDF file.
Two years after our first public release of the Google-Sponsored Sweble 2.0 Alpha, we are happy to announce the release of Sweble 2.0!
The most important innovation in the alpha release was the introduction of the engine component which allowed full Mediawiki template expansion. Since then many other new features and bug fixes have been added to the software. Here are the highlights:
- In the post-processing phase Sweble normalizes and fixes the AST according to the rules found in the WHATWG HTML Spec, Section 12.2 of Apri 2012. This improves the quality of the resulting AST and guarantees that a rendered AST looks just like the HTML produced by Mediawiki when viewed in a modern browser.
- The Wiki Object Model v2 (WOM) has been replaced by a complete rewrite called WOM v3. The WOM v3 implements the org.w3c.dom Java interfaces and thus implements and extends the Document Object Model.
- The WOM v3 allows full round-trip support of Mediawiki articles. After parsing and converting an article to WOM v3, all formatting information from the original wiki markup is preserved. The original formatting can be restored even after alterations to the WOM tree, when the tree is converted back into wiki markup.
- Since the WOM v3 implements the org.w3c.dom interfaces it can be processed by standard Java facilities:
- A WOM v3 tree can be serialized to XML and deserialized to a WOM v3 in-memory document using a javax.xml.transform.Transformer or a javax.xml.parsers.DocumentBuilder.
- With a javax.xml.transform.Transformer one can also transform a WOM v3 document using an XSLT script.
- The module sweble-wom3-swc-adapter converts the AST produced by the sweble-wikitext-parser to a WOM v3 document and can restore wiki markup formatting to a WOM document.
- The module sweble-wom3-json-tools offers serialization of WOM v3 document to and from JSON.
Abstract: The organic growth of wikis requires constant attention by contributors who are willing to patrol the wiki and improve its content structure. However, most wikis still only offer textual editing and even wikis which offer WYSIWYG editing do not assist the user in restructuring the wiki. Therefore, “gardening” a wiki is a tedious and error-prone task. One of the main obstacles to assisted restructuring of wikis is the underlying content model which prohibits automatic transformations of the content. Most wikis use either a purely textual representation of content or rely on the representational HTML format. To allow rigorous definitions of transformations we use and extend a Wiki Object Model. With the Wiki Object Model installed we present a catalog of transformations and refactorings that helps users to easily and consistently evolve the content and structure of a wiki. Furthermore we propose XSLT as language for transformation specification and provide working examples of selected transformations to demonstrate that the Wiki Object Model and the transformation framework are well designed. We believe that our contribution significantly simplifies wiki “gardening” by introducing the means of effortless restructuring of articles and groups of articles. It furthermore provides an easily extensible foundation for wiki content transformations.
Keywords: Wiki, Wiki Markup, WM, Wiki Object Model, WOM, Transformation, Refactoring, XML, XSLT, Sweble.
Reference: Hannes Dohrn and Dirk Riehle. “Design and Implementation of Wiki Content Transformations and Refactorings.” In Proceedings of the 9th International Symposium on Open Collaboration (WikiSym + OpenSym 2013). ACM, 2013.
The paper is available as PDF file.
The Sweble Project can now be found on GitHub and Ohloh.
The GitHub repositories mirror the primary repositories hosted on our servers. Commits pushed to our repositories will be pushed to GitHub after a short delay.
Please visit us on Ohloh and let us know if you’re using Sweble!
We released an early 2.0 (alpha) version of the Sweble Wikitext parser and related libraries on our git repository and as maven artifacts. The Sweble Wikitext parser aims to provide a Mediawiki-compliant Wiktext parser implementation in Java. This includes full Mediawiki template expansion but does not cover all of the parser functions and tag extensions (yet).
We would like to thank Google, and in particular the Open Source Program Office of Chris Dibona, for sponsoring the development of our Wikitext parser.
Stay tuned for more Sweble components for Wikitext handling and domain-expert programming.
Sweble 1.1.0 fixes some bugs and introduces a couple of new features/modules. For a full list of changes please refer to the changes reports of the individual modules. The release can be found on maven central. Jars with dependencies will soon be available from our downloads page.
Fixed bugs (excerpt contains only bugs filed in our bug tracker):
- Can not parse image block with nested internal link. Fixes 9.
- The LinkTargetParser is now decoding XML references and URL encoded entities (%xx) before checking titles for validity. Fixes 10.
- Tests fail under Windows due to encoding and path separator differences. Fixes 11.
- mvn license:check fails under Windows. Fixes 12.
- LazyRatsParser.java: type parameters of <T>T cannot be determined. Fixes 13.
- NPE on Spanish wikipedia dump. Fixes 14.
- Template expansion does not expand anonymous parameters correctly Fixes 18.
Notable new features/modules (excerpt):
- Added submodule ptk-json-tools: Library for serializing and deserializing ASTs to JSON and back.
- Added submodule ptk-xml-tools: Library for serializing and deserializing ASTs to XML and back.
- Added submodule swc-article-cruncher: A framework for processing Wikitext pages spreading the work over multiple processors.
- Added submodule swc-dumpreader: A library for reading Wikipedia XML dumps.
- Added submodule swc-example-basic: Example demonstrating parsing of an article and conversion to HTML.
- Added submodule swc-example-serialization: Example demonstrating the serialization and deserialization of ASTs to JSON, XML and native Java object streams.
- Added submodule swc-example-xpath: Example demonstrating XPath queries in ASTs.
We are finally deploying releases of Sweble and related software to Maven Central. This has many advantages for users of our software, among others:
- You don’t have to refer to our Maven repositories any more in your own poms (if you only use our releases; snapshots are still only available from our repositories).
- Releasing your own software on Maven Central becomes easier if you depend on Sweble.
For now only an updated version of our original 1.0.0 release of the Sweble software is available under the version number 1.0.01. However, we hope to provide a new release of the current development branch on Maven Central soon.
With version 220.127.116.11 of Sweble we’ve also started to auto-generate maven sites for all of our software modules. These sites provide documentation of the individual projects and can be found in the Documentation menu of the Sweble Blog.
We will be presenting our paper on the design and implementation of the Sweble Wikitext Parser at the WikiSym 2011 conference! The conference will take place in Mountain View, CA in October.
For those of you who want to take a peek before the conference, we’ve put a pre-print version of the paper in the Sweble Wiki’s downloads section.
We still have some days left for fine-tuning the paper; if you have any suggestions for improvement, we would love to hear from you.
Wikipedia is a rich encyclopedia that is not only of great use to its contributors and readers but also to researchers and providers of third party software around Wikipedia. However, Wikipedia’s content is only available as Wikitext, the markup language in which articles on Wikipedia are written. Unfortunately, those parsers which convert Wikitext into a high-level representation like an abstract syntax tree (AST) define their own format for storing and providing access to this data structure. Further, the semantics of Wikitext are only defined implicitly in the MediaWiki software itself.
This situation makes it difficult to reason about the semantic content of an article or exchange and modify articles in a standardized and machine-accessible way. To remedy this situation we propose a markup language, called XWML, in which articles can be stored and an object model, called WOM, that defines how the contents of an article can be read and modified. Both are presented in a technical report published by the University of Erlangen, Dept. of Computer Science.
The technical report can be found in the Sweble Wiki’s downloads section.
The WOM Java interfaces and the XWML XML Schema definition are also available as files from our repository.