Design and Implementation of Wiki Content Transformations and Refactorings

Abstract: The organic growth of wikis requires constant attention by contributors who are willing to patrol the wiki and improve its content structure. However, most wikis still only offer textual editing and even wikis which offer WYSIWYG editing do not assist the user in restructuring the wiki. Therefore, “gardening” a wiki is a tedious and error-prone task. One of the main obstacles to assisted restructuring of wikis is the underlying content model which prohibits automatic transformations of the content. Most wikis use either a purely textual representation of content or rely on the representational HTML format. To allow rigorous definitions of transformations we use and extend a Wiki Object Model. With the Wiki Object Model installed we present a catalog of transformations and refactorings that helps users to easily and consistently evolve the content and structure of a wiki. Furthermore we propose XSLT as language for transformation specification and provide working examples of selected transformations to demonstrate that the Wiki Object Model and the transformation framework are well designed. We believe that our contribution significantly simplifies wiki “gardening” by introducing the means of effortless restructuring of articles and groups of articles. It furthermore provides an easily extensible foundation for wiki content transformations.

Keywords: Wiki, Wiki Markup, WM, Wiki Object Model, WOM, Transformation, Refactoring, XML, XSLT, Sweble.

Reference: Hannes Dohrn and Dirk Riehle. “Design and Implementation of Wiki Content Transformations and Refactorings.” In Proceedings of the 9th International Symposium on Open Collaboration (WikiSym + OpenSym 2013). ACM, 2013.

The paper is available as PDF file.

Posted in Uncategorized | Comments Off

Sweble on GitHub and Ohloh

The Sweble Project can now be found on GitHub and Ohloh.

The GitHub repositories mirror the primary repositories hosted on our servers. Commits pushed to our repositories will be pushed to GitHub after a short delay.

Please visit us on Ohloh and let us know if you’re using Sweble!

Posted in Uncategorized | Comments Off

Google-Sponsored Sweble 2.0 Alpha Released

We released an early 2.0 (alpha) version of the Sweble Wikitext parser and related libraries on our git repository and as maven artifacts. The Sweble Wikitext parser aims to provide a Mediawiki-compliant Wiktext parser implementation in Java. This includes full Mediawiki template expansion but does not cover all of the parser functions and tag extensions (yet).

We would like to thank Google, and in particular the Open Source Program Office of Chris Dibona, for sponsoring the development of our Wikitext parser.

Google Logo

Stay tuned for more Sweble components for Wikitext handling and domain-expert programming.

Posted in Uncategorized | Comments Off

Sweble 1.1.0 released

Sweble 1.1.0 fixes some bugs and introduces a couple of new features/modules. For a full list of changes please refer to the changes reports of the individual modules. The release can be found on maven central. Jars with dependencies will soon be available from our downloads page.

Fixed bugs (excerpt contains only bugs filed in our bug tracker):

  • Can not parse image block with nested internal link. Fixes 9.
  • The LinkTargetParser is now decoding XML references and URL encoded entities (%xx) before checking titles for validity. Fixes 10.
  • Tests fail under Windows due to encoding and path separator differences. Fixes 11.
  • mvn license:check fails under Windows. Fixes 12.
  • LazyRatsParser.java: type parameters of <T>T cannot be determined. Fixes 13.
  • NPE on Spanish wikipedia dump. Fixes 14.
  • Template expansion does not expand anonymous parameters correctly Fixes 18.

Notable new features/modules (excerpt):

  • Added submodule ptk-json-tools: Library for serializing and deserializing ASTs to JSON and back.
  • Added submodule ptk-xml-tools: Library for serializing and deserializing ASTs to XML and back.
  • Added submodule swc-article-cruncher: A framework for processing Wikitext pages spreading the work over multiple processors.
  • Added submodule swc-dumpreader: A library for reading Wikipedia XML dumps.
  • Added submodule swc-example-basic: Example demonstrating parsing of an article and conversion to HTML.
  • Added submodule swc-example-serialization: Example demonstrating the serialization and deserialization of ASTs to JSON, XML and native Java object streams.
  • Added submodule swc-example-xpath: Example demonstrating XPath queries in ASTs.
Posted in Uncategorized | Comments Off

Sweble is available on Maven Central

We are finally deploying releases of Sweble and related software to Maven Central. This has many advantages for users of our software, among others:

  • You don’t have to refer to our Maven repositories any more in your own poms (if you only use our releases; snapshots are still only available from our repositories).
  • Releasing your own software on Maven Central becomes easier if you depend on Sweble.

For now only an updated version of our original 1.0.0 release of the Sweble software is available under the version number 1.0.01. However, we hope to provide a new release of the current development branch on Maven Central soon.

With version 1.0.0.1 of Sweble we’ve also started to auto-generate maven sites for all of our software modules. These sites provide documentation of the individual projects and can be found in the Documentation menu of the Sweble Blog.

Posted in Uncategorized | 2 Comments

Design and Implementation of the Sweble Wikitext Parser: Unlocking the Structured Data of Wikipedia

We will be presenting our paper on the design and implementation of the Sweble Wikitext Parser at the WikiSym 2011 conference! The conference will take place in Mountain View, CA in October.

For those of you who want to take a peek before the conference, we’ve put a pre-print version of the paper in the Sweble Wiki’s downloads section.

We still have some days left for fine-tuning the paper; if you have  any suggestions for improvement, we would love to hear from you.

Posted in Uncategorized | Comments Off

WOM: An object model for MediaWiki’s Wikitext

Wikipedia is a rich encyclopedia that is not only of great use to its contributors and readers but also to researchers and providers of third party software around Wikipedia. However, Wikipedia’s content is only available as Wikitext, the markup language in which articles on Wikipedia are written. Unfortunately, those parsers which convert Wikitext into a high-level representation like an abstract syntax tree (AST) define their own format for storing and providing access to this data structure. Further, the semantics of Wikitext are only defined implicitly in the MediaWiki software itself.

This situation makes it difficult to reason about the semantic content of an article or exchange and modify articles in a standardized and machine-accessible way. To remedy this situation we propose a markup language, called XWML, in which articles can be stored and an object model, called WOM, that defines how the contents of an article can be read and modified. Both are presented in a technical report published by the University of Erlangen, Dept. of Computer Science.

The technical report can be found in the Sweble Wiki’s downloads section.

The WOM Java interfaces and the XWML XML Schema definition are also available as files from our repository.

Posted in Uncategorized | Comments Off

The Sweble Wikitext Parser Offer to the Wikipedia Community

Our offer to the Wikimedia Foundation and the Wikipedia (technical) community is this: Come up with a new and better Wikitext and use the Sweble Wikitext parser to convert old Wikipedia content to that new format. Naturally, the new Wikitext format should work well with visual editors etc. We have spent more than one year full-time working on a parser that can handle the complexities of current Wikitext and it does not make sense to us to create another one. You only need one bridge away from the place you don’t want to be any longer (the current “old” Wikitext) to get to a new and happier place.

Posted in Wikitext Parser | 5 Comments

Using CrystallBall, the Sweble Parser Demo

CrystallBall is our parser demo so that you don’t have to get down to code to check out the parser. It is a simple and easy way to see how we interpret Wikitext.

The general Sweble Parser documentation is on the wiki, naturally. Here are a few examples, though, for the hurried among you. Please note that we have not invested in style sheets to make HTML output look nice or like Wikipedia.org output (not our project goal).

Parsing the generic article (page) ASDF:

Some other articles:

The ultimate parser deathmatch Wikipedia article page (courtesy of Luca Dealfaro of WikiTrust fame):

And finally some XPath queries:

Have fun! And please let us know if your favorite article doesn’t do what you think it should do!

Posted in Wikitext Parser | Comments Off

Announcing the Open Source Sweble Wikitext Parser v1.0

We are happy to announce the general availability of the first public release of the Sweble Wikitext parser, available from http://sweble.org.

The Sweble Wikitext parser

  • can parse all complex Wikitext, incl. tables and templates
  • produces a real abstract syntax tree (AST); a DOM will follow soon
  • is open source made available under the Apache Software License 2.0
  • is written in Java utilizing only permissively licensed libraries

You can find all relevant information and code at http://sweble.org – this also includes demos, in particular the CrystalBall demo, which lets you query a Wikipedia snapshot using XQuery. (The underlying storage mechanism is not particularly well-performing, so you may have to wait a little if load is high.)

The Sweble Wikitext parser intends to be a complete parser for Wikitext. That said, plenty of work remains to be done. Wikitext, as implemented through the MediaWiki engine, has ties to many components that aren’t strictly part of the language, most notably the parser functions, of which we have implemented only a subset.

At this stage, we are hoping for your help. You can help us by

  • playing with the CrystalBall demo and pointing out to us wiki pages that look particularly bad or faulty
  • simply using the parser in your projects and telling us what works and what doesn’t (bug reports!)
  • getting involved in the open source project by contributing code, documentation, and good humor

If you have questions, please don’t hesitate to use the sweble.org facilities or send email to the main implementor, Hannes Dohrn.

Brought to you by the Open Source Research Group at the University of Erlangen, http://osr.cs.fau.de

 

Posted in Wikitext Parser | Comments Off