Decarboni.se: An advanced web-based knowledge management architecture
The Decarboni.se knowledge hub is an evolution of the Global CCS Institute’s Drupal-based knowledge platform we started to build in 2010. After enjoying about four-year’s solid growth, the old platform started to show its age and limitations. After intensive prototyping and evaluation from late 2012 to early 2013, we decided to explore the possibility of building a knowledge hub that retained the key features of the old knowledge platform, had a better and stronger search function and more importantly allowed us to quickly build great sites for other organisations and encourage them to share information more effectively.
A distributed architecture built around Solr and Drupal
Shortly after CNET made Solr an Apache project in 2006 some of my colleagues at the National Library of Australia discussed an idea of using Solr as a NoSQL database and building a search-oriented system to share the Library’s digital subscriptions. The idea even from today’s standards is still very fresh and despite a range of technical challenges we rolled out this system in late 2008. Although it took more than two years for this system to make it to production, the experience we gained from the project was tremendous.
Since we needed to retain the key features of the old knowledge platform and our familiarity with Drupal, it seemed logical that Drupal should still be part of the knowledge hub formula. In addition, the enthusiastic developers of the Drupal community had over time built an excellent framework that integrated Solr with Drupal. It was clear to us we had gathered all the key ingredients and we were ready to cook.
The overall idea of the knowledge hub architecture is to develop a suite of web systems that are interconnected by a Solr-powered API. The architecture comprises a knowledge “hub core” and a number of “terminals”. Complex back-end features, such as converting PDF documents to structured HTML pages, only reside on the hub core. Terminals then use specific Solr queries to pull content from the hub core and display as if it was served from a local database. Imagine the knowledge hub architecture as being a business that owns a large warehouse and a number of retail stores. Stocks will only be shipped to the warehouse in bulk and each retail store will only request specialized goods on demand. Shipping cost will then be reduced as well as wastage and store maintenance. Our customers will also find it much easier to go 'shopping' for information. Finally, each store could choose to use their preferred store layout, marketing and branding that are best suited for the goods they are carrying.
This in our world, will result in lower cost for sourcing, creating and administering content (knowledge assets), and lower cost to maintain the technologies. Terminal sites will also have greater flexibility in terms of their individual look and feel, information architecture as well as marketing strategy. The hub core will also provide power search function to the terminals through the API and every piece of content is interconnected by carefully designed taxonomies. All in all this architecture enables us to spin off a terminal site with ease and fill it with quality content by simply crafting a few Solr queries. A standard site, for example, can be created in as little as an hour by one experienced web developer.
Figure 1 - An overview of the knowledge hub architecture
An automatic classification engine
Back in my days working with the Australian Parliament House I was managing a massive database that houses almost all of the Parliament’s records. It was built by the boffins at RMIT and uses a technology that may have been of interest to more than one of the intelligence agencies. One of the key features of this database is auto-classification. The Parliamentary Library has, over the years, built a thesaurus for cataloguers to keep content in order. This model had become less efficient when more and more information flowed into the Library and needed to be catalogued. The solution was to feed the computer program with the thesaurus and let the machine “observe” how human cataloguers would use the thesaurus to classify content. It learns from this exercise and catalogues content on its own. The program works so well I am even thinking it may take jobs away from its human counterparts.
Coming back to my warehouse analogy for the knowledge hub, we have built this massive depot and truck loads of fresh content arrive every day. The problem now is our human warehouse workers simply cannot cope with the sheer volume of goods and things are in a disarray. We need robots, we need an auto-classifier. However given our time and budget constraints we are unlikely to be able to implement an auto-classification system from scratch. We considered Apache Mahout but our database is too small for the technology. The focus came back on Solr.
By the time we started building the knowledge hub, we already had a reasonably mature “thesaurus” as a result of running the Institute’s knowledge platform for over three years. The idea was to extend this “thesaurus” (or let’s just call this the taxonomies) to a format that can be used by Solr to classify content automatically. Sorry did I say classify? I actually meant predefined searches. The idea is that- for a user who is trying to find contents related to a topic- it doesn’t really make a difference if the content has already been classified by topic; or if the content items related to that topic are found in real time. In other words, if we use a classification algorithm to categorise content, we may just as well use a search algorithm to find content using pre-defined queries on the fly. The only disadvantage with this approach is it's unable to learn from its human counterparts and improve its accuracy. Our solution is that instead of the algorithm learning from humans, we "teach" it by carefully crafting Solr queries. For example, we can teach it that New South Wales is a state of Australia and that Al Gore is an environmental activist.
The result, after a whole year's hard work, is a Solr-powered, search-driven and distributed web system that encapsulates an advanced document processing workflow, an “automated content classifier” and houses over 20,000 pages of knowledge on climate change solutions.
Interesting facts about Decarboni.se
- The first code commit was made on the 18th of March 2013 and the system went live on the 17th of March 2014. The development took exactly one year.
- About 50% of the development work is done on machines powered by renewable energy and our hosting provider also uses renewable energy.
- Both the Decarboni.se hub core and terminals run on Ubuntu machines hosted by Rackspace. The complete suite of sites costs 2/3 less to run than our old knowledge platform and is about seven times faster. The hub core can serve over 200 pages per second without slowing down even a little bit.
- Decarboni.se can easily process 10,000 pages of PDF documents per month, publish them as full-text searchable HTML and classify them automatically.
- The first functional terminal site was built for the European Commission and it took us less than a day to deploy and configure the site for the client: http://ccsnetwork.eu
Figure 2 - Publications on biomass will be automatically organised under a definition page after it is uploaded to Decarboni.se
Figure 3 - Full-text search is available and users can see precisely which chapter/section contains information they are interested in
Figure 4 - Terms definied in Decarboni.se taxonomies will be automatically highlighted in the content and link to the term definition page with other content relating to the term