Opening the door to semantic search in Magnolia

In the past couple of months, by pure coincidence, we have been in contact with two independent companies providing semantic search services. Less coincidentally, customers have been asking for semantic search features in Magnolia for months. Trending topic ? Yeah, pretty much. While it can sound like a mishmash of buzzwords, “semantic search” can mean a lot of different things, but generally provides different – or new – ways of looking at, and navigating, your content.

In this blog post, I’ll expose a few of the problems we have with regular searches in Magnolia, and how we’ve implemented a module that allows using the above mentioned services, bringing semantics into Magnolia !

To cut a long story short, let me show you part of what’s been going during those past few months: Screenshot

See that list of links on the right ? Well, that’s the result of a query to a service which returns “similar” or “related” pages to the one we’re looking at. How that service knows what’s relevant and related is a bit of magic linguistics voodoo to me, but I will expose what we’ve done in Magnolia to achieve this, and where this is going to go in the coming weeks.

External indexing

If you have fiddled around with searches in JCR, JackRabbit and especially Magnolia, you must have hit quite a bunch of walls. Truth is, JCR search works well for finding documents (nodes) with a given word, or where a given property has a given value. Anything beyond that quickly falls beyond the realm of anything that would be portable to a different JCR backend. I’ve gone that way for the forum already, for the keyword search: if you look at the Forum-STK integration module, you’ll find that the JCR query itself is fairly standard, but we use rep:excerpt (which is JackRabbit specific but tends to be supported by other repositories as well), which in turn relies on the fact that we index the forum workspace with a customized configuration; finally, the whole feature relies on the fact that JackRabbit renders those excerpts with a (configured) customized HighlightingExcerptProvider, so that it fits our HTML structure. Not pretty.

Another reason this can be hairy is how data is aggregated on a page. Think about dynamic pages in Magnolia: there’s a high chance that what you see on a given page comes from another data source than the page itself. Take a web shop, for example; the shop items’ descriptions and images will most likely come from nodes off the data workspace. The way Magnolia’s page are indexed by JackRabbit makes searching for such pages pretty much useless. For each and every type of item you want to find in your search results, you’ll need to go through a bunch of hoops like described above. And as it stands with the forum module, we haven’t touched the case where we want aggregated data: all we get from the forum’s search (or our hypothetical web shop search) is forum threads (shop items); do you have a shop and a forum on the same site ? Or do you simply have editorial content on that site, that you’d like to search as well ? Well, good luck with that. Sure, it works. It’s not too complicated to execute two queries and aggregate their results, but how do you sort those results in any sort of relevant way ? (Not so ironically, you’ll find results if you search the forum for this topic.)

This is no secret for anyone who’s been working with searches. Solutions like Solr exist to solve this precise problem, and they work quite well. Our own Federico even wrote a module to integrate Solr with Magnolia.

Now, keyword-based searches are one thing, but… do people really use search ? What if, instead of telling someone to “use the search” to find a solution to their problem (or RTFM!), why not give them possible solutions straight away ? This is where “semantic search” shines, in my opinion. Navigating a website with “similar articles”, “related articles”, “you might like” can be a much richer experience than searching for, and via, a keyword or two (“what’s that term I’m looking for, again?”).

Enter the External Indexing module

So we want to free ourselves from the limitations of the built-in JCR search capabilities; or rather, we want to take advantage of other more specialized solutions, solving different problems. So we need a mechanism in Magnolia to be able to index pages, documents, resources into such a system.

So here’s where this blog post is going: we used our forum as a test bed and proof of concept, and have now made the new “External Indexing” available on the Magnolia Forge. For the impatient, the source is on Git and snapshot builds are available on our Maven repository.

Goals of the External Indexing Module

This module was written with a couple of simple goals in mind:

  • Decoupling from the underlying system. We wanted to be able to easily integrate with more systems, without necessarily having to rewrite everything from indexing to paragraphs.
  • Decoupling from the type of content being indexed. Ideally, we want to be able to index and search different types of content (web pages, documents from the DMS, forum threads, shop items, …) transparently.
  • Following the above, we also wanted to be able to aggregate independent sites; for example, our documentation site runs on a completely separate than our forum, but we want “suggestions” from both to appear (on both too, more on that later)

Concepts of the External Indexing Module

If you’re interested in integrating another system, improving the existing, understand how to configure the module for your own setup, or are simply curious (yay you!), here are some of the key components used in the External Indexing module.

Services

The module offers several services:

  • IndexerService: this service is used via the IndexerEventListener mentioned above. It is responsible for the actual indexation of content. There is currently an implementation of this for one the services we used, as well as “null” implementation, for cases where indexation is automatic. See below for a couple more details.
  • SimilarService: this services offers a simple API to retrieve content “similar” to a given node. What “similar” means is up to the underlying system. This is what’s used to display “related pages” on our forum. There are currently implementations for both services we use, as well as a “caching” implementation, to avoid unnecessary traffic to the remove servers.
  • SearchService: similarly, this service offers keyword-based search results. Currently no implementation, so here’s your cue to start contributing ;)
  • Others ? I’m sure there are a couple more services the module could offer to complete its palette of services. Any suggestion ?

Observation and EventAccumulators

JCR offers an observation mechanism, which we’re taking advantage of (once more!), in order to push our content to an external indexing system.

The module is configured via EventAccumulators, which allow to filter and group events before those are processed by the indexer. A typical page edit will trigger about 4 or more events (text change, various meta data changes); using the EventAccumulator, we can filter which events we’re interested in, and accumulate all events related to the same page into one single indexing operation. IndexerEventListener instances are registered into JCR’s observation mechanism for each configured EventAccumulator.

IndexGenerators

The IndexGenerator implementations are responsible for converting Magnolia content (i.e the input to the IndexerService) into something the underlying system can understand (this could be as simple as a URL, a plain text rendition of the page, or, for example, an object which we can send to a remote REST or SOAP API). Several IndexGenerator implementations are needed, depending on the input content type, and the underlying service.

Side note about the WebsiteIndexGenerator for the Salsa service: since Salsa expects plain text input, we had to go through some hoops to generate it from a Magnolia page. We wanted a generic solution that works for all types of pages (so the idea of sub-templates was abandoned), and get readable and meaningful results. After fiddling around with various solutions, and not finding a turn-key one, I came up with a combination using Tika and jsoup. Tika works greats for converting HTML to plain text, while keeping some level of semantics (i.e a blank line after headings and between paragraphs, etc), but it lacks in the “filtering” section. A plain text version of a web page will hardly make any sense or be readable if it contains all navigation and other extraneous elements. Jsoup shines in this area, by having a jQuery-like API allowing one to easy select and remove elements based on a selector. To filter an STK page, for example, you’d select #main and remove #breadcrumb and .text-meta (which is exactly what the class does by default). WebsiteIndexGenerator is configurable, such that you can specify a mainElementSelector and a comma-separated list of elementsToFilter.

REST service, Servlet, and Paragraph

This is the final piece of the puzzle. We expose the above services as servlets or REST endpoints, and use that from Magnolia paragraphs. Currently, only the SimilarService is available, as a Servlet. There is a (currently untested) REST service, which should benefit from the upcoming REST API in Magnolia (at which point the Servlet implementation might be dropped altogether).

There is currently one implementation of a Magnolia paragraph, which gets results from the Servlet via an Ajax call, and displays results using a jQuery animation, as is visible our our forum. There is also a variation of that paragraph, to be used on forum pages (because the “input” node is not the page itself).

Bonus: Abstract REST Client: since both API we were working with are REST-friendly, I abstracted a little client class, which could perhaps be useful in other contexts. What do you think ?

How we use it

The module is currently (in a primary form) deployed on our forum. Since then, and as said above, we’ve extracted it into its own module. In the coming weeks, we will deploy this new version on the forum, as well as on our documentation site. So here’s the trick: since we wanted to testbed the two products mentioned earlier – on one hand to validate the concepts of the module, and on the other to see if any of either provided “better” results – both sites are indexed by both services, and each site gets the results from one of the services. The forum will keep on showing suggestions, courtesy of the SalsaDev API, while the documentation site will suggest related pages provided by Canoo’s FindIT services.

There, I said it. So those are the two services we’ve been playing with. So far, I have no strong opinion on either. They both deliver pretty cool results, and are both fairly easy to use, in their own way.

CanooFindIT vs Salsa

Here’s a quick comparison of both services, from a developer’s perspective. I can’t really say much about the results just yet, but so far they seem pretty good. It’s also fun to see those results becoming better after you notify these guys that some results might be irrelevant, and linguists apply their voodoo.

Don’t let the following deter you from trying both services; these details are hidden by the External Indexing Module, after all !

SalsaDev Canoo FindIT
Input is plain text. This gives some flexibility to the developer (decides what gets indexed and how). Getting plain text output out of Magnolia’s various paragraph wasn’t trivial, but works well now. Input is a bunch of URLs, essentially. Canoo’s system fetches the data and parses it. Extremely simple, but less control for the developer (i.e the system needs to be configured to parse your documents properly)
The client application needs to send and update its data. (hence the usage of observation) It’s all automatic and transparent. There’s even an “auto_index” feature, such that if a not-yet-indexed document requests document similar to itself, it will be added to the index transparently.
An arbitrary ID is chosen when indexing documents; we use a simple <workspace>:<uuid> pattern. The index id is the document’s URL. While this makes things easy, it might not play well with content reuse (say if a page is visible via different URLs)

Another differentiator is that SalsaDev provides a documented REST API (including a Java .jar if you’re so inclined), whereas Canoo FindIT provides JavaScript snippets which are directly embeddable into your pages. (Instead of using that, we reproduced their functionality in a Java REST client class, so we could cache results on the server, but it makes it pretty easy to integrate their service in any site.)

What’s next ?

There are a couple of tasks we still need to take on; among other things, we need to handle the case where documents are deleted (and should thus be removed from the index).

There’s plenty of room for contribution, too: implementing the SearchService should happen, at some point. The IndexerService could benefit from a delegating pattern, for cases where one site needs to be indexed into several indexers. An integration with Solr (and/or with Federico’s module!) would perhaps help adoption of this module, too. Any idea that would fit in this module would also be great feedback !

There’s talk of standardization of semantic tools, lead by the IKS project, and, as far as I understand, the Stanbol project. Hopefully, those won’t diverge too much from the approaches taken by this module, and thus adapting the two will hopefully be a breeze.

Now, my hope is you’ll be able to give this module a try. Let us know what you think !

4 thoughts on “Opening the door to semantic search in Magnolia

  1. Great stuff, Greg. These studies of how Magnolia gets integrated and extended are some of my favorite things to read; really fascinating both to see how the product’s extensible design pays off, and to learn about the other products and technologies we’ve taken advantage of.

  2. Pingback: Magnolia CMS and semantic technologies – follow-up on the SalsaDev webinar | Greg’s ramblings

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>