Magnolia CMS and semantic technologies – follow-up on the SalsaDev webinar

Back in August, I introduced the “External Indexing Module” for Magnolia CMS. Just last week, we had a webinar with salsaDev, where they introduced their technology stack, and I demonstrated how to use it with Magnolia and the External Indexing Module.

For those who missed it, it’s now available on our website. Unfortunately, we didn’t have time to address all questions asked by the audience. We usually send a follow-up email to the audience of webinars, but why not publish those unanswered questions as well ? Well, we decided to go ahead and just do it. It provides some interesting insights and different perspectives on the problem and the solution. (Questions were anonymized, some questions were slightly rewritten or merged when they touched on similar topics. The answers were written collegially by salsaDev and myself)

What are other features of the salsaDev API and the Magnolia module ?

See http://doc.salsadev.com/ and http://apidoc.salsadev.com/ for salsaDev API features.

As for the External Indexing module, the currently implemented services are “Indexing” and “Similar”. Anything else will be done on a need-basis, and we’re of course very open to any and all contributions. See the module’s javadoc for some details.

Is there a function to search for specific words ?

Magnolia CMS offers a search functionality out-of-the-box, based on the repository’s own indexing. The External Indexing module could also offer such a functionality – there is in fact an interface that exists in the source code for this, but currently not underlying implementation. We’d love to get such additional features contributed !

The search process seems to be slow. What is the relation between search time and the length of article ?

Keep in mind that the demo server was running on my laptop, from my home network connection. Also keep in mind that there is a cache mechanism, which I explicitly disabled at the beginning of this demo, which helps a lot. The first time “related pages” are searched for a given page, it might take a little while, but on next hits, it’s close to instantaneous, since it’s a simple Ajax query between your browser and the Magnolia server. Lastly, this potential delay is the reason this is done using an asynchronous request; the list of related pages is secondary to what the user really wanted to see in the first place. (i.e the page we’re looking at)

The length of the document has no impact on the search time. SalsaDev does not use the document, but its statistical representation which is of constant size.

Does the External Indexing module affect the activation time of Magnolia, does it block the activation process until the indexing is complete ?

Not at all. The indexing happens asynchronously, using observation features. During the demo, I switched to checking the log file for a moment, after activating all the article pages. (trying to jinx the demo-effect, and making sure everything went fine indeed). If you’re familiar with log messages from Magnolia, you’ll recognize activation success messages, and you’ll see Salsa indexing messages showing up a few seconds after that; pages are already available on the public instances at that point.

Can you explicitly exclude search results ?

In the case of salsaDev, the External Indexing module, and the Similar service, for example, I would take the opposite approach: not excluding results, but preventing content from being indexed in the first place. The EventAccumulator and IndexGenerator classes allow you to fine-tune what gets indexed and how. These can be configured and registered via the module configuration.

We have scenarios where users rights are relevant not only regarding to displaying a page but even more granular on a component level (that would be “paragraphs” in Magnolia). That means users that are not allowed to read a paragraph should not receive search results based on the content of that paragraph. Could this be realized with a salsaDev/Magnolia CMS integration ?

The module works at page-level. It could probably be customized or even improved in a way that would allow finer grained indexing. I’m not entirely sure how to approach this though. Perhaps it’d be a simple enough solution to make sure that the “restricted” paragraphs are not indexed along the rest of the pages ?

To search on salsaDev, you have to specify which language is used in the text, which may not necessarily be convenient when you process free text automatically. Do you consider integrating a language detection function ?

The language attribute of the search API is used to select the languages of retrieved documents. SalsaDev is capable of detecting the language of both queries and documents (at least in latest versions of the API), as well as executing trans-language information retrieval, hence providing the end-user with a choices of languages.

What languages are supported ?

Theoretically all languages are supported. The salsaDev team covers 4 of them at the moment; they’ve tested and enjoyed great performance in French, English, German and Spanish. They are currently thinking about supporting other languages such as Russian, Chinese or Arabic and are always looking for collaboration opportunities with possible beta-users of such languages who could contribute to the process of adding a language and provide feedback on the results.

How is the similarity between documents calculated ?

There are multiple methods to compute the similarity between documents. From the most naive tf-idf computation to the most extreme graph-edge-computation. In the particular instance of salsaDev and this webinar, the similarity measure is based on an LSA projection. LSA (Latent Semantic Analysis) is a method for extracting “signal” within loads of noise. For salsaDev, that means trying to find a sense within a lot of words. Those senses are represented as discrete mathematical objects which can be compared. Think of this similarity measure as a naive approach of asking someone in the street a simple “How much do these 2 things talk about the same thing” question.

Does salsaDev also deal with any kind of media within the content and weigh that into the relation calculation? If not, would that even be feasible or does it simply not matter ?

SalsaDev purely concentrates on raw-text. It is possible to add any media information to the indexed item (metadatas) and let them play a crucial role.

For indexing, does the salsaDev service take advantage of HTML5 microdata and schema.org schemas? If not, are there plans to do so in the future ?

SalsaDev’s core speciality, when it comes to semantics, is part of the “implicit” side of the balance. SalsaDev currently does not leverage microdata or schema.org schemas. That said, they do take advantage of existing components that do, and there are future plans to provide more functionalities to end-users with these technologies. But salsaDev wants to make sure their product remains simple to use.

How does this integrate with my OWL ontologies ?

It does not automatically integrate, but can leverage both definition and content of the ontology to create its categorizer or NLP model.

Are there tools to migrate OWL to NLP for Magnolia ?

Not that we know of. On the other hand, salsaDev would not recommend to migrate an existing OWL to NLP, it does not make much sense: each serve a slightly different purpose and are best used in conjunction.

I assume most of us already have a CMS in place and working, is there a way to use components that can semantic-enable our CMS ?

The demo in this webinar showed an integration option for Magnolia CMS. It’s only the tip of the iceberg of what can be done, even within the realm of Magnolia. As Stéphane mentioned in his presentation, one of the nice things about salsaDev is its API. I can confirm it’s indeed fairly easy to use, so have a look at the features they propose, and the rest (pun intended) is up to your imagination !

To our knowledge there are currently working integrations of salsaDev’s SearchBox for Magnolia CMS, Drupal, Liferay and Exo. While salsaDev does not directly initiate CMS integration, they strongly support communities or open-source efforts.

What are the costs of such a solution ?

The API works in a freemium model. It’s basically free to get the solution up & running, run it on a prototype mode, or on production with a relatively small number of queries. Then, for more usage-intensive customers, they propose add-ons which consist of an annual subscription to be able to run more queries against the system. For specific prices, please contact salsaDev.

What’s next ?

I would also like to provide a little insight on what’s left to be done for this module; a few things are “unfinished” and need some polishing:

If you would like to discuss any of this further, hop on to the forum or leave a comment here ! There’s been a couple of questions/reactions that hinted at possible additions to the module, and I’d love to help you contribute – so let me know about that too.

As a side-note, if anyone was wondering, I wrote a script to scrape contents off Wikipedia and generate sample pages for this demo – and used the occasion to create a GitHub repository for such useful (or not so useful) scripts. Feel free to share, clone, and send pull requests my way !

8 thoughts on “Magnolia CMS and semantic technologies – follow-up on the SalsaDev webinar

  1. Hi Gregory,

    Kind of related

    Just seen this, so a bit behind – is the external indexing module intended to be pushed towards general release – as I ‘d like to use it as a basis for indexing into other services ?

    regards, Jon

    • Hi Jon,

      That’s a possibility, but it’ll highly depend on interest and contributions. (and time…)

      What other services are you looking to index into ? Would you like to contribute that ?

      Cheers,

      -greg

      • Nothing especially spectacular – I was after getting data out into a reporting datawarehouse.

        If I get a chance to do it, I’ll post it back to you guys,

        cheers, Jon

        • Ho nice, that’s a use-case I hadn’t thought of at all! (assuming it’s data you’ll store for later use/analysis, but not use information from the same system back in Magnolia). In all modesty, it’s true that the module offers a nice framework that ties together observation of arbitrary content, filtering it (i.e extracting html if needed) and sending that to some external systems

          If I can do anything to help, let me know – and I’ll try, within the limits of my time.

          Snapshots are available on our Maven repository. If it’d help, I could cut a new pre-release, but I think 1.0-pre-4 is pretty much exactly what’s in the sources at this moment.

          • Hi Greg,

            I’ve had a chance to make some progress on this and have got to a stage where I’m indexing data to an external ‘datawarehouse’ DB.
            in order to get there I’ve extracted out key areas of external indexing and removed the salsa/formum dependancy elements which were getting in my way.

            this means I can do reporting via a 3rd party system (jasper is the current example), but I’d like to integrate it back into a mgnl5.0 app.

            I’ve got a 5.0 app running but would like to use the visualizationsforvaadin addon – but I’m struggling to get this working – have any of you tried to deploy this addon in 5.0 yet ?

            cheers, Jon

  2. Pingback: Learn More about Semantics in Content Management |

  3. Just noticed on the 5.0 preview site that pulse is due to have stats and analysis in – got any more infomation about this – I don’t erally want to re-invent the wheel

    Should have said before I work for Sceneric (jon.holmes@sceneric.com), so I’m more than happy to carry on the conversation down that channel.

    cheers, Jon

    • Hey Jon,

      Great news about this. I’d be happy to have a look at what you have regardless of which version it works with !
      I’m not very up-to-date with regards to 5.0. How about hopping on to the forum (or user-list) ?

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>