Sunday, February 9, 2014

Alfresco - indexing document metadata only - confirmation with Luke

Enterprise Alfresco 4.1.2 has two methods that control content and metadata indexing to facilitate search capabilities that apply to all content nodes.  Let's explore the two methods, experiment with changing out of the box capability, and verify those changes.

NOTE: There may be a way to apply indexing configuration only to certain content mimetypes.  I am still verifying this and will update this blog in the future.

Method 1
Enterprise Alfresco 4.1.2 automatically indexes document content and metadata out of the box.  This is defined in the Alfresco Content Domain Model contentModel.xml by the following:
<aspect name="cm:indexControl">
    <title>Index Control</title>
    <properties>
        <property name="cm:isIndexed">
            <title>Is indexed</title>
            <type>d:boolean</type>
            <default>true</default>
        </property>
        <property name="cm:isContentIndexed">
            <title>Is content indexed</title>
            <type>d:boolean</type>
            <default>true</default>
        </property>
    </properties>
</aspect>

You can control indexing on only certain objects types (folder, content) by modifying the contentModel.xml data dictionary, by following these instructions.

Method 2
Solr is the default search / index for Alfresco 4.1.2, another way to alter index functionality is via configuration file change with each solr core in solrcore.properties.   Using both the attributes below disables both content and the metadata, respectively:
alfresco.index.transformContent=false
alfresco.ignore.datatype.1=d:content

Use Case: Disable content indexing only
If your application does not need content full-text search capability, then turning off content indexing can increase performance.  Method 2 controls indexing across all document types throughout the repository. Using only the first attribute (alfresco.index.transformContent=false) disables content indexing for all documents that are introduced into the system.  This is described in Alfresco's documentation; even though documentation state this only works in 4.1.3 and above, it also works in 4.1.2.

Setting transformContent to false disables content indexings because Alfresco performs context indexing by transforming a document into a plain/text document first, then indexing that text document.  So if you turn off transformation of the document, then the document's content cannot be indexed.

Verify indexing (or not indexing) using Luke
Since Solr uses the Lucene Java search library at its core for full-text indexing and search, let's verify that with Luke.  We'll be using the executable lukeall-0.9.9.1.jar because this is the version compatible with Lucene 2.9.3 in Alfresco 4.1.2 and it has all necessary dependencies included.

Luke step-by-step
  • First, find the system:node-dbid of the node index you wish to look at via Alfresco Explore node browser.  In our case, that's 534
  • Start the Luke GUI by executing the luke jar:
    • cherryshoe@ubuntu:~/Downloads$ java -jar lukeall-0.9.9.1.jar
  • When the GUI opens, navigate to the index folder:
    • i.e.: /opt/alfresco-4.1.2/alf_data/solr/workspace/SpacesStore/index
  • Navigate/find your document.  For my example:
    • Navigated to 'Documents' tab.  Chose '@{http://www.alfresco.org/model/content/1.0}content.mimetype' as the term. Clicked 'Next Term' until the 'text/plain' value was selected.  Clicked 'Show All Docs'.
  • This brings you to the 'Search' tab.  Click the 'right arrow' until you find the latest set of documents.  Scroll to the right until you find the column 'dbid', and find row with 534.   NOTE: The document number has no correlation!
  • Double click the row, this brings you to the 'Documents' tab again.  Click on 'Reconstruct & Edit' to open up the document details window.
Findings for each index combination:
Using .txt plain/text files as examples.
  1. Default Out Of Box Alfresco, content and metadata indexing enabled

  2. Both content and metadata are indexed.  Notice the document details below that content, content._, content.encoding, content.mimetype, content.size, and content.transformationStatus are all there.  And you see the content in the tab on the right side (judy ed romano)
    1. Document Details Window


  3. Content and metadata indexing disabled

    Add in the following to workspace-SpacesStore solrcore.properties (and any additional cores).
    alfresco.index.transformContent=false
    alfresco.ignore.datatype.1=d:content

    This causes both the content and the metadata to not be indexed, the content won’t even show up in Luke since it wasn’t indexed.  You can double verify by going in Alfresco Explorer Node Browser and doing a lucene search for the mimetype in question and you won’t find the document: @cm\:content.mimetype: "text/plain".

  4. Metadata indexing only

    Add in the following to workspace-SpacesStore solrcore.properties (and any additional cores).
    alfresco.index.transformContent=false

    This causes only the metadata to be indexed.  Notice the document details below that only content.encoding and content.mimetype are available, the content itself was not indexed.
    2. Document Details Window

2 comments:

  1. So Luke has to have filesystem access to the SOLR index files? Can it run remotely, or in text mode, e.g. in ssh terminal windows? Or would you have to run it over X-Windows in that situation?

    ReplyDelete
  2. Hi David! Thank you for your comment.

    Yes, Luke needs access to the lucene index files; whether it is directly on the box you're on, or by mounting a folder accessible to where you are running Luke. For example, I mounted a windows folder to where lucene indexes are located on my windows machine; and Luke was able to access that remote folder.
    These two look like two alternatives to Luke that have a command-line interface; I have not had a chance to play with them yet.
    Lucli (http://manpages.ubuntu.com/manpages/natty/man1/lucli.1.html)
    CLue (https://github.com/javasoze/clue). CLue's description sounds like what you are looking for! (Often times it is not feasible to inspect an index on a remote machine using a GUI. You can ssh into your production box and inspect your index using your favorite shell. Another important feature for Clue is the ability to interact with other Unix commands via piping, e.g. grep, more)

    ReplyDelete

I appreciate your time in leaving a comment!