Sunday, February 3, 2019

Elasticsearch nuances - default field text length and lucene tokenizer

The web application I work on has a reporting module where data is ETLed with Logstash and stored in Elasticsearch. There is a reporting module where you can specify multiple filters, i.e state filter, program filter, etc.

Environment:
Elasticsearch 5.0.2
Windows 10

There were two issues that came up in recent months:

1. Looking at all field mappings for a particular index you can see that fields with type "text" has a max value of 256, defined by "ignore_above": 256.  This is the default setting of "text" fields.  Performing the following GET to retrieve index field mappings -
  • curl http://localhost:9200/{index_name} 
  • i.e. curl http://localhost:9200/cherryshoe_idx 
returns a JSON that looks something like -
{
  "cherryshoe_idx": {
    "aliases": {},
    "mappings": {
      "logs": {
        "properties": {
          "@timestamp": {
            "type": "date"
          },
          "@version": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "text_data_that_can_be_very_long": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "unique_id": {
            "type": "long"
          }
        }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1546610232085",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "cC1mdfLfSi68sZe6r-QNLA",
        "version": {
          "created": "5000299"
        },
        "provided_name": "cherryshoe_idx"
      }
    }
  }
}

PROBLEM and SOLUTION:
One of the filters was using the "text_data_that_can_be_very_long" field to filter on; sometimes the value was being cut off because of the length restriction.  Because of this, an additional field was added for the "id" value of the filter (text_data_that_can_be_very_long_id), the query was updated to use the "id" field of this value to filter instead, and the "ignore_above": 256 restriction was removed for "text_data_that_can_be_very_long" for data display purposes.

Updated field mapping json snippet:
  "text_data_that_can_be_very_long": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword"
      }
    }
  },
  "text_data_that_can_be_very_long_id": {
    "type": "long"
  }

2. As I mentioned above, the report can specify multiple filters, one of them the state filter -



PROBLEM and SOLUTION:
If a user only chooses "Virginia" it returned both "Virginia" and "West Virginia" records (it should only return "Virginia"). The Lucene query portion constructed used the field "state_name:Virginia", but this retrieves documents that have "Virginia" and "West Virginia" for the state_name attribute.

This is because the way Elasticsearch tokenized the state_name with both "West" and "Virginia" tokens, so "West Virginia" documents were incorrectly retrieved along with "Virginia" records.

The fix was to add the .keyword after the state_name attribute (i.e. "state_name.keyword:Virginia". This comes by default (meaning it didn't have to be specially defined) where state_name.keyword tokenizes by the entire value (in other words exact match).

No comments:

Post a Comment

I appreciate your time in leaving a comment!