Friday, March 22, 2019

Elastic Stack Multiple Index Search Examples

I am working with Elastic Stack (elasticsearch, logstash, and kibana) for a report where data needed to be joined with two indexes, where the level_X_id to level_Y_id could be matched upon.  The level_X_id to level_Y_id attributes exist in both indexes.  NOTE: You can have multiple documents in each index that have level_X_id to level_Y_id, not just one document that matches with an exact match.

Environment:
Elasticsearch 5.0.2
Logstash 5.0.2
Kibana 5.0.2

Here are multiple ways to do that:
1.  Use multiple indexes in your _search API.  This returns all documents where field name(s) match in both indexes.

GET /cherryshoe_primary_idx,cherryshoe_secondary_idx/_search
{
  "query": {
    "bool": {
      "must": {
        "query_string": {
          "analyze_wildcard": false,
          "query": "level_1_id:7268 AND level_2_id:7292"
        }
      }
    }
  }
}

2. Using Terms Query to specify primary index to match on, only returning the secondary index’s records.  This returns all documents where field name(s) match in both indexes.  The Elasticsearch API for Terms Query was not very clear, it took a while for me to get the query to work so I will explain it in detail below:
  • Retrieves cherryshoe_secondary_idx documents, where both cherryshoe_primary_idx and cherryshoe_secondary_idx documents have matching "level_1_id" value of "3629".  
  • query.terms.level_1_id json attribute refers to the cherryshoe_secondary_idx index attribute.
  • query.terms.level_1_id.path value of level_1_id refers to the query.terms.level_1_id.index document json structure "_source.level_1_id".  You can see this in Kibana -> Discover -> cherryshoe_primary_idx.  Expand one of the results -> and instead of the "Table" view look at the "JSON" view.  You'll notice the "_source" JSON object holds all the index attributes.
  • query.terms.level_1_id.type json attribute refers to the document json structure "_type".  You can see this in Kibana -> Discover -> cherryshoe_primary_idx.  Expand one of the results -> and instead of the "Table" view look at the "JSON" view.  You'll notice the "_type" JSON attribute has value "logs".

Single "terms":
GET cherryshoe_secondary_idx/_search
{
    "query" : {
        "terms" : {
            "level_1_id" : {
                "index" : "cherryshoe_primary_idx",
                "type" : "logs",
                "id" : 3629,
                "path" : "level_1_id"
            }
        }
    }
}

I thought I could immediately put multiple Terms in the query, to add additional attributes, but you can't have multiple Terms be defined and return the results you expect.  For example, the below runs with valid syntax, but doesn't return any data.  I haven't been able to find documentation to say that you cannot have multiple Terms in a query work.  Interesting because you can also have one Term.

Multiple "terms":
GET cherryshoe_secondary_idx/_search
{
    "query" : {
   "bool": {
    "must": 
    [{
     "terms" : {
      "level_1_id" : {
       "index" : "cherryshoe_primary_idx",
       "type" : "logs",
       "id" : 3629,
       "path" : "level_1_id"
      }
     }
    },
    {
     "terms" : {
      "level_2_id" : {
       "index" : "cherryshoe_primary_idx",
       "type" : "logs",
       "id" : 3719,
       "path" : "level_2_id"
      }
     }
    }]
   }
  }
}

3. Using the multi-search template, which allows you to execute several search template requests within the same API.  It returns records from either index depending on the query criteria that you want from each respective index.  NOTE:  each "index" and "query" json should not span multiple lines.

POST /_msearch
{"index": "cherryshoe_primary_idx" }
{"query":{"bool":{"must":{"query_string":{"analyze_wildcard":false,"query":"level_1_id:3629 AND tier_2_fa_id:level_2_id"}}}}}
{"index": "cherryshoe_secondary_idx" }
{"query":{"prefix":{"level_id":"_3629_3719_"}}}

Sunday, February 3, 2019

Elasticsearch nuances - default field text length and lucene tokenizer

The web application I work on has a reporting module where data is ETLed with Logstash and stored in Elasticsearch. There is a reporting module where you can specify multiple filters, i.e state filter, program filter, etc.

Environment:
Elasticsearch 5.0.2
Windows 10

There were two issues that came up in recent months:

1. Looking at all field mappings for a particular index you can see that fields with type "text" has a max value of 256, defined by "ignore_above": 256.  This is the default setting of "text" fields.  Performing the following GET to retrieve index field mappings -
  • curl http://localhost:9200/{index_name} 
  • i.e. curl http://localhost:9200/cherryshoe_idx 
returns a JSON that looks something like -
{
  "cherryshoe_idx": {
    "aliases": {},
    "mappings": {
      "logs": {
        "properties": {
          "@timestamp": {
            "type": "date"
          },
          "@version": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "text_data_that_can_be_very_long": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "unique_id": {
            "type": "long"
          }
        }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1546610232085",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "cC1mdfLfSi68sZe6r-QNLA",
        "version": {
          "created": "5000299"
        },
        "provided_name": "cherryshoe_idx"
      }
    }
  }
}

PROBLEM and SOLUTION:
One of the filters was using the "text_data_that_can_be_very_long" field to filter on; sometimes the value was being cut off because of the length restriction.  Because of this, an additional field was added for the "id" value of the filter (text_data_that_can_be_very_long_id), the query was updated to use the "id" field of this value to filter instead, and the "ignore_above": 256 restriction was removed for "text_data_that_can_be_very_long" for data display purposes.

Updated field mapping json snippet:
  "text_data_that_can_be_very_long": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword"
      }
    }
  },
  "text_data_that_can_be_very_long_id": {
    "type": "long"
  }

2. As I mentioned above, the report can specify multiple filters, one of them the state filter -

Monday, January 21, 2019

MySQL Stored Procedure with parameter dynamic filtering and sorting

I recently worked on a web application where we had to retrieve user specific dashboard data in real-time with dynamic paging, dynamic column sorting (ASC or DESC), and dynamic data filtering (i.e by year). For the normal use case, there would not be too many user specific data, but for a small sub-set of users they would have a large amount of user specific data. Also, out count parameters were needed in addition to the select column data output.  For this reason, MySQL stored procedures were chosen to achieve this real-time requirement.


Environment: 
MySQL 5.7.10.0 on Windows 10
MySQL 5.7.10.0 on CentOS 6.7
MySQL 5.7.10.0 on RHEL 7

A MySQL temp table was chosen (over a view) because:
  1. Needed to pass parameter dynamic data for filtering and sorting in the SELECT statement to retrieve user data. With a Temp table that is possible vs a View's select statement cannot contain a variable or a parameter (it's a known limitation).
    • NOTE: If a View's select statement could contain variables and parameters it would have been chosen because:
        • Data in a view is always current because it is dynamically generated, whereas the data in a temp table
        • reflects the state of the database at the time it was populated and is only created once per session.
        • We want to always update the View real-time, even in the same session for a user.
  2. Temp tables are created per session, so you can have the "same name" temp table across different sessions. MySQL will maintain different "copies" of it.
The "skeleton" of the stored procedure strategy is below:

Thursday, January 10, 2019

Spring Boot curl example with multipart/form-data and JSON metadata in POST request

Below are two curl examples of Spring Boot controller method signatures for a multipart/form-data POST request, one with request parameter metadata, one with JSON request body metadata.

Environment:
Spring Boot / Spring Boot Starter Batch Version 1.5.6.RELEASE
Oracle Java 8

1. multipart/form-data POST request with request parameter metadata

curl:
curl -k -H 'Content-Type: multipart/form-data' -F 'docFiles=@test1.txt;type=text/plain' 'https://localhost:8443/cherryshoe-app/document/file/upload/REVIEW?id=3554&docTypeId=1'

-k is to  to turn off curl's verification of the self-signed certificate
-F Contents to be read from file use the @

Matching Java Controller Method Signature:

@RequestMapping(value = "/document/file/upload", consumes = "multipart/form-data", produces = "application/json;charset=utf-8", method = RequestMethod.POST)

@ResponseBody
public DocumentUploadMetadata fileUpload(
    @RequestParam(value = "id", required = true) final Long reviewId,
    @RequestParam(value = "docTypeId", required = true) final Long docTypeId,
    @RequestParam("docFiles") List<MultipartFile> docFiles, Principal principal)
    throws IllegalArgumentException, NullPointerException, Exception {
}


2. multipart/form-data POST request with JSON request body metadata

curl:
curl -k -H 'Content-Type: multipart/form-data' -F 'docFiles=@test1.txt;type=text/plain' -F metadata='{"reviewId":3554,"docTypeId":1,"customFileName":"customFileName.txt","commentText":"This is a really long text, it may even be in HTML"};type=application/json' 'https://localhost:8443/cherryshoe-app/document/file/upload'

-k is to  to turn off curl's verification of the self-signed certificate
-F Contents to be read from file use the @

Matching Java Controller Method Signature:
@RequestMapping(value = "/document/file/upload", consumes = "multipart/form-data", produces = "application/json;charset=utf-8", method = RequestMethod.POST)

@ResponseBody
public DocumentUploadMetadata fileUpload(@RequestPart("docFiles") List<MultipartFile> docFiles,
    @RequestPart("metadata") DocumentUploadMetadata metadata, Principal principal)
    throws IllegalArgumentException, NullPointerException, Exception {
}

This Signature also works, had to do this to work with ng-file-upload library:

/**
 * Takes a list of document MultipartFile's, and JSON metadata.
 * 
 * NOTE: To make this work with ng-file-upload: - have to pass the JSON
 * metadata as a String, and then deserialize the JSON String manually in
 * the controller.
 * 
 * @param docFiles
 * @param metadata
 * @param principal
 * @return
 * @throws IllegalArgumentException
 * @throws NullPointerException
 * @throws Exception
 */
@RequestMapping(value = "/document/file/upload", consumes = "multipart/form-data", produces = "application/json;charset=utf-8", method = RequestMethod.POST)
@ResponseBody
public DocumentUploadMetadata fileUpload(@RequestParam("docFiles") List<MultipartFile> docFiles,
        @RequestParam("metadata") String metadataJson, Principal principal)
        throws IllegalArgumentException, NullPointerException, Exception {
  // deserialize metadataJson JSON string manually
}

POJO for JSON metadata serialization/deserialization:

public class DocumentUploadMetadata {
    private Long reviewId;
    private Long docTypeId;
    private String customFileName;
    private String commentText;

    public DocumentUploadMetadata() {
        super();
    }
    
    // getters and setters are needed for jackson serialization/deserialization, add them for this to work
}


Helpful curl documentation: https://curl.haxx.se/docs/manual.html


Monday, September 10, 2018

Kibana: Create index patterns and set default index pattern through curl

I needed programatically create kibana index patterns and set the default index pattern to support an automated deployment.  The below two curl commands were used in a bash script to do that.  To get the $KIBANA_HIDDEN_INDEX, navigate to <kibana_home>/config/kibana.yml, and note down the "kibana.index" value.

Environment, this worked on both:
centos-release-6-8.el6.centos.12.3.x86_64
kibana-5.0.2-linux-x86_64

Windows 10 with Git Bash (Cygwin)
kibana-5.0.2-windows-x86
  1. Create kibana index pattern:

  2. curl -X POST -H "kbn-xsrf:true" -H "Content-Type: application/json" -d '{"timeFieldName":"@timestamp","title": "$INDEX_PATTERN"}' http://localhost:5601/elasticsearch/$KIBANA_HIDDEN_INDEX/index-pattern/$INDEX_PATTERN?op_type=create

    curl -X POST -H "kbn-xsrf:true" -H "Content-Type: application/json" -d '{"timeFieldName":"@timestamp","title": "cherryshoe_idx"}' http://localhost:5601/elasticsearch/.local/index-pattern/cherryshoe_idx?op_type=create

  3. Set default index pattern:
  4. curl -X POST -H "Content-Type: application/json" -H "kbn-xsrf: true" -d '{"value":"$INDEX_PATTERN"}' http://localhost:5601/api/kibana/settings/defaultIndex

    curl -X POST -H "Content-Type: application/json" -H "kbn-xsrf: true" -d '{"value":"cherryshoe_idx"}' http://localhost:5601/api/kibana/settings/defaultIndex

Updated instructions for Kibana 6.4.2:
Windows 10 with Git Bash (Cygwin)
kibana-6.4.2-windows-x86_64

  1. Create kibana index pattern:

  2. jsonValue="{\"attributes\":{\"title\":\"$INDEX_PATTERN\",\"timeFieldName\":\"@timestamp\"}}"
    curl -X POST -H "kbn-xsrf:true" -H "Content-Type: application/json" -d $jsonValue http://localhost:5601/api/saved_objects/index-pattern/$INDEX_PATTERN

    jsonValue="{\"attributes\":{\"title\":\"cherryshoe_idx\",\"timeFieldName\":\"@timestamp\"}}"
    curl -X POST -H "kbn-xsrf:true" -H "Content-Type: application/json" -d $jsonValue http://localhost:5601/api/saved_objects/index-pattern/cherryshoe_idx

  3. Set default index pattern:
  4. defaultIndexValue="{\"changes\":{\"defaultIndex\":\"$INDEX_PATTERN\"}}"
    curl -X POST -H "kbn-xsrf:true" -H "Content-Type: application/json" -d $defaultIndexValue http://localhost:5601/api/kibana/settings

    defaultIndexValue="{\"changes\":{\"defaultIndex\":\"cherryshoe_idx\"}}"
    curl -X POST -H "kbn-xsrf:true" -H "Content-Type: application/json" -d $defaultIndexValue http://localhost:5601/api/kibana/settings
This article helped a lot.

Friday, August 17, 2018

Bamboo SSH Task - calling remote script that has switch user (su) fails silently posted to Atlassian Community

I recently submitted this question to the Atlassian Community on a problem I am facing with using the Bamboo SSH Task.

Problem:
I have defined a SSH Task where the commands in the SSH Task input field are:
source /home/myuser/.bash_profile
cd /opt/app/myproject/build/TEST
./deploy_something.sh

Inside the deploy_something.sh bash script, it needs to switch user to be able to start/stop a service (i.e. logstash) that is owned by a different user (i.e. elasticsearch):
su - elasticsearch -s /bin/bash -c "$LOGSTASH_DIR/bin/logstash -f $LOGSTASH_PIPELINES &" > /dev/null 2>&1

This works great when running from within the remote server, but when called from Bamboo SSH task, it dies silently.

I've disabled the "requiretty" and "!visiblepw" configuration via visudo as discussed in https://community.atlassian.com/t5/Answers-Developer-Questions/Execute-sudo-on-remote-agent/qaq-p/523834, thinking it may be similar to a sudo issue. Does not solve the problem.

The Environment details are below:
Bamboo Server: centos-release-6-9.el6.12.3.x86_64
Remote Server: centos-release-6-8.el6.centos.12.3.x86_64

Anyone know a workaround or solution for this?

Linux - How to remotely call a script with parameter over ssh

Below is an example on how to remotely call a bash script with a parameter over ssh.  I've done this multiples times in the past, but decided to document it this time so next time I can just reference this post.  This assumes that ssh to remote server passwordlessly already works with the intended user.

The example here performs an automated database script deployment for the sprint branch that is currently being worked.  Atlassian bamboo is being used for CI, but the DB script part was still being done manually.  This was a home-grown solution, the SQL for Bamboo add-on was not available and not used.

Environment: 
Source and Target DB server both centos-release-6-8.el6.centos.12.3.x86_64

The bamboo server was already configured with SSH Task's that called scripts to download the source code for the sprint branch on the source server.  So by the time the cherryshoe_db_copy_deploy.sh script is called, the database scripts files already reside on the source server.  We need the sprint number passed as a parameter because we have to know which sprint folder to copy.  This script then preps folders on the DB server (cherryshoe_db_prep.sh), copies the DB scripts, and runs scripts on the DB server (cherryshoe_db_deploy.sh).

cherryshoe_db_copy_deploy.sh
#!/bin/sh
usage () {
  echo "Usage (sprint number)"
}

SPRINT_NUMBER=$1
if [ -z "$1" ]
  then
   usage
   exit 1
fi

BUILD_HOME=/opt/app/cherryshoe/cherryshoe
BUILD_DIR=/opt/app/cherryshoe/Deploy
REMOTE_USER=cherryshoeuser
REMOTE_MACHINE=10.21.14.72
REMOTE_PATH_TO_SCRIPT=/opt/app/cherryshoe/build/database/automated_deployment/TEST

echo SPRINT_NUMBER $SPRINT_NUMBER
echo GIT_USER $GIT_USER
echo BUILD_HOME $BUILD_HOME
echo REMOTE_USER $REMOTE_USER
echo REMOTE_MACHINE $REMOTE_MACHINE
echo REMOTE_PATH_TO_SCRIPT $REMOTE_PATH_TO_SCRIPT
###########################################################

cd $BUILD_HOME

echo Call remote script to create sprint DB automated deployment folder if not exists
ssh $REMOTE_USER@$REMOTE_MACHINE "cd $REMOTE_PATH_TO_SCRIPT;./cherryshoe_db_prep.sh $SPRINT_NUMBER"

echo Copy current branch DB scripts to DB server
cd $BUILD_HOME
cd database/scripts/release_scripts/$SPRINT_NUMBER
scp *.sql $REMOTE_USER@$REMOTE_MACHINE:$REMOTE_PATH_TO_SCRIPT/$SPRINT_NUMBER
scp *.sh $REMOTE_USER@$REMOTE_MACHINE:$REMOTE_PATH_TO_SCRIPT/$SPRINT_NUMBER

echo Call remote script to perform deployment of db scripts
ssh $REMOTE_USER@$REMOTE_MACHINE "cd $REMOTE_PATH_TO_SCRIPT;./cherryshoe_db_deploy.sh $SPRINT_NUMBER"

echo DONE

cherryshoe_db_prep.sh
#!/bin/sh

if [ -z "$1" ]
  then
    echo "No sprint folder specified"
    exit 1
fi

SPRINT_FOLDER=$1
echo SPRINT_FOLDER: $SPRINT_FOLDER

DB_SCRIPT_DEPLOY_HOME=/opt/app/cherryshoe/build/database/automated_deployment/TEST

echo  Create sprint DB automated deployment folder if not exists.
cd $DB_SCRIPT_DEPLOY_HOME
mkdir -p $SPRINT_FOLDER

echo DONE

cherryshoe_db_deploy.sh
#!/bin/sh

if [ -z "$1" ]
  then
    echo "No sprint folder specified"
    exit 1
fi


SPRINT_FOLDER=$1
echo SPRINT_FOLDER: $SPRINT_FOLDER

DB_SCRIPT_DEPLOY_HOME=/opt/app/cherryshoe/build/database/automated_deployment/TEST
DB_USERNAME=cherryshoe
DB_PASSWORD=cherryshoe
DB_SCHEMA=cherryshoetest

echo changing to sprint folder: $DB_SCRIPT_DEPLOY_HOME/$SPRINT_FOLDER
cd $DB_SCRIPT_DEPLOY_HOME/$SPRINT_FOLDER

BASHFILE=$(ls | grep .sh)
echo BASHFILE: $BASHFILE

echo check if bash file exists
if [ -e "$BASHFILE" ]
then
  echo make sprint script file readable and executable
  chmod +rx *.sh

  echo executing bash file
  ./$BASHFILE $DB_SCHEMA $DB_USERNAME $DB_PASSWORD
else
  echo bash file does not exist, SKIPPING
fi

echo DONE



This article helped a lot.