Elasticsearch in Production

Igor Kupczyński

igor@kupczynski.info

1 Intro

1.1 Elasticsearch

Distributed document store and search server based on Apache Lucene

  • full-text search
  • stores json documents
  • accessible via rest api
  • based on Lucene
  • distributed:
    • high availability
    • scales out by adding nodes to the cluster

1.2 Basic Terminology

logical-struct.png

$ curl -XGET localhost:9200/jug/customers/1?pretty
{
  "_index" : "jug",
  "_type" : "customers",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source":{"name": "Sunrise Inc.", "location": "Dallas, TX"}
}
Elasticsearch RDBMS
Node Database server
Index Database/namespace
Type Table
Field Column
Document Row in a table
  • Shard - a part of a table stored on a single node

1.3 Inverted Index

  • Doc 1: A big brown fox
  • Doc 2: Brown windows attract dogs
  • Query: What does the fox say?
Doc Terms
#1 [big, brown, fox ]
#2 [attract, brown, dog, window]
Q [fox, say]
Term Docs
attract #2
big #1
brown #1, #2
dog #2
fox #1
window #2

1.4 Node types

Data
Store shards
Client
Load balancers — request routing, distributing queries, merging results
Master
Lightweight, clusterwide tasks

1.5 Cluster health

  • Shard can be either primary or replica
  • Cluster can survive as many node failures as replicas we have
  • Green — everything is fine
  • Yellow — some replicas are not allocated
  • Red — some primaries are not allocated

2 Our Solution

2.1 Stats

  • Migrating in-house application into Elasticsearch
  • Project is still in progress, only some customers migrated
  • We index denormalized file versions content and metadata + folders metadata
  • Few billion documents
  • Dozens of TB of data
  • 21 data nodes in GCE + 9 client/master nodes
  • Three production clusters

2.2 Architecture

arch.png

3 Distributed Aspects

3.1 CAP theorem

In presence of network partitions, we can't have both consistency and availability.

Watch out for:

  • Split brain
  • Availability over consistency

3.2 Split brain

  • Two elected masters
  • Cluster will go into an inconsistent state — lost of conflicting updates
  • Github faced it
  • Solution: 3+ dedicated master nodes, minimum_master_nodes=n/2+1
discovery: 
  zen: 
    minimum_master_nodes: 2

3.3 Call me maybe

You should store your data in a real database and replicate it to Elasticsearch

— @aphyr #CraftConf

https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0

transitive-split-brain.png

3.4 Consistency vs Availability

Scenario:

  1. We have one replica
  2. Node #1 goes down –> cluster goes yellow
  3. Node #2 goes down –> cluster goes red, some shards are nonexistent
  4. Search request sees partial data right now
  5. On incoming update ES recreates nonexistent shard, recreated shard is empty or contains only one document from the request
  6. Node #2 goes up and rejoins the cluster — two versions of single shard, i.e. conflicting updates and data loss

Cluster was available for writes (and reads) all the time, but the data are inconsistent at the end.

How to avoid it?

Do not let elasticsearch recreate missing shards. Make it drop the write requests and return partial results on read.

Example config for 4 data node cluster:

gateway: 
  expected_data_nodes: 4
  recover_after_time: 48h

3.5 Rolling restart

  • You can restart a cluster without downtime and data-loss
  • This can be used to upgrade cluster nodes one by one
  • Procedure
  • It can take hours or even days sometimes

4 Memory management

4.1 Basics

  • The more memory on the box the better
  • Set 50% to JVM heap
  • Leave the 50% for Lucene; it uses off heap storage
  • Heap should be <32GB — compressed object pointers

4.2 mlockall

Use mlockall to avoid swapping and to make sure all the memory is locked at start up.

elasticsearch.yml

bootstrap:
  mlockall: true

But make sure that OS allowed it.

[2015-02-27 13:58:00,828][WARN ][common.jna               ]
Unable to lock JVM memory (ENOMEM). This can result in part
of the JVM being swapped out. Increase RLIMIT_MEMLOCK (ulimit).
curl -s $URL/_nodes/process?pretty | grep mlockall
        "mlockall" : true
        "mlockall" : true
        "mlockall" : true
        "mlockall" : true
        "mlockall" : true

4.3 Fielddata

  • Cache of all values for a field
  • Stored in-memory
  • Expensive to calculate
  • Used in aggregations, sorting, faceting
  • By default unbounded and never expires

Elasticsearch uses a circuit breaker to avoid OOM

CircuitBreakingException[Data too large, data for field [canonicalFolderPath]
would be larger than limit of [19285514649/17.9gb]]

4.4 Fielddata — Solutions

  • Redesign the queries
  • Bound the cache
    • But monitor evictions!
  • Add more memory or more nodes
  • Use doc values where possible

https://kupczynski.info/2015/04/06/fielddata.html

4.5 Garbage Collection

  • Concurrent Mark and Sweep
    • Stop the world pauses
    • Non-compacting
    • Not designed for very large heaps
  • Do not use G1GC as it has a history of issues with ES/Lucene
  • Do not tune, monitor!

Long stop the world pauses may result in nodes leaving the cluster.

[2015-04-10 15:42:20,057][WARN ][monitor.jvm ] [sjc-elasticsearch-data03-si]
[gc][old][708332][307] duration [7.3m], collections [37]/[7.3m], total [7.3m]/[25.6m],
memory [29.9gb]->[26.7gb]/[29.9gb], all_pools {[young] [532.5mb]->[341.3mb]/[532.5mb]}
{[survivor] [62.4mb]->[63.9mb]/[66.5mb]}{[old] [29.3gb]->[26.3gb]/[29.3gb]}

Causes of long GC:

  • Cache evictions
  • Small heap
  • Too many filters in a query

Watch out for OOMs!

5 Monitoring

5.1 Stats to Monitor

  • Cluster health
  • Heap usage + GC patterns
  • Load/cpu usage
  • Index growth
  • Query response time
  • Other
    • Cache evictions
    • Thread pools

5.2 Plugins to Help You

Kopf — cluster overview

kopf.png

HQ — basic diagnostics

hq.png

Bigdesk — node level monitoring

bigdesk.png

Nagios:

  • Cluster health
  • Nodes up/down
  • Check if search is working

Other:

  • Access log (include customer id/name)

5.3 APIs

  • _cluster/stats
  • _cat

6 Summary

6.1 What was covered

  • Basic concepts and building blocks
  • Pitfalls of distributed systems
  • Memory management
  • What to monitor

6.2 What is Missing?

  • OPS engagement
  • Data modeling
  • Sharding strategy
  • Sizing the cluster

6.3 Useful Resources

6.4 Thanks