Sunday, October 12, 2014

Elastic search experimentation with enron emails.

ES (Elasticsearch) is distributed, real-time search and analytics engine. I have been using this product from more than 18 months. It is a de-facto distributed search solution built on lucene. Its fast and easy to develop.
Features could be broadly classified as

  • Full text search, search by words with boolean operators (Ex: (java OR techonlogy) AND (bangalore OR hyderbad) 
  • Easy to build filters, facets, & aggregations
  • Fast response times with large data volume (usually measured in millis)
  • Ability to make geo-localized searches & configurable robust ranking system

Its a incredible tool for collecting analytic information & possibly with my experience easiest to build

Once we got convinced ourselves that ES is best option for given use case, usually product managers/analysts come up with questions like "tell me what is the node/RAM/Disk requirement for indexing (say 1TB) data?" & "what's the indexing strategy to use?" (like index schema, index sharding that could be time based OR data based etc...). Unfortunately there is no immediate correct answer, there are lots of  data points & query characteristics needs to be collected before we even try to attempt to answer questions although people asking these questions are usually uncomfortable to wait and lengthy answers.
Here are few guide-lines for asking questions before trying to come up with answer.
  • required indexing rate (acceptable/desirable limits)
  • total # of docs with payload characterstics
  • latency requirements for searches (well, its for types of searches)
  • Full text search requirements (Highlight, analyzers, internationalization)
  • Indexing schema requirements (Is there any join requirement with nested elements)
  • Indexing schema metadata requirements (especially for date ones & the precision requirements)
Although internet literature claims ES scaling to insane numbers, Its all contextual (nature of data) and situational (nature of hardware), there is no option but to do experiment with real required data and assert the claim and quantifying hardware/software requirements as suggested by the creator ES Shay Banon.
There are too many indexing options and fine tune parameters for the ES. The exercises like below should help to find holy grail combination of ES fine tune parameters for the given use case.

Here is an effort OR sample guidelines that could be used to carry out such experiments. Usually the approach to create data using random values has the disadvantage that data it will have less/more unique content tailored that's not intended for the domain may not give desired results & worse it could be misleading. Lucene is mainly used mainly for full text search and stores the data in reverse fashion i.e. terms to docs. Here the nature of terms plays very important role in deciding the index size and query latency. In lucene as term will be stored once per field, searches work by locating terms and finding documents associated with those.

There is set of free large corpus of data  available that could be used. Here I have picked enron email data as they are pretty decent in numbers given that mostly we restrict the index size to these numbers, making it easy to extrapolate the numbers as ES indexes are sharded to scale horizontally. Its also nice to do experiment with ES features with this publicly available mass collections of "real" emails and is not bound by privacy and legal restrictions.
Although I wrote all my framework code to deal with ES in java (making use spring data elasticsearch wrapper), I used python for the current exercise. Its awesome for running deploy tool-chain (pants), data processing pipelines and other offline systems. Now its my de-facto scripting language replacing groovy. Its easy to share as its concise and reader friendly language.

Before the approach & the code, here are some highlights & observations about this data.

1. If we enable the source (which is required highlighting feature) while indexing the storage cost increases by 76%
http://localhost:9200/_cat/indices
enron-email_without_source 746.5mb  
enron-email 1.2gb
It was observed that there was no change search latency with storing source.

Although general assumption that lucene index size 20-30% of actual content may not be correct, as it depends on the indexing field. Even if I used better compression its unlikely to change the final figure drastically here. Its ~50% here. (considering source is not enabled).


2. Payload size distribution, Its important to match the use-case on how the real domain are getting pumped into ES.
Here are again ES aggregation feature comes handy for finding out the payload size distribution.
Payload range                Count
0KB to 1KB 124181
1KB to 5KB 237870
5KB to 10 KB 42331
10K to 1MB 303


You can find the awesome ES aggregation used here to extract this information. I introduced a size field to capture the size of each document & above is the result. I guess this is most killer feature in ES for building data analytics applications.

3. There were totally 66043 'to' unique addresses & 21021 unique 'from' addresses. Enron mails were extracted by email in-boxes (with complete folder structure) of 148 employees.



Transform the raw enron data into JSON and push into ES.
ES works with JSON & raw data needs to be transformed into ES. Download the data & extract the content into a directory.
Here is the python script used to create in windows machine. The script expects a firs parent directory where enron mails have been extracted & recursively parses the emails in EML format and index into ES. It also creates a fresh index (deleting previous one if exists) making it easy to test with experiments.


Search and test the generated data.
Here is python script shows how the we can do search and other powerful analytics feature like aggregations 



Although I quashed the idea of using ES (storing the content along with index) as primary storage rather than secondary storage in the beginning itself, now I do think its worth re-vising again with the availability of snapshot feature. 

Hope this helps some one trying out newly with ES.
Bookmark and Share