Tuesday, April 26, 2016

Cloud Design Patterns

Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications (Microsoft patterns & practices)

Design patterns serve as great communication tools during design conversations with various stake holders.Most of these things we already know but are packaged in succinct manner and I think their importance are limited if majority of stake holders are unaware of those terminologies.Examples like GoF patterns and enterprise patterns from Martin Fowler served the great purpose in capturing the imagination of developers and helped the design communication better.
I believe this documentation from microsoft has same level maturity and  very well written with excellent diagrams. This is the best I have seen in this space.
This 236 page PDF document on Cloud Design Patterns is divided into 8 Focus Areas  each area consist a list of design patterns. Also note that some of the design patterns fall into multiple categories but intention remains the same.

Most of the samples are in C# and I intend to re-create the sample using open source implementations in java eco system (best of breed of course). Will publish all the same samples github as I keep reading these pattern implementations and best practices. I believe this will serve as great learning tool. I will update this reference list as I complete coding for each pattern.

Availability


Data Management


Design and Implementation


Messaging


Management and Monitoring


Performance and Scalability


Resiliency


Security


And Big Thanks to Microsoft engineers for writing this.

Sunday, August 02, 2015

Java-8 For Python Developers


Abstract

This write up I wrote for providing session on java programming semantics for python programmers.The theme was to convey that Java (especially Java8) is not as verbose we (python programmers) think :)

Java 8 release is touted as biggest release since its inception in terms of paradigm shift with the support for functional programming. Given the backward compatibility burden, Java engineering team has come up brilliant effort with seamless integration with functional programming. Most importantly adding power to all the existing old single method classes and enabling java collections with stream classes. Boiler plate code is going to be reduced drastically. This write up on "Java 8 for Python developers" is for  developers who are well versed python and familiar with basic java syntax. This content should provide enough orientation to navigate the Java ecosystem's libraries, tools and programming semantics.

As this is most the significant release adding support to functional programming, many of the previous assumptions (with verbosity) on Java language deserves a re-look. Java is a platform that supports multiple languages and surely ecosystem is bigger compared to python, so here the focus is only on language semantics. Its really hard to be popular for a long time for any language. I guess java has re-invented well and is giving tough fight to other languages despite its age and baggage.

In this I will go through over-view and in the next series we will see how new Java-8 features like Lambdas result in less boiler plate code (Method Reference,Concurrency), new annotations  improves decorator coding and how many of the popular external libraries (Guava, apache commons, GS Collectioins, Joda time) can be removed (its obsolete as most of these features now available with JDK itself). The improved java script engine as part core JDK helps to utilize the existing java script code can make code as closer its python counter part in most of the cases.

Topics

------------
History
------------
Java:
Created by Canadian 'James Gosling' in 1996 influenced by Oak/C++. This is supported by corporate (Sun/Oracle) and is regarded on #1 language/platform for developing all kinds of application.

Python:
Created by Dutch 'Guido van Rossum' in 1994 influenced by ABC. Its pure community driven project although its creator is called "Benevolent Dictator For Life" and regarded as most developer friendly languages.

Both are object oriented (classes, encapsulation, inheritance, polymorphism) but python is more scripting/functional friendly language.  Python is more object oriented in the sense that it does not have primitives and treats everything including functions as object. Java still deals with low level primitives and hence more perform-ant.

---------------------
General Mapping
---------------------


Python
Java
.py
.java
.pyc
.class
python.exe
java.exe + javac.exe
pypi
Maven central repo
PyCharm
IntelliJ IDEA
pip
maven
requirements.txt
pom.xml/gradle
PYTHONPATH
CLASSPATH
CPython
HotSpot/JIT
Jython/IronPython/PyPy
JRockit/oracle jdk/Dalvik



Java is a statement oriented language. You write a statement, and when it executes, it has an effect. REPLs by contrast are expression-oriented: You write an expression, and the REPL shows the result,like a calculator. Hence REPL was not super useful with java. (I remember hearing this from  scala's creator Martin Odersky )


Definitely REPL(Read Eval Print Loop) will be greatly missed by python developers. But if you use the jython/groovysh, they serve the purpose.  It should give the same feel of interactive mode development with REPL.
REPL for java coming with Java9 looks impressive.

For ex:
D:\jython2.7.0\bin>jython
Jython 2.7.0 (default:9987c746f838, Apr 29 2015, 02:25:11)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_25
Type "help", "copyright", "credits" or "license" for more information.
>>> 1+1
2
>>> from java.util import Arrays as A
>>> numbers = A.asList(3,4,2,5,1)
>>> print numbers
[3, 4, 2, 5, 1]
>>> from java.util import Collections as C
>>> C.sort(numbers)
>>> print numbers
[1, 2, 3, 4, 5]
>>>

In general python has linux orientation and code quality is much better than corresponding java code because of the people involved. Its favorite with sys admins and scientist community. Lot many awesome libraries exist in python compared to any other language, Its a great glue language i.e. the performance intensive stuff can easily moved to native (C++) and python has the best interface mechanism to include. As I was listening to Jessica, a great Python evangelist on python future - the focus is to get python working equally well in "windows" and "mobile"

-------------------------
Language semantics
-------------------------
In Java, all variable names (along with their types) must be explicitly declared. Attempting to assign an object of the wrong type to a variable name triggers a type exception.That’s what it means to say that Java is a statically typed language. In java you have to declare variables with their type and use '{' '} for scope. Python does not explicitly type parameters (python 3 allows notation optionally) and apparently google moving their legacy to python 3 mainly because of types. If we see carefully, type checking is actually winning, people are dying to type. "You don't have to type" is not really a cool thing.

In Python if a name is assigned to an object of one type, it may later be assigned to an object of a different type. That’s what it means to say that Python is a dynamically typed language.

In Java there are access specifier (private, public, protected) for encapsulation, but there is  nothing like in python but follows a convention with '_' for making developers to stay away from using it. 
Because of private variables there are lot of get/set verbose code for all fields in a class. Python solves this problem pretty well making it develop on demand basis (get_attr). But sealing of fields/methods with 'final' serves a great tool to communicate design.
In java getters/setters are required because public field don't give opportunity to go back and change. To be fair to have there is builder patterns and libraries like lombak and immutables that makes it type safe and faster to implement as well.

Java uses method overloading heavily, Python does not have overloading of method names, with default values its not necessary for python.

In java there is support for 'switch' statement its a great replacement for huge if/else or the dictionary in python to declare and work with multiple alternate flows.

-------------------------------------
Basic Data structures and usage
--------------------------------------

Some of the basic data types in Python are numbers, strings and collections. The collections are natively supported and hence makes to breeze to work with them.
Java has huge eco system for designing java collections (Interface, Class and Abstract class)  and separate set of classes to support primitives.(long,int,double ...).

List
numbers = [1, 2, 3, 4, 5 ]
ArrayList is usually replacement for mutable lists
List numbers = new ArrayList() {{ add(1);add(2);add(3);add(4);add(5); }}

Immutable List
tuple = (1, 2, 3, 4, 5 )
There is no equivalent of of tuple but there are immutable list implementation like
Arrays.asList() or Collections.unmodifiableList(), mutating is stopped at runtime.
List tuple = asList(1, 2, 3, 4, 5);

Map
called as dictionary
numberToText = { 1:"one",  2:"two", 3:"three" } 
HashMap is the equivalent to the dict jack of all trades.

Map numberToText = new HashMap<> {{ put(1 :"one"); put(2:"two");put(3:"three"); }}

Similarly set() translates to HashSet<>()

"Java has huge of set of data structure classes in JDK and other external libraries dealing with concurrency and performance probably unmatched compared to any language including python"

But dealing with map (dictionary) could be single most important feature missed by python developers. Probably most important and used dictionary python makes it breeze but definitely java makes it very painful.

String
Like in Python, strings are immutable in java. In python String values can be wrapped in either single or double quotes. To differentiate between vanilla ASCII strings and Unicode strings, Python uses the u prefix to denote the latter.

Multiline strings will be greatly missed by python programmers in java. Its painful and ugly to define large strings in java.

Here are the list of features that are missing in java (or how they could be acheived in java)

------------------------
comprehension (List ...)
------------------------

>>> numbers = [1,2,3,4,5]
>>> print numbers
1, 2, 3, 4, 5]
>>> even_numbers = [n for n in numbers if n%2==0]
>>> print even_numbers
[2, 4]

We could divide above into following blocs.

|final_data|  = [|conversion|   |enumeration|   |predicates| ]
Java don't have anything like native list comprehension this but with lambdas comes close with one liner, but no-where near its python counter part.





[TODOs]

--------------------------------------------------
Returning multiple values from a function
--------------------------------------------------

------------------------------------------------
sequential substitution of actual/formal parameters
-------------------------------------------------

-------------------------------------------
Defining functions inside a function
-------------------------------------------

-------------------------------------------------
@staticmethod, @classmethod, decorator
--------------------------------------------------

-----------------------
Meta programming
------------------------

-------------
Generators
-------------


References:
https://dzone.com/storage/assets/4018-rc193-010d-python_2.pdf
oracle has comprehensive documentation available compared to any language.
Java collections overview:



Sunday, October 12, 2014

Elastic search experimentation with enron emails.

ES (Elasticsearch) is distributed, real-time search and analytics engine. I have been using this product from more than 18 months. It is a de-facto distributed search solution built on lucene. Its fast and easy to develop.
Features could be broadly classified as

  • Full text search, search by words with boolean operators (Ex: (java OR techonlogy) AND (bangalore OR hyderbad) 
  • Easy to build filters, facets, & aggregations
  • Fast response times with large data volume (usually measured in millis)
  • Ability to make geo-localized searches & configurable robust ranking system

Its a incredible tool for collecting analytic information & possibly with my experience easiest to build

Once we got convinced ourselves that ES is best option for given use case, usually product managers/analysts come up with questions like "tell me what is the node/RAM/Disk requirement for indexing (say 1TB) data?" & "what's the indexing strategy to use?" (like index schema, index sharding that could be time based OR data based etc...). Unfortunately there is no immediate correct answer, there are lots of  data points & query characteristics needs to be collected before we even try to attempt to answer questions although people asking these questions are usually uncomfortable to wait and lengthy answers.
Here are few guide-lines for asking questions before trying to come up with answer.
  • required indexing rate (acceptable/desirable limits)
  • total # of docs with payload characterstics
  • latency requirements for searches (well, its for types of searches)
  • Full text search requirements (Highlight, analyzers, internationalization)
  • Indexing schema requirements (Is there any join requirement with nested elements)
  • Indexing schema metadata requirements (especially for date ones & the precision requirements)
Although internet literature claims ES scaling to insane numbers, Its all contextual (nature of data) and situational (nature of hardware), there is no option but to do experiment with real required data and assert the claim and quantifying hardware/software requirements as suggested by the creator ES Shay Banon.
There are too many indexing options and fine tune parameters for the ES. The exercises like below should help to find holy grail combination of ES fine tune parameters for the given use case.

Here is an effort OR sample guidelines that could be used to carry out such experiments. Usually the approach to create data using random values has the disadvantage that data it will have less/more unique content tailored that's not intended for the domain may not give desired results & worse it could be misleading. Lucene is mainly used mainly for full text search and stores the data in reverse fashion i.e. terms to docs. Here the nature of terms plays very important role in deciding the index size and query latency. In lucene as term will be stored once per field, searches work by locating terms and finding documents associated with those.

There is set of free large corpus of data  available that could be used. Here I have picked enron email data as they are pretty decent in numbers given that mostly we restrict the index size to these numbers, making it easy to extrapolate the numbers as ES indexes are sharded to scale horizontally. Its also nice to do experiment with ES features with this publicly available mass collections of "real" emails and is not bound by privacy and legal restrictions.
Although I wrote all my framework code to deal with ES in java (making use spring data elasticsearch wrapper), I used python for the current exercise. Its awesome for running deploy tool-chain (pants), data processing pipelines and other offline systems. Now its my de-facto scripting language replacing groovy. Its easy to share as its concise and reader friendly language.

Before the approach & the code, here are some highlights & observations about this data.

1. If we enable the source (which is required highlighting feature) while indexing the storage cost increases by 76%
http://localhost:9200/_cat/indices
enron-email_without_source 746.5mb  
enron-email 1.2gb
It was observed that there was no change search latency with storing source.

Although general assumption that lucene index size 20-30% of actual content may not be correct, as it depends on the indexing field. Even if I used better compression its unlikely to change the final figure drastically here. Its ~50% here. (considering source is not enabled).


2. Payload size distribution, Its important to match the use-case on how the real domain are getting pumped into ES.
Here are again ES aggregation feature comes handy for finding out the payload size distribution.
Payload range                Count
0KB to 1KB 124181
1KB to 5KB 237870
5KB to 10 KB 42331
10K to 1MB 303


You can find the awesome ES aggregation used here to extract this information. I introduced a size field to capture the size of each document & above is the result. I guess this is most killer feature in ES for building data analytics applications.

3. There were totally 66043 'to' unique addresses & 21021 unique 'from' addresses. Enron mails were extracted by email in-boxes (with complete folder structure) of 148 employees.



Transform the raw enron data into JSON and push into ES.
ES works with JSON & raw data needs to be transformed into ES. Download the data & extract the content into a directory.
Here is the python script used to create in windows machine. The script expects a firs parent directory where enron mails have been extracted & recursively parses the emails in EML format and index into ES. It also creates a fresh index (deleting previous one if exists) making it easy to test with experiments.


Search and test the generated data.
Here is python script shows how the we can do search and other powerful analytics feature like aggregations 



Although I quashed the idea of using ES (storing the content along with index) as primary storage rather than secondary storage in the beginning itself, now I do think its worth re-vising again with the availability of snapshot feature. 

Hope this helps some one trying out newly with ES.

Sunday, July 13, 2014

About my current server side software stack

Here are my thoughts on working with NoSQL technologies from past more than a year while building a platform in archival/e-discovery space where the cost of data storage matters a lot. Clearly its very difficult claim to be expert on every possible technology trend especially in NoSQL where technology/usecase over-lap to solve a problem domain & its not easy to pick one against other. There is also a cost of maintaining multiple data sources (code base & relationship). Here I am documenting overview of technologies that we picked and plan to document all the major issues and best parts separately for each of the stack in the next set of write ups that can help developers to make informed technical decisions especially with my favorites "Elasticsearch","Hazelcast" with spring data wrapper libraries.

I guess the paradigm shift is "Think about the queries(business), horizontal scaling & work backwards." rather than data first approach (storage, integrity) and make queries to work on the model. Data storage is driven by query usage patterns (or functional requirements) and cost of building and maintaining it. The latency and storage costs have moved from non-functional to basic functional requirements, distributed computing has become basic necessity and not the luxury. In NoSQL context also data is still the king but he is at the mercy of query access patterns and the storage cost.By end of the day most of the time software is all about storing & retrieving the useful data in cost effective way. Performance/transaction should not be the reason to abandon RDBMS, but distribution of data (because of scale) and ease of development are the reasons. Shortening development cycle, lowering costs and enabling new ideas should be goal for any technology professional & non-relational data store mechanisms increasingly fit the bill.

There is no single silver bullet for data storage requirements (for anything for that matter) that servers any serious project, it has to be polygot persistence & the trend looks to be ir-reversible.The important here is that each of them provide their scalability & performance by offering a limited data model. They compromise one of the feature of the CAP (Consistency, Availability, Partition). Although everyone claims to do almost anything. You name feature, the evangelists from NoSQLs comes with post saying "Look ma! I can also do this", but its important for us to make informed decisions after doing POCs making sure that these are not stretching themselves too much to achieve the goal (latency, storage efficiency). It was key to pick the storage nodes & technology behind them considering their sweet spots & after evaluating with their counterparts.

Based on use cases we made following decisions.
Search from key - Document Search (MongoDB) Serves most of the RDMS requirements. Ex: Blog
Search the term or value - Full Text Search (ElasticSearch) serves the ability look for text and then the document. Ex: conversations, documents.
Search and store by order - Column Family (Cassandra) serves the ability to constant write for data associated usually with time series Ex: twitter
Search for large binary objects one at a time- Object store provided by cloud vendors, ceph
Search/process for results in millis - In memory data grid solution (Hazelcast)
Although all are scalable solutions cassandra stands out the best in linear scalability offering both in read (for range based queries) and writes efficiency ir-respective data size as column family keeps the columns that fit together & work with append only mode making writes faster than reads & doesn't depend on the current size of the data.
Future plan is to use graphs.
Search by relationship - Graph DB (Neo4J) Users and their relationship Ex: social graph

Although there numerous links suggesting selection criteria, I found following links interesting.
http://blog.nahurst.com/visual-guide-to-nosql-systems - explains following CAP theory & what you will be giving up among three (Consistency, Availability & Partition aware) with most prominent solutions in this space.
http://highscalability.com/blog/2011/6/20/35-use-cases-for-choosing-your-next-nosql-database.html -  Documenting use cases
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis - Easy to follow well documented introductions.

KAFKA 
A distributed pub-sub messaging system designed for throughput. It is maintained/used in LinkedIn & more importantly written in scala which is my favorite language these days. Kafka is log centric, ordered, im-mutable and sequence based (offset). It is replication/persistence by default. It concentrates durable, scalable message storage system & does pretty well.
As author defines it as "kafka is a system that supports long retention of ordered data" - mainly used for pipeline pushing data to the system that's re-playable.

Buzzwords:
topic - category of data.
partition - log of records (data maintained for duration, week, month ...)
log sequence number (offset) - acts as state of system.
consumer/producer - for pulling and pushing the messages.

Biggest success story:
LinkedIn claims of processing 175 TB of in-flight data with ~1.5 ms latency of 7 million writes and 35 million reads/second. This is what I heard from  kafka author himself when I recently listened to a webinar hosted from orielly publishers.

Uses:
Zookeeper

Issues:
No easy way to trap re-balancing events. Scala code is difficult with work especially with existing java code.

STORM

Storm is like a pipeline where you push individual events and then get processed in a distributed fashion, it makes it easy to process massive streams of data in a scalable way, and provides mechanisms for doing things like guaranteeing that the data will be processed.

there is always some kind of confusion its relationship with hadoop, storm creator clearly says they are complimentory & comes up with following thumb rule while adopting,
"Any time you need to look at data historically(already saved) , use batch processing (i.e. Hadoop) as its cheap & scalable, whenever you need to look at all the data once, then anything you do as it comes in in real time, use Storm for that."

The ability of storm to run multiple copies of a bolt’s/spout’s code can be run in parallel (depending on the parallelism setting) is the crux of the solution and helps to scale to insane numbers by taking care of cluster communication, fault-tolerant fail-over, scalable & durable messaging with order and distributing topologies across cluster nodes. Each worker could be assigned process, memory & threads(tasks) depending on their requirement making ideal for any growing company for horizontal scale requirement.

Buzzwords:
Spout - is a source of streams in a topology making it a starting point.
Bolts - any piece of code, which does arbitrary processing on the incoming tuples
tuple - basic unit of data that gets emited in each step
topology - layout of the communication which is static and defined upfront, deployable workflow manager that ties spout and bolts together. It can also be described as a graph consisting of spouts (which produce tuples) and bolts (which transform tuples).
Nimbus - similar to hadoop job tracker, service that tracks all the workers executing the bolt.
supervisor - worker (collection of executors)

Spouts are designed and intended to poll, we can't push to them. So we have kafka to keep the messages and spout is consumer triggering the indexing process.

Built on:
LMAX Disruptor : intra-worker communication in Storm (inter-thread on the same Storm node)
ZeroMQ or Netty: Inter-worker communication (node-to-node across the network)
Zookeeper: cluster management of Storm  (automatic fail-over, automatic balancing of workers across the cluster)

Biggest success story:
Twitter claims to index four hundred million tweets within 50 milliseconds with the help of storm.

The issues:
Inter-topology communication: nothing built into Storm. (we used kafka). The UI isn;t mature. Storm is not well suited to handle OOM kind of errors. Its tough to manage the state across the bolts. (errors, progress)

Samza & Spark are also tries to solve similar kind of problems that storm is trying to solve. Here are some notes in comparison with spark which apparently is doing great & looks to be has the largest momentum.

Storm Vs Spark
Both frameworks are used to parallelize computations of massive amount of data.With Storm, you move data to code. With Spark , you move code to data (Data parallelism). Spark is based on the idea that, when the existing data volume is huge, it is cheaper to move the process to the data (similar to Hadoop map/reduce, except memory storage is aggressively used to avoid I/Os which makes it efficient for iterative algorithms)
However, Storm is good at dynamically processing numerous generated/collected small data items (such as calculating some aggregation function or analytics in real time on a Twitter stream) also called task parallelism. Spark applies on a corpus of existing data (like Hadoop) which has been imported into the Spark cluster, provides fast scanning capabilities due to in-memory management, and minimizes the global number of I/Os for iterative algorithms.
Storm is like a pipeline where you push individual events in which then get processed in a distributed fashion.
Instead, Spark follows a model where events are collected and then processed at short time intervals (few seconds) in a batch manner.
Spark’s approach to fault tolerance is that instead of persisting or checkpointing intermediate results, Spark remembers the sequence of operations which led to a certain data set. So when a node fails, Spark reconstructs the data set based on the stored information

LOGSTASH 
provides centralized logging server that listens for connections from Java application servers, accepting streams of logs when they connect & it filters, modifies, and routes those streams to the appropriate outputs.Its one of the corner stone product offered from elasticsearch family.

ELASTICSEARCH

Distributed (through shards) and Highly Available (through replicas) Search Engine. Its Document oriented, dynamic and provides reliable, asynchronous write Behind for long term persistency with (Near) Real Time Search support.

Buzzwords:
Shard : a single Lucene instance
Replica : is a copy of the primary shard for fail over and performance
Index Type : is like table

Index can be sharded with a configurable number of shards & each shard can have one or more replicas.

Used native java ES APIs instead of HTTP - spring data wrapper
Increased refresh time to higher number (index.engine.robin.refresh_interval). (default is 1 second)
Increased the indexing buffer size (indices.memory.index_buffer_size), it defaults to the value 10%
Increased the number of dirty operations that trigger automatic flush (so the translog won't get really big, even though its FS based) by setting index.translog.flush_threshold (default is 5000).
Increase the memory allocated to elasticsearch node default is 1G
Decreased replica count (or nothing with 0), and increase it later for HA and search performance. Replicas can be changed @ runtime but not the number of shards.
Increased the number of machines you have so you get less shards allocated per machine.
Use scan mode for search while search all results.
Planning to use SSDs for translog & check the performance.

Lucene (search engine behind elasticsearch) requires these to be in RAM, so we should be paying attention to these "deleted docs", " norms", "terms index", " field cache", "doc values"
 In lucene as term will be stored once per field, searches work by locating terms and finding documents. So initially size will decrease and will get stabilized after most of the terms are indexed. So don;t worry about initial numbers usually it translates into 20-30% of the content.

CASSANDRA

column-oriented way of modeling system that is eventually Consistent relies.

Buzzwords:
Column. The basic element which is a tuple composed of a timestamp, a column name and a column value. The timestamp is set by the client and this has an architectural impact on clients’ clocks synchronization.
SuperColumn. This structure is a little more complex. You can imagine it as a column in which can store a dynamic list of Columns.
ColumnFamily: A set of columns,
KeySpaces: A set of ColumnFamily.
Row in cassandra sense is a list of Columns or SuperColumns identified by a row key.

Cassandra allows only N (under 3) number of executions to be checked and immediately returns to value. Therefore, write can be conducted successfully even with failures in the replication node, which raises usability. Histories of failed write operations are recorded on a different node, and the operation can be retried at a later date (this is called “hinted handoff”). Since the success of replicated writing is not guaranteed, the data suitability is checked in the reading stage. Generally, if there are multiple replications, it is collected into one when reading. However, Cassandra keeps in mind that not all replications match, and reads the data from all three replications, checks whether it is identical, and restores the latest data if it is not suitable (this is called “read repair”).

HAZELCAST
As compute nodes doing processing, In memory data grid solution is great for building applications that requires real time latency.

MONGODB
is mmap’d linked lists of BSON documents with B-tree indexing.

OSGI - JBoss Fuse
OSGi can help in managing "ClassNotFoundExceptions" by ensuring that code dependencies are satisfied before allowing the code to execute. It also help with running with multi version jars and hot deployments with life cycle management.
  • OSGi verifies that the set of dependencies are consistent with respect to required versions and other constraints.
  • Package application as logically independent JAR files and be able to deploy. (gracefully without affecting or managing others)
  • Manage new level of code visibility in a bundle(JAR), can make public classes unavailable which is not possible with java access specifier symantics (private, protected,public, package) making deployment/upgradation easy.
  • Extensibility mechanism with like plugin (eclipse,netbeans)
  • Helps to dynamic service model in your application, where services can be registered and discovered both declaratively and programmatically during execution.
  • Shell scripting for JVM to investigate
  • JVM footprint for special-purpose installs could be greatly reduced (with positive implications for disk, memory, and security). 
  • Benefits although looks great, but having other systems (containers) sharing the same jars (storm, hazelcast, etc..), I don't think its worth spending efforts with OSGI.
Its always better to streamline on library versions and manage them in a consistent manner. Its a nightmare  with OSGI containers & we end up solving the jar issues most of the time. I think OSGI is not worth the effort when we have different container with shared jars.

References:
http://wiki.apache.org/cassandra/ArticlesAndPresentations
http://blogs.atlassian.com/2013/09/do-you-know-cassandra/


Thursday, January 17, 2013

Faceted Navigation for mobile applications and NoSQL


Recently I developed a android based application as part of proof of concept initiative. It was search application based on pre-defined static data that will get shipped along with application. Using lucene with android was pretty simple and straight forward. Having working with information retrieval space from past 4 years and having worked with one of best known swing based UI application (sitebuilder.yahoo.com), it did not take time to come up with decent working application. One of the challenge was to move data stored in database to lucene so as to make it full text searchable and build facets.
There were 2 points that were revelation for me "the power of facet based navigation  in building user interface" and "using lucene as data store and viable NoSQL solution" after this exercise.

The interfaces for mobile(or web for that matter) that requires keyword text entry aren't suited for browsing. Facet based navigation is pretty effective when exact keyword is not known. Amazon is a pioneer in this space & their web site is great example of faceted navigation. Their cross sell and up sell features directly rely on faceting.Keyword entry is tedious,using hierarchical faceted metadata for content selection results in awesome experience especially for mobile application. With ever increasing power and storage power of mobile devices (or desktop client for that matter), we should effectively use them to give better user experience. Let's say if some-one looking for job with key words such "java architect technologist", giving auto completion with total results (like java(200), architect(10) location(20)...) improves the over-all searching experience better. We should actively look into our existing web/mobile applications where faceting can be effectively utilized.

Lucene makes creating a facets a breeze. TaxonomyWriter & TaxonomyReader classes are well designed that helps to build categories/dictionaries easily. All the existing applications can be re factored to make use this feature rather than using groupby SQL queries. Its is also pretty straight forward to extract categories from full text and not column oriented.
Developing a android application is a tedious job. Since all the java APIs aren't supported its painful exercise to migrate & also its really greatly time-consuming to debug the android application with simulator although we can appreciate the effort from eclipse plugin team. I used groovy swing builders to develop quick mock up before doing it with Android. I love JVM scripting languages for that.One of the main reason to learn these although we may not use it in our daily jobs is the ability to come up with quick working prototypes. Here is the script that I created in Groovy, it prints the violations in the code that from each package level. It would be easier to build such things in GWT as well. So we can refactor all these upfront.
The problem that got solved with the application was storing the data & query any information using the lucene (Both with API and lucene syntax) in sub-second. It can deal with huge data & schema-less, the main characteristics of NoSQL. Although lucene is not generally counted as one of the NoSQL solution, it actually is.
When some one asks to me about NoSQL, I can safely tell I have used one :) and buzzword compliant. But this is a much bigger issue, recently I was hearing session from Martin Fowler where he argued very well for the need to both forms of data storage, SQL and NoSQL types & used the term "Polygot Persistence". The challenge for technologists is to identify the best contextually applicable use cases.
I started looking more into subject more & here are my notes. Although I am not in position to strongly recommend one over the other, definitely readers will appreciate the (NoSQL against SQL)value they bring in to the table and can make technically informed decision.

Schemalessness is one of the main reasons for interest in NoSQL databases - But we can generate schema-less design in relational database as well. There is a masterpiece discussion from asktom where "Generic Data Models" with dynamic capabilities have been regarded as anti-pattern defeating strong-typeness and query performance of relational database. I completely agree with that sentiment. These are types of cases, we should generally looking for Non SQL solution.

Generally, the data is significantly more important than the applications which use it - content is the king. The main reason is data outlives the applications that create it, the applications that use the data usually change over time. The schema and strong typeness is very important and in the long run will help the organization. We should be doubly cautious before deciding to put the data in non-SQL based storage & SQL is a safe option from manageability and staffing perspective for utility projects.
After all  - "Misuse is the greatest cause of FAILURE & we need to understand the merits/demerits  before dismissing it.

High performance from NoSQL is not free, it comes with cost (high-performance look-ups  at the cost of no or no easy support of relational queries through "joins" and reliability  through "transactions", "consistency", referential integrity) that are inherently addressed/provided by SQL products. NoSQL solutions needs to bend too much to match with RDBMS.
When data no longer fits on a single RDBMS server or when a single machine can no longer handle the query load, some strategy for storing/sharding and replication is required. Doing with RDBMS (rows and columns with index) is not easy and viable from cost and scalability requirements. The primary reason to integrate in-memory-based solutions with a NoSQL DB is to reduce the cost per GB of data. The need of systems that can run across multiple machines is going to be an absolute requirement & NoSQL shines over there.

Here is quick overview of options that we have.

Key-Value Stores  
USP   - Caches/Simple domain with fast read access
Use case   - Massively concurrent systems
Examples  - Redis/Memcached/Tokyo Cabinet/Coherence

Column-Family Stores
USP - Write on a big scale for reading and writing
Use case  - Massively concurrent systems
Examples  - Cassandra/Google BigTable/Apache HBase

Document-Oriented Databases
USP - Contents are documents with changing schema
Use case  - Dynamic Data Structure
Examples  - MongoDB/CouchDB

Graph Databases
USP  -  interconnected data with nodes and relationship
Use case   - Social Graphs with many to many relationships/Recommendation engines/Access Control Lists
Examples   - Neo4J/AllegroGraph/OrientDB

Ir-respective data source chosen, CAP theory states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees  Consistency, Availability & Partition tolerance(CAP), Rules change with data and the context, but below rule remains the same to improve the performance.

"Look for opportunity to turn scan into a seek (cache) and use event based concurrency (asynchronous and parallel computation)"
  • Read Cache with In memory data grid solutions - Coherence, Hazelcast, CDN
  • Make computations asynchronus & parallel wherever possible. - JMS, Threads (event based)
  • Move computation and not data. - Map Reduce 
  • Events in efficient machine parseable forms for effecient serialization - protobuff
  • Route new data through new partitions
  • Avoiding shared mutable state is the secret weapon to winning concurrency battles

I got interested in graph database, especially Neo4J and the problem that it tried to solve. Spring Data library samples were interesting. I highly recommend to try out spring data, spring source guys are awesome in creating the abstractions making developers life easier as usual.


here is the extract from "Neo4J in Action" book that explains the use case and technology behind the same.Social graph is being used as sample. Friends with many to many relationships. We can relate easily the differentiation it brings.

To find all friends on depth 5, MySQL will perform Cartesian product on 't_user_friend' table 5 times, resulting in 50,0005 records, out of which all but 1,000 are discarded. Neo4j,on the other hand, will simply visit nodes in the database, and when there is no more nodes to visit, it will stop the traversal. That is why Neo4j can keep constant performance as long as the number of nodes returned remains the same, opposed to significant degradation in performance when using MySQL queries
Each join creates a Cartesian product of all potential combinations of rows, and then filters out those that are not matching the where clause. With one million users, the Cartesian product of 5 joins (equivalent of query on depth 5) contains huge number of rows – billion billion billions – way to many zeros to be readable. Filtering out all the records that don’t match the query is too expensive – such that the SQL query on depth 5 never finishes in a reasonable time.
The secret is in the data structure – the localized nature of graphs makes it very fast for this type of traversals. Imagine yourself cheering your team on the a football stadium. If someone asks you how many people is sitting five meters around you, you will get up and count them. If the stadium is half empty, you will count people around you as fast as you can count. If the stadium is packed, you will still do it in a similar time! Yes, it may be slightly slower, but only because you have to count more people because of the higher density. We can say that, irrespective of how many people are on the stadium, you will be able to count the people around you at predictable speed – as you’re only interested in people near you, you won’t be worried about packed seats on the other end of the stadium for example.
This is exactly how Neo4j engine works in our example – it counts nodes connected to the starting node, at the predictable speed. Even when the number of nodes in the whole graph increases (given similar node density), the performance can remain predictably fast. If you apply same football analogy to the relational database queries, we would count all people in the stadium and then remove those that not around us – not the most efficient strategy given the inter connectivity of the data."



References:
Here is a pretty neat effort in to explain to support Unit testing NoSQL. A great way to learn as well. https://github.com/lordofthejars/nosql-unit#readme
Identifying the use case
http://highscalability.com/blog/2011/6/20/35-use-cases-for-choosing-your-next-nosql-database.html




Bookmark and Share