Saturday, March 03, 2012

Introduction to Hadoop

My notes on Hadoop - I never got an opportunity to directly work with Hadoop but currently looking into in house POC application that analyzes logs, Hadoop remains attractive for solving huge data that was majorly required by big companies (& mainly social software companies that deal with big data) but likely to expand into other companies as the data emitted by the world is growing exponentially. As I see from job description and salary perspective hadoop engineer is at the top & rightly so.
Hadoop has passed the hype cycle & is stable now. Now Hadoop is with Plateau of Productivity.

My notes here on hadoop should make readers hadoop buzz words compliant & give better idea of over-all picture.

What is Hadoop?
A cost-effective, scalable, distributed and flexible big data processing framework written in java based on Map-Reduce algorithm. MapReduce is a algorithm proposed by Google that breaks complex tasks down into smaller elements which can be executed in parallel.
Hadoop at core consists of two parts, one a special file system called Hadoop Distributed File System (HDFS) and Map-Reduce Framework that defines two phases to process the data a map and reduce.

Why Hadoop?
Traditional storage with RDBMS is costly and unsuitable in follwing scenarios
  • Too much of data that too in unstructured format (100s of Terabytes, Peta bytes)
  • Large # of applications to use a single filesystem namespace in distributed execution space
  • In-expensive storage of large data, but can be still easily queried.
  • Big data with high volume & velocity
Relational databases are designed with good principles of data durability, isolation and independence but the design is centralized and tends to get disk IO bound for high write workloads. The use of locking at different levels (row, page, table level) for consistency/isolation and the need to flush transaction state to disk introduces scalability challenges requiring users to scale by deploying heavy machines (vertical scaling).

Many tech vendors like EMC,Y!, Twitter, IBM,Amazon,Oracle & even Microsoft has got Hadoop-oriented "big data" strategy. Hadoop has proven to work with many of the big companies & huge investment in hadoop and its related technology made by them makes it viable option for handling big data in years to come.


Hadoop Use Cases:
This wiki-page lists out 100s of companies and their usages alphabetically, 
Here is the list that I picked from there.
  • Log files processing, 
  • Machine learning
  • User experience optimization/Prediction of behaviours based on the patterns and build the recommender system for behavioral targeting with pattern discovery/analysis.
  • Billions of lines of GPS data to create TrafficSpeeds for accurate traffic speed forecast
(proximity operators, hub and spoke search terms, customized relevance ranking, stemming, white and black lists, Data mining, Analytics and machine learning) Data from various wide verity of sources Ex: sensors, cameras, feeds,streaming, logs, user interactions
Where Hadoop doesn't make sense
You cannot directly use Hadoop as a substitute for a database that takes query & returns result in milliseconds, so if there is requirement to have sub-second interactive reporting from your data, or using the data in multi-step, updates/insertions, complex transactions, an RDBMS solution may still be your best bet.
By design Hadoop is suited for batch index building,and is not proper for incremental index building and search. 
Hadoop eco system
There are software add on components developed on hadoop to make life simpler, VMWare recently announced that  Spring also will have support for Hadoop under their "Spring Data" umbrella.
  • HBase - Column-store database (of the order of terabytes) based on Google's big table
  • Pig         - Yahoo owned DSL for Data-flow or routing data
  • Hive - Facebook owned DSL for routing data but based on SQL
  • ZooKeeper - Distributed consensus engine with concurrent access
All the big companies (Twitter,Amazon, Yahoo, Facebook) have something to offer over the Hadoop which is a good thing.
Hadoop Core components:
At core hadoop can be grouped into following 
  • HDFS - Hadoop Distributed File System, is responsible for storing huge data on the cluster.
  • Hadoop Daemons -  A set of services offering to work with the data. 
  • Hadoop HDFS API -  APIs to communicate with the various nodes (services) from applications.
Here is brief write up on each of the components:
HDFS:
This is a distributed file system designed to run on commodity(low cost) hardware & highly fault tolerent.
It provides the high throughput with large data sets (files). HDFS supports write-once-read-many semantics on files
In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size.
HDFS Vs NAS   
In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.
HDFS is designed to work with MapReduce System, since computation are moved to data. NAS is not suitable for MapReduce since data is stored seperately from the computations.
HDFS runs on a cluster of machines and provides redundancy using a replication protocal. Whereas NAS is provided by a single machine therefore does not provide data redundancy.
Hadoop Daemon services or modules:
Hadoop is comprised of five separate daemons. Each of these daemon run in its own JVM.
      Master nodes:
    • NameNode - This daemon stores and maintains the metadata for HDFS.
    • Secondary NameNode - Performs housekeeping functions for the NameNode.
    • JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.
      Slave nodes
    • DataNode     – Stores actual HDFS data blocks.
    • TaskTracker - Responsible for instantiating and monitoring individual Map and Reduce tasks.
NameNode :
NameNode is heart of HDFS file system. It keeps the directory hierachy information of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself but just metadata. "NameNode" is a Single Point of Failure for the HDFS Cluster & makes all decisions regarding replication of blocks.Any hadoop user applications will have to talk to NameNode through Hadoop HDFS API to locate a file or to add/copy/move/delete a file.
Data Node :
A DataNode stores data in the Hadoop File System HDFS. DataNode instances can talk to each other, this is mostly during replicating data.
JobTracker :
A daemon service for submitting and tracking jobs(a processing unit) in Hadoop & is single point of failure for the Hadoop MapReduce service.
As per wiki:
Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.
Task Tracker:
Task Tracker   is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) & monitors these task instances(Task instances are the actual MapReduce jobs), from a JobTracker. 
Miscellaneous:  
MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation & hence can be zero as well.
Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers.
Speculative execution is a way of coping with individual Machine performance.
Some Criticisms:
"Hadoop brings a tons of data, but until you know what to ask about it, it’s pretty much garbage in, garbage out." - There are limited use cases for this especially for generic programmer to fully invest on this.
"Does querying huge data sets win over the advanced algorithms applied over limited data" - I am skeptical about querying huge data
"While most of Hadoop is built using Java, a larger and growing portion is being rewritten in C and C++" - I thought Google Map-Reduce must be better & converting some components to C++ is not a good sign for Java
"Configuration parameters are pretty huge" - that's a design smell I guess it shouldn't that complex.

Wednesday, February 08, 2012

Implementing equals and hash

All about equals() and hashCode()

Java does not provide direct support for associative arrays -- arrays that can take any object as an index. In Java we have Object class has two methods for making inferences about an object's identity: equals() and hashCode()
HashMap helps to lookup based on object as index, other hash-based data structures such as HashSet, LinkedHashSet, HashMap, Hashtable, WeakHashMap does the same thing.

There are two approaches to defining equality and hash value: identity-based, which is the default provided by Object, and state-based, which requires overriding both equals() and hashCode(). If an object's hash value can change when its state changes, be sure you don't allow its state to change while it is being used as a hash key.

There are some restrictions placed on the behavior of equals() and hashCode(), as explained in JavaDoc of Object class.
In particular, the equals() method must exhibit the following properties:
Symmetry: For two references, a and b, a.equals(b) if and only if b.equals(a)
Reflexivity: For all non-null references, a.equals(a)
Transitivity: If a.equals(b) and b.equals(c), then a.equals(c)
Consistency with hashCode(): Two equal objects must have the same hashCode() value

Here are some rules to follow:
    - if a class overrides equals, it must override hashCode
    - equals and hashCode must use the same set of fields
    - if two objects are equal, then their hashCode values must be equal as well
    - Consistency with the equals() contract is a fundamental requirement to   every implementation of equals() . Not only the hash-based collections rely on reflexivity, symmetry, and transitivity, but everybody who calls equals()  will expect exactly this behavior. Failure to comply to the equals()  contract leads to subtle bugs that are difficult to track down, because they are conceptual problems.
    - The value that's likely to change or unique should be first to compare (like Id etc...)
    - Make sure hashCode() of the key objects that you put into the collection never changes while the object is in the collection or make it immutable
    - if the object is immutable, then hashCode is a candidate for caching and  lazy initialization
Caching of hashCode value is useless with modren JVMs.The last one in the above proved to be wrong after my experiments, JVM seems to be intelligent enough to make it fast.

Morale of the story : If you think you can optimize some obvious stuff,don't do that & JVM might be already doing that.

Generally its assumed that hashCode provides a unique identifier for an object. But actually it does not.
According to The java.lang.Object documentation it should be perfectly ok to always return 0 for the hashCode(). The positive effect of implementing hashCode() to return unique numbers with the use prime number for unique objects, is that it might increase performance. The downside is that the behavior of hashCode() must be consistent with equals(). For object a and b, if a.equals(b) is true, than a.hashCode() == b.hashCode() must be true. But if a.equals(b) returns false, a.hashCode() == b.hashCode() may still be true. Implementing hashCode() as 'return 0' meets these criteria, but it will be extremely inefficient in Hash based collection such as a HashSet or HashMap.

Recently we had one problem with the old code, where in the equals was utilizing the hashCode() method in the equals(), but since some of the string values were getting the same hashCode(), the hashmap was giving unexpected results.
assert 2627 == "RU".hashCode()
assert 2627 == "S6".hashCode()
So its always better to be careful about this as it can manifest itself after a long time

I was looking into code generated by 3 dominant IDEs in java, I thought its interesting to share some of the information. Its safest way to use IDEs to generate code from code review & consistency purpose. I don't see anything wrong with generated code here.


For me IntelliJ is doing better here (anyway nothing wrong with others).
- provides a option to mark certain fields non-null and hence null check can be avoided which I think is a good thing to do.
- makes use of instanceof others uses getClass(), All classes in the hierarchy either allow slice comparison and use instanceof or they disallow it and use getClass() . The use of 'instanceof'    if need be, it can match any supertype, and not just one class &  it renders an explict check for "that == null" redundant, since it does the check for null already - "null instanceof [type]" always returns false. (Effective Java)
- code looks compact compared to others 

There are excellent helper classes EqualsBuilder and HashCodeBuilder from the Apache Commons Langlibrary. can make life simpler, but I think code generated from IDEs are good enough


Reference:
Effective Java - Joshua Bloch chapter about equals() and hashCode()
Java theory and practice: Hashing it out

Friday, September 30, 2011

Backward and Forward compatibility with Java serialization protocol:-

Java offers a nice protocol for storing and communicating the persisted information. In a distributed deployment environment the server code has to support both the new and old information when its anticipated the control over the deployment of different versions of the servers are not available. The code has to deal with both backward as well as forward compatibility. i.e. VERSION N should be able to read both VERSION N-1 & VERSION N+1 object.

Maintaining the backward compatibility is easy. i.e. Version N code can read Version N-1 code by maintaining the version attribute. The version attribute in the read() API will make sure that the reading of new information is skipped. This is quite straight forward to achieve through “java.io.Externalizable” interface provided java where we have the full control of the context information on what we need to save. But it becomes tricky when we try to read the information created by future versions. The problem with forward compatibility is that the Java de-serialization it has to know what the new stuff that has been added & also should know where the information has been added.

The solution explained below tries to solve the problem in a pure java with less overhead & suitable for systems that have already established using Java serialization with  ”java.io.Externalizable”  interface.

Here the top down approach is followed for the explanation, it starts with how the usage of API should look like VERSION 0,1 & 2 then show how that can be achieved through code.
The challenge here is to read VERSION-1 (N) data from VERSION-0 (N-1) class. The current solution targets supporting only N+1 version & not the future versions which is the only important part of 24 X 7 support systems in the event of release back out.
Let's take an example of class "Test" having 2 attribute in VERSION-0, VERSION-1 introduces a new attribute called "versionData1" & VERSION-2 introduces the one more new attribute "versionData2". To simulate the case the data is inserted in between the data. In real-time scenario we will have lots of classes with readExternal()/writeExternal() methods spread across all over the code, but usually controlled with single mother object.
SKIP_START & SKIP_END are the marker interfaces used to identify the new information and ObjectInputWrapper is the new class used to move the cursor to skip the details when we read new information from the old class.
"Test" is Externalizable class having 3 attributes. We will include a new attribute in the second version and see how a "Test" object with VERSION -0 (N) reads VERSION -1 (N-1) object.

This plugin architecture is quite scalable. ObjectInputWrapper can be injected with different serialization protocols like Google proto buffer, Oracle Coherence's POF, etc... using a using factory pattern like ObjectInputOutputProviderFactory that provides different implementation including custom logic to efficiently store & retrieve the information

  VERSION = 0;

    @Override
    public void writeExternal(ObjectOutput out) throws IOException {
       out.writeInt(VERSION);
       out.writeInt(number);
       out.writeObject(name);
    }
    @Override
    public void readExternal(ObjectInput _in) throws IOException, ClassNotFoundException {
        ObjectInputWrapper in = new ObjectInputWrapper(_in);
        VERSION = in.readInt();
        number = in.readInt();
        name = (String) in.readObject();
    }
"Test{VERSION=0, number=10, name=praveen}"
   
VERSION = 1

This class has new String attribute called versionData1 with the value "versionData1"
    @Override
    public void writeExternal(ObjectOutput out) throws IOException {
       out.writeInt(VERSION);
       out.writeInt(number);
       // Added for version1 in between
       out.writeObject(new SKIP_START());
       out.writeObject(version1Data);
       out.writeObject(new SKIP_END());
       out.writeObject(name);
    }
A marker interface class has been included to identify the new information that the older version can safely ignore.

    @Override
    public void readExternal(ObjectInput _in) throws IOException, ClassNotFoundException {
        ObjectInputWrapper in = new ObjectInputWrapper(_in);
        VERSION = in.readInt();
        number = in.readInt();
        if(VERSION>=1) {
             in.readObject(); // we can safely ignore skip signals
            version1Data = (String)in.readObject();
            in.readObject();
        }
        name = (String) in.readObject();
    }

"Test{VERSION=1, number=10, versionData1=versionData1, name=praveen}"

 VERSION = 2

    @Override
    public void writeExternal(ObjectOutput out) throws IOException {
       out.writeInt(VERSION);
       out.writeInt(number);
       out.writeObject(version1Data);
       // Added for version2 data in between
       out.writeObject(new SKIP_START());
       out.writeObject(version2Data);
       out.writeObject(new SKIP_END());
       out.writeObject(name);
    }
    @Override
    public void readExternal(ObjectInput _in) throws IOException, ClassNotFoundException {
        ObjectInputWrapper in = new ObjectInputWrapper(_in);
        VERSION = in.readInt();
        number = in.readInt();
        if(VERSION>=1) {
           if(VERSION==1){
                  in.readObject();
           }
           version1Data = (String)in.readObject();
           if(VERSION==1)
                   in.readObject();
           }
        }
        if(VERSION>=2) {
            in.readObject();
            version2Data = (String)in.readObject();
            in.readObject();
        }
          name = (String) in.readObject();
    }
"Test{VERSION=2, number=10, versionData1=versionData1, versionData2=versionData2, name=praveen}"
The skip data version can safely be ignored in future versions. It will be removed in write first and also from read on subsequent versions.

The 2 key classes used to implement are explained here.


The ObjectInputWrapper class will have read method implementations for each basic data type. The sample below shows just two methods reading and readObject for illustration. For real implementations, we need to create final ObjectInputReaderTemplate classes for each basic data type and reuse them in the read methods of ObjectInputWrapper  & they can be cached to avoid lots of object creations. The profiling of  this code showed that these light weight are pretty cheap to handle, JVM will make sure that there is overhead in terms of memory or the cpu by in lining them automatically.




This class makes use of generics and anonymous inner classes to move the cursor to skip the new information that version N was un-aware while reading.




 One sample code having all the classes published in github

Saturday, August 20, 2011


Caching Exceptions in server applications -
In web applications it’s common that  response page of certain types can be cached for given http request, similar approach could be used to cache the exceptions in the server side applications as well. If the reason for any exception known upfront that it will be applicable for given amount of time, there is a opportunity to cache these to reduce the load on resources. This could be huge gain in cases where a particular piece of code is doing heavy operations like cpu intensive tasks, making remote calls, Database/File manipulation etc..
Here is a solution making use of Java generics.
For Ex:
public static int mockHeavyOperation(String someData) throws FileNotFoundException,Exception{
        Thread.sleep(3000);//simulating some heavy operation
        if(true) throw new FileNotFoundException();
        return 10;
    }
Can be converted into
public static int mockHeavyOperationWithCacheExceptionHandler(String someData) throws FileNotFoundException{
        return new ExceptionCacheTemplate<Integer,FileNotFoundException>(){
                @Override
                public Integer handle() throws Exception {
                    Thread.sleep(3000); //simulating some heavy operation
                        if(true) throw new FileNotFoundException("Checked Exception");
                        return 10;
                    }
           }.runIn(new ExceptionCacheTemplate.ExceptionKey(someData,40000));
    }
As we can see from above code extra 4 lines are doing the trick.
If we run the first operation 10 times it takes 30000 millis where as 2nd example takes only 3000 millis, that's a huge gain I suppose.
1. The input information that resulted in exception can be stored as part of Exception Key class. It can be any class including String as shown above as long as it implements equals/hashCode(). runIn method works directly with String as well. If any Exception re-appears within the interval 40000 as shown above the Exception will be thrown from cache.
2. Only the exceptions that are included as part of the  generic exceptions,
3. A custom map can be provided to manage the expiring of cached exceptions, Default implementation will be managing through the value provided while creating cache key.
"public Map getCache()" can be overridden to provide different implementation
For example Google's Guava library provides awesome fluent interface to manage expiry of keys.
Map<Key,Graph> graphs = new MapMaker()
       .concurrencyLevel(4)
       .weakKeys()
       .maximumSize(10000)
       .expireAfterWrite(10, TimeUnit.MINUTES)
       .makeComputingMap(
           new Function() {
             public Graph apply(Key key) {
               return createExpensiveGraph(key);
             }
           });

It can be handled at the framework level making it completely transparent to the applications.
ExceptionTamplate is available in github

Sunday, July 17, 2011

Notes on TDD and Unit testing in general.
Most of the benefits of TDD are unquantifiable, Its tough to correlate better design, better ability to refactor the code to TDD practice. It requires a real practice to experience and appreciate the productivity benefits. When we are working with legacy code and is not easily testable, advocating TDD there is counter productive and refactoring such code to suit TDD is a hard sell & taking up such task is usually a thankless exercise.For green field projects TDD should be applied without a second thought, to my mind its not debatable anymore.

Here are some notes that I collected can be used to sell TDD during the discussions. For me TDD as effective tool of "COMMUNICATION" is the single most important point that should to enough to employ TDD. The lesser defects as result of clear executable communication should make all the stake holders namely Developers, Managers, Testers, Clients and finally the real users of software happy.
Tests prove that your code actually works resulting in fewer bugs, if we catch all the scenarios. Testing before the code is a design activity.These tests can’t replace system and acceptance testing,but they do supplement it & also fewer bugs that make it to QA.
We can improve the design without breaking it. Having unit tests in place, we can do powerful refactorings that can untangle the most challenging of system psychoses.Refactoring not only becomes cheaper, it changes the developer mindset to strive for better quality. When refactoring becomes cheaper, the quality of software continuously improves.Fixing PMD and FindBugs errors should not require management approval.
Unit tests are a way to make programmers have documentation as they hate to write a MSWord document & is a project management technique. So when we can’t remember how to use a class APIs, read the unit tests to find out.MS Word documents are not meant to be compiled and deployed, if we can avoid as much as possible, its good for everyone.Unit tests reduces the communication pain points. Its a shared language.You know what your code needs to do. Then you make it do it. Even if you don’t have a working system, you can see your code actually run and actually work. You get that great “I’ve done it!” feeling. Developers can avoid nagging questions "Have you written it", "Have you tested it" & finally "Have you really tested it". If tests are written and published, developer can always go & check from web tool.
Just try test-first if you want to be high on endorphins, proud about your work, and motivated to do more.They demonstrate concrete progress. You don’t have to wait a month for all the pieces of the system to come together. You can show progress even without a working system. Not only can you say you've written the code, you can actually demonstrate success. Of course, this is another distinction that traditional programming teaches us to ignore. 
“Done” doesn’t mean you’ve written the code and checked it in.“Done” means the code actually runs in the system without bugs. Running unit tests is a step closer to the latter.Unit tests are a form of sample code. We all encounter library functions and classes we don’t know how to use and one of the first places we go is the sample code. Sample code is documentation. But we don’t usually have samples for internal code. So we’re left shifting through the source or through the rest of the system. 

Test-first forces you to plan before you code. Writing the test first forces you to think through your design and what it must accomplish before you write the code which results in better code. This not only keeps you focused, it makes for better designs.
Test-first reduces the cost of bugs. Bugs detected earlier are easier to fix. Bugs detected later are usually the result of many changes, and we don’t know which one caused the bug. So first we have to hunt for and find the bug. Then we have to refresh our memories on how the code is supposed to work, because we haven’t seen it for months. Then finally we understand enough to propose a solution. 
Anything that reduces the time between when we code the bug and when we detect it seems like a obvious win. We consider ourselves lucky to find out about bugs within a few days, before the code is shipped to QA or to customers. But how about catching them within a few minutes? That’s what test-first accomplishes with the bugs it catches.It’s even better than code inspections. Code inspections, they say, are better than testing, because using them to detect and fix bugs is cheaper than testing. After the code ships, it’s much more expensive to fix the bugs. The earlier we can detect and fix bugs, the easier and cheaper and better. That’s the advantage of having code reviews.Code inspections catch more bugs within a few days, rather than a few months, but It virtually eliminates coder’s block. Ever wonder what statement to write next? Like writer’s block, coder’s block can be a real problem. But test-first systematizes the structured part of coding, allowing you to concentrate on the creative part. You may get stuck on how to test the next bit or how to make the test pass, but you’ll never be left puzzling over where to go next. In fact, usually you’re left with the opposite problem: You know you need to take a break before you burn out, but you’re on a roll and don’t want to stop.Failed tests make better designs. Testing a piece of code forces you to define what that code is responsible for. If you can do this easily, that means the code’s responsibility is well-defined and therefore that it has high cohesion. And if you can unit-test your code, that means you can bind it as easily to the test as to the rest of the system. Therefore, it has loose coupling to the pieces around it.High cohesion and loose coupling is the definition of good, maintainable design. Code that is easy to unit-test is also easy to maintain.It’s faster than writing code without tests! Or to put it another way, skipping unit tests is faster, unless you actually need the code to work. 
Most of the effort we spend on code, we spend fixing it after we’ve checked it in to the source-code repository. But test-first eliminates much of that waste by allowing us to get more of it right to start with and by making bugs easier to fix.
One of the real value proposition of unit tests is that we get a low-level regression-test suite,we can go back at any time and see not only what broke but where the bug is. Arguably with many frameworks around it’s a low-effort way to catch bugs before the build goes off to QA. Whenever a bug comes it will make life lot easier without the need of using a debugger many a times.Test-first catches some bugs within a few minutes instead of a few days. It is even cheaper than code inspections, code reviews. 



Bookmark and Share