Saturday, December 29, 2012

Java.next() -> Scala or Groovy?

The future is already here—it’s just not evenly distributed. —William Gibson

The Java.next() is going to be Java followed by Groovy and Scala. The dominance of java language is likely to continue but there are more and more use cases that compels us to use other JVM languages (with their functional programming capabilities) for developing newer platforms. Java7's new features like "Improved Type Inference","Multi-catch","try-with-resources statement","Strings in switch","New Garbage First Collector" can still fight with newer languages. Java8 is likely to close the gap further with the support of closures. Hence Java.next() would be multiple languages led by java, a technique also popularly known as polygot programming.

Project team will have heterogeneous level of competence (best/good/average bad developers), technical superiority is not always the best choice, cost of maintenance always exceeds the cost of construction. Java has proved to be the best choice although it may not be technically superior, there is no reason to believe that it will change dramatically in coming days.
I am personally sold on the productivity benefits of JVM languages (Scala/Groovy/JRuby) and their functional programming abilities. Having developed several internal projects during day work and few personal projects during free time given a chance, I always prefer to code with JVM languages rather than in java.

The assumption that Java is getting old its time look for new ideas is getting stronger day by day. Irony of the history that .NET started as truly multi-language with single platform as compared to java claiming one language running in multi-platforms, the current trend looks like there are lots of JVM languages gaining lots of mind-share but .NET still heavily tilted towards C#.
The question of what is java.next() itself seems to be strange because all who came to replace java as language compile to byte code and runs on rock solid JVM. I have been following JVM languages for a quite some time. I picked these Groovy,Scala languages (Only because I can code in these languages and following seems to be higher for these 2 compared to any other JVM languages) as contenders.
I also picked up the parameters that suits java developers thinking.
1. Can I get my things faster and hence spend less time in office. (Here "my things" are general programming requirements as I perceive)
2. How easily I can pick them and can convince both technically literate and il-literate stake holders.
3. How easily I can get jobs and is it going to offer better salary compared to java

Thought Leaders opinion about the the languages 
Generally the understanding is that scala has the ability to disrupt the status quo rather than incrementally improving it like Groovy. Scala has done and doing good job of reducing the cost of abstraction transliterate Java into Scala and end up with bytecode that is almost exactly the same. Scala has some original good ideas and well thought out type system.
It does look like java programming Intellectuals move towards scala.
Here are the examples.
- Spring creator (Rod Johnson) endorses Typesafe and his dislike for groovy is quite evident as I heard from a podcost
- Java creator (James Gosling) prefers scala over ther JVM languages
- Groovy original creator (James Strachan) says scala as better option in the long run to replace java and so the JRuby creator (Charles Oliver Nutter )
But I think there will be some vested interest by some of these folks taking extreme stand and hence we need to take them with pinch of salt.

Winner : Scala

Job Trends
"Where the rubber meets the road" - The moment of truth.
Making more money with enjoyable effort (I don't say least effort) is a general end goal with technologies. Job opportunuties are the best borometer. They are the real reason for anyone to invest in a technology & probably single most important parameter to believe the hype in technology.
As web technology development is dominant one Grails popularity and active SpringSource support is driving the results.

Winner: Groovy

Learning curve
Compared to Groovy, Scala has been harder to learn for me (still learning of course). The ability to write scary programs in scala is big NO for enterprise development. We are always going to have mixture of good/averege/bad and great developers.
Although Martin (author of scala) tried hard to prove other way round brilliantly  From my experience I say it just more hard for java developers.In groovy I can just close my eyes and type java code if I get struck with some groovy concepts while learning that can't happen with scala.

Winner : Groovy

Multi-core programming
For the past 30 years, computer performance has been driven by Moore's Law;(the number of transistors on integrated circuits doubles approximately every two years). We don’t get the same increase in clock frequency as we used to, instead, we get more cores. CPUs aren't get faster anymore, it’s physically difficult to do so. Programming languages that can handle this better is likely to succeed more in coming days.
Scala supports  multi-core friendly language constructs and libraries than others. (Im-mutability and functional programming is baked into the language). Although there is nothing inherently great about scala (which runs on JVM anyway) compared java. GPars from groovy is also fair. Scala folks says if we stick our head in the sand we will be left behind. This is most preferred argument from scala evangelists to promote. I do think there is huge merit in this argument by seeing what libraries like akka are able to achieve although famous TestNG creator thinks otherwise.

Winner: Scala

Web Applications
Grails and play framework are the show pieces of Groovy and Scala respectively. They beat other java web frameworks in the page oriented MVC architecture hands down. The productive gains are pretty huge to overlook. The approach is not only new and intelligent.
More and more work is getting done at the web browser, mainly through java script. This trend likely to continue the future as well. With lots of great library pushing the limits of ability of java script (JQuery, AngularJS, Backbone ...).
Both the frameworks are way ahead of other frameworks in supporting new java scrpit libraries, web soeckets etc... JSON/XML rendering from controllers of play and grails looks so innate behaviour of the framework unlike others. Both supports MVC and stateless architecture much better than their counterparts.
I think there is scope to write component oriented single web applications is pretty huge. GWT is still the best option for java programmers. Relying on some code generators is always not good. Coffee script(better way to write java script) also looks nice.

Coming to performance, performance of most web applications is I/O bound, so the practical performance impact of Groovy may be largely offset by the (developer) productivity gains it offers. Play framework folks thought otherwise and moved from groovy based templates to scala one. But I still feel for most of the applications the first assertion stands,  and groovy offers awesome productivity benefits.
The plugin framework eco system of grails is miles ahead compared to play,  definitely a big plus.

Winner : Groovy


Mobile platform
Its difficult to imagine any new application that runs only on web. Although web apps will work with mobile as well with new frameworks fluently with mobile centric CSS/Script supoort, I believe HTML5's inherent advantages with distribution, monetization, platform power and network effects aren't enough to match with rich user experience with native in mobile apps.
I firmly believe that both of them will have their own space in the days to come and we have to live with building 2 kinds of applications. Java android SDK is the viable option for java programmers. (although there are some options to run other way)

Winner : Java

Writing DSLs
"Say What You Mean, Mean What You Say" - bridging the gap between what the user and a programming language. Ability to defining Domain Specific Language is very important feature. There are lots of use cases where it can be used to simplify most of the tasks. (new rules, Data entry, API usage, Testing, build systems, deployment etc...). Writing fluent API and XML are definitely verbose of way of defining new DSL. Resorting into using ANTLR, JavaCC is also overkill in most of the cases with huge learning curve/maintenance overhead that they bring in.
Ability to modify the existing class at runtime (meta programming) and language ability for operator overloading flexible syntax plays very important role.All functions can be used infix (obj.method(arg) or obj method arg) makes the writing of DSL a joy. I created fluent interface library myself and a great fan of functional libraries like Guava,  and functional java. But we are bending too much to use functional programming with java and isn't natural.

Winner : Groovy

Both Scala/Groovy wins hands down over java.

Database operations
Most of the development involve large chunk of code dealing with bringing data from various data sources (mainly relational) and updating the same. JPA (Its originator Hibernate) tried solve impedance mismatch between the object oriented code and relational data. After evaluating some of the scala framework I have come conclusion, JPA fixed the issue and not solved it. Functional programming is the best way to deal with row oriented data and success is not hiding the SQL details from developer but making as easy and transparent to developer without losing type safety.
Its thumbs up for slick and other several scala frameworks making JDBC operations simpler from scala world.

Winner : Scala

Big Data
In general appetite for insight into the details is increasing.Data inflow is increasing with larger velocity, its becoming difficult to assimilate the insight information. When we deal with larger data we need software that monitors and heals the software. Pretty big challenges.
We need software that exploits the hardware and new multi-core cpus better. The functional programming paradigm and immutable friendly programming language plays very important role.

Winner : Scala


Here are some interesting threads on this topic I found interesting against Scala. You can also see spirited fight from scala enthusiasts as response.
Switching back to Java:
Scala feels like EJB-2
Yammer moves back to java from scala

Saturday, March 03, 2012

Introduction to Hadoop

My notes on Hadoop - I never got an opportunity to directly work with Hadoop but currently looking into in house POC application that analyzes logs, Hadoop remains attractive for solving huge data that was majorly required by big companies (& mainly social software companies that deal with big data) but likely to expand into other companies as the data emitted by the world is growing exponentially. As I see from job description and salary perspective hadoop engineer is at the top & rightly so.
Hadoop has passed the hype cycle & is stable now. Now Hadoop is with Plateau of Productivity.

My notes here on hadoop should make readers hadoop buzz words compliant & give better idea of over-all picture.

What is Hadoop?
A cost-effective, scalable, distributed and flexible big data processing framework written in java based on Map-Reduce algorithm. MapReduce is a algorithm proposed by Google that breaks complex tasks down into smaller elements which can be executed in parallel.
Hadoop at core consists of two parts, one a special file system called Hadoop Distributed File System (HDFS) and Map-Reduce Framework that defines two phases to process the data a map and reduce.

Why Hadoop?
Traditional storage with RDBMS is costly and unsuitable in follwing scenarios
  • Too much of data that too in unstructured format (100s of Terabytes, Peta bytes)
  • Large # of applications to use a single filesystem namespace in distributed execution space
  • In-expensive storage of large data, but can be still easily queried.
  • Big data with high volume & velocity
Relational databases are designed with good principles of data durability, isolation and independence but the design is centralized and tends to get disk IO bound for high write workloads. The use of locking at different levels (row, page, table level) for consistency/isolation and the need to flush transaction state to disk introduces scalability challenges requiring users to scale by deploying heavy machines (vertical scaling).

Many tech vendors like EMC,Y!, Twitter, IBM,Amazon,Oracle & even Microsoft has got Hadoop-oriented "big data" strategy. Hadoop has proven to work with many of the big companies & huge investment in hadoop and its related technology made by them makes it viable option for handling big data in years to come.


Hadoop Use Cases:
This wiki-page lists out 100s of companies and their usages alphabetically, 
Here is the list that I picked from there.
  • Log files processing, 
  • Machine learning
  • User experience optimization/Prediction of behaviours based on the patterns and build the recommender system for behavioral targeting with pattern discovery/analysis.
  • Billions of lines of GPS data to create TrafficSpeeds for accurate traffic speed forecast
(proximity operators, hub and spoke search terms, customized relevance ranking, stemming, white and black lists, Data mining, Analytics and machine learning) Data from various wide verity of sources Ex: sensors, cameras, feeds,streaming, logs, user interactions
Where Hadoop doesn't make sense
You cannot directly use Hadoop as a substitute for a database that takes query & returns result in milliseconds, so if there is requirement to have sub-second interactive reporting from your data, or using the data in multi-step, updates/insertions, complex transactions, an RDBMS solution may still be your best bet.
By design Hadoop is suited for batch index building,and is not proper for incremental index building and search. 
Hadoop eco system
There are software add on components developed on hadoop to make life simpler, VMWare recently announced that  Spring also will have support for Hadoop under their "Spring Data" umbrella.
  • HBase - Column-store database (of the order of terabytes) based on Google's big table
  • Pig         - Yahoo owned DSL for Data-flow or routing data
  • Hive - Facebook owned DSL for routing data but based on SQL
  • ZooKeeper - Distributed consensus engine with concurrent access
All the big companies (Twitter,Amazon, Yahoo, Facebook) have something to offer over the Hadoop which is a good thing.
Hadoop Core components:
At core hadoop can be grouped into following 
  • HDFS - Hadoop Distributed File System, is responsible for storing huge data on the cluster.
  • Hadoop Daemons -  A set of services offering to work with the data. 
  • Hadoop HDFS API -  APIs to communicate with the various nodes (services) from applications.
Here is brief write up on each of the components:
HDFS:
This is a distributed file system designed to run on commodity(low cost) hardware & highly fault tolerent.
It provides the high throughput with large data sets (files). HDFS supports write-once-read-many semantics on files
In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size.
HDFS Vs NAS   
In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.
HDFS is designed to work with MapReduce System, since computation are moved to data. NAS is not suitable for MapReduce since data is stored seperately from the computations.
HDFS runs on a cluster of machines and provides redundancy using a replication protocal. Whereas NAS is provided by a single machine therefore does not provide data redundancy.
Hadoop Daemon services or modules:
Hadoop is comprised of five separate daemons. Each of these daemon run in its own JVM.
      Master nodes:
    • NameNode - This daemon stores and maintains the metadata for HDFS.
    • Secondary NameNode - Performs housekeeping functions for the NameNode.
    • JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.
      Slave nodes
    • DataNode     – Stores actual HDFS data blocks.
    • TaskTracker - Responsible for instantiating and monitoring individual Map and Reduce tasks.
NameNode :
NameNode is heart of HDFS file system. It keeps the directory hierachy information of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself but just metadata. "NameNode" is a Single Point of Failure for the HDFS Cluster & makes all decisions regarding replication of blocks.Any hadoop user applications will have to talk to NameNode through Hadoop HDFS API to locate a file or to add/copy/move/delete a file.
Data Node :
A DataNode stores data in the Hadoop File System HDFS. DataNode instances can talk to each other, this is mostly during replicating data.
JobTracker :
A daemon service for submitting and tracking jobs(a processing unit) in Hadoop & is single point of failure for the Hadoop MapReduce service.
As per wiki:
Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.
Task Tracker:
Task Tracker   is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) & monitors these task instances(Task instances are the actual MapReduce jobs), from a JobTracker. 
Miscellaneous:  
MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation & hence can be zero as well.
Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers.
Speculative execution is a way of coping with individual Machine performance.
Some Criticisms:
"Hadoop brings a tons of data, but until you know what to ask about it, it’s pretty much garbage in, garbage out." - There are limited use cases for this especially for generic programmer to fully invest on this.
"Does querying huge data sets win over the advanced algorithms applied over limited data" - I am skeptical about querying huge data
"While most of Hadoop is built using Java, a larger and growing portion is being rewritten in C and C++" - I thought Google Map-Reduce must be better & converting some components to C++ is not a good sign for Java
"Configuration parameters are pretty huge" - that's a design smell I guess it shouldn't that complex.

Wednesday, February 08, 2012

Implementing equals and hash

All about equals() and hashCode()

Java does not provide direct support for associative arrays -- arrays that can take any object as an index. In Java we have Object class has two methods for making inferences about an object's identity: equals() and hashCode()
HashMap helps to lookup based on object as index, other hash-based data structures such as HashSet, LinkedHashSet, HashMap, Hashtable, WeakHashMap does the same thing.

There are two approaches to defining equality and hash value: identity-based, which is the default provided by Object, and state-based, which requires overriding both equals() and hashCode(). If an object's hash value can change when its state changes, be sure you don't allow its state to change while it is being used as a hash key.

There are some restrictions placed on the behavior of equals() and hashCode(), as explained in JavaDoc of Object class.
In particular, the equals() method must exhibit the following properties:
Symmetry: For two references, a and b, a.equals(b) if and only if b.equals(a)
Reflexivity: For all non-null references, a.equals(a)
Transitivity: If a.equals(b) and b.equals(c), then a.equals(c)
Consistency with hashCode(): Two equal objects must have the same hashCode() value

Here are some rules to follow:
    - if a class overrides equals, it must override hashCode
    - equals and hashCode must use the same set of fields
    - if two objects are equal, then their hashCode values must be equal as well
    - Consistency with the equals() contract is a fundamental requirement to   every implementation of equals() . Not only the hash-based collections rely on reflexivity, symmetry, and transitivity, but everybody who calls equals()  will expect exactly this behavior. Failure to comply to the equals()  contract leads to subtle bugs that are difficult to track down, because they are conceptual problems.
    - The value that's likely to change or unique should be first to compare (like Id etc...)
    - Make sure hashCode() of the key objects that you put into the collection never changes while the object is in the collection or make it immutable
    - if the object is immutable, then hashCode is a candidate for caching and  lazy initialization
Caching of hashCode value is useless with modren JVMs.The last one in the above proved to be wrong after my experiments, JVM seems to be intelligent enough to make it fast.

Morale of the story : If you think you can optimize some obvious stuff,don't do that & JVM might be already doing that.

Generally its assumed that hashCode provides a unique identifier for an object. But actually it does not.
According to The java.lang.Object documentation it should be perfectly ok to always return 0 for the hashCode(). The positive effect of implementing hashCode() to return unique numbers with the use prime number for unique objects, is that it might increase performance. The downside is that the behavior of hashCode() must be consistent with equals(). For object a and b, if a.equals(b) is true, than a.hashCode() == b.hashCode() must be true. But if a.equals(b) returns false, a.hashCode() == b.hashCode() may still be true. Implementing hashCode() as 'return 0' meets these criteria, but it will be extremely inefficient in Hash based collection such as a HashSet or HashMap.

Recently we had one problem with the old code, where in the equals was utilizing the hashCode() method in the equals(), but since some of the string values were getting the same hashCode(), the hashmap was giving unexpected results.
assert 2627 == "RU".hashCode()
assert 2627 == "S6".hashCode()
So its always better to be careful about this as it can manifest itself after a long time

I was looking into code generated by 3 dominant IDEs in java, I thought its interesting to share some of the information. Its safest way to use IDEs to generate code from code review & consistency purpose. I don't see anything wrong with generated code here.


For me IntelliJ is doing better here (anyway nothing wrong with others).
- provides a option to mark certain fields non-null and hence null check can be avoided which I think is a good thing to do.
- makes use of instanceof others uses getClass(), All classes in the hierarchy either allow slice comparison and use instanceof or they disallow it and use getClass() . The use of 'instanceof'    if need be, it can match any supertype, and not just one class &  it renders an explict check for "that == null" redundant, since it does the check for null already - "null instanceof [type]" always returns false. (Effective Java)
- code looks compact compared to others 

There are excellent helper classes EqualsBuilder and HashCodeBuilder from the Apache Commons Langlibrary. can make life simpler, but I think code generated from IDEs are good enough


Reference:
Effective Java - Joshua Bloch chapter about equals() and hashCode()
Java theory and practice: Hashing it out
Bookmark and Share