Thursday, June 03, 2010

Using groovy to search name for my newly born son.
I had the un-sorted list of names of the God Vishnu from where I wanted to select a name, so wanted to sort them out alphabetically & used Groovy for the same.
Groovy Code in Java style - Click to see the code
Groovy with DSL MarkUpBuilder -Click to see the code
Unsorted 1000 Names
Sorted 1000 Names with groovy
Groovy is indeed Java++, It adds nicer syntax, closures and fluent interface making coding a joy without much noise.
Here is the result from groovy program. Click on letters to hide/expand the names starting with those letters.

a (113)

b (54)c (20)d (57)e (6)f (0)
g (25)h (13)i (3)
j (22)k (58)l (10)m (64)n (33)
o (4)
p (81)q (0)
r (14)s (243)t (19)
u (8)v (119)w (0)
x (0)
y (22)z (0)

Sunday, September 20, 2009

Google App Engine & Cloud computing : Making world flat for Developers all over the world.
The playing field for developers all over the world has been levelled with the advent of cloud computing & web services. It's pretty much free for any developer to develop/deploy application over the Internet. There is no start-up cost.Scaling up can happen incrementally based business requirement with upfront investment & I think that's revolutionary in my opinion.
Here are my thoughts on Google App Engine, after deploying my first sample application.

Salient features of Google App Engine
  • Google started supporting java on their engine in April - 09 - Ref
  • Support for Google Id and Sign In
  • Automatic Persistence -JDO or JPA (standards based approach, implementing standard Java APIs on top of App Engine where possible. So instead of using the underlying App Engine datastore API, developers can program against Java Data Objects or Java Persistence API)
  • Local Development - Remote Depolyment
  • Scalability and free Use - Pretty cheap
  • Monitoring - Nice overview
  • Eclipse plug in - I did not face any problem with while deploying.
  • Limited support for JDK classes - When I tried converting my swing application to GWT one I found many of classes I used were not supported like java.util.Timer classes
  • Native threads can't be spawned
Google unleashed App Engine with support for Python in 2008(April). This was Google's first entry into on-demand application development and deployment. Developers were able to build, develop and easily deploy apps using Python. I think it has been successful in that.The economic impact of is that cloud computing disrupts the data center world by slashing the capital and skills required to deploy a web application. I am betting on success of this model.
About the current application: java typing tutor
I created this sample application within an hour with GWT & published it. Initially I tried to porting my one of Swing application "Java Typing Tutor" which I created after reading the steve yegge's blog long back Programmings dirtiest little secret - BTW I don't believe that fast typist should also be better programmer in general, but it was interesting write up. Whenever I get time I do intent to make this application on par with my feature rich swing application.
GWT - Although I feel very comfortable coding with this framework, (may be because I spent lot of time in coding Swing Apps) it's API limitations are quite annoying(java.util.Timer does't work etc...). Initially I had the impression that converting a Swing application to GWT application is straight forward & I had even the ambition to write even converter!(paint() for GWT widgets), But I guess now I realized that it's not possible. The programming style, approach are quite different. Once I used wingS framework to convert one of swing application to web - it was straight forward because of deep integration with swing, but now apparently that project has died. It definitely requires a different mindset to develop web application than developing swing application.
Finally some happy notes for swing programmers:
Wicket & GWT are great saviours for struggling swing developers : With Swing GUIS losing their relevance, these 2 framework provides great pathway to move into web development.
Over all I feel GWT & Wicket are great framework for Swing developers to develop web application in terms productivity. Rails/Grails (or any request/response framework) guys can never match component developers in terms of productivity :-)

Advantage of GWT clients (or any fat client) is that they can tap into the resources of the client PC they are running on (such as memory to store state) and thus scale much better for large numbers of users. GWT produces something that is more akin to applets, independent applications that happen to be hosted on web page.
But where as,
Wicket still assumes that you want to build at least part of your application the 'old fashioned'
way, so that it will work without JavaScript, turn up in search engines, can be bookmarked so on...
So component developers can satisfy both the worlds.


Sunday, September 13, 2009

Elements of Java Coding Styles

Java Coding Styles: Here are some of the tricks which I felt useful for java developers & I have been using them a lot. Again these are personal preferences, consider what you like & disdain if you don't. By end of the day the software professional is all about getting "good working software, quickly and at low cost" that can sustain for longer time & solves the business problem at hand (with YAGNI caveat). I want to keep updating this whenever I see interesting ones.

Prefer smaller methods
They look beautiful, easy to understand/reuse/change & test.

'As I did 20 years ago, I still fervently believe that the only way to make software secure, reliable, and fast is to make it small.' - Andrew Tanenbaum

Dr. Venkat gives wonderful explanation on - how to convince your fellow developer to write short methods?

Use JUnit test cases; assert statements and method/variable names as documentation, writing comments in JavaDoc are obsolete and useless many a times.

The purpose of the assert statement is to give you a way to catch program errors early, using assertions to state things you know (or think you know) about your program can improve readability & are great in serving as comments, they are intended to be cheap to write, just drop them into your code any time you think of them. They are great tool of communication for a programmer & can get rid of dumb English stuff in the code with // & /** */.

"There is nothing as useless as doing efficiently, which should not be done at all." - Peter Drucker.

I hate to hear someone telling me to add javadoc comments especially for private methods. The method name variable name should be wise enough to reveal the intent. Fluent interface & design pattern can help a lot in naming variables, methods & class names. Over the years my variable & function names have become more verbose. It's so much easier to understand code from years ago when it reads like a sentence.

Big caveat with respect to assertions,

Assertions should be used to specify things that to be true at various points in your program for providing the documentation & definitely NOT for error checking and any active code. Assertions are disabled ("turned off") by default. Assertions were introduced in Java1.4 & can be turned on with -enableassertions (or -ea) flag on the java command line.

& for JavaDoc,

Code that isn't fully documented is unfinished and potentially useless. Javadoc comments on methods and classes should normally indicate what the method or class

It was necessary document the type of keys and values in a Map, as well as the Map's purpose, but now with Java5 Generics that is also not required. Prior to Generics I always used to code List/*String*/ = new ArrayList();

Some Java 5 features are smart; Look for smarter new libraries that exploit these features well

Use basic infrastructure libraries like Google Collections (I prefer this over apache collections), sl4j, Guice etc… as much as possible. We will learn to appreciate the value of Java-5 new features by looking into the source code of these libraries, Old coding styles have to be abandoned in favor new features.

Commenting the code with "if(false)"

In many a times we want to use different implementation for testing & commenting out code temporarily. If the method is very long it's painful to comment out the whole block. Let us say we have method like this.

public void sendMail(){

if(true) return;

// Lot of code for adding to database to event

...

// Transforming text with XSL etc...

....

}

if(true) return; -> This line has the same impact as the commenting out the whole stuff :)

Use double brace initialization (For non-performance intensive code), If possible use Google Collections in all the cases.

As swing programmer I have been using these quite long time. I have seen quite number of experienced java programmers aren't aware this. This really useful while testing. It's concise, requires less typing & more readable but comes with cost. Google Collections provide better same feature without any performance cost.

Map map = new HashMap() {{ 
    put(1, "one");     
    put(2, "two");    
    put(3, "three");     
    put(4, "four");     
    put(5, "five"); 
}};

Use protected & final where ever possible

Using final does not make much sense from performance point of view as modern JVM is intelligent enough to make use them efficiently, but still I feel they have lot of value in making the intent clear & in multi threading simple with im-mutable state.

Usage of "private" is over-rated & I think it's better to have it has protected making setting the code simpler.

Person p = new Person(){{

age=30; name="Suresh";

}}

where age & name are protected variables.

More & more you start using inner classes, you will get frustrate with java for lack of "closures" & will eventually move to Groovy or Scala (Like me :-))

Use Null Object Pattern & throw IllegalArguementException wherever applicable

The Null Object provides intelligent do nothing behaviour that helps to avoid problems with Null references (NPEs) & so called "A billion dollar irreversible mistake"

Prefer Unchecked Exception over Checked Exception

90% of time Unchecked Exception makes more sense than the Checked Exception. This is one of the most significant attribute of successful libraries like Spring & Hibernate. Use Checked Exception for recoverable errors (Business Exception) & Unchecked for remaining stuff.

Understand Soft, Weak and Phantom references in Java

WeakReference is a reference which doesn't have enough force to prevent Garbage Collector(GC) deleting object - Useful for caching

Soft Reference behaves like weak references, except when GC determines object is softly reachable & can be used to avoid OutOfMemoryError.

Phantom references are enqueued when objects are deleted from memory and get()method always returns null to prevent resurrecting object. So phantom references are good for determining exactly when object is deleted from memory.

__________________

Interesting quotes from Rock star programmers that I collected.

"Write lots of code. Have fun with it!" — Joshua Bloch

"Learn to use your tools. And I don't mean just enough to get by. I mean really learn how to use your tools." — Tor Norbye

"Don't use line numbers. Don't put your entire application in one method." — Chet Haase

"Don't be overwhelmed by the language or the platform." — Raghavan Srinivas (In the same line Neal Ford says "When you were hired by your current employer, you may think it's because of your winning personality, your dazzling smile, or your encyclopedic knowledge of Java. But it's not. You were hired for your ability to sit and concentrate for long periods of time to solve problems")

"Millions of people have been employed because someone at Sun Microsystems invented Java." - Masood Mortazavi. Master it & you will never regret for that

"There will always be opportunities for great engineers, but as I said earlier, I think the number of these opportunities will shrink as other, less technical personnel play larger roles in the software-development process, using more productive, higher-level tools and frameworks than we have used in the past." - Ben Galbraith

"Google makes finding information easier than ever, but nothing beats interacting with an expert."- Ben Galbraith. So always associate with people from whom you think you can learn.

From Pragmatic Thinking & Learning

There is no expertise without experience.It takes something on the order of ten years/ 10,000 hours
of practice to be expert in a field Deliberate, thoughtful practice is what makes the difference—not
just going through the motions.Practice doesn't make perfect, but it does make permanent:
neuroplasticity will cause your brain to re-wire itself according to what you do.You may not become
what you dream, or what you aspire to be, but you will become what you do.Unfortunately there is no substitute for hard work.

References:

http://java.sun.com/developer/technicalArticles/Interviews/studentdevs/index.html?intcmp=2225

http://www.agiledeveloper.com/blog/PermaLink.aspx?guid=8a745e85-2a34-4d9c-8c25-ca371530e281 - How to convince your fellow developer to write short methods?

http://weblog.raganwald.com/2007/04/rails-style-creators-in-java-or-how-i.html - Rails style initializers

http://www.refactory.org/s/double_brace_initialisation/view/latest

http://c2.com/cgi/wiki?DoubleBraceInitialization

http://bwinterberg.blogspot.com/2009/09/introduction-to-google-collections.html

http://thestrangeloop.com/sessions/ghost-virtual-machine-reference-references

http://juixe.com/techknow/index.php/2009/11/18/favorite-programming-quotes-2009/

http://www.readwriteweb.com/archives/top_10_software_engineer_traits.php



Friday, June 26, 2009

ALL ABOUT UNIT TESTING

"First about Boring Theory"

"Unit Test is the smallest piece of testable part of an application" In computer programming, unit testing is a software verification and validation method where the programmer gains confidence those individual units of source code is fit for use. A unit is the smallest testable part of an application. The primary goal of unit testing is to take the smallest piece of testable software in the application, isolate it from the remainder of the code, and determine whether it behaves exactly as you expect. Each unit is tested separately before integrating them into modules to test the interfaces between modules. Unit testing has proven its value in that a large percentage of defects are identified during its use.

"In Java Unit Test cases means JUnit test cases, the single most importance of Spring & Guice (or any dependency injection framework) is to make unit testing easier"

JUnit is the de-facto framework for unit testing in java world. JUnit is a simple library, although there are mock objects, test code generators, behavioral test design & many other tools based on dynamic languages JUnit remains viable option while testing libraries or API where developer is a end user. Bob Lee (Author of Guice) stresses on the point that single most importance of dependency injection framework or interface driven design for matter is easier testability. All Google great applications like Gmail, Google Adsense, Calendar are the great testimony of this fact.

"Developers don't like writing unit test cases; Management needs to understand the technical debt associated with un-availability of test cases"

Let's face it, developers don't like writing unit tests & write documentation. Kent says, Software, like golf, is both a long and short game. JUnit is an example of a long game project – lots of users, stable revenue, where the key goal is to just stay ahead of the needs of the users. So it's hard to sell writing JUnit cases for small projects that don't have longer life. It's clearly avoidable overhead in such cases (Most of the web applications). It may not economically make sense to write extensive test cases for short lived & small applications.

In some cases developers hate to be embarrassed & look stupid when someone finds a mistake or highly technical guys think they don't need to write test their solid code. The first case can be handled through management as it's a purely competence issue which can be sorted out through training & other means, in second case it's hard to convince as these guys very much correct in their assertions in their own way. It's an attitude problem, the best way would to be deploy someone to write unit test cases. A "high level of quality code" is great, yet most software lives on and on and people expect to add/modify features in that software or debug it, it's economic requirement that super stars need to prove that their code works with unit test cases. How much will the maintenance costs are without unit tests? How much more risk does that add?

"Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?" -- Brian Kernighan

"Unit testing seems to a lot of managers and developers like pure overhead, but professionally responsible developers know that it is one of the keys to quality." - Neal Ford

"Solid test cases with 100% coverage provides the courage to refactor the code & the reduces the testing effort in future"

We need to educate ourselves it's economically makes sense to have solid test cases especially for pure library (API)providers, as cost associated testing & testers can be completely (almost) avoided as anyway applications will test the APIs & we can avoid duplicating of the testing effort of testers in testing the APIs. Unfortunately these long sighted approaches are difficult to sell to the management and it's become thankless job when the unit tested solid code can't be differentiated with the working code. Developing APIs is a marathon job & not a 100 meters race. Stamina & perseverance plays very important role. In most of the cases committed can't be taken back, I guess JDK deprecated APIs must be haunting, humiliating the initial designers. One of the benefits of JUnit test cases are that developers get first had experience of the developers using the same.

As time goes on, there will be cases where the code works a bit less, some minor bugs, and some dirty quick fix (or hack) happens. Since you don't want to touch that code, you'll put fixes/enhancements/workarounds in other parts of the code, slightly but constantly degrading the quality of your design. You won't even upgrade a depending library, since you can't easily run regression tests over it. In a shorter time than you expect, that good designed and implemented project will turn into a nightmare. So it's just not about changing code, it's about changing environment - RDBMS vendor, JDK version, Library versions, OS versions…

"JUnit can be used to write End to End functional Tests & Unit test cases needs to be reviewed"

From definition, a test is not a unit test if:

  • It talks to the database
  • It communicates across the network
  • It touches the file system
  • It can't run at the same time as any of your other unit tests
  • You have to do special things to your environment (such as editing config files) to run it.

If we go by above principle 90% of our JUnit test cases don't pass above rules. Although POJO driven frameworks like Spring tries to solve it, I don't think we can use JUnit in it's pure form. It's ok to use the JUnit for functional and integration test cases as well.

"Test Before Code, or perhaps Test Before Design." – That means unit test cases need to be reviewed before the design or coding. These reviews probably should be more thorough than the code reviews itself.

"JUnit test cases can serve as great tool to document"

Probably writing documentation through Javadocs is a bad idea. Usage of verbose class names, method names & JUnit test cases is more scalable & efficient way of documenting classes. Communicating through the code is the best way communication.

"JUnit test cases have to be efficient & succinct"

Manual test hurts both economic wise & manageability wise. But by end of the day if test cases are not capturing the correct scenarios & worst part if we have repetitive test it really JUnit really doesn't help. Best of the people involved with software development needs to do this type of unit testing. Garbage in & Garbage out rule is perfectly applicable here.

I am done with all my legal points to sell unit testing? Do you guys buy this argument? J

My next topic on unit testing would be on patterns and anti-patterns while writing test cases.

Resource:

http://www.artima.com/weblogs/viewpost.jsp?thread=126923

http://www.theserverside.com/news/thread.tss?thread_id=51615

http://www.junit.org

http://c2.com/cgi/wiki?WhoIsUsingJunit

http://www.theserverside.com/news/thread.tss?thread_id=51615

http://c2.com/cgi/wiki?FunctionalTest

http://www.davenicolette.net/articles/functional_tdd.html

http://www.logigear.com/newsletter/api_vs_unit.asp

http://www.exubero.com/junit/antipatterns.html

http://www.infoq.com/news/2009/06/test-or-not

http://www.agitar.com/solutions/why_unit_testing.html




Friday, April 24, 2009

Search Domain Basics : Search is probably the most pervasive technology domain of this century. Here I have tried to cover some basic concepts & some software implementation details with java as the focus. I thought this information can help new comers to this domain & provides enough starting pointers to dig into more details.

"Content is King" - Content is what that drives the web & "search" is the engine for that. All the services providing the content has to employ search techniques to fetch the required information with minimum possible user inputs in fastest possible way. Google/Yahoo which have become synonyms with search is implementing all the applications in relation with search one way or the other. Apparently unstructured content forms the major portion of the content available in the universe. Unstructured data (or unstructured information) refers to (usually) computerized information that either does not have a data model or has one that is not easily usable by a computer program or in simple words any data that is not represented in terms of column names in RDBMS table schema. Parallel computing, Data sharding, Schema definition, Scale of data & nature of the content acquisition related with un-structured content makes it unsuitable to be solved from 100% RDBMS solution, although full text indexing does exist in the RDBMS world.

Raw data with context is called 'Information' & Information search and retrieval is all about locating relevant material from collection of raw data in a fastest way taking minimum possible input from the users. The ability to aid and assist a user in finding relevant information is the primary goal of information engineers & information Retrieval (IR) libraries.
Major parts for search engine:
fetching/Loading the document – downloading content (lists of pages) that have been referenced.
analysis – analyzing the database to assign a priority scores to pages (PageRank) and to prioritize fetching.
indexing – combines content from the fetcher, incoming information from the built-in data source, and link analysis scores into a data structure that’s quickly reachable usually using cache services.
searching – returns set of content that ranks pages against a query using an index.
database – keeping track of what documents with various context information helping the ranking better.
To scale to billions of documents, all of these must be distributable, i.e., each must be able to run in parallel on multiple machines. This should happen by throwing more hardware into the pool, without massive reconfiguration whenever scale up is required.As we cannot offer to have failure of any single component cause a major hiccups; a search solution must be able to easily scale by throwing more hardware into the pool, without massive reconfiguration; and  things should largely fix themselves without human intervention. This can only be possible with the stateless implementation of software services.

Search Engine with Java for unstructured content: Studying the Information library APIs gives better insights into what search engine is capable of providing the service & I am taking the lucence as the sample for that.
Lucene library has become defacto IR library in java world, now lucene has been ported to almost all the major languages showcasing the popularity & capability of this small library.
Lucene is a search and retrieval library providing key word and fielded search. It can use boolean AND, OR, and NOT to formulate complex queries & can use fuzzy logic that is useful when searching text created by optical character recognition. Un-structured content far exceeds the structured content in the web world. Lucene mainly deals unstructured content & can effectively search structured content also with the field tags.Lucene provides minimum required information retrieval functionality. We can call this a "SearchKernal" library that provides full-text search and indexing functionality. Instead of an out of the box application, Lucene offers a usable API for programmers and operates on a lower level.There are off the shelf libraries (Compass, Nutch, Solr...) providing monitoring, transaction utilities over lucene. Commercial enterprise search offerings include from vendors such as, Autonomy, Google, Oracle & FAST (MS).Lucene does not search file by file. The search space is analysed first and translated into a normalised representation - the index. Lucene uses a reverse index. All words in the index are unique, that means the index is a compressed representation of the search space. Lucene only supports plain-text files. However, a variety of free open source document parsers are available for document types such as, RTF, PDF, HTML, XML, Word etc. Depending on the nature of the text content various analysers are on offer. For example, text can be analysed with a white space analyzer which breaks down the text in tokens separated by white space. To keep the response time short the process of generating and optimising the index is separated. The index gets normalized by applying a stemming and lemmatisation algorithms. Lucene beats the RDBMS in full text search in terms of processing speed, manageable reduced the size of the index footprint (now about 25 percent of the source documents’ size), easy incremental updates, support for index partitions, price & flexibility (index methodology, deploy options & schema evolution). RDBMS way of searching with where clause & LIKE % is not only scalable but ineffecient, although RDBMS like Oracle includes full text indexing capabilities they have not been as popular to independent solutions such as Lucene & also it's not easy to implement the parallel processing (map/reduce) invloving multiple machines & terabytes of data. It's the dynamic nature & context understanding nature of search bringing huge changes search domain with navigating the hierarchy is becoming old fashioned. With technical problems pretty solved in this domain, now the key thing to remember is "search methodology is much more important than the underlying technology".

Search Terminologies:
Proximity search:A search where users to specify that documents returned should have the words near each other.
Concept Search: A search for documents related conceptually to a word, rather than specifically containing the word itself. Involves parallel computing.
Boolean search: A search allowing the inclusion or exclusion of documents containing certain words through the use of operators such as AND, NOT and OR.
Proximity search: A search where users to specify that documents returned should have the words near each other.
Stemming: The ability for a search to include the "stem" of words. For example, stemming allows a user to enter "running" and get back results also for the stem word "run."
Lemmatisation: is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
Noise or Stop words:Conjunctions, prepositions and articles and other words such as AND, TO and A that appear often in documents yet alone may contain little meaning.
Thesaurus: A list of synonyms a search engine can use to find matches for particular words if the words themselves don't appear in documents.
Index: Normailzed presentation of words
Semantic Search: is a process used to improve online searching by using data from semantic networks to disambiguate queries and web text in order to generate more relevant results.
Web Search : Content is public & Generic.Uses keywords, Links (relevency) based some kind of historic traffic.
Enterprise Search : Also contains private documents that domian specific, Quality of content should be highest quality content & not necessarily popular. Information/metadata needs to be secure with role based access to the content.It has to support security (Realms, Roles), SLAs and many other requirements. Google & Yahoo do not provide enterprise search.

As of now I am interested in researching this topic in the search retrieval field, So will keep updating this blog with my research findings.
"Automatic annotation/Summary addition for content":Lengthy documents/text are boring to read. It will be great if someone or computer can automatically creates the gist of the content. Automatic creation of annotation is tough task especially for non-domain specific topics but can be predictable in domain specific cases. For example it might be easier to extract the information from judgments copy automatically (at least in routine cases that hardly requires special knowledge from legal experts to annotate) or through workflow for review with automatically annotating the content/documents. I see this as an interesting area.

Summary:
IR & Search domain is pretty complex subject requiring mastery over algorithms & data structures. Hope that I have been able assimilate the information related to search taking "lucene" as sample search engine library.

References:

Lucene Book : http://www.manning.com/hatcher3/

Powered by Lucene:http://wiki.apache.org/lucene-java/PoweredBy/

Google Search:http://en.wikipedia.org/wiki/Google_search

Useful wrapper libraries over Lucene.

http://lucene.apache.org/solr/features.html - Solr Features

http://www.compass-project.org/overview.html - Compass Features

Bookmark and Share