Friday, June 26, 2009

All about Unit Testing

"First about Boring Theory"

"Unit Test is the smallest piece of testable part of an application" In computer programming, unit testing is a software verification and validation method where the programmer gains confidence those individual units of source code is fit for use. A unit is the smallest testable part of an application. The primary goal of unit testing is to take the smallest piece of testable software in the application, isolate it from the remainder of the code, and determine whether it behaves exactly as you expect. Each unit is tested separately before integrating them into modules to test the interfaces between modules. Unit testing has proven its value in that a large percentage of defects are identified during its use.

"In Java Unit Test cases means JUnit test cases, the single most importance of Spring & Guice (or any dependency injection framework) is to make unit testing easier"

JUnit is the de-facto framework for unit testing in java world. JUnit is a simple library, although there are mock objects, test code generators, behavioral test design & many other tools based on dynamic languages JUnit remains viable option while testing libraries or API where developer is a end user. Bob Lee (Author of Guice) stresses on the point that single most importance of dependency injection framework or interface driven design for matter is easier testability. All Google great applications like Gmail, Google Adsense, Calendar are the great testimony of this fact.

"Developers don't like writing unit test cases; Management needs to understand the technical debt associated with un-availability of test cases"

Let's face it, developers don't like writing unit tests & write documentation. Kent says, Software, like golf, is both a long and short game. JUnit is an example of a long game project – lots of users, stable revenue, where the key goal is to just stay ahead of the needs of the users. So it's hard to sell writing JUnit cases for small projects that don't have longer life. It's clearly avoidable overhead in such cases (Most of the web applications). It may not economically make sense to write extensive test cases for short lived & small applications.

In some cases developers hate to be embarrassed & look stupid when someone finds a mistake or highly technical guys think they don't need to write test their solid code. The first case can be handled through management as it's a purely competence issue which can be sorted out through training, in second case it's hard to convince as these guys very much correct in their assertions in their own way. A "high level of quality code" is great, yet most software lives on and on and people expect to add/modify features in that software or debug it. How much will the maintenance costs are without unit tests? How much more risk does that add?

"Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?" -- Brian Kernighan

"Solid test cases with 100% coverage provides the courage to refactor the code & the reduces the testing effort in future"

We need to educate ourselves it's economically makes sense to have solid test cases especially for pure library (API)providers, as cost associated testing & testers can be completely (almost) avoided as anyway applications will test the APIs & we can avoid duplicating of the testing effort of testers in testing the APIs. Unfortunately these long sighted approaches are difficult to sell to the management and it's become thankless job when the unit tested solid code can't be differentiated with the working code. Developing APIs is a marathon job & not a 100 meters race. Stamina & perseverance plays very important role. In most of the cases committed can't be taken back, I guess JDK deprecated APIs must be haunting, humiliating the initial designers. One of the benefits of JUnit test cases are that developers get first had experience of the developers using the same.

As time goes on, there will be cases where the code works a bit less, some minor bugs, and some dirty quick fix (or hack) happens. Since you don't want to touch that code, you'll put fixes/enhancements/workarounds in other parts of the code, slightly but constantly degrading the quality of your design. You won't even upgrade a depending library, since you can't easily run regression tests over it. In a shorter time than you expect, that good designed and implemented project will turn into a nightmare. So it's just not about changing code, it's about changing environment - RDBMS vendor, JDK version, Library versions, OS versions…

"JUnit can be used to write End to End functional Tests & Unit test cases needs to be reviewed"

From definition, a test is not a unit test if:

  • It talks to the database
  • It communicates across the network
  • It touches the file system
  • It can't run at the same time as any of your other unit tests
  • You have to do special things to your environment (such as editing config files) to run it.

If we go by above principle 90% of our JUnit test cases don't pass above rules. Although POJO driven frameworks like Spring tries to solve it, I don't think we can use JUnit in it's pure form. It's ok to use the JUnit for functional and integration test cases as well.

"Test Before Code, or perhaps Test Before Design." – That means unit test cases need to be reviewed before the design or coding. These reviews probably should be more thorough than the code reviews itself.

"JUnit test cases can serve as great tool to document"

Probably writing documentation through Javadocs is a bad idea. Usage of verbose class names, method names & JUnit test cases is more scalable & efficient way of documenting classes. Communicating through the code is the best way communication.

"JUnit test cases have to be efficient & succinct"

Manual test hurts both economic wise & manageability wise. But by end of the day if test cases are not capturing the correct scenarios & worst part if we have repetitive test it really JUnit really doesn't help. Best of the people involved with software development needs to do this type of unit testing. Garbage in & Garbage out rule is perfectly applicable here.

I am done with all my legal points to sell unit testing? Do you guys buy this argument? J

My next topic on unit testing would be on patterns and anti-patterns while writing test cases.

Resource:

http://www.artima.com/weblogs/viewpost.jsp?thread=126923

http://www.theserverside.com/news/thread.tss?thread_id=51615

http://www.junit.org

http://c2.com/cgi/wiki?WhoIsUsingJunit

http://www.theserverside.com/news/thread.tss?thread_id=51615

http://c2.com/cgi/wiki?FunctionalTest

http://www.davenicolette.net/articles/functional_tdd.html

http://www.logigear.com/newsletter/api_vs_unit.asp

http://www.exubero.com/junit/antipatterns.html

http://www.infoq.com/news/2009/06/test-or-not


 


 


 

Friday, April 24, 2009

Search Domain Basics : Search is probably the most pervasive technology domain of this century. Here I have tried to cover some basic concepts & some software implementation details with java as the focus. I thought this information can help new comers to this domain & provides enough starting pointers to dig into more details.

"Content is King" - Content is what that drives the web & "search" is the engine for that. All the services providing the content has to employ search techniques to fetch the required information with minimum possible user inputs in fastest possible way. Google/Yahoo which have become synonyms with search is implementing all the applications in relation with search one way or the other. Apparently unstructured content forms the major portion of the content available in the universe. Unstructured data (or unstructured information) refers to (usually) computerized information that either does not have a data model or has one that is not easily usable by a computer program or in simple words any data that is not represented in terms of column names in RDBMS table schema. Parallel computing, Data sharding, Schema definition, Scale of data & nature of the content acquisition related with un-structured content makes it unsuitable to be solved from 100% RDBMS solution, although full text indexing does exist in the RDBMS world.

Raw data with context is called 'Information' & Information search and retrieval is all about locating relevant material from collection of raw data in a fastest way taking minimum possible input from the users. The ability to aid and assist a user in finding relevant information is the primary goal of information engineers & information Retrieval (IR) libraries.
Major parts for search engine:
fetching/Loading the document – downloading content (lists of pages) that have been referenced.
analysis – analyzing the database to assign a priority scores to pages (PageRank) and to prioritize fetching.
indexing – combines content from the fetcher, incoming information from the built-in data source, and link analysis scores into a data structure that’s quickly reachable usually using cache services.
searching – returns set of content that ranks pages against a query using an index.
database – keeping track of what documents with various context information helping the ranking better.
To scale to billions of documents, all of these must be distributable, i.e., each must be able to run in parallel on multiple machines. This should happen by throwing more hardware into the pool, without massive reconfiguration whenever scale up is required.As we cannot offer to have failure of any single component cause a major hiccups; a search solution must be able to easily scale by throwing more hardware into the pool, without massive reconfiguration; and  things should largely fix themselves without human intervention. This can only be possible with the stateless implementation of software services.

Search Engine with Java for unstructured content: Studying the Information library APIs gives better insights into what search engine is capable of providing the service & I am taking the lucence as the sample for that.
Lucene library has become defacto IR library in java world, now lucene has been ported to almost all the major languages showcasing the popularity & capability of this small library.
Lucene is a search and retrieval library providing key word and fielded search. It can use boolean AND, OR, and NOT to formulate complex queries & can use fuzzy logic that is useful when searching text created by optical character recognition. Un-structured content far exceeds the structured content in the web world. Lucene mainly deals unstructured content & can effectively search structured content also with the field tags.Lucene provides minimum required information retrieval functionality. We can call this a "SearchKernal" library that provides full-text search and indexing functionality. Instead of an out of the box application, Lucene offers a usable API for programmers and operates on a lower level.There are off the shelf libraries (Compass, Nutch, Solr...) providing monitoring, transaction utilities over lucene. Commercial enterprise search offerings include from vendors such as, Autonomy, Google, Oracle & FAST (MS).Lucene does not search file by file. The search space is analysed first and translated into a normalised representation - the index. Lucene uses a reverse index. All words in the index are unique, that means the index is a compressed representation of the search space. Lucene only supports plain-text files. However, a variety of free open source document parsers are available for document types such as, RTF, PDF, HTML, XML, Word etc. Depending on the nature of the text content various analysers are on offer. For example, text can be analysed with a white space analyzer which breaks down the text in tokens separated by white space. To keep the response time short the process of generating and optimising the index is separated. The index gets normalized by applying a stemming and lemmatisation algorithms. Lucene beats the RDBMS in full text search in terms of processing speed, manageable reduced the size of the index footprint (now about 25 percent of the source documents’ size), easy incremental updates, support for index partitions, price & flexibility (index methodology, deploy options & schema evolution). RDBMS way of searching with where clause & LIKE % is not only scalable but ineffecient, although RDBMS like Oracle includes full text indexing capabilities they have not been as popular to independent solutions such as Lucene & also it's not easy to implement the parallel processing (map/reduce) invloving multiple machines & terabytes of data. It's the dynamic nature & context understanding nature of search bringing huge changes search domain with navigating the hierarchy is becoming old fashioned. With technical problems pretty solved in this domain, now the key thing to remember is "search methodology is much more important than the underlying technology".

Search Terminologies:
Proximity search:A search where users to specify that documents returned should have the words near each other.
Concept Search: A search for documents related conceptually to a word, rather than specifically containing the word itself. Involves parallel computing.
Boolean search: A search allowing the inclusion or exclusion of documents containing certain words through the use of operators such as AND, NOT and OR.
Proximity search: A search where users to specify that documents returned should have the words near each other.
Stemming: The ability for a search to include the "stem" of words. For example, stemming allows a user to enter "running" and get back results also for the stem word "run."
Lemmatisation: is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
Noise or Stop words:Conjunctions, prepositions and articles and other words such as AND, TO and A that appear often in documents yet alone may contain little meaning.
Thesaurus: A list of synonyms a search engine can use to find matches for particular words if the words themselves don't appear in documents.
Index: Normailzed presentation of words
Semantic Search: is a process used to improve online searching by using data from semantic networks to disambiguate queries and web text in order to generate more relevant results.
Web Search : Content is public & Generic.Uses keywords, Links (relevency) based some kind of historic traffic.
Enterprise Search : Also contains private documents that domian specific, Quality of content should be highest quality content & not necessarily popular. Information/metadata needs to be secure with role based access to the content.It has to support security (Realms, Roles), SLAs and many other requirements. Google & Yahoo do not provide enterprise search.

As of now I am interested in researching this topic in the search retrieval field, So will keep updating this blog with my research findings.
"Automatic annotation/Summary addition for content":Lengthy documents/text are boring to read. It will be great if someone or computer can automatically creates the gist of the content. Automatic creation of annotation is tough task especially for non-domain specific topics but can be predictable in domain specific cases. For example it might be easier to extract the information from judgments copy automatically (at least in routine cases that hardly requires special knowledge from legal experts to annotate) or through workflow for review with automatically annotating the content/documents. I see this as an interesting area.

Summary:
IR & Search domain is pretty complex subject requiring mastery over algorithms & data structures. Hope that I have been able assimilate the information related to search taking "lucene" as sample search engine library.

References:

Lucene Book : http://www.manning.com/hatcher3/

Powered by Lucene:http://wiki.apache.org/lucene-java/PoweredBy/

Google Search:http://en.wikipedia.org/wiki/Google_search

Useful wrapper libraries over Lucene.

http://lucene.apache.org/solr/features.html - Solr Features

http://www.compass-project.org/overview.html - Compass Features

Friday, April 17, 2009

Some statistics about Java source code of popular open source libraries. Currently I am looking/learning a big system that has multi-million number of source code. I wrote a simple utility to extract information about the java source code just for fun. I ran this utility on many of "src" directory of open source libraries as well.This java utility takes a source code directory as inputs & traverses all the java code recursively inside that directory. It collects total number of active lines of code excluding comments, package count... 

& here is the result.

********* JDK1.5 ********
Total # Lines = 850918
Total # of Files = 6556
Avg # Lines per file = 129
Total # of packages = 368
******** Hadoop 0.18.3 ******
Total # Lines = 129742
Total # of Files = 926
Avg # Lines per file = 140
Total # of packages = 66
******* Lucene 2.4.1 *********
Total # Lines = 69606
Total # of Files = 528
Avg # Lines per file = 131
Total # of packages = 16
******** Struts *********
Total # Lines = 63707
Total # of Files = 1040
Avg # Lines per file = 61
Total # of packages = 120
********* iText 2.1.5 ***********
Total # Lines = 96830
Total # of Files = 544
Avg # Lines per file = 177
Total # of packages = 55
******* Tapestry 5.1.0.3 ******
Total # Lines = 96395
Total # of Files = 1937
Avg # Lines per file = 49
Total # of packages = 103
It's not all surprising that one of the best prolific java coder comes best ("Howard") in the java world when it it comes to modularity.
********* iBatis *********
Total # Lines = 14132
Total # of Files = 202
Avg # Lines per file = 69
Total # of packages = 45
***** Hibernate 3.3.1.GA *******
Total # Lines = 173698
Total # of Files = 2102
Avg # Lines per file = 82
Total # of packages = 292

I guess these figures can be considered as standard while reviewing the modularity of any library.Do let me know if any one interested in code & having ANT target to analyze the
 source code b/w releases.

Tuesday, March 10, 2009

Interview with one of the best API designer in the world -Joshua Bloch.



Adding some other interesting wise quotes on API design that I have read...
When you design user interfaces, it's a good idea to keep two principles in mind:
Users don't have the manual, and if they did, they wouldn't read it.In fact, users can't read 
anything, and if they could, they wouldn't want to.
Same rule applies for API designers as well:
Developers don't have the java docs, and if they did, they wouldn't read it.In fact, Developers can't read anything other than pressing "." against the object reference in IDE & wait for something to select, and if they could, they wouldn't want to.
Learnability,Effeciency,Memorability,Errors & Satisfaction remains the core of good interface design.
APIs should emerge from the needs of real applications, and that they should make common tasks super-easy as the demand for quality, validated designs far exceeds our capacity to create them.In 1996 it wasn't clear we could create a sufficiently fast language without primitive types and arrays, It wasn't clear how much boilerplate code would be required by anonymous callback classes or checked exception. So Java couldn't resist including primitive types,excluding closures in favour of anonymous classes & over-using checked exceptions. 
Grady Booch->
"Great thing about objects is that they can be replaced". The great thing about Spring is it helps you replace them. Flexibility is much more important than the re-use.
Joshua Bloch says,
  • Public APIs are forever - one chance to get it right
  • Good code is modular–each module has an API
  • Thinking in terms of APIs improves code quality
  • Easy to use, even without documentation
  • Don’t let implementation details “leak” into API
  • Make classes and members as private as possible 
  • Make variables "final" wherever possible
Now some economics:
Economic reality suggest that buying more memory can be easier and cheaper than to pay someone to debug code & it pays more in the long run to have understandable slow code than super fast cryptic code.
Once I heard from Architect in IBM conference (don;t remember the name) for choosing Java over C++ saying this,
"The most compelling reason for adopting Java over C++ is automatic memory management. It protects application from mediocre programmers. It eliminates many embarrassments of memory leaks & crash that randomly occur in production". He went on say that "So as a result we are trading with inexplicable crashes for slow performance (automatic memory management & database-centric storage).This makes sure that application at least works anyway"
"Rushing is at the root of all lack of quality" - Peter Calthorpe, architect

Reference:

Monday, March 09, 2009

javaiq.in - A playground for learning Java RIAs. I am planning to expose my experimental applications with GWT, JavaFx & Felx. I will also be creating applications mixing multiple open services from various vendors (Google,Yahoo, Amazon, eBay...) & create a combined value. I guess this is the area that has huge scope.
I created this site & made it public 2 months back. It was result of my experimenting with new techniques with sample applications/code snippets,I always felt that best way to sell any new technique is with real application. It's great to have useful apps while learning new technologies/framework,I hate to write throw away examples. I have been following web frameworks from past few years, I have spent large amount of time in validating/learning over the year. I am SWING programmer & was always trying out the samples from Internet/books. Learning (or Stealing) the good code snippets that I liked & thought of exposing them as useful apps. Having spent a considerable amount of time and energy with Swing, it was queasy feeling in my stomach to see all my Swing programming heroes (Like Chet, Romain Guy...) either have gone to Adobe/Google or have moved to JavaFx/Flex way. I have also lost faith in Swing. I don't expect to develop new pure Swing apps any more, but I guess I was able to grasp GWT, Echo, Wicket frameworks much better than any pure web MVC framework (Struts, Spring MVC...) developers who were not having exposure to swing. I guess that's the advantage still I can leverage.
Chasing entrepreneurship dream- "When you are not into inheriting money and because you do not belong to a rich family and when you aren’t an individual blessed with the talent of a sportsperson or an actor, the next best thing to do is to become an entrepreneur" - Anand Morzaria
A man is a success if he gets up in the morning and gets to bed at night, and in between he does what he wants to do - Bob Dylan
Now experts say it has become relatively cheaper to start a new web application or a startup. Moore's law has made hardware cheap; open source has made software free; the web has made marketing and distribution free; and more powerful programming languages & techniques are making development teams smaller & powerful. But actual life it is not that simpler, especially if you are fighting a lone battle.
It's really pretty time consuming process to create & manage applications. I went with cheap shared Java hosting which was unable to run my "stripes" web framework based application & no one was there to help me out by providing tomcat logs & I had to settle with few JSPs with trimmed version. It's pretty costly affair to run java apps in a proper manner. It's big road blocker if anybody wants to develop/expose web applications in java today. Sun has to fix this problem (shared hosting) if it wants to replace PHP/Ruby On Rails apps in small scale applications. (I will have separate notes with java hosting) Enthusiasm to code is very hard to maintain, actually I wrote most of the applications in a few days, but was difficult to keep the momentum. Perseverance, relentless resourcefulness (as paul graham calls it) is difficult to achieve without full time dedication. Anyway it was good exercise to know all these limitations.
If I look back, I seriously doubt my spending of countless hours in browsing blogs, learning 10s of java web frameworks has significant impact on carrying out my day today work or think better. In fact my guess is I have wasted >30% of total time spent in gleaning through useless/marketing literature links. Hopefully from now on I will be able channelize all my energy in building better context for my day today work rather than learning all java web frameworks under the Sun. BTW all my web framework heroes (Wicket, Appfuse, Stripes, Tapestry...) that I have been following from past one year are apparently looking for jobs themselves & are investing time in using Groovy, Scala & Clojure. :-( not really a happy situation. I guess innovation with pure java in this space has reached dead end. Now I want to to learn/invest more on my current work so that I can become better at what I am doing currently, that's my priority #1 & I will resist experimenting with what blog writers say on latest technologies. One of the trouble I had disclosing this site was public embarrassment! of people knowing that I created this half baked application that I hesitated to own. But now it's quite decent set of useful applications that I started as my week-end project & I have good ideas to make it still better & this will be always my low priority work & also make sure to extract the best out of I am currently working on.Oh! Now I can call myself CTO of javaiq. nice feeling. :-)