Praveen Manvi's Technical Diary

Friday, April 24, 2009

Search Domain Basics : Search is probably the most pervasive technology domain of this century. Here I have tried to cover some basic concepts & some software implementation details with java as the focus. I thought this information can help new comers to this domain & provides enough starting pointers to dig into more details.

"Content is King" - Content is what that drives the web & "search" is the engine for that. All the services providing the content has to employ search techniques to fetch the required information with minimum possible user inputs in fastest possible way. Google/Yahoo which have become synonyms with search is implementing all the applications in relation with search one way or the other. Apparently unstructured content forms the major portion of the content available in the universe. Unstructured data (or unstructured information) refers to (usually) computerized information that either does not have a data model or has one that is not easily usable by a computer program or in simple words any data that is not represented in terms of column names in RDBMS table schema. Parallel computing, Data sharding, Schema definition, Scale of data & nature of the content acquisition related with un-structured content makes it unsuitable to be solved from 100% RDBMS solution, although full text indexing does exist in the RDBMS world.

Raw data with context is called 'Information' & Information search and retrieval is all about locating relevant material from collection of raw data in a fastest way taking minimum possible input from the users. The ability to aid and assist a user in finding relevant information is the primary goal of information engineers & information Retrieval (IR) libraries.

Major parts for search engine:

fetching/Loading the document – downloading content (lists of pages) that have been referenced.

analysis – analyzing the database to assign a priority scores to pages (PageRank) and to prioritize fetching.

indexing – combines content from the fetcher, incoming information from the built-in data source, and link analysis scores into a data structure that’s quickly reachable usually using cache services.

searching – returns set of content that ranks pages against a query using an index.

database – keeping track of what documents with various context information helping the ranking better.

To scale to billions of documents, all of these must be distributable, i.e., each must be able to run in parallel on multiple machines. This should happen by throwing more hardware into the pool, without massive reconfiguration whenever scale up is required.As we cannot offer to have failure of any single component cause a major hiccups; a search solution must be able to easily scale by throwing more hardware into the pool, without massive reconfiguration; and things should largely fix themselves without human intervention. This can only be possible with the stateless implementation of software services.

Search Engine with Java for unstructured content: Studying the Information library APIs gives better insights into what search engine is capable of providing the service & I am taking the lucence as the sample for that.

Lucene library has become defacto IR library in java world, now lucene has been ported to almost all the major languages showcasing the popularity & capability of this small library.

Lucene is a search and retrieval library providing key word and fielded search. It can use boolean AND, OR, and NOT to formulate complex queries & can use fuzzy logic that is useful when searching text created by optical character recognition. Un-structured content far exceeds the structured content in the web world. Lucene mainly deals unstructured content & can effectively search structured content also with the field tags.Lucene provides minimum required information retrieval functionality. We can call this a "SearchKernal" library that provides full-text search and indexing functionality. Instead of an out of the box application, Lucene offers a usable API for programmers and operates on a lower level.There are off the shelf libraries (Compass, Nutch, Solr...) providing monitoring, transaction utilities over lucene. Commercial enterprise search offerings include from vendors such as, Autonomy, Google, Oracle & FAST (MS).Lucene does not search file by file. The search space is analysed first and translated into a normalised representation - the index. Lucene uses a reverse index. All words in the index are unique, that means the index is a compressed representation of the search space. Lucene only supports plain-text files. However, a variety of free open source document parsers are available for document types such as, RTF, PDF, HTML, XML, Word etc. Depending on the nature of the text content various analysers are on offer. For example, text can be analysed with a white space analyzer which breaks down the text in tokens separated by white space. To keep the response time short the process of generating and optimising the index is separated. The index gets normalized by applying a stemming and lemmatisation algorithms. Lucene beats the RDBMS in full text search in terms of processing speed, manageable reduced the size of the index footprint (now about 25 percent of the source documents’ size), easy incremental updates, support for index partitions, price & flexibility (index methodology, deploy options & schema evolution). RDBMS way of searching with where clause & LIKE % is not only scalable but ineffecient, although RDBMS like Oracle includes full text indexing capabilities they have not been as popular to independent solutions such as Lucene & also it's not easy to implement the parallel processing (map/reduce) invloving multiple machines & terabytes of data. It's the dynamic nature & context understanding nature of search bringing huge changes search domain with navigating the hierarchy is becoming old fashioned. With technical problems pretty solved in this domain, now the key thing to remember is "search methodology is much more important than the underlying technology".

Search Terminologies:

Proximity search:A search where users to specify that documents returned should have the words near each other.

Concept Search: A search for documents related conceptually to a word, rather than specifically containing the word itself. Involves parallel computing.

Boolean search: A search allowing the inclusion or exclusion of documents containing certain words through the use of operators such as AND, NOT and OR.

Proximity search: A search where users to specify that documents returned should have the words near each other.

Stemming: The ability for a search to include the "stem" of words. For example, stemming allows a user to enter "running" and get back results also for the stem word "run."

Lemmatisation: is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

Noise or Stop words:Conjunctions, prepositions and articles and other words such as AND, TO and A that appear often in documents yet alone may contain little meaning.

Thesaurus: A list of synonyms a search engine can use to find matches for particular words if the words themselves don't appear in documents.

Index: Normailzed presentation of words

Semantic Search: is a process used to improve online searching by using data from semantic networks to disambiguate queries and web text in order to generate more relevant results.

Web Search : Content is public & Generic.Uses keywords, Links (relevency) based some kind of historic traffic.

Enterprise Search : Also contains private documents that domian specific, Quality of content should be highest quality content & not necessarily popular. Information/metadata needs to be secure with role based access to the content.It has to support security (Realms, Roles), SLAs and many other requirements. Google & Yahoo do not provide enterprise search.

As of now I am interested in researching this topic in the search retrieval field, So will keep updating this blog with my research findings.

"Automatic annotation/Summary addition for content":Lengthy documents/text are boring to read. It will be great if someone or computer can automatically creates the gist of the content. Automatic creation of annotation is tough task especially for non-domain specific topics but can be predictable in domain specific cases. For example it might be easier to extract the information from judgments copy automatically (at least in routine cases that hardly requires special knowledge from legal experts to annotate) or through workflow for review with automatically annotating the content/documents. I see this as an interesting area.

Summary:

IR & Search domain is pretty complex subject requiring mastery over algorithms & data structures. Hope that I have been able assimilate the information related to search taking "lucene" as sample search engine library.

References:

Lucene Book : http://www.manning.com/hatcher3/

Google Search:http://en.wikipedia.org/wiki/Google_search

Useful wrapper libraries over Lucene.

http://lucene.apache.org/solr/features.html - Solr Features

http://www.compass-project.org/overview.html - Compass Features

Friday, April 17, 2009

Some statistics about Java source code of popular open source libraries. Currently I am looking/learning a big system that has multi-million number of source code. I wrote a simple utility to extract information about the java source code just for fun. I ran this utility on many of "src" directory of open source libraries as well.This java utility takes a source code directory as inputs & traverses all the java code recursively inside that directory. It collects total number of active lines of code excluding comments, package count...

& here is the result.

********* JDK1.5 ********

Total # Lines = 850918

Total # of Files = 6556

Avg # Lines per file = 129

Total # of packages = 368

******** Hadoop 0.18.3 ******

Total # Lines = 129742

Total # of Files = 926

Avg # Lines per file = 140

Total # of packages = 66

******* Lucene 2.4.1 *********

Total # Lines = 69606

Total # of Files = 528

Avg # Lines per file = 131

Total # of packages = 16

******** Struts *********

Total # Lines = 63707

Total # of Files = 1040

Avg # Lines per file = 61

Total # of packages = 120

********* iText 2.1.5 ***********

Total # Lines = 96830

Total # of Files = 544

Avg # Lines per file = 177

Total # of packages = 55

******* Tapestry 5.1.0.3 ******

Total # Lines = 96395

Total # of Files = 1937

Avg # Lines per file = 49

Total # of packages = 103

It's not all surprising that one of the best prolific java coder comes best ("Howard") in the java world when it it comes to modularity.

********* iBatis *********

Total # Lines = 14132

Total # of Files = 202

Avg # Lines per file = 69

Total # of packages = 45

***** Hibernate 3.3.1.GA *******

Total # Lines = 173698

Total # of Files = 2102

Avg # Lines per file = 82

Total # of packages = 292

I guess these figures can be considered as standard while reviewing the modularity of any library.Do let me know if any one interested in code & having ANT target to analyze the

source code b/w releases.

Tuesday, March 10, 2009

Interview with one of the best API designer in the world -Joshua Bloch.

Adding some other interesting wise quotes on API design that I have read...

When you design user interfaces, it's a good idea to keep two principles in mind:

Users don't have the manual, and if they did, they wouldn't read it.In fact, users can't read

anything, and if they could, they wouldn't want to.

Same rule applies for API designers as well:

Developers don't have the java docs, and if they did, they wouldn't read it.In fact, Developers can't read anything other than pressing "." against the object reference in IDE & wait for something to select, and if they could, they wouldn't want to.

Learnability,Effeciency,Memorability,Errors & Satisfaction remains the core of good interface design.

APIs should emerge from the needs of real applications, and that they should make common tasks super-easy as the demand for quality, validated designs far exceeds our capacity to create them.In 1996 it wasn't clear we could create a sufficiently fast language without primitive types and arrays, It wasn't clear how much boilerplate code would be required by anonymous callback classes or checked exception. So Java couldn't resist including primitive types,excluding closures in favour of anonymous classes & over-using checked exceptions.

Grady Booch->

"Great thing about objects is that they can be replaced". The great thing about Spring is it helps you replace them. Flexibility is much more important than the re-use.

Joshua Bloch says,

Public APIs are forever - one chance to get it right
Good code is modular–each module has an API
Thinking in terms of APIs improves code quality
Easy to use, even without documentation
Don’t let implementation details “leak” into API
Make classes and members as private as possible
Make variables "final" wherever possible

Now some economics:

Economic reality suggest that buying more memory can be easier and cheaper than to pay someone to debug code & it pays more in the long run to have understandable slow code than super fast cryptic code.

Once I heard from Architect in IBM conference (don;t remember the name) for choosing Java over C++ saying this,

"The most compelling reason for adopting Java over C++ is automatic memory management. It protects application from mediocre programmers. It eliminates many embarrassments of memory leaks & crash that randomly occur in production". He went on say that "So as a result we are trading with inexplicable crashes for slow performance (automatic memory management & database-centric storage).This makes sure that application at least works anyway"

"Rushing is at the root of all lack of quality" - Peter Calthorpe, architect

Reference:

Design Slides
Defensive Programming

Spring is Good

Monday, March 09, 2009

javaiq.in - A playground for learning Java RIAs. I am planning to expose my experimental applications with GWT, JavaFx & Felx. I will also be creating applications mixing multiple open services from various vendors (Google,Yahoo, Amazon, eBay...) & create a combined value. I guess this is the area that has huge scope.

I created this site & made it public 2 months back. It was result of my experimenting with new techniques with sample applications/code snippets,I always felt that best way to sell any new technique is with real application. It's great to have useful apps while learning new technologies/framework,I hate to write throw away examples. I have been following web frameworks from past few years, I have spent large amount of time in validating/learning over the year. I am SWING programmer & was always trying out the samples from Internet/books. Learning (or Stealing) the good code snippets that I liked & thought of exposing them as useful apps. Having spent a considerable amount of time and energy with Swing, it was queasy feeling in my stomach to see all my Swing programming heroes (Like Chet, Romain Guy...) either have gone to Adobe/Google or have moved to JavaFx/Flex way. I have also lost faith in Swing. I don't expect to develop new pure Swing apps any more, but I guess I was able to grasp GWT, Echo, Wicket frameworks much better than any pure web MVC framework (Struts, Spring MVC...) developers who were not having exposure to swing. I guess that's the advantage still I can leverage.

Chasing entrepreneurship dream- "When you are not into inheriting money and because you do not belong to a rich family and when you aren’t an individual blessed with the talent of a sportsperson or an actor, the next best thing to do is to become an entrepreneur" - Anand Morzaria

A man is a success if he gets up in the morning and gets to bed at night, and in between he does what he wants to do - Bob Dylan

Now experts say it has become relatively cheaper to start a new web application or a startup. Moore's law has made hardware cheap; open source has made software free; the web has made marketing and distribution free; and more powerful programming languages & techniques are making development teams smaller & powerful. But actual life it is not that simpler, especially if you are fighting a lone battle.

It's really pretty time consuming process to create & manage applications. I went with cheap shared Java hosting which was unable to run my "stripes" web framework based application & no one was there to help me out by providing tomcat logs & I had to settle with few JSPs with trimmed version. It's pretty costly affair to run java apps in a proper manner. It's big road blocker if anybody wants to develop/expose web applications in java today. Sun has to fix this problem (shared hosting) if it wants to replace PHP/Ruby On Rails apps in small scale applications. (I will have separate notes with java hosting) Enthusiasm to code is very hard to maintain, actually I wrote most of the applications in a few days, but was difficult to keep the momentum. Perseverance, relentless resourcefulness (as paul graham calls it) is difficult to achieve without full time dedication. Anyway it was good exercise to know all these limitations.

If I look back, I seriously doubt my spending of countless hours in browsing blogs, learning 10s of java web frameworks has significant impact on carrying out my day today work or think better. In fact my guess is I have wasted >30% of total time spent in gleaning through useless/marketing literature links. Hopefully from now on I will be able channelize all my energy in building better context for my day today work rather than learning all java web frameworks under the Sun. BTW all my web framework heroes (Wicket, Appfuse, Stripes, Tapestry...) that I have been following from past one year are apparently looking for jobs themselves & are investing time in using Groovy, Scala & Clojure. :-( not really a happy situation. I guess innovation with pure java in this space has reached dead end. Now I want to to learn/invest more on my current work so that I can become better at what I am doing currently, that's my priority #1 & I will resist experimenting with what blog writers say on latest technologies. One of the trouble I had disclosing this site was public embarrassment! of people knowing that I created this half baked application that I hesitated to own. But now it's quite decent set of useful applications that I started as my week-end project & I have good ideas to make it still better & this will be always my low priority work & also make sure to extract the best out of I am currently working on.Oh! Now I can call myself CTO of javaiq. nice feeling. :-)

Friday, March 06, 2009

Popular techie words: we hear below words a lot in discussion & blogs. I thought it's good to have their definitions (easier ones) & explain if anyone asks "what's that?" as I also use them to have buzzwords compliance.

Technical Debt:
Technical Debt is a wonderful metaphor developed by Ward Cunningham to help us think about this problem. In this metaphor, doing things the quick and dirty way sets us up with a technical debt, which is similar to a financial debt. Like a financial debt, the technical debt incurs interest payments, which come in the form of the extra effort that we have to do in future development because of the quick and dirty design choice. We can choose to continue paying the interest, or we can pay down the principal by refactoring the quick and dirty design into the better design. Although it costs to pay down the principal, we gain by reduced interest payments in the future.
The metaphor also explains why it may be sensible to do the quick and dirty approach. Just as a business incurs some debt to take advantage of a market opportunity developers may incur technical debt to hit an important deadline. The all too common problem is that development organizations let their debt get out of control and spend most of their future development effort paying crippling interest payments.
Reference:
http://www.c2.com/cgi/wiki?TechnicalDebt
http://martinfowler.com/bliki/TechnicalDebt.html
http://www.youtube.com/watch?v=pqeJFYwnkjE

MapReduce:
MapReduce is hierarchical scatter/gather operation.
MapReduce is a library that lets you adopt a particular, stylized way of programming that's easy to split among a bunch of machines. The basic idea is that you divide the job into two parts: a Map, and a Reduce. Map basically takes the problem, splits it into sub-parts, and sends the sub-parts to different machines - so all the pieces run at the same time. Reduce takes the results from the sub-parts and combines them back together to get a single answer.
Reference:
http://scienceblogs.com/goodmath/2008/01/databases_are_hammers_mapreduc.php

Cloud Computing:
With cloud computing, everything is web-based instead of being desktop-based; access all programs and documents from any computer that’s connected to the Internet is possible. Cloud computing helps to do it more easily than ever before.

Wikipedia Cloud Computing is "a style of computing in which resources are provided as a service over the internet". Cloud computing user need not have worry about managing a machine or service at the physical level (Machine & location). Amazon Simple DB is an example for this as it handled operating system or database maintenance functions, SLA & operational issues.

Sharding:
Sharding or horizontal partitioning is about splitting up data sets. If data doesn't fit on one machine then split it up into pieces, each piece is called a shard.
Sharding is used when you have too much data to fit in one single relational database.

Reference:
http://highscalability.com/sharding-hibernate-way
http://highscalability.com/unorthodox-approach-database-design-coming-shard
http://lethargy.org/~jesus/archives/95-Partitioning-vs.-Federation-vs.-Sharding.html

Refactoring:
Refactoring is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior. Its heart is a series of small behavior preserving transformations

Reference:
http://www.refactoring.com/
http://en.wikipedia.org/wiki/Code_refactoring

Code Smell:
In computer programming, code smell is any symptom in the source code of a program that possibly indicates a deeper problem.
Reference:
http://en.wikipedia.org/wiki/Code_smell

Saturday, February 28, 2009

DHH - The best talk I heard on startups

I was quite impressed by DHH (Creator of Ruby On Rails framework) speech on "The secret to making money online" when I viewed this video again today.

There are just 3 steps for this.
1. Create a gr8 product that people like & is useful to them
2. Have a price for the product & ask people to pay for it or it's usage.
3. Make profits!

This is so ridiculously simple grand old rule! he goes on to say that you don't need to be f***ing genius to achieve this.

He also came up with nice probability analysis that strike rate of success following above simple model is much higher rather than Yahoo, Google,facebook & youtube (setting up BillBoards/hoardings) way of attracting users providing free services & make use of web page real estate to show ads & make money. He also suggest to go slow on implementing the idea as "Finding a good cause is incredibly hard & time consuming" - Craign Newmark.

Simply superb.

I have been following many startup blogs including paul Graham's, but never impressed like this before, so simple words but telling you the hard truth :-)

Thursday, November 20, 2008

Nice way to tell what all you know with wordle

Tuesday, November 11, 2008

Do we really need a web service api for serving static data when we already have a good query and mining language? (SQL, XML)

Google, Amazon & eBay provide the static data by publishing through their docs.
- Location information (City, Country...)
- Categories
- Currency data
- Hierarchy information

These are the advantages that I see from the above approach.
- Uses local CPU power, saves unnecessary traffic to the web service
- Supports online/offline scenarios
- Supports programming logic flow, since we will have the information at the compile time it's easy to code
- Solves the performance problem, what ever be the improvement network calls are always costly compared to local calls
- data model is essentially a flat data model, easy to import,save & navigate

But I guess they should be better exposed as XML rather than as HTML

The problems I see with this approach
- We need to have stringent way of updating the local cache of static code (usually stored in SQL or XML or as text file with some delimiters)
- static information that are easy to represent in code (say number is less than 10 & unlikely to change) are represented as Enums , so we will be 2 rules, some will be represented with static enums & some will be represented with static code
- There will be different rules for representing the data & we will be sacrificing the strong typing which I think is not a issue as benefits to the both parties (client & server) exceeds the pain with String based APIs.

As from the mature real SOAP based web services, it's clear that we don't need to expose API StaticCodeService() to get these static information. The demand for quality, correct designs far exceeds our capacity to create them. Web services should emerge from the needs of real applications, and that they should make common tasks ridiculously easy.

In Summary,
It's not worth wasting SOAP web service API calls for getting static information.

Thursday, November 06, 2008

Why Fluent Interface? Here are some thoughts on using fluent interface while designing APIs & their impact
Definition : In software engineering, a fluent interface (as first coined by Eric Evans and Martin Fowler) is an object oriented construct that defines a behavior capable of relaying the instruction 'context' of a subsequent call.

First philosophy,

The success of Conversation/Communication/Integration depends on the quality of signals between two entities involved & the signal is clean & is understood well. Communication is always based on context, shared context. Context also influences interpretation. For example , "use a fork" differently depending on context (Unix, dining table, Java or Ant build file). Well understood shared context improves the signal-to-noise ratio in communication it makes communication effective, expressive, easy to understand and easy to work with.
Domain Specific languages (DSLs) have implicit context, context is never mentioned, once the context is established repeating it again & again is a noise,. The end goal for involved parties in communication should be to reduce the noise to zero.
We have 2 types of DSL, Internal - based on exiisting language & external - A new language with parser & full fledged grammar, evidently external DSL is tough to implement & very effective (SQL,HTML) on the other hand Internal DSL is easy to implement especially with dynamic languages.

That's all about the boring philosophy...

The benefits,

No need to document the APIs.
- They are all self explanatory
- Examples should do all the explanation if any
- Testing/adaptability becomes easier

Less effort required to use, resulting in better economics
- There is very less chance to make mistakes
- All the extra noise is hidden which does not add domain concepts
- User has to write less & less number of lines of code, LOC most
of the time translate to number of bugs. Less the the code less the bugs

Correctly written fluent API gives the satisfaction of well written novel to
the author & the same reading experience to the consumer.

Market is always right, now all the new libraries are coming up using fluent APIs
Big software houses are putting more efforts to make code less noisy. The trend from Microsoft (C#3, IronRuby, F#...) Sun (JRuby & other dynamic language support) points towards that.

Experts think Fluent APIs are cool.
- Martin Fowler is writing a book on DSL
- Well proven frameworks like JUnit are coming up with fluent API alternatives
- Google collections, JMock, Fest, Guice... countless popular APIs are based
on fluent APIs
- Ruby on Rails Active Record is best example how a DSL can simplify the job & has successfully forced to think differently
- Market Signals shows that, best brains are talking more about DSL, functional
langauges & is definately are the way go about to develop software.
- Joshua Bloch (Google Java Architect) in his new Effective Java book talks about Builder pattern for building immutable fluent java objects.
buying more memory can be easier and cheaper than to pay someone to understand/debug code. Well written code (read fluent interface) results in good economics.

Now the real world code using Fluent APIs,

No marketing is better than showcasing real working code for technique or technology.

New Java Date APIs JSR 130:
Period thePeriod = Periods.periodBuilder().years(8).months(3).build();
This is what experts feel how the code should be written & this code is definitely looks better than java.util.Calendar,java.util.Date APIs.

Fluent way of handling XML marshalling un-marshalling:

Here we have sample example showcasing the XML usage with fluent APIs

<contacts>
<contact>
<name>praveenm</name>
<phone type="mobile">98862342333</phone>
<phone type="office">080-2344234233</phone>
<email>pm@aol.com</email>
</contact>
</contacts>

Groovy Sample:
def mkp = new MarkupBuilder()
mkp.contacts {
contact {
name("praveenm")
phone(type: "mobile", "98862342333")
phone(type:"office", "080-2344234233")
email("pm@aol.com")
}
}
Ruby Sample
require 'builder'
x = Builder::XmlMarkup.new(:target => $stdout, :indent => 2)
x.contacts {
x.contact {
x.name('praveenm')
x.phone '98862342333', :type => 'mobile'
x.phone '080-2344234233', :type => 'office'
x.email 'pm@aol.com'
}
}

Google Collections:
A excelellent example how JDK Collections can be simplified with fluent APIs
public static final ImmutableSet FAVORITE_NUMBERS
= ImmutableSet.of(2,9, 8, 15, 16, 50);

FEST Example: DSL-oriented API for functional Swing GUI testing

dialog.comboBox("domain").select("Users");
dialog.textBox("username").enterText("alex.ruiz");
dialog.button("ok").click();
dialog.optionPane().requireErrorMessage()
.requireMessage("Please enter your password");

JaxB common:

USAddress address = new USAddress()
.setName(name)
.setStreet(street)
.setCity(city)
.setState(state)
.setZip(new BigDecimal(zip));

I guess this is not a good example for fluent.

Guice:
@Override
protected void configure() {
binder().bind(IUserService.class).to(UserServiceMockImpl.class);
binder().bind(AuditInfo.class).to(DummyAuditInfo.class);
}

Hibernate:
List cats = session.createCriteria(Cat.class).setMaxResults(50).list()

DesignGridLayout:
A Fluent Layout manager

layout.row().label(label("Last Name")).add(lastNameField, 2).add(label("First Name")).add(firstNameField, 2);
layout.row().label(label("Phone"))
.add(phoneField, 2).add(label("Email")).add(emailField, 2);
layout.row().label(label("Address 1")).add(address1Field);
layout.row().label(label("Address 2")).add(address2Field);

Frameworks like Grails, JMock, JPA utilizes the fluent APIs. We also have some samples in standard JDK itself like StringBuffer, StringBuilder, ProcessBuilder etc...

These are the commonly seen techniques in Fluent APIs with Java.
- Method Chaining
- Nested Interfaces
- Builder Pattern
- static imports

And finally the problems with fluent APIs,

It's definitely not the 'wow' technique. People have been using this for a quite long time & now we have just fancy name 'Fluent API' that's it. So let's not think this is like OOP or OOAD or MDA. It's a simple programming technique for making code look better.

- DSL is about writing good essay. Programmers are not essayists
- Java is not suitable for DSL- It doesn't have closures, open classes , quite verbose & is filled badly written APIs.
- It's very difficult to get correct in developing DSLs.Fluent API can be useful for highly used API,otherwise the investment of developing may not be as effective as intended. Thus, not all API can be made fluent.
- Difficult to track down null return value issues that occur somewhere in the chain
- Difficult to handle the exceptions, especially while dealing with existing APIs having checked exceptions.
- One of my 'java friend' was not ready to believe some of the fluent samples that I showed him were actaully a java code, :-), so there is also -ve impact on readability
- Need to write more code to make code fluent.

Summary:
I strongly believe that DSL, both internal & external helps in developing better software that is easy to learn,extend, use & hard to misuse.

That's it, Hope that I was able sell fluent APIs to the new guys.

References:
Domain Specific Language Book by Martin Fowler
Domain Specific Language by Martin Fowler
DSL Boundary by Martin Fowler

Friday, October 17, 2008

I am currently working on building a fluent interface over soap toolkits. I did discussed the possibilities of rich APIs value addition. One of the problem with fluent APIs is that we have to get rid of checked exception to make it easier, but there are cases where fault containment is necessary. I am just thinking the better appraoach would be to generate the unchecked Exception counterparts for each of the fault code & provide the easier error handling.

For Example: (I have modified sample shown by the Martin Fowler), Let us assume that newOrder process will talk to 2 other web services & there is possibilities of 2 soapfault errors.

import static test.Customer.newOrder;

private void orderNew(){
newOrder() .
with(6, "TAL") .with(5, "HPK").skippable()
.withDollar(35.d) .withAccountId(23423l).withLicense("AS900980") .
.with(3, "LGV")
.priorityRush();
}
@Test
public void test(){
try{
orderNew();
}catch(GenricException exp){
switch(exp.getErrorCode()){
case SOAPFaultsCodes.E1001:
// handle the exception possibly giving more useful message or some other recovery action
case SOAPFaultsCodes.7003:
}
throw new RuntimeException("Unhandled Error");
}
}

Here again GenericExcption is abstract Exception extedning RuntimeException that is implemented by all the soap fault exception correspding to the errorcode & will be thrown by the rich APIs.

I guess with JDK7 I guess we can have String based switch() & upgraded catch block.

Well, for unit testing exception we can laverege the annotations support to assert with JUnit4
@Test(expected = E7069Exception.class)

I wanted to generate all these Exceptions automatically from soap faults document, For that I wrote this groovy script. (show casing the usage of multiline string, closure & file I/O)

constantsFile=""
converFileLineIntoException = {
sarray = it.split(" ")
exception = sarray[0]+"Exception"
errorCode = exception.substring(1,sarray[0].length())
constantsFile+="\n public static final int E$errorCode = $errorCode;"
className =
"""package com.yahoo.sm.ws.builders.exception;
// Generated code
public class $exception extends GenericException {

private String description;
private String shortDescription;

public $exception(String description, String shortDescription){
this.description=description;
this.shortDescription=shortDescription;
}

public int errorCode(){
return $errorCode;
}

public String getDescription(){
return this.description;
}

public String shortDescription(){
return this.shortDescription;
}
}
"""
new File(sarray[0]+"Exception.java").write(className)
}

def lines = new File("soapFaultList.txt").eachLine(converFileLineIntoException)
soapFaults = """
package com.yahoo.sm.ws.builders.exception;
public interface SOAPFaultsCodes {
$constantsFile
}"""
new File("SOAPFaultsCodes.java").write(soapFaults)
println "--- Gr8 I am done "

Groovy, JUnit4 & static imports just rocks :-)

Sunday, October 12, 2008

Dependency Injection with Guice, JUnit 4 & Get-Set problem:
I was evaluating Guice,JUnit-4,while trying out with these same I wrote a sample application using the same. & also tried out a possible solution for the problem with data conversion.
In a typical web application we also use heavily with get/set to convert from UI representation of object to back end implementation object. Many a times (In my experience most of the time) they are all heavy parallel structures required by framework (Like classic struts forces UI object to extend & ActionFormBean, limited data type support) or data type representation in the different UI models. With the advent of domain driven design we have more rich domain objects which usually contains many other data/behaviour which should not be or need not be exposed to the UI layer.
The problem with this conversion is that they are not only look dumb consuming lot of source code lines but they are error-prone & since there is no contract with back end we are forced to unit test the code as we cannot rely on compile time checks. I also tried a possible solution for this, 3 years back I tried out implementing this in a classic struts based web application & was successful in reducing the number of bugs.

The sample application is user management system. (Please note that the code has been written in such a way to show the usage Guice, JUnit & type safe data conversion & shouldn't be mistaken as real time design,there are no exceptions, validation etc...). It follows a typical MVC pattern followed in web applications.

Model Layer

User.java Represents a User domain object. As we notice it also contains the information that is populated through context (Logged in user, context, date, etc... usually through HttpSession or EJBContext) in the form of AuditInfo object. Here the fullName field (which doesn't make sense to UI) representing the firstName & fullName as 'firstName,lastName' in a single field. This is done to show how easily back end object can be refactored to accomadate client requirements. This can be even applied to data types. Since extracting of interface from any class is supported by all the IDEs, there is no coding effort required here.

\src\model\User.java


 1 /*
 2  * To change this template, choose Tools  Templates
 3  * and open the template in the editor.
 4  */
 5
 6 package model;
 7
 8 import controller.IUser;
 9
10 /**
11  *
12  * @author praveenm
13  */
14 public class User implements IUser,java.io.Serializable {
15
16     private String userName;
17     private String eMail;
18     private int age;
19     private String fullName;
20     private static final String DELIMETER=",";
21  
22     private AuditInfo auditInfo = new AuditInfo();
23
24     public String getUserName() {
25         return userName;
26     }
27   
28     public User(){
29       
30     }
31  
32
33     public void setUserName(String userName) {
34         this.userName = userName;
35     }
36
37     public String getEMail() {
38         return eMail;
39     }
40
41     public void setEMail(String eMail) {
42         this.eMail = eMail;
43     }
44
45     public int getAge() {
46         return age;
47     }
48
49     public void setAge(int age) {
50         this.age = age;
51     }
52
53     public String getFirstName() {
54         return fullName.split(DELIMETER)[0];
55     }
56
57     public void setFirstName(String firstName) {
58         if(getLastName()!=null){
59             fullName = firstName+DELIMETER+getLastName();
60         }
61     }
62
63     public String getLastName() {
64         return fullName.split(DELIMETER)[1];
65     }
66
67     public void setLastName(String lastName) {
68         if(getFirstName()!=null){
69             fullName = getFirstName()+DELIMETER+lastName;
70         }
71           
72           
73     }
74     public void setFullName(String str){
75         fullName=str;
76       
77     }
78
79     public AuditInfo getAuditInfo() {
80         return auditInfo;
81     }
82
83     public void setAuditInfo(AuditInfo auditInfo) {
84         this.auditInfo = auditInfo;
85     }
86  
87  
88
89 }

IUserService - Exposing the functionality of user management service

\src\model\IUserService.java


 1 package model;
 2
 3 import ui.IUser;
 4 import java.util.List;
 5
 6 /**
 7  * A simmple service representing the general User management activities.
 8  * @author praveenm
 9  */
10 public interface IUserService {
11
12     User get(String userName);
13     boolean saveOrUpdate(User user);
14     boolean delete(String userName);
15
16     boolean saveOrUpdate(IUser user,AuditInfo info);
17     List&lt;IUser> getUsers();
18 }
19
20

UserServiceMockImpl.java - Implements the service by saving the content in a file using object serialization.

\src\model\AuditInfo.java


 1 /*
 2  * To change this template, choose Tools  Templates
 3  * and open the template in the editor.
 4  */
 5
 6 package model;
 7
 8 import java.io.Serializable;
 9 import java.util.Date;
10
11 /**
12  *
13  * @author praveenm
14  */
15 public class AuditInfo implements Serializable {
16     private String updatedBy;
17     private String createdBy;
18     private Date updatedDate;
19     private Date createdDate;
20     private boolean isAdmin;
21
22     public String getUpdatedBy() {
23         return updatedBy;
24     }
25
26     public void setUpdatedBy(String updatedBy) {
27         this.updatedBy = updatedBy;
28     }
29
30     public String getCreatedBy() {
31         return createdBy;
32     }
33
34     public void setCreatedBy(String createdBy) {
35         this.createdBy = createdBy;
36     }
37
38     public Date getUpdatedDate() {
39         return updatedDate;
40     }
41
42     public void setUpdatedDate(Date updatedDate) {
43         this.updatedDate = updatedDate;
44     }
45
46     public Date getCreatedDate() {
47         return createdDate;
48     }
49
50     public void setCreatedDate(Date createdDate) {
51         this.createdDate = createdDate;
52     }
53     @Override
54     public String toString(){
55         return "updatedBy=["+updatedBy+"] "+
56                 "createdBy=["+createdBy+"] ";
57     }
58
59     public boolean isIsAdmin() {
60         return isAdmin;
61     }
62
63     protected void setIsAdmin(boolean isAdmin) {
64         this.isAdmin = isAdmin;
65     }
66 }
67
68

UI Layer

IUser.java - Now this is the trick we have IUser that's required by UI layer (which is always a subset of back end object) that's being implemented by the UI model object.


 1 package ui;
 2 /**
 3  *
 4  * @author praveenm
 5  */
 6 public interface IUser extends java.io.Serializable {
 7
 8     int getExperience();
 9     String getEMail();
10     String getFirstName();
11     String getLastName();
12     String getUserName();
13     void setExperience(int Experience);
14     void setEMail(String eMail);
15     void setFirstName(String firstName);
16     void setLastName(String lastName);
17     void setUserName(String userName);
18
19
20 }
21
22

UserForm.java - UI bean honouring java bean spec & can extend the classes like ActionFormBean representing the HTML form in the screen.

\src\ui\UserForm.java


 1 package ui;
 2
 3 import ui.IUser;
 4 import java.io.Serializable;
 5
 6 /**
 7  *
 8  * @author praveenm
 9  */
10 public class UserForm implements IUser,Serializable {
11     private String userName;
12     private String eMail;
13     private int experience;
14     private String firstName;
15     private String lastName;
16
17     public String getUserName() {
18         return userName;
19     }
20
21     public void setUserName(String userName) {
22         this.userName = userName;
23     }
24
25     public String getEMail() {
26         return eMail;
27     }
28
29     public void setEMail(String eMail) {
30         this.eMail = eMail;
31     }
32
33     public int getExperience() {
34         return experience;
35     }
36
37     public void setExperience(int experience) {
38         this.experience = experience;
39     }
40
41     public String getFirstName() {
42         return firstName;
43     }
44
45     public void setFirstName(String firstName) {
46         this.firstName = firstName;
47     }
48
49     public String getLastName() {
50         return lastName;
51     }
52
53     public void setLastName(String lastName) {
54         this.lastName = lastName;
55     }
56
57 }
58
59

controller

UserController.java - Controls the logic flow b/w UI & the back end. Here is the place we will introduce the dependency injection. In a typical web application audit information is captured in HttpSession object, since in test envioronment we will not be having references to HttpRequest, HttpResponse etc... we will inject those data with our Guice & also we will inject service implementation which could be changed without disturbing the other layers.

\src\controller\UserController.java


 1 package controller;
 2
 3 import ui.IUser;
 4 import com.google.inject.Inject;
 5 import java.util.List;
 6 import model.AuditInfo;
 7 import model.IUserService;
 8
 9 /**
10  *
11  * @author praveenm
12  */
13 public class UserController {
14
15     private final IUserService service;
16     private AuditInfo info;
17
18     @Inject
19     public UserController(IUserService _service){
20         service=_service;
21         System.out.println("UserController Instaniated ");
22     }
23     @Inject
24     public void setAuditInfo(AuditInfo info){
25         this.info=info;
26     }
27
28     public boolean save(final IUser user){
29         return service.saveOrUpdate(user,info);
30    }
31
32     public IUser get(String userName){
33         return service.get(userName);
34     }
35
36     public boolean delete(String userName){
37         if(!info.isIsAdmin()){
38             throw new IllegalAccessError("You don;t have the permission to delete");
39         }
40         System.out.println("users"+service.getUsers());
41         return service.delete(userName);
42     }
43     public List<IUser> users(){
44         return (List<IUser>)service.getUsers();
45     }
46
47 }
48
49

Unit test layer

MockBinder.java - As I am obsessed with fluent interface, generics & type safety, it's really enjoyable to wire up dependencies using Guice APIs rather than through XML.

\test\MockBinder.java


 1 package test;
 2
 3 import com.google.inject.AbstractModule;
 4 import java.util.Date;
 5 import model.AuditInfo;
 6 import model.IUserService;
 7 import model.UserServiceMockImpl;
 8
 9 /**
10  *
11  * @author praveenm
12  */
13 public class MockBinder extends AbstractModule {
14
15     final boolean isAdmin;
16
17     public MockBinder() {
18         isAdmin = false;
19     }
20
21     public MockBinder(boolean isAdmin) {
22         this.isAdmin = isAdmin;
23     }
24
25     @Override
26     protected void configure() {
27         binder().bind(IUserService.class).to(UserServiceMockImpl.class).asEagerSingleton();
28         if (!isAdmin) {
29             binder().bind(AuditInfo.class).to(DummyAuditInfo2.class);
30         } else {
31             binder().bind(AuditInfo.class).to(DummyAuditInfo.class);
32
33         }
34     }
35 }
36
37 class DummyAuditInfo extends AuditInfo {
38
39     public DummyAuditInfo() {
40         super.setIsAdmin(true);
41         super.setCreatedBy("Praveen M");
42         super.setCreatedDate(new Date());
43         super.setUpdatedBy("Praveen");
44         super.setUpdatedDate(new Date());
45     }
46 }
47
48 class DummyAuditInfo2 extends AuditInfo {
49
50     public DummyAuditInfo2() {
51         super.setIsAdmin(false);
52         super.setCreatedBy("someone");
53         super.setCreatedDate(new Date());
54         super.setUpdatedBy("someone");
55         super.setUpdatedDate(new Date());
56     }
57 }
58
59
60

UserTest.java - & now finally we have unit testing code with JUnit4 test case show-casing the usage of the applications.

test\UserTest.java


 1 package test;
 2
 3 import com.google.inject.Guice;
 4 import ui.IUser;
 5 import controller.UserController;
 6 import junit.framework.Assert;
 7 import org.junit.BeforeClass;
 8 import org.junit.Test;
 9 import ui.UserForm;
10
11 /**
12  * The power & simplicity of JUnit4 test cases.
13  * @author praveenm
14  */
15 public class UserTest {
16
17     private static UserController controller;
18
19     @BeforeClass
20     public static void bootStrap(){
21         controller = Guice.createInjector(new MockBinder(true)).getInstance(UserController.class);
22     }
23
24     @Test
25     public void save(){
26         IUser form = new UserForm();
27         form.setExperience(9);
28         form.setEMail("praveen.manvi@yahoo.com");
29         form.setFirstName("praveen");
30         form.setUserName("pmanvi");
31         form.setLastName("Manvi");
32         Assert.assertTrue(controller.save(form));
33
34         form.setExperience(10);
35         form.setEMail("praveen.manvi@yahoo.com");
36         form.setFirstName("praveen");
37         form.setUserName("pmanvi123");
38         form.setLastName("m");
39         Assert.assertTrue(controller.save(form));
40     }
41
42     @Test
43     public void get(){
44         IUser user = controller.get("pmanvi");
45         if(user!=null)
46         Assert.assertEquals("praveen", user.getFirstName());
47     }
48
49     @Test
50     public void delete(){
51        // if(controller.get("pmanvi123")!=null) 
52             Assert.assertEquals(controller.delete("pmanvi123"), true);
53     }
54
55     @Test(expected= IllegalAccessError.class)
56     public void unauhorizedDelete(){
57         controller = Guice.createInjector(new MockBinder(false)).getInstance(UserController.class);
58         controller.delete("pmanvi123");
59     }
60
61 }
62
63

------- Final Summary-------

Guice simplifies the testing a lot doing everything in java, If you just want dependency Injection, Guice beats all other frameworks (Spring...) hands down in ease of use

JUnit4 is much easy to use with annotations & now we don;t have to extend any class & don't have to write methds with test...(). I really liked the way we can test the Exceptions

Adding interface to both backend & UI objects we can get rid of get/set noise.
Although it makes(or forces) back end to be aware of UI & other form of clients, in some cases (like we we use ORMs instead of JDBCTemplate) one time conversion logic shifts from UI to backend & it might put more development load on persistent layer team. My guess is these objections can be ignored because of the benefit of to cleaner & less error prone code. In case of distrbuted enviornment all the extra fields can be set to null to reduce the payload, but anyway number of rows are much more important than the number of columns in deciding the data size.

Hope that this sample application help new comers to understand/appreciate the value of Guice & JUnit4 libraries.

References;
Guice Dependency Injection Framework from Google

JUnit 4 - with Java5 features & is quite different from older versions.