Friday, April 24, 2009

Search Domain Basics : Search is probably the most pervasive technology domain of this century. Here I have tried to cover some basic concepts & some software implementation details with java as the focus. I thought this information can help new comers to this domain & provides enough starting pointers to dig into more details.

"Content is King" - Content is what that drives the web & "search" is the engine for that. All the services providing the content has to employ search techniques to fetch the required information with minimum possible user inputs in fastest possible way. Google/Yahoo which have become synonyms with search is implementing all the applications in relation with search one way or the other. Apparently unstructured content forms the major portion of the content available in the universe. Unstructured data (or unstructured information) refers to (usually) computerized information that either does not have a data model or has one that is not easily usable by a computer program or in simple words any data that is not represented in terms of column names in RDBMS table schema. Parallel computing, Data sharding, Schema definition, Scale of data & nature of the content acquisition related with un-structured content makes it unsuitable to be solved from 100% RDBMS solution, although full text indexing does exist in the RDBMS world.

Raw data with context is called 'Information' & Information search and retrieval is all about locating relevant material from collection of raw data in a fastest way taking minimum possible input from the users. The ability to aid and assist a user in finding relevant information is the primary goal of information engineers & information Retrieval (IR) libraries.
Major parts for search engine:
fetching/Loading the document – downloading content (lists of pages) that have been referenced.
analysis – analyzing the database to assign a priority scores to pages (PageRank) and to prioritize fetching.
indexing – combines content from the fetcher, incoming information from the built-in data source, and link analysis scores into a data structure that’s quickly reachable usually using cache services.
searching – returns set of content that ranks pages against a query using an index.
database – keeping track of what documents with various context information helping the ranking better.
To scale to billions of documents, all of these must be distributable, i.e., each must be able to run in parallel on multiple machines. This should happen by throwing more hardware into the pool, without massive reconfiguration whenever scale up is required.As we cannot offer to have failure of any single component cause a major hiccups; a search solution must be able to easily scale by throwing more hardware into the pool, without massive reconfiguration; and  things should largely fix themselves without human intervention. This can only be possible with the stateless implementation of software services.

Search Engine with Java for unstructured content: Studying the Information library APIs gives better insights into what search engine is capable of providing the service & I am taking the lucence as the sample for that.
Lucene library has become defacto IR library in java world, now lucene has been ported to almost all the major languages showcasing the popularity & capability of this small library.
Lucene is a search and retrieval library providing key word and fielded search. It can use boolean AND, OR, and NOT to formulate complex queries & can use fuzzy logic that is useful when searching text created by optical character recognition. Un-structured content far exceeds the structured content in the web world. Lucene mainly deals unstructured content & can effectively search structured content also with the field tags.Lucene provides minimum required information retrieval functionality. We can call this a "SearchKernal" library that provides full-text search and indexing functionality. Instead of an out of the box application, Lucene offers a usable API for programmers and operates on a lower level.There are off the shelf libraries (Compass, Nutch, Solr...) providing monitoring, transaction utilities over lucene. Commercial enterprise search offerings include from vendors such as, Autonomy, Google, Oracle & FAST (MS).Lucene does not search file by file. The search space is analysed first and translated into a normalised representation - the index. Lucene uses a reverse index. All words in the index are unique, that means the index is a compressed representation of the search space. Lucene only supports plain-text files. However, a variety of free open source document parsers are available for document types such as, RTF, PDF, HTML, XML, Word etc. Depending on the nature of the text content various analysers are on offer. For example, text can be analysed with a white space analyzer which breaks down the text in tokens separated by white space. To keep the response time short the process of generating and optimising the index is separated. The index gets normalized by applying a stemming and lemmatisation algorithms. Lucene beats the RDBMS in full text search in terms of processing speed, manageable reduced the size of the index footprint (now about 25 percent of the source documents’ size), easy incremental updates, support for index partitions, price & flexibility (index methodology, deploy options & schema evolution). RDBMS way of searching with where clause & LIKE % is not only scalable but ineffecient, although RDBMS like Oracle includes full text indexing capabilities they have not been as popular to independent solutions such as Lucene & also it's not easy to implement the parallel processing (map/reduce) invloving multiple machines & terabytes of data. It's the dynamic nature & context understanding nature of search bringing huge changes search domain with navigating the hierarchy is becoming old fashioned. With technical problems pretty solved in this domain, now the key thing to remember is "search methodology is much more important than the underlying technology".

Search Terminologies:
Proximity search:A search where users to specify that documents returned should have the words near each other.
Concept Search: A search for documents related conceptually to a word, rather than specifically containing the word itself. Involves parallel computing.
Boolean search: A search allowing the inclusion or exclusion of documents containing certain words through the use of operators such as AND, NOT and OR.
Proximity search: A search where users to specify that documents returned should have the words near each other.
Stemming: The ability for a search to include the "stem" of words. For example, stemming allows a user to enter "running" and get back results also for the stem word "run."
Lemmatisation: is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
Noise or Stop words:Conjunctions, prepositions and articles and other words such as AND, TO and A that appear often in documents yet alone may contain little meaning.
Thesaurus: A list of synonyms a search engine can use to find matches for particular words if the words themselves don't appear in documents.
Index: Normailzed presentation of words
Semantic Search: is a process used to improve online searching by using data from semantic networks to disambiguate queries and web text in order to generate more relevant results.
Web Search : Content is public & Generic.Uses keywords, Links (relevency) based some kind of historic traffic.
Enterprise Search : Also contains private documents that domian specific, Quality of content should be highest quality content & not necessarily popular. Information/metadata needs to be secure with role based access to the content.It has to support security (Realms, Roles), SLAs and many other requirements. Google & Yahoo do not provide enterprise search.

As of now I am interested in researching this topic in the search retrieval field, So will keep updating this blog with my research findings.
"Automatic annotation/Summary addition for content":Lengthy documents/text are boring to read. It will be great if someone or computer can automatically creates the gist of the content. Automatic creation of annotation is tough task especially for non-domain specific topics but can be predictable in domain specific cases. For example it might be easier to extract the information from judgments copy automatically (at least in routine cases that hardly requires special knowledge from legal experts to annotate) or through workflow for review with automatically annotating the content/documents. I see this as an interesting area.

Summary:
IR & Search domain is pretty complex subject requiring mastery over algorithms & data structures. Hope that I have been able assimilate the information related to search taking "lucene" as sample search engine library.

References:

Lucene Book : http://www.manning.com/hatcher3/

Powered by Lucene:http://wiki.apache.org/lucene-java/PoweredBy/

Google Search:http://en.wikipedia.org/wiki/Google_search

Useful wrapper libraries over Lucene.

http://lucene.apache.org/solr/features.html - Solr Features

http://www.compass-project.org/overview.html - Compass Features

Friday, April 17, 2009

Some statistics about Java source code of popular open source libraries. Currently I am looking/learning a big system that has multi-million number of source code. I wrote a simple utility to extract information about the java source code just for fun. I ran this utility on many of "src" directory of open source libraries as well.This java utility takes a source code directory as inputs & traverses all the java code recursively inside that directory. It collects total number of active lines of code excluding comments, package count... 

& here is the result.

********* JDK1.5 ********
Total # Lines = 850918
Total # of Files = 6556
Avg # Lines per file = 129
Total # of packages = 368
******** Hadoop 0.18.3 ******
Total # Lines = 129742
Total # of Files = 926
Avg # Lines per file = 140
Total # of packages = 66
******* Lucene 2.4.1 *********
Total # Lines = 69606
Total # of Files = 528
Avg # Lines per file = 131
Total # of packages = 16
******** Struts *********
Total # Lines = 63707
Total # of Files = 1040
Avg # Lines per file = 61
Total # of packages = 120
********* iText 2.1.5 ***********
Total # Lines = 96830
Total # of Files = 544
Avg # Lines per file = 177
Total # of packages = 55
******* Tapestry 5.1.0.3 ******
Total # Lines = 96395
Total # of Files = 1937
Avg # Lines per file = 49
Total # of packages = 103
It's not all surprising that one of the best prolific java coder comes best ("Howard") in the java world when it it comes to modularity.
********* iBatis *********
Total # Lines = 14132
Total # of Files = 202
Avg # Lines per file = 69
Total # of packages = 45
***** Hibernate 3.3.1.GA *******
Total # Lines = 173698
Total # of Files = 2102
Avg # Lines per file = 82
Total # of packages = 292

I guess these figures can be considered as standard while reviewing the modularity of any library.Do let me know if any one interested in code & having ANT target to analyze the
 source code b/w releases.
Bookmark and Share