Thursday, January 17, 2013

Faceted Navigation for mobile applications and NoSQL


Recently I developed a android based application as part of proof of concept initiative. It was search application based on pre-defined static data that will get shipped along with application. Using lucene with android was pretty simple and straight forward. Having working with information retrieval space from past 4 years and having worked with one of best known swing based UI application (sitebuilder.yahoo.com), it did not take time to come up with decent working application. One of the challenge was to move data stored in database to lucene so as to make it full text searchable and build facets.
There were 2 points that were revelation for me "the power of facet based navigation  in building user interface" and "using lucene as data store and viable NoSQL solution" after this exercise.

The interfaces for mobile(or web for that matter) that requires keyword text entry aren't suited for browsing. Facet based navigation is pretty effective when exact keyword is not known. Amazon is a pioneer in this space & their web site is great example of faceted navigation. Their cross sell and up sell features directly rely on faceting.Keyword entry is tedious,using hierarchical faceted metadata for content selection results in awesome experience especially for mobile application. With ever increasing power and storage power of mobile devices (or desktop client for that matter), we should effectively use them to give better user experience. Let's say if some-one looking for job with key words such "java architect technologist", giving auto completion with total results (like java(200), architect(10) location(20)...) improves the over-all searching experience better. We should actively look into our existing web/mobile applications where faceting can be effectively utilized.

Lucene makes creating a facets a breeze. TaxonomyWriter & TaxonomyReader classes are well designed that helps to build categories/dictionaries easily. All the existing applications can be re factored to make use this feature rather than using groupby SQL queries. Its is also pretty straight forward to extract categories from full text and not column oriented.
Developing a android application is a tedious job. Since all the java APIs aren't supported its painful exercise to migrate & also its really greatly time-consuming to debug the android application with simulator although we can appreciate the effort from eclipse plugin team. I used groovy swing builders to develop quick mock up before doing it with Android. I love JVM scripting languages for that.One of the main reason to learn these although we may not use it in our daily jobs is the ability to come up with quick working prototypes. Here is the script that I created in Groovy, it prints the violations in the code that from each package level. It would be easier to build such things in GWT as well. So we can refactor all these upfront.
The problem that got solved with the application was storing the data & query any information using the lucene (Both with API and lucene syntax) in sub-second. It can deal with huge data & schema-less, the main characteristics of NoSQL. Although lucene is not generally counted as one of the NoSQL solution, it actually is.
When some one asks to me about NoSQL, I can safely tell I have used one :) and buzzword compliant. But this is a much bigger issue, recently I was hearing session from Martin Fowler where he argued very well for the need to both forms of data storage, SQL and NoSQL types & used the term "Polygot Persistence". The challenge for technologists is to identify the best contextually applicable use cases.
I started looking more into subject more & here are my notes. Although I am not in position to strongly recommend one over the other, definitely readers will appreciate the (NoSQL against SQL)value they bring in to the table and can make technically informed decision.

Schemalessness is one of the main reasons for interest in NoSQL databases - But we can generate schema-less design in relational database as well. There is a masterpiece discussion from asktom where "Generic Data Models" with dynamic capabilities have been regarded as anti-pattern defeating strong-typeness and query performance of relational database. I completely agree with that sentiment. These are types of cases, we should generally looking for Non SQL solution.

Generally, the data is significantly more important than the applications which use it - content is the king. The main reason is data outlives the applications that create it, the applications that use the data usually change over time. The schema and strong typeness is very important and in the long run will help the organization. We should be doubly cautious before deciding to put the data in non-SQL based storage & SQL is a safe option from manageability and staffing perspective for utility projects.
After all  - "Misuse is the greatest cause of FAILURE & we need to understand the merits/demerits  before dismissing it.

High performance from NoSQL is not free, it comes with cost (high-performance look-ups  at the cost of no or no easy support of relational queries through "joins" and reliability  through "transactions", "consistency", referential integrity) that are inherently addressed/provided by SQL products. NoSQL solutions needs to bend too much to match with RDBMS.
When data no longer fits on a single RDBMS server or when a single machine can no longer handle the query load, some strategy for storing/sharding and replication is required. Doing with RDBMS (rows and columns with index) is not easy and viable from cost and scalability requirements. The primary reason to integrate in-memory-based solutions with a NoSQL DB is to reduce the cost per GB of data. The need of systems that can run across multiple machines is going to be an absolute requirement & NoSQL shines over there.

Here is quick overview of options that we have.

Key-Value Stores  
USP   - Caches/Simple domain with fast read access
Use case   - Massively concurrent systems
Examples  - Redis/Memcached/Tokyo Cabinet/Coherence

Column-Family Stores
USP - Write on a big scale for reading and writing
Use case  - Massively concurrent systems
Examples  - Cassandra/Google BigTable/Apache HBase

Document-Oriented Databases
USP - Contents are documents with changing schema
Use case  - Dynamic Data Structure
Examples  - MongoDB/CouchDB

Graph Databases
USP  -  interconnected data with nodes and relationship
Use case   - Social Graphs with many to many relationships/Recommendation engines/Access Control Lists
Examples   - Neo4J/AllegroGraph/OrientDB

Ir-respective data source chosen, CAP theory states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees  Consistency, Availability & Partition tolerance(CAP), Rules change with data and the context, but below rule remains the same to improve the performance.

"Look for opportunity to turn scan into a seek (cache) and use event based concurrency (asynchronous and parallel computation)"
  • Read Cache with In memory data grid solutions - Coherence, Hazelcast, CDN
  • Make computations asynchronus & parallel wherever possible. - JMS, Threads (event based)
  • Move computation and not data. - Map Reduce 
  • Events in efficient machine parseable forms for effecient serialization - protobuff
  • Route new data through new partitions
  • Avoiding shared mutable state is the secret weapon to winning concurrency battles

I got interested in graph database, especially Neo4J and the problem that it tried to solve. Spring Data library samples were interesting. I highly recommend to try out spring data, spring source guys are awesome in creating the abstractions making developers life easier as usual.


here is the extract from "Neo4J in Action" book that explains the use case and technology behind the same.Social graph is being used as sample. Friends with many to many relationships. We can relate easily the differentiation it brings.

To find all friends on depth 5, MySQL will perform Cartesian product on 't_user_friend' table 5 times, resulting in 50,0005 records, out of which all but 1,000 are discarded. Neo4j,on the other hand, will simply visit nodes in the database, and when there is no more nodes to visit, it will stop the traversal. That is why Neo4j can keep constant performance as long as the number of nodes returned remains the same, opposed to significant degradation in performance when using MySQL queries
Each join creates a Cartesian product of all potential combinations of rows, and then filters out those that are not matching the where clause. With one million users, the Cartesian product of 5 joins (equivalent of query on depth 5) contains huge number of rows – billion billion billions – way to many zeros to be readable. Filtering out all the records that don’t match the query is too expensive – such that the SQL query on depth 5 never finishes in a reasonable time.
The secret is in the data structure – the localized nature of graphs makes it very fast for this type of traversals. Imagine yourself cheering your team on the a football stadium. If someone asks you how many people is sitting five meters around you, you will get up and count them. If the stadium is half empty, you will count people around you as fast as you can count. If the stadium is packed, you will still do it in a similar time! Yes, it may be slightly slower, but only because you have to count more people because of the higher density. We can say that, irrespective of how many people are on the stadium, you will be able to count the people around you at predictable speed – as you’re only interested in people near you, you won’t be worried about packed seats on the other end of the stadium for example.
This is exactly how Neo4j engine works in our example – it counts nodes connected to the starting node, at the predictable speed. Even when the number of nodes in the whole graph increases (given similar node density), the performance can remain predictably fast. If you apply same football analogy to the relational database queries, we would count all people in the stadium and then remove those that not around us – not the most efficient strategy given the inter connectivity of the data."



References:
Here is a pretty neat effort in to explain to support Unit testing NoSQL. A great way to learn as well. https://github.com/lordofthejars/nosql-unit#readme
Identifying the use case
http://highscalability.com/blog/2011/6/20/35-use-cases-for-choosing-your-next-nosql-database.html




No comments:

Bookmark and Share