The Server-Side Pad

by Fabien Tiburce, Best practices and personal experiences with enterprise software

Archive for the ‘Semantic Web’ Category

Relational Databases Under Fire

with 4 comments

There is a certain irony to this post.  It’s a bit like a car salesman trying to sell you a bicycle.  My career so far has largely revolved around relational databases.  That is slowing changing however as new storage mechanisms and models emerge and demonstrate they are better suited to certain requirements.  I discuss a number of them here.

1. Distributed file systems.  DFS, out of the box, scale well beyond the capabilities of relational databases.  Hadoop is an open-source distributed file system inspired by Google’s BigTable.  Hadoop also implements MapReduce, a distributed computing layer on top of the file system.

2. Enterprise search servers.  The biggest eye opener in recent years (which we implemented for a public library’s “social” catalogue) has to be Solr.  Solr is based on Lucene and also integrates with Hadoop.  Already in widespread use, this product is poised to gain further adoption as more organizations seek to expose their data (including social data) to the world through searches.  The speed and features of Solr alone sell search servers better than I ever could and quite simply leave relational databases in the dust.

3. RDF stores.  While relational databases are governed by an overarching schema and excel at one-to-many relationships, RDF stores are capable of storing disparate data and excel at many-to-many relationships.  Open source products include Jena and Sesame.  Unfortunately, at the present time, the performance of RDF stores falls well short of relational databases for one-to-many data (most typical in enterprise databases) making its widespread enterprise adoption a long shot.

4. Web databases like this recent (and very quiet) Google announcement on Fusion Tables.  While functionally and programmatically limited compared to other stores, the Google product focuses on rapid correlation and visualization of data.  A product to watch.

Seismic shift in data storage?  Not quite.  But an evolution is certainly under way.  Relational databases are in widespread use.  They are highly capable at storing data  and data relationships, scale reasonably well and are economical for the most part.  Relational databases are not going away.  But the once dominant technology is being challenged by other models that are more capable, more efficient and/or more economical at handling certain tasks.  By evaluating these technologies against your organization’s needs, you may find surprising answers and ROI.

Advertisements

Written by Compliantia

June 12, 2009 at 9:04 am

Semantic Technologies will Rise from the Limitations of Relational Databases and the Help of Distributed File Systems

with 4 comments

As an architect of large enterprise systems, I look to the Semantic Web with envy and anticipation.   And yet, the more I look into the potential of semantic technologies, the more I realize semantics are victims of the success of the very technologies they are trying to replace. The semantic web is a network of global relations.  Semantic content is not bound by a single database schema, it represents globally linked data.  However as an expert in database modelling and database-backed systems, I am forced to concede that, for the purpose of each enterprise, a relational database governed by rules (schema) mostly internal to the organization and serving a certain functional purpose, is often all that’s needed.  Semantics are to a large extent, a solution in need of a problem.  And yet I am a strong believer in a semantic future, but not for reasons pertaining to semantics per se.   While actual numbers vary by database vendor, installation and infrastructure, relational databases are inherently limited in how much data they can store, query and aggregate efficiently.  Millions yes, billions no.  The world’s largest web properties don’t use relational databases for primary storage, they use distributed file systems.  Inspired by Google’s famous Big Table file system, Hadoop is an open-source free distributed file system.  It currently supports 2,000 nodes (servers) and, coupled with MapReduce, allows complete abstraction of hardware across a large array of servers, assured failover and distributed computing.  While 2,000 servers seems like a lot, even for large enterprise, I am amazed how many enterprise clients and partners are dealing with ever increasing datasets that challenge what relational databases were designed for.  Why does this matter?  When dealing with millions of files, billions of “facts” on a distributed file system, semantic technologies start making a lot of sense.  In fact dealing with universally marked loose content is precisely what semantic technologies were engineered to address.  And so I am hopeful.  Not that semantic technologies will prevail because of some inherent advantage but that the future points to gigantic datasets of disparate origins, ill suited conceptually and technically to be handled by relational databases.  It’s not that semantic technologies are better, it’s that they are better suited for the times ahead.

Written by Compliantia

June 3, 2009 at 10:04 pm

Fee-Based APIs Are Coming (It’s a Good Thing!)

leave a comment »

While Google has captured an overwhelming share of the search market by combining relevance with simplicity and speed, capitalizing on Google’s data to build business applications hasn’t been easy.  To this day, while you can buy a license for Google apps, maps and other offerings, the terms of use of the core search engine remain restrictive for B2B use.  In no uncertain terms, the terms of use state “The implementation of the Service on your Property must be generally accessible to users without charge and must not require a fee-based subscription or other fee-based restricted access”.    While this doesn’t rule out commercial ventures per se, it does rule out fee-based systems.  Ad-based systems are inappropriate for most B2B applications delivering the type of value-adding service that a corporate client typically expects to pay for, without ads and other distractions.

Why would a login-protected SaaS business application want to search Google?  The web is the largest collection of human knowledge ever assembled.  It’s also slowly being re-engineered semantically as a giant global database.  Thus opportunities abound for businesses to systematically mine the web and provide value-adding services on top of web-sourced data.  So why isn’t Google opening up its API to B2B use?  Google may be a search engine by function, it’s an advertising company by revenue.    Google doesn’t make money crawling the web, its  revenue is primarily generated by Sponsored Links.    Since ads don’t mesh well with API-sourced data (typically returned in a non-human readable format such as XML or JSON), Google doesn’t have much to gain by giving it away.

This post would end on a rather pessimistic note if it weren’t for the wonders of competition.   Being a distant second in the search market, and no longer the centre of attention, Yahoo has been quietly but relentlessly pushing the envelope lately.  They supported microformats long before Google did.  They also announced fee-based use of their BOSS Search API starting this year.  This is great news for two reasons.  Firstly, the fee eliminates the restriction to ad-based systems.  Secondly, the fee comes with assurances: response time guarantees, continued investment and support, as well as no usage limits.

Search engines and semantics are increasingly the “glue” of the Internet, a global repository of information which is starting to look more and more like a database (albeit one with no overarching schema).  Fee-based APIs enable an ecosystem of value-adding niche B2B players to mine, transform and add value to web-sourced data.  I hope other web properties follow Yahoo’s lead and open their API, for a fee, to B2B use.

Written by Compliantia

May 28, 2009 at 9:32 pm

Helping Machines Read, A Simple Microformat Case Study

with 2 comments

I recently made Betterdot’s Contact Us page both human and machine readable by adding hCard microformat markup to the underlying XHTML.  This notion of “machine readable” content is arguably abstract and somewhat obscure however.  What do we mean?  What do machines see?  Perhaps a picture (or three) are worth the proverbial 1,000 words.

When a human reader, using a web browser, looks at the page, he or she sees this:

Contact page, as seen by human readers

Contact page, as seen by human readers

 

Without semantic markups such as the hCard microformat markup, a machine (for example a Google bot crawling the Betterdot site for indexing) sees this:

Contact page as seen by machines (no microformat markup)

Contact page as seen by machines (no microformat markup)

 

With semantic markups such as the hCard microformat markup, the same machine or bot sees this:

Contact page, as seen by machines with microformat markup

Contact page, as seen by machines with microformat markup

 

In Layman’s terms, microformats help machine “read” data marked up with microformat tags on the page.   While “reading” falls short of true semantic “understanding”, microformats are certainly a step in the right direction.

Written by Compliantia

May 19, 2009 at 10:12 pm

The Road to the Semantic Web is Paved with Microformats

with 4 comments

microformatsGoogle recently and quietly announced something huge, “rich snippets”.   Rich snippets are smart previews, displayed right on a search results page.   While Google has long relied on snippets to attach a bit of information to each link (thus letting the user know what he or she might expect on each page represented by a link), rich snippets go a step further: they extract key characteristic of the page, be it a rating of a review or a person’s contact information.    Google doesn’t have to guess it, it knows it.  Google’s rich snippets are powered by microformats and RDFa, two semantic standards that are rapidly gaining adoption.   Google’s implementation allows semantically-marked web content (such as reviews and contact information) to be exposed, aggregated and averaged  in a Google search results page.  In short, after years in the lab, the web is at last, albeit quietly, becoming semantic!  

Microformats are not a substitute for the semantic web, they are a stepping stone and a very important one.  They demonstrate the feasibility and value of adding semantic meaning to web page content.   They do so using existing browsers and standards.  They do so today, in the field not in the lab.  By making web pages understandable to both humans (also known as readers…)  and machines, using current technologies, current browsers and minimal effort, microformats allow web content to be reliably understood and aggregated by search engines.   The future is bright.  Google could, for example, calculate an average review for a book from a list of semantically compliant sites.  Google could also uniquely identify a user as a single human being across sites.   The semantic web, a web of meaning, is finally taking shape.

I am convinced the semantic web is going to change the way we publish content, exchange, correlate and aggregate information, both in the public domain and the enterprise.   It’s an exciting time for web professionals who can look forward to building companies and next generation systems that leverage semantic data.

toronto_semantic

In Toronto and interested in the semantic web?  Join us at the Toronto Semantic Web group on LinkedIn.

Written by Compliantia

May 15, 2009 at 5:14 pm

Evangelists, The Semantic Web Needs You!

with 2 comments

need_you_medium1

First a confession.  What started as a curiosity, has turned into a bit of an obsession.   Artificial intelligence, natural language processing, data interchange, global ontologies, it’s all there in the semantic web.   There is enough in there to excite the geek in me for three life times and perhaps there lies the bigger problem… Let me take a step back. In broad terms, the semantic web refers to a global web of unequivocal meaning, that can be used and queried by machines, programs and ultimately user-facing applications.  In equally broad terms, this amounts to turning loose data (words on a page, with no meaning other than their proximity to other words which can be counted, similarities inferred, etc…) on the web into information (meaning, purpose and inter-operability).  Micro-formats asides, words like ISBN or UPC on most web sites are just that, words.   They mean nothing, they are not tied to the same universal concept and the words that preceed or follow them (which usually is an actual ISBN or UPC code) are not linked to the same resource.  The web was built for people.  Please scan a page and quickly understand the purpose of the page and the meaning of captions, buttons and other elements on the page.  Machines don’t.   
The semantic web refers to a web (the largest collection of human knowledge ever assembled) understandable to machines.  Currently web pages are assembled to be read and understood by humans.  While tags and meta-data exist, these allone are generally insufficient to be used predictably and reliably by computer programs.  XML is in wide use arount the web but XML schemas (XML contracts which govern the structure and content of XML documents) are often attached to single documents, single services or single organizations.  And there lies the problem: without the semantic web, there doesn’t exists a single, universale way to refer to a person, or a UPC code, a financial service or a purchaseable item.  The fact that product A on site X and product A on site y are the same product is established by humans (by comparing brands, labels, model numbers, pictures), it often cannot be conclusively and reliably be determined by a computer program.  While search engines have bridged some this gap, short of a complete AI system, the information on the web will remain in data form (and turned into information by readers at page view time) until technologies like the semantic web become prevalent.  The semantic web, a term coined by sir Tim Berners Lee and spear-headed by the W3C, attached meaning to web page content so this content can be consumed, queried and indexed by machines.  From the largest collection of text in the world, the internet would be elevated to the largest collection of information, inter-related, meaningful. 
The semantic web is generally believed to be the next version of the web.  Whereas Werb 1.0 was basic publishing, Web 2.0 was social, Web 3.0 is expected to be semantic.   Yet for all the promises, it’s ascension remains clouded with doubts and hindered by real world impediments.   The semantic web is a technology of the future that for the time being has remained in the future.  Taxonomies, folksonomies (tags), meta-data and micro-formats are all small steps in the semantic direction.  Its rise, in time, is It is inevitable.
On paper, all the required building blocks are here.  Standards (W3C recommmendations) have been published, parsers are available and so are global open-source ontologies.  What’s missing?
The “social web” is largely being promoted and evangelized by online marketing professionals. Evangelists are tremendously important.  And yet the semantic web hasn’t made it easy for Web 2.0 professionals to ramp up on Web 3.0. There is substantial technical barrier of entry on the semantic web today.  Part of is is by design. The semantic web talks about schemas, objects and relationships.  It talks about machine language and parsers.  It is, by design, mostly “back-end”, conceptual and somewhat complex.  To succeed, the semantic web needs to leave the lab and the universities research department.  The semantic web has failed to make itself palatable to would-be evangelists.  It  needs a business plan, it evangelists, promoters.   It needs to reach out to the social web community.  It needs to inform and excite.  Why bother?  While the first phase of a semantic web ecosystem will most-likely be focused on the “back-end” (as web 1.0 was until the focus was put back on the user experience with Web 2.0), that would be followed by a new generation of user-centered web services, once again focused on the user experience and powered by semantic web data.  If you thought that the web did a lot today, imagine the capabilities of a web 4.0 front-end powered by a semantic-web back-end.  The potential is mind boggling.
This is *not* a web page. it’s 1 of 150K concepts from open-source ontology for semantic web (human readable format). http://bit.ly/tSBel

First, a confession.  What started as a curiosity, has turned into a bit of an obsession…  Artificial intelligence, natural language processing, data interchange, global ontologies are all, directly or indirectly, facets of the semantic web.   There is enough in there to excite the geek in me for three life times and there lies the problem… Let me take a step back.

In broad terms, the semantic web refers to a global web of unequivocal meaning, that can be used and queried by machines, programs and ultimately user-facing applications.  In equally broad terms, this amounts to turning loose data (words on a page, with no meaning other than their proximity to other words which can be counted, similarities inferred, etc…) into information (meaning, purpose and inter-operability).  Micro-formats asides, words like ISBN or UPC on most web sites are just that, words.   They mean nothing, they are not tied to the same universal concept and the words that precede or follow them (which usually is an actual ISBN or UPC code) are not linked to the same resource.  The web was built for people, not machines.  People scan a page and quickly understand the purpose of the page and the meaning of captions, buttons and other elements on the page.   On the other hand, the semantic web refers to a collection (the web is the largest collection of human knowledge ever assembled) understandable to machines. While user-generated tags and meta-data exist, these alone are generally insufficient to be used predictably and reliably by computer programs.  XML is widely used around the web but XML schemas (XML contracts which govern the structure and content of XML documents) are often attached to a single document, a single service or a single organization.  This point alone gets to the root of the problem: without the semantic web, there doesn’t exist a single, universally accepted way of specifying a person, a UPC code, a financial service or a purchaseable item.  The fact that product “A” on site x and product “A” on site y are the same product is established by humans (by comparing brands, labels, model numbers, pictures), it cannot be conclusively and reliably determined by a computer program.  Lastly, while search engines have bridged this gap somewhat, short of a complete Artificial Intelligence system, the information on the web will remain in unstructured data form until technologies like the semantic web become prevalent.  In conclusion, the semantic web, a term coined by Sir Tim Berners-Lee and spearheaded by the W3C, seeks to attach meaning to page content so this content can be consumed, queried and inter-related by machines.  From the largest collection of text in the world, the internet would be elevated to the largest collection of inter-related, meaningful information in the world

The semantic web is generally believed to be the next version of the web.  Whereas Web 1.0 was about basic publishing, Web 2.0 is social, Web 3.0 is expected to be semantic.   Yet for all the promises, its ascension remains clouded with doubts and hindered by real world impediments.   The semantic web is a technology of the future that, until now, has remained in the future.  On paper, all the required building blocks are here.  Standards (W3C recommendations) have been published, parsers, query-engines and core-technologies are available and so are global open-source ontologies.  What’s missing?

The “social web” is largely being promoted and evangelized by a combination of online marketing and user-experience professionals. Evangelists are tremendously important in spreading the word and encouraging adoption.  On the Toronto scene, Web 2.0 evangelists like David Crow, Matthew Milan and Saul Colt come to mind.  And yet the semantic web community hasn’t really reached out to Web 2.0 professionals in general. The conversation mostly revolves around the back-end, infrastructure and core technologies. The semantic web talks about schemas, objects and relationships.  It talks about machine languages and parsers.  It does not directly address the user experience (although its ultimate goal is just that).   To succeed, the semantic web needs to leave the lab and the research department.  It needs to make itself palatable to early adopters and would-be evangelists.  It  needs a business plan,  promoters and supporters.   It needs to reach out, inform and excite the web 2.0 community.  Why bother?  While the first iteration of a semantic ecosystem will most likely focus on the “back-end” (similar to back-end-centered Web 1.0 followed by user-centered Web 2.0),  this will likely be followed by a second iteration of user-centered services, heavily skewed on the user experience and powered by semantic web data.  While the web does a lot today, imagine the capabilities of a web 4.0 front-end powered by a semantic web back-end.  The potential is mind boggling.  Let’s go semantic, if you catch my meaning 😉

Resources:

W3C semantic web homepage: http://www.w3.org/2001/sw/

Wikipedia on semantic web: http://en.wikipedia.org/wiki/Semantic_Web

Sample concept from open-source ontology for semantic web (in human readable format): http://sw.opencyc.org/concept/Mx4rvVi1AJwpEbGdrcN5Y29ycA

Open source (created by HP, java-based) semantic web toolkit: http://jena.sourceforge.net

toronto_semantic

In Toronto and interested in the semantic web?  Join us at the Toronto Semantic Web group on LinkedIn.

Written by Compliantia

May 7, 2009 at 6:05 pm