Semantic web?

Problem

Metadata at the document level is insufficient to clearly define the underlying ideas or concepts in a document. Keywords tagged to a document by the authors do not fully reflect the content. Since keywords sometimes play a role in the search engine ranking strategy, a pertinent document might not be ranked appropriately for a user to discover it. Also search engines are not able to distinguish one idea from another effectively.

By integrating metadata with the actual data, idea navigation is facilitated and automatable. Also by delimitating ideas within documents, reconstruction of documents using those concepts in another concept can be achieved with greater ease. Finally, since ideas can be extracted from document while maintaining their essence, only relevant content can be displayed to a user unlike traditional search who might only point to a document containing the idea somewhere within its content.

Key definitions

Idea:

An idea is a formulated thought or opinion. An idea can be contained within one of more documents.

Document:

A document is a grouping of ideas.

Object:

Something material that may be perceived by the senses

Sample Applications

Unstructured Data

Text documents

Video documents

Extra metadata in video documents could allow embedding information about each actor present in a video frame. It would then be possible to search for all sequences with a particular actor in it.

Also the context of a particular scene could be identified and linked with relevant content. For example, if one scene was shot near the Eifel tower, one could then look at the history behind the setting that was chosen for the film.

By linking content to the dialogs, references from previous episodes in a TV show could then be accessible for the viewer to review important details pertaining to a scene. This could allow more precision in the subtitles because a certain dialog could be tagged to a particular actor.

Extra advertising revenues could be generated by allowing the viewer of a particular to click on an item of its liking in a particular scene of a movie or TV show. The user could then be redirected to the manufacturer or a reseller site where he could get more information regarding the product.

Audio document

Music

Extra metadata could be used to identify multiple ideas and objects within a track. For instance, if multiple discrete channels are used for recording, the musician and the instrument used could easily be identified and link with other relevant content such as spinoff projects by the artist, the manufacturer of the instrument, other artists using the instrument and so on. Lyrics could also be integrated to the audio document and reference within the song to other ideas could be identified. So one could easily find all war related songs. You could also analyze a period where the song was composed to relate the song content with other trends at the time (cultural, political, etc.)

Call Recording

Again using discrete channel recording, one could easily tag the interlocutor as well as the context in which the discussion took place. If multiple ideas are exchanged during an audio exchange, those could be tagged by delimitating the different moments in the conversation where that idea is referenced. For example, if one would negotiate the purchase of a car and several options were discussed, one could break the conversation down into multiple scenarios for easy browsing and reference after the conversation took place. A digital signature could also be stored as part of the information of the interlocutor track.

Monitoring service quality could be enhanced through the addition of metadata as a manager could make sure certain topics are covered during a conversation with a customer. Also the customer sales representative (CSR) could use this at the same time to check whether all topics were covered during a conversation to provide consistent service.

Another purpose could be to match call trends with sales data to determine the key influencers for a sale within a call using data mining techniques.

Costs of call could also be better attributed by allowing slicing of the call multiple topics.

Audio learning

Using delimitation at the word level, one could navigate through the different contexts in which a word can be used when learning a new language. Once could also link to image content, concepts within a conversation to enhance the learning experience.

Structured Data

Content Authoring

While authoring content with extra metadata, the process of delimitating context, ideas and objects should be as simple as possible.

Common principles

Document authoring software could allow highlighting and therefore tagging of contexts, ideas and objects. Once the content has been delimited appropriately, the authoring software could issue a search against a service that could return relevant content that could be linked to it. The author would then select one or more sources of information that are relevant in that context. This way, smarter and more educated links can be established between documents easily.

This same search facility could be used to insert new content in the document. The user would first search for the desired content through a service mechanism, then insert only a reference to this content in the document. Once that reference is inserted, the author could have the several options as to how the external content can be integrated within the document. The author could keep the content only as link to the external reference. Another possibility would be to insert the actual content by expanding the reference while still keeping a live link to the original source meaning that if the source content changes, the referencing document content will change as well. The author could also have the possibility to disconnect the content from the source to keep an unsynchronized copy of the information.

By allowing this, the notion of what is a document becomes increasingly difficult to determine. A document is no longer a collection of authored content, but a seamless assembly of both authored and referenced content. With a finer grain to identify authored content and more descriptive relationships between elements, proper credits can be attributed to the author of the content. The royalty concept can easily be added as a service for document author. When premium content is requested for consumption or reference, the author of the document would pay for only the referenced or consumed content. This could also help be the basis used to evaluate a content’s worth based on how many times it’s been referenced or consumed.

Another important aspect applicable to all types of documents is the fact that contexts, ideas and objects delimitations must allow for overlapping. The delimitation could be done by the author of the original document as well as by the author of another document that is referencing the original document. Allowing external authors to delimit content entails an approval mechanism where the delimited content by someone who is not the author could be approved or denied. This capability can be embedded in the original document as an operation. See the security section for more information on the concept. This mechanism would also reinforce the concept of intellectual property across the mesh of documents.

Since a document can be observed through difference perspectives, links to external content must be aware of this notion of context. As relationships with other documents are context based, one could link the same idea in a document to multiple different external concepts by simply switching the perspective or context of the relationship.

Audio and Video Documents

Delimitation of content in audio and video documents must have the following characteristics:

  • Time based
    • The metadata must be associated with a lapse of time where the context, idea or object is in effect.

Consuming Content

Crawling Content

A search engine could index documents in a way that is similar to current search engines. Since the document now contains richer information about their context and content, documents could be indexed and ranked in the following ways:

  • Search engine crawl starts with a document and records each the topic within the document
  • Full text indexing could also be done on the textual content of a document
  • Search engine could then explore the references made in the document (context). This would lead to the discovery of new documents and new concepts in other documents.
  • The depth of the document could also be recorded to help in the ranking algorithm. A document that is referenced multiple times should have higher relevance in the search.
  • The content object within a document is indexed independently of the container document, but a reference to the container is kept
  • Attributes such as the author name or the content creation date are indexed and tagged to the content

Unstructured Querying

Searching for a document would require the user to enter his search criteria in one of the following possible ways:

  • Keywords
    • Topics
    • Object name
    • Author
  • Reference to an existing concept, idea or document while consuming content (i.e. I’m looking for a concept similar to this one or to this document).

The search engine would then have to look within its index for concepts that match the user’s criteria. Those could be “hard” criteria such as dates or author name or “criteria” such as similarities.

Once the search engine found the relevant content, it returns a list of only the relevant content. What happens during this process is the search engine fetches the content from the original sources and exposes them in the search result. Only the ideas pertaining to the search are returned. This means that if a document contains the idea sought but contains other irrelevant ideas, only the proper content is returned to the user. This can be a nice efficiency boost while searching for content. Currently when content is searched, only the document containing the content is returned. The user often has to search in the document to find the relevant content.


Structured Querying

A programmatic layer should be provided to allow application developers to query all content as if it was part of a single repository. To allow this the indexing engine must be accessible as a single service, through which standard query concepts could be applied, much in the same way you would with a relational database management system. Having access to content in a standard format could allow for easy development of standard document management functionality, such as archiving and workflow. The querying engine must be allowed to work in a distributed manner meaning that part of a query could be forwarded to one or multiple engines. As results are prepared

Security

While the ability to explore easily the environment of a document is a core to this concept, security is an aspect that is equally important.

Document Control

Controlling what can be done with the document is a crucial aspect to ensure proper security. One key concept in achieving this is the notion of operation. An operation defines a certain activity that can be performed against any securable object. Examples of operations could be View, Modify, Delete, Exchange, Print or Link. Operations are applied against securable objects in a document. A securable object could be a concept, an idea, a relation with another document or the whole document itself. The information regarding all security aspects for all users is only kept in the master document. Since it’s assumed that the user will receive only the information he has access to along with the operations he’s allowed to perform, we can achieve a level of control that was unachievable before.

Document Exchange

A metadata enhanced document is by design an entity that can live by itself without reliance on any external security infrastructure (authentication, authorization). This would allow for secure document interchange between two parties without any special mechanism. The document would only need to be sent at destination. No secure channels should be necessary during the exchange. While a document is packaged for offline usage, the following things should happen:

  • Permitted operations on the securable document aspects for a particular user are added to a sub-package
  • Authorizations required for these operations are added to the sub-package
  • Content allowed for this particular user is added to the sub-package
  • The resulting sub-package is encrypted with the key of the user with rights to this sub-package
  • The process is repeated for each user having access to a piece of content in the package
  • A hash key is generated for each sub-packages and added to the final compressed file
  • All the sub-packages are then compressed into their final file format

Once the compressed file is created, a software aware of the file format can be used to send only the proper part of the compressed file to the recipient. This would ensure that only the proper information is sent to the recipient. In a server scenario, a user could authenticate using the same key that was used to encrypt the sub-package and the server could return only the proper section seamlessly.

Exportability

One important aspect of a metadata enhanced document is the notion of exportability. The browsing application must offer the user to option of exporting the document along with its contextual information. The export must allow the user to select the contextual information to be included with the export. This could be filtered with a combination of the following ways:

  • Content types
    • Audio
    • Video
    • Images
  • Recursion depth
    • Since documents are linked together by context, objects and ideas, the export must allow the user to select the number of immediate relations that are allowed. For instance Document 1 is linked to Document 2 and Document 2 is linked to Document 3. In this example the user could choose to either export Document 1 and the referenced Document 2 while excluding Document 3.
  • Topics
    • Since a document can contain several topics, one could choose to only export certain topics while excluding others


Peer to Peer Topology

One crucial aspect of any system should be resilience. To achieve this, a number of techniques are available. One technology that is hard to beat on both availability and capacity is the peer-to-peer topology. If you consider the peer-to-peer network as a computing resource, you can achieve massive scalability that can’t be achieved using traditional monolithic systems.

In this context, the peer-to-peer computing grid can have multiple purposes and advantages over competing technologies.

Storage


As you can see in the diagram above, the grid can be used to ensure redundancy of the information. As each member of the storage grid is aware of the other nodes, it could track modified content and replicate it as needed to the other nodes in the grid to ensure an appropriate number of copies of the same content is kept in grid. Each content fragment within the grid should be able to announce the level of redundancy it requires or policies could be set at the grid level for predefined types of content.

In storage resiliency is one thing, performance is another. If a piece of content is highly demanded, the grid should be able to adapt to this increase. One way to achieve this is by spreading content across as many nodes as necessary to achieve acceptable throughput and response time for the user.

Once the content has been pushed to the other nodes, they can now be used to serve the content. Different strategies could be used when fetching the content from multiple nodes in the grid.

The client could retrieve the desired fragment by using one of the following strategies depending on network conditions and the type of content served.

  • The request for the fragment could be distributed across the nodes serving the content
  • Download the whole fragment from the node with the best throughput because sometimes the overhead of distributing the request through the grid could be prohibitive for some hardware configuration
  • If the fragment is very small, the client could chose to download the file from the node with the fastest response time

The reverse concept should also be possible. If a highly demanded topic loses in popularity in the grid, unnecessary copies should be discarded. Only the appropriate number needed for redundancy should be kept.


When a document is retrieved, several requests would be issued against the grid as each fragment could potentially be retrieved from multiple nodes.

The previous principles could address a multitude of problems currently happening with traditional clusters. Here are some of the problems this solution attempts to solve:

In the diagram above, all the computing nodes fight for a centralized pool of storage. Traditionally, a storage area network is used to achieve this. They are usually built with an expensive set of devices with hard capacity limitations. As the number of computing nodes increases, a few problems arise. The storage resources cannot easily adapt in a sudden change in content access. This often entails reconfiguration of the storage array to either increase the capacity or the number hard drives required to deliver acceptable response time or throughput. As concurrency increases between the nodes, the amount of information exchanged between the nodes to coordinate transactions increases. This has the effect to negatively impact the performance of the cluster which spends more time managing the work to be done, rather than actually do the computing task at hand. This architecture has a practical processing scalability limit due to centralized storage and inter-nodes communication.

An alternative architecture, usually called “shared nothing” clusters takes a different approach to avoid the scalability issues mentioned above. As each computing node is independent from one another, you effectively avoid the storage contention issue and the inter-nodes communication problems. However this architecture still presents a considerable drawback. As its name indicates, the computing nodes don’t share anything. This means if you want to query for a particular piece of content you either have to know on which node it’s located or query every node. If you need content from more than one node, you need to write code that will do the aggregation manually. Another issue with this architecture is that if you need to rebalance data across the node for scalability reasons, this is usually a manual process as well.

With the proposed architecture we can eliminate or mitigate the issues explained above. The peer-to-peer computing grid has the following advantages:

  • Can take advantage of the cheapest storage media and computing hardware available regardless of their built-in availability features
  • Provides a single query interface for multiple computing nodes
  • Minimizes inter nodes communication overflow as data is dynamically load balanced across nodes, avoiding multiple nodes contention on the same piece of data
  • Allows for live addition/deletion/upgrade of computing nodes

As the rate of updates increases against a certain piece of content, the hot content will get invalidated on all the nodes of the cluster with the exception of a limited set of nodes. The excluded nodes only receive the notification that a certain piece of information is no longer up to date as well as the addresses of the nodes now responsible for the data. The remaining nodes will take over the sole responsibility of updating this piece of content. As the number of nodes involved for a given transaction is limited to potentially only two nodes, we minimize the amount of chatter required between the nodes of the cluster to coordinate a transaction on a specific content fragment. As the nodes are now processing the updates for this piece of information with a higher priority than the other requests, the existing load on the nodes taking the new responsibility could be redistributed amongst the remaining cluster nodes if needed. Once the update rate decreases, the data can be redistributed across necessary nodes to accommodate the required throughput and response time.

Versioning and Archiving

The storage subsystem must allow for multiple versions of a content fragment to be stored across the cluster. Typical backup and recovery hardware and software should not be necessary in this environment. To achieve this, the following methods are used.

  • Policies are defined to regroup content types
  • Storage policies are defined on top of the content types. The policies can include the following information
    • Retention period

As content is accessed in the cluster, each request is tracked. Periodically, the cluster will determine a node to analyze the access pattern of the content on the nodes. The content will then be classified using data mining algorithm. Once the classification is completed, the nodes in the cluster are then notified of the changes in the content they host. Nodes can then delete, pull and fetch content from and to other nodes to balance the cluster.

Regular server applications could be balanced in the same way using application virtualization technologies.

As the number of nodes increases in the cluster, nodes will regroup together based on different criteria:

  • Content similarity
  • Consumer proximity
    • As content is sometimes localized to a particular area (department, enterprise, city, country, etc.) The local content will be stored near its consumers for faster access and maximization of network links.
    • As access pattern change, for instance a worldwide news is sometimes consumed first in some part of the world due to time zones, the content will move to accommodate the demand

Node Architecture


Hardware

The goal of this project is to leverage as much as possible existing and future hardware.

Operating System

The implementation of the project should be operating system independent. The operating system must be able to support the core functions required by the virtualization layer.

Indexing Services

The role of the indexing services is to build and maintain the index for the content stored on that node. The index is populated by calling the CrawlContent method on the content type.

Content Services

The role of the content services layer is to provide the common functions that apply to all content types on a node.

Virtualization

The virtualization layer provides the necessary emulation to provide each content fragment with its own networking stack.

IPv6 Address

Each piece of content can be addressed using a standard IPv6 address. This will allow the content to be retrievable easily as it moves between nodes.

Content Methods

The content methods are functions specific for a particular content type. Here’s a sample list of the functions that could be implemented as content methods:

  • ReadContent
  • DeleteContent
  • WriteContent
  • ConvertContent
  • GetContentExtract
  • SetTransactionIsolation
  • IndexContent

It is important to note that all those functions are content types specific. As format and functionality differs from a content type to another, the API accessing it must be adapted. Content methods are also version specific. Multiple versions of content methods must be able to coexist at one time to be able to work with content created with older version of authoring software.

Another point worth mentioning is those methods are primarily aimed as support services for the cluster functionality. This will be transparent for an end user, as the storage pool made available by the cluster will be implemented as a regular file system.

Content

This will be the content or file as seen by the end user.

Client Architecture

Path Resolution Service

The goal of the Path Resolution Service is to resolve a file path to an IP address. To achieve this, the following concepts are necessary.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s