Processor Scheduling
July 15, 2008
As we know, different processors perform better for certain types of workload. It would be nice to see a processing scheduler that could take advantage of this fact. Here’s my take on how this could work.
The first step into efficiently enabling this functionnality would be to determine on what type of processor the instructions would execute most efficiently. Think of this as a loose guideline for the scheduler. In order to achieve this, the compiler or a code profiling tool could examine the instruction set and rank the preferred order of each processor type for each instructions. For instance, instruction x processor ranking could be GPU, CPU, APU. Once the scheduler has this information in hand, it can then optimize processor usage based on various set of conditions and policies. Here’s a few example of such controls:
1) Processor availability: Is the preferred processor currently free to execute?
2) Processor consumption: Is it efficient power wise to execute this instruction set on the preferred processor?
3) Instruction set priority/deadline: Can this instruction set be executed on a non-optimal processor because it’s not running at a high priority or because there’s no execution deadline?
4) Instruction set parallelization: Can this instruction set leverage the parallelization features of the processors available?
5) Instruction set execution location: Can this instruction set be executed on a remote processor without adverse effects?
In order to achieve this, the compiler would supply the binary of a executable with the specific instruction sets targeting each processor type. Only the processor scheduler can determine at runtime on which type of processor a particular bit of code would execute.
This concept could work in a way that is similar to execution plans in database engine. As more statistics are gathered by the execution of the various pieces of code on the system, the scheduler could take smarter decisions as to how the code should be executed. The ranking of the processor could then be adjusted in the processor ranking manifest of the executable.
I think this could greatly enhance processor usage in all class of computers by offering alternate execution paths that were not dynamically available before. Right now, to take advantage of the available processor cycle of a specific type, one would have to rewrite the application to target it. The compilers and code profiling tools are good candidates at achieving those tasks as they already provide an abstraction layer from the targeted architecture. For example, the Intel compiler is able to target specific capabilities of the processor without having the developper to necessarily change a line of code. Programming language should stay what they are, a high level way to instruct the computer and all it’s resources to achieve a particular task. I think that targeting in code specific capabilities of an architecture is wandering away from the problem at hand to solve.
Fast Computation of Database Operations using Graphics Processors
February 27, 2008
Here’s an interesting research paper I found that seems to support my previous blog post:
GPU Accelerated Databases
February 27, 2008
I was reading about General Purpose GPU programming on http://www.gpgpu.org and I started to think about how GPU could be leveraged in database technology. One use that came to my mind immediately was for geospatial and geometrical data. I’m far from being an expert in that matter, but I would think that one could offload most of the calculations to a GPU. Both raster and vectorial maps can benefit the use of a GPU since it can handle both bitmaps and vectorial data like a real champion.
Another use that I think might work is for indexing data. If you represent data using geometrical patterns, it would be thinkable to use a GPU to perform pattern matching in a very efficient manner due the highly parallel nature of those processing units. If you combine those patterns with set theory, you could define patterns that encompass the actual data. By combining geometrical pattern, those indices would be able to determine the data that is to be included in queries involving a wide range of aggregations and computations.
I’d be curious to see with the CLR integration in SQL Server, if one could call DirectX libraries to offload some work to a GPU.
Feel free to comment on the post, I’m learning and thinking out loud here! I’ll probably be posting more thoughts about this in the coming posts. I’m already thinking about applications for data minining and OLAP data…
Application Virtualization for My Mom
February 25, 2008
We see application virtualization software emerging more and more in the enterprise with products like Microsoft Application Virtualization and Citrix Application Streaming. I think if there’s a scenario where application virtualization would be well received, it would be for home use of software. This would have few advantages for the users.
First of all, the stability of their PC would most likely increase as each applications operates in a sandboxed environment, which can avoid a lot of headaches. No more complicated setup and configuration for end users as well. Just click on the icon and there it is. Installing an application of IT pros seems like no big deal, but for users at home, it’s a risky operation. Just deploy a stable base operating system image and the user is set.
Another upside to this model, is related to SaaS, as users could only pay for their actual usage of the software. For instance, when I’m at home, I’m not using Microsoft Office 100% of the time, I might only need it a few hours per week. I’m having a hard time paying a few hundred dollars for my copy Office for the usage I have for it at home. I’d much rather pay 100$ for a bank of 10 hours of use per month.
Hopefully software vendor will get their act together as they did with full OS virtualization. As this article points out, we have a bit of way to go: http://www.news.com/Microsoft-Streaming-Office-infringes-license/2100-1012_3-6229776.html
When you have something like application virtualization, a published desktop or application via Citrix or Terminal Services, why whould you want to build web versions of application like what Google is trying to do with Google Docs?
Combining Boinc and Facebook to Find Usama Bin Laden
February 21, 2008
What would happen if you were to combine Boinc and Facebook? Probably the most powerful data mining application ever created.
The pure computing power available through the workload distribution of Boinc and the wealth of information present on Facebook would produce one scary application. The kind of application Big Brother would like to have… Maybe he already has! One of the most time consuming task in gathering intelligence is to identify people and their relations with each other. It’s one of the main feature of Facebook. If you combine this with the fact that people increase this knowledge by tagging people in pictures and with the rest of the information that is available, things get very interesting.
Boinc offers 1.06 PetaFLOP/sec, which is twice what BlueGene/L, the fastest supercomputer on the planet delivers. The more users get involved with Boinc, the more its capacity increases. Why would the government pay to build such a large and resilient supercomputer? I would give tax credit to people who contribute to the national computing resource, because it has value. I’m surprised there isn’t a market for that yet! Just as a nation has oil and gold reserves, computing resources are now a way to rank a nation’s power.
If I would be a hacker, the first thing I would hack is Boinc! I would use it to break Facebook and then use its crunching power to process the data to find Usama Bin Laden and get the 5 million dollars! :-p
File duplicates finder
February 4, 2008
A month ago I’ve built a small application with a set of Microsoft SQL Server Reporting Services reports that inventories files on a specific path on a file system or in SharePoint and stores its hash and location in a table. With this you can identify duplicates between a file server and SharePoint sites as well as get storage statistics.
I’ll be publishing the solution on CodePlex at the following URL:
The project was developped in .Net 3.5 using technologies such as Windows Communication Foundation and Parallel FX for multi-threaded file hashing.
Please note that this is a early version and has not been fully tested. I ran both SharePoint and file system scans on about 130000 files without apparent issues.
Enjoy!
What Windows 7 should be
February 2, 2008
Here’s my take on some of the features Windows 7 should have:
Distributed processing
Now that we’re getting decent bandwidth in LANs and WANs, it would be nice if Microsoft could develop a layer in the operating system that would work similarly to VMWare VMotion but at the thread level. This means that if the local computing device, could be Windows Mobile phone, laptop, desktop or a server, is overloaded by another thread execution, the load could be offloaded to another node with available processing on the fly. Once a new device is connected to a network, being a private or public network like a wi-fi access point, the users could decide if it wants to participate in the processing. Applications could be built in such a way that they could detect if the computer is part of a larger computing pool and enable extra functionnality for the end-users. The distribution of thread execution could vary based on the bandwidth available and the job at hand. This could also be controlled via policies as well, for instance engineering workstations could have execution priority over a gaming application in a corporate network.
It would be interesting to see libraries such as Parallel FX from Microsoft leverage distributed processing over multiple computers in a transparent fashion. They’re already abstracting job distribution over multiple local CPU, I don’t see any reasons why that couldn’t be extended to remote processing of tasks as well.
It would be cool to have the operating system capable of routing certain specialized tasks to specific processors such as GPUs. Multiple PCs with graphic cards could be used as a rendering farm transparently for thin clients for instance.
I know distributed processing is not a new thing in academic and sometimes corporate settings, but imagine if a general purpose operating system such as Windows could bring this mainstream. A pool with 100 millions of PC sharing their workload to achieve what was before unthinkable. PC virtualization is nice, but thread execution virtualization is the future!
Distributed storage
Wouldn’t it be nice to have access to a single pool of storage that is almost infinite in size and always available? If you look at one of my previous post Semantic Web, you can find in there an idea of how storage could evolve into something more collaborative and intelligent. Think of it as a giant hard drive with built-in redundancy and load balancing functionnality.
Wrap up
If you combine those two technologies together, imagine how that would change the world of computing. Near unlimited storage and processing available to each of us. This would also help in the area of green computing as we wouldn’t have to build PCs with as much processing capability anymore since we could always leverage the computing capability of our peers in a given network. You could plug-in a PC that boots from a distributed storage pool and startup a data mining application that runs multiple threads over multiple local and remote CPUs. Neato!
Green computing
February 1, 2008
What should be our primary objective? To combine processing, build more energy efficient hardware or change user behaviors? If networking technology is the main barrier for centralized computing, should we focus our efforts in that direction instead of looking at cramming more transistors in handheld devices with poor user experiences and short battery life?
I think it’s a little bit like in the automotive industry. Which one is the “greener” option? Public transportation or more fuel efficient cars?
Just food for thoughts here!
Artificial intelligence
February 1, 2008
While watching Terminator: The Sarah Connor’s chronicles, I began to wonder about certain things regarding artificial intelligence. After doing a quick research on the subject, I noticed that a lot of the approaches used to AI involved data analysis. This means the application is “learning” through data that it can capture itself or data that is submitted by humans in a controlled fashion. That might be the most elegant end-all be all solution to AI, but what if there was another alternative?
I think one approach that could be used to achieve maybe not exactly AI but really smart applications is to build a software that could learn from other software. This means if an application needs to add two numbers together, it could look around (locally, Network, Internet) for an application who knows how to achieve this. Once that is found, it could take part of the binary or interface with the application that achieves the desired result and integrate it. I don’t need to know how to build a CPU to use one.
To obtain such a result, a few of ways could be used such as:
-
Automated reverse engineering from the binaries directly
-
Semantic describing what a method or function does
Right now applications can’t express what they can or cannot do. Web services are a step forward in this direction, but there is a major concept missing. In web services, you know the method name and its input and output. For instance you could call a method hotPotato that takes integers and you would have no way of knowing what that method does without looking at its code or reverse engineer it by submitting input and analyzing outputs (which is error prone). Once we figure out a way to enhance the expressiveness of an application, we will be able to build applications that can understand other pieces of code. It would be important that such a language could be apply to existing pieces of codes as it could bring a level of efficiency never seen before in application development.
Think about it, how much code is really truly unique in the world? If a system would allow you to find pieces of codes automatically by leveraging the current code base available worldwide, wouldn’t that revolutionize development forever?
Code learning code, watch out Skynet!
Semantic web?
February 1, 2008
Problem
Metadata at the document level is insufficient to clearly define the underlying ideas or concepts in a document. Keywords tagged to a document by the authors do not fully reflect the content. Since keywords sometimes play a role in the search engine ranking strategy, a pertinent document might not be ranked appropriately for a user to discover it. Also search engines are not able to distinguish one idea from another effectively.
By integrating metadata with the actual data, idea navigation is facilitated and automatable. Also by delimitating ideas within documents, reconstruction of documents using those concepts in another concept can be achieved with greater ease. Finally, since ideas can be extracted from document while maintaining their essence, only relevant content can be displayed to a user unlike traditional search who might only point to a document containing the idea somewhere within its content.
Key definitions
Idea:
An idea is a formulated thought or opinion. An idea can be contained within one of more documents.
Document:
A document is a grouping of ideas.
Object:
Something material that may be perceived by the senses
Sample Applications
Unstructured Data
Text documents
Video documents
Extra metadata in video documents could allow embedding information about each actor present in a video frame. It would then be possible to search for all sequences with a particular actor in it.
Also the context of a particular scene could be identified and linked with relevant content. For example, if one scene was shot near the Eifel tower, one could then look at the history behind the setting that was chosen for the film.
By linking content to the dialogs, references from previous episodes in a TV show could then be accessible for the viewer to review important details pertaining to a scene. This could allow more precision in the subtitles because a certain dialog could be tagged to a particular actor.
Extra advertising revenues could be generated by allowing the viewer of a particular to click on an item of its liking in a particular scene of a movie or TV show. The user could then be redirected to the manufacturer or a reseller site where he could get more information regarding the product.
Audio document
Music
Extra metadata could be used to identify multiple ideas and objects within a track. For instance, if multiple discrete channels are used for recording, the musician and the instrument used could easily be identified and link with other relevant content such as spinoff projects by the artist, the manufacturer of the instrument, other artists using the instrument and so on. Lyrics could also be integrated to the audio document and reference within the song to other ideas could be identified. So one could easily find all war related songs. You could also analyze a period where the song was composed to relate the song content with other trends at the time (cultural, political, etc.)
Call Recording
Again using discrete channel recording, one could easily tag the interlocutor as well as the context in which the discussion took place. If multiple ideas are exchanged during an audio exchange, those could be tagged by delimitating the different moments in the conversation where that idea is referenced. For example, if one would negotiate the purchase of a car and several options were discussed, one could break the conversation down into multiple scenarios for easy browsing and reference after the conversation took place. A digital signature could also be stored as part of the information of the interlocutor track.
Monitoring service quality could be enhanced through the addition of metadata as a manager could make sure certain topics are covered during a conversation with a customer. Also the customer sales representative (CSR) could use this at the same time to check whether all topics were covered during a conversation to provide consistent service.
Another purpose could be to match call trends with sales data to determine the key influencers for a sale within a call using data mining techniques.
Costs of call could also be better attributed by allowing slicing of the call multiple topics.
Audio learning
Using delimitation at the word level, one could navigate through the different contexts in which a word can be used when learning a new language. Once could also link to image content, concepts within a conversation to enhance the learning experience.
Structured Data
Content Authoring
While authoring content with extra metadata, the process of delimitating context, ideas and objects should be as simple as possible.
Common principles
Document authoring software could allow highlighting and therefore tagging of contexts, ideas and objects. Once the content has been delimited appropriately, the authoring software could issue a search against a service that could return relevant content that could be linked to it. The author would then select one or more sources of information that are relevant in that context. This way, smarter and more educated links can be established between documents easily.


This same search facility could be used to insert new content in the document. The user would first search for the desired content through a service mechanism, then insert only a reference to this content in the document. Once that reference is inserted, the author could have the several options as to how the external content can be integrated within the document. The author could keep the content only as link to the external reference. Another possibility would be to insert the actual content by expanding the reference while still keeping a live link to the original source meaning that if the source content changes, the referencing document content will change as well. The author could also have the possibility to disconnect the content from the source to keep an unsynchronized copy of the information.


By allowing this, the notion of what is a document becomes increasingly difficult to determine. A document is no longer a collection of authored content, but a seamless assembly of both authored and referenced content. With a finer grain to identify authored content and more descriptive relationships between elements, proper credits can be attributed to the author of the content. The royalty concept can easily be added as a service for document author. When premium content is requested for consumption or reference, the author of the document would pay for only the referenced or consumed content. This could also help be the basis used to evaluate a content’s worth based on how many times it’s been referenced or consumed.
Another important aspect applicable to all types of documents is the fact that contexts, ideas and objects delimitations must allow for overlapping. The delimitation could be done by the author of the original document as well as by the author of another document that is referencing the original document. Allowing external authors to delimit content entails an approval mechanism where the delimited content by someone who is not the author could be approved or denied. This capability can be embedded in the original document as an operation. See the security section for more information on the concept. This mechanism would also reinforce the concept of intellectual property across the mesh of documents. 
Since a document can be observed through difference perspectives, links to external content must be aware of this notion of context. As relationships with other documents are context based, one could link the same idea in a document to multiple different external concepts by simply switching the perspective or context of the relationship.
Audio and Video Documents
Delimitation of content in audio and video documents must have the following characteristics:
-
Time based
- The metadata must be associated with a lapse of time where the context, idea or object is in effect.

Consuming Content
Crawling Content
A search engine could index documents in a way that is similar to current search engines. Since the document now contains richer information about their context and content, documents could be indexed and ranked in the following ways:
- Search engine crawl starts with a document and records each the topic within the document
- Full text indexing could also be done on the textual content of a document
- Search engine could then explore the references made in the document (context). This would lead to the discovery of new documents and new concepts in other documents.
- The depth of the document could also be recorded to help in the ranking algorithm. A document that is referenced multiple times should have higher relevance in the search.
- The content object within a document is indexed independently of the container document, but a reference to the container is kept
- Attributes such as the author name or the content creation date are indexed and tagged to the content

Unstructured Querying
Searching for a document would require the user to enter his search criteria in one of the following possible ways:
-
Keywords
- Topics
- Object name
- Author
- Reference to an existing concept, idea or document while consuming content (i.e. I’m looking for a concept similar to this one or to this document).
The search engine would then have to look within its index for concepts that match the user’s criteria. Those could be “hard” criteria such as dates or author name or “criteria” such as similarities.
Once the search engine found the relevant content, it returns a list of only the relevant content. What happens during this process is the search engine fetches the content from the original sources and exposes them in the search result. Only the ideas pertaining to the search are returned. This means that if a document contains the idea sought but contains other irrelevant ideas, only the proper content is returned to the user. This can be a nice efficiency boost while searching for content. Currently when content is searched, only the document containing the content is returned. The user often has to search in the document to find the relevant content.

Structured Querying
A programmatic layer should be provided to allow application developers to query all content as if it was part of a single repository. To allow this the indexing engine must be accessible as a single service, through which standard query concepts could be applied, much in the same way you would with a relational database management system. Having access to content in a standard format could allow for easy development of standard document management functionality, such as archiving and workflow. The querying engine must be allowed to work in a distributed manner meaning that part of a query could be forwarded to one or multiple engines. As results are prepared
Security
While the ability to explore easily the environment of a document is a core to this concept, security is an aspect that is equally important.
Document Control
Controlling what can be done with the document is a crucial aspect to ensure proper security. One key concept in achieving this is the notion of operation. An operation defines a certain activity that can be performed against any securable object. Examples of operations could be View, Modify, Delete, Exchange, Print or Link. Operations are applied against securable objects in a document. A securable object could be a concept, an idea, a relation with another document or the whole document itself. The information regarding all security aspects for all users is only kept in the master document. Since it’s assumed that the user will receive only the information he has access to along with the operations he’s allowed to perform, we can achieve a level of control that was unachievable before.
Document Exchange
A metadata enhanced document is by design an entity that can live by itself without reliance on any external security infrastructure (authentication, authorization). This would allow for secure document interchange between two parties without any special mechanism. The document would only need to be sent at destination. No secure channels should be necessary during the exchange. While a document is packaged for offline usage, the following things should happen:
- Permitted operations on the securable document aspects for a particular user are added to a sub-package
- Authorizations required for these operations are added to the sub-package
- Content allowed for this particular user is added to the sub-package
- The resulting sub-package is encrypted with the key of the user with rights to this sub-package
- The process is repeated for each user having access to a piece of content in the package
- A hash key is generated for each sub-packages and added to the final compressed file
- All the sub-packages are then compressed into their final file format
Once the compressed file is created, a software aware of the file format can be used to send only the proper part of the compressed file to the recipient. This would ensure that only the proper information is sent to the recipient. In a server scenario, a user could authenticate using the same key that was used to encrypt the sub-package and the server could return only the proper section seamlessly.
Exportability
One important aspect of a metadata enhanced document is the notion of exportability. The browsing application must offer the user to option of exporting the document along with its contextual information. The export must allow the user to select the contextual information to be included with the export. This could be filtered with a combination of the following ways:
-
Content types
- Audio
- Video
- Images
-
Recursion depth
- Since documents are linked together by context, objects and ideas, the export must allow the user to select the number of immediate relations that are allowed. For instance Document 1 is linked to Document 2 and Document 2 is linked to Document 3. In this example the user could choose to either export Document 1 and the referenced Document 2 while excluding Document 3.
-
Topics
- Since a document can contain several topics, one could choose to only export certain topics while excluding others

Peer to Peer Topology
One crucial aspect of any system should be resilience. To achieve this, a number of techniques are available. One technology that is hard to beat on both availability and capacity is the peer-to-peer topology. If you consider the peer-to-peer network as a computing resource, you can achieve massive scalability that can’t be achieved using traditional monolithic systems.
In this context, the peer-to-peer computing grid can have multiple purposes and advantages over competing technologies.
Storage

As you can see in the diagram above, the grid can be used to ensure redundancy of the information. As each member of the storage grid is aware of the other nodes, it could track modified content and replicate it as needed to the other nodes in the grid to ensure an appropriate number of copies of the same content is kept in grid. Each content fragment within the grid should be able to announce the level of redundancy it requires or policies could be set at the grid level for predefined types of content.
In storage resiliency is one thing, performance is another. If a piece of content is highly demanded, the grid should be able to adapt to this increase. One way to achieve this is by spreading content across as many nodes as necessary to achieve acceptable throughput and response time for the user.

Once the content has been pushed to the other nodes, they can now be used to serve the content. Different strategies could be used when fetching the content from multiple nodes in the grid.
The client could retrieve the desired fragment by using one of the following strategies depending on network conditions and the type of content served.
- The request for the fragment could be distributed across the nodes serving the content
- Download the whole fragment from the node with the best throughput because sometimes the overhead of distributing the request through the grid could be prohibitive for some hardware configuration
- If the fragment is very small, the client could chose to download the file from the node with the fastest response time
The reverse concept should also be possible. If a highly demanded topic loses in popularity in the grid, unnecessary copies should be discarded. Only the appropriate number needed for redundancy should be kept.

When a document is retrieved, several requests would be issued against the grid as each fragment could potentially be retrieved from multiple nodes.
The previous principles could address a multitude of problems currently happening with traditional clusters. Here are some of the problems this solution attempts to solve:

In the diagram above, all the computing nodes fight for a centralized pool of storage. Traditionally, a storage area network is used to achieve this. They are usually built with an expensive set of devices with hard capacity limitations. As the number of computing nodes increases, a few problems arise. The storage resources cannot easily adapt in a sudden change in content access. This often entails reconfiguration of the storage array to either increase the capacity or the number hard drives required to deliver acceptable response time or throughput. As concurrency increases between the nodes, the amount of information exchanged between the nodes to coordinate transactions increases. This has the effect to negatively impact the performance of the cluster which spends more time managing the work to be done, rather than actually do the computing task at hand. This architecture has a practical processing scalability limit due to centralized storage and inter-nodes communication.
An alternative architecture, usually called “shared nothing” clusters takes a different approach to avoid the scalability issues mentioned above. As each computing node is independent from one another, you effectively avoid the storage contention issue and the inter-nodes communication problems. However this architecture still presents a considerable drawback. As its name indicates, the computing nodes don’t share anything. This means if you want to query for a particular piece of content you either have to know on which node it’s located or query every node. If you need content from more than one node, you need to write code that will do the aggregation manually. Another issue with this architecture is that if you need to rebalance data across the node for scalability reasons, this is usually a manual process as well.

With the proposed architecture we can eliminate or mitigate the issues explained above. The peer-to-peer computing grid has the following advantages:
- Can take advantage of the cheapest storage media and computing hardware available regardless of their built-in availability features
- Provides a single query interface for multiple computing nodes
- Minimizes inter nodes communication overflow as data is dynamically load balanced across nodes, avoiding multiple nodes contention on the same piece of data
- Allows for live addition/deletion/upgrade of computing nodes

As the rate of updates increases against a certain piece of content, the hot content will get invalidated on all the nodes of the cluster with the exception of a limited set of nodes. The excluded nodes only receive the notification that a certain piece of information is no longer up to date as well as the addresses of the nodes now responsible for the data. The remaining nodes will take over the sole responsibility of updating this piece of content. As the number of nodes involved for a given transaction is limited to potentially only two nodes, we minimize the amount of chatter required between the nodes of the cluster to coordinate a transaction on a specific content fragment. As the nodes are now processing the updates for this piece of information with a higher priority than the other requests, the existing load on the nodes taking the new responsibility could be redistributed amongst the remaining cluster nodes if needed. Once the update rate decreases, the data can be redistributed across necessary nodes to accommodate the required throughput and response time.
Versioning and Archiving
The storage subsystem must allow for multiple versions of a content fragment to be stored across the cluster. Typical backup and recovery hardware and software should not be necessary in this environment. To achieve this, the following methods are used.
- Policies are defined to regroup content types
-
Storage policies are defined on top of the content types. The policies can include the following information
- Retention period

As content is accessed in the cluster, each request is tracked. Periodically, the cluster will determine a node to analyze the access pattern of the content on the nodes. The content will then be classified using data mining algorithm. Once the classification is completed, the nodes in the cluster are then notified of the changes in the content they host. Nodes can then delete, pull and fetch content from and to other nodes to balance the cluster.
Regular server applications could be balanced in the same way using application virtualization technologies.

As the number of nodes increases in the cluster, nodes will regroup together based on different criteria:
-
Content similarity
-
Consumer proximity
- As content is sometimes localized to a particular area (department, enterprise, city, country, etc.) The local content will be stored near its consumers for faster access and maximization of network links.
- As access pattern change, for instance a worldwide news is sometimes consumed first in some part of the world due to time zones, the content will move to accommodate the demand
Node Architecture


Hardware
The goal of this project is to leverage as much as possible existing and future hardware.
Operating System
The implementation of the project should be operating system independent. The operating system must be able to support the core functions required by the virtualization layer.
Indexing Services
The role of the indexing services is to build and maintain the index for the content stored on that node. The index is populated by calling the CrawlContent method on the content type.
Content Services
The role of the content services layer is to provide the common functions that apply to all content types on a node.
Virtualization
The virtualization layer provides the necessary emulation to provide each content fragment with its own networking stack.
IPv6 Address
Each piece of content can be addressed using a standard IPv6 address. This will allow the content to be retrievable easily as it moves between nodes.
Content Methods
The content methods are functions specific for a particular content type. Here’s a sample list of the functions that could be implemented as content methods:
- ReadContent
- DeleteContent
- WriteContent
- ConvertContent
- GetContentExtract
- SetTransactionIsolation
- IndexContent
It is important to note that all those functions are content types specific. As format and functionality differs from a content type to another, the API accessing it must be adapted. Content methods are also version specific. Multiple versions of content methods must be able to coexist at one time to be able to work with content created with older version of authoring software.
Another point worth mentioning is those methods are primarily aimed as support services for the cluster functionality. This will be transparent for an end user, as the storage pool made available by the cluster will be implemented as a regular file system.
Content
This will be the content or file as seen by the end user.
Client Architecture


Path Resolution Service
The goal of the Path Resolution Service is to resolve a file path to an IP address. To achieve this, the following concepts are necessary.
