Distributed Universal Memory for Windows

*Disclaimer* This is only an idea I’ve been toying with. It doesn’t represent in any way, shape or form future Microsoft plans in regards to memory/storage management. This page will evolve over time as the idea is being refined and fleshed out.

**Last Updated 2017-03-23**

The general ideal behind Distributed Universal Memory is to have a common memory management API that would achieve the following:
  • Abstract the application from the memory medium required to maintain application state, whether it’s volatile or permanent
  • Allow the application to express memory behavior requirements and not worry about the storage medium to achieve this
  • Support legacy constructs for backward compatibility
  • Enable new capabilities for legacy applications without code change
  • Give modern applications a simple surface to persist data
  • Enables scale out applications to use potentially a single address space
  • Could potentially move towards a more microservice based approach instead of the current monolithic code base
  • Could easily leverage advances in hardware development such as disaggregation of compute and memory, usage of specialized hardware such FPGAs or GPUs to accelerate certain memory handling operations
  • Could be ported/backported to further increase the reach/integration capabilities. This memory management subsystem to could be cleanly decoupled from the underlying operating system.
  • Allow the data to be optimally placed for performance, availability and ultimately cost

Availability Management

  • Process memory can be replicated either systematically or on demand
  • This would allow existing process memory to be migrated from one operating system instance to another transparently.
  • This could offer higher resiliency to process execution in the event of an host failure
  • This could also allow some OS components to be updated while higher level processes keep going. (i.e. redirected memory IO)
Performance Management
  • Required medium to achieve performance could be selected automatically using different mechanisms (MRU/LRU, machine learning, etc.)
  • Memory performance can be expressed explicitly by the application
    • By expressing its need, it would be easier to characterize/model/size the required system to support the application
    • Modern applications could easily specify how each piece of data it interacts with should be performing
  • Could provide multiple copies of the same data element for compute locality purposes. i.e. Distributed read cache
    • This distributed read-cache could be shared between client and server processes if desired. This would enable to have a single cache mechanism independently of the client/process accessing it.
Capacity Management
  • Can adjust capacity management techniques depending on performance and availability requirements
  • For instance, if data is rarely used by the application, several data reduction techniques could be applied such as deduplication, compression and/or erasure coding
  • If data access time doesn’t require redundancy/locality/tolerates time for RDMA, it could be spread evenly across the Distributed Universal Memory Fabric

High Level Cluster View


Here’s an high level diagram of what it might look like:

Let’s go over some of the main components.

Data Access Manager

The Data Access Manager is the primary interface layer to access data. The legacy API would sit on top of this layer in order to properly abstract the underlying subsystems in play.

  • Transport Manager
    • This subsystem is responsible to push/pull the data on the remote host. All inter-node data transfers would occur over RDMA to minimize the overhead of copying data back and forth between nodes.
  • Addressing Manager
    • This would be responsible to give a universal memory address for the data that’s independent of storage medium and cluster nodes.

Data Availability Manager

This component would be responsible to ensure the proper level of data availability and resiliency are enforced as per defined policies in the system. It would be made of the following subsystems:

  • Availability Service Level Manager
    • The Availability Service Level Manager’s responsibility to to ensure the overall availability of data. For instance, it would act as the orchestrator responsible to trigger the replication manager to ensure the data is meeting its availability objective.
  • Replication Manager
    • The Replication Manager is responsible to enforce the right level of data redundancy across local and remote memory/storage devices. For instance, if 3 copies of the data must be maintained for the data of a particular process/service/file/etc. across 3 different failure domains, the Replication Manager is responsible of ensuring this is the case as per the policy defined for the application/data.
  • Data History Manager
    • This subsystem ensure that the appropriate point in time copies of the data are maintained. Those data copies could be maintained in the system itself by using the appropriate storage medium or they could be handed of to a thrid party process if necessary (i.e. standard backup solution). The API would provide a standard way for data recovery operations.

Data Capacity Manager

The Data Capacity Manager is responsible to ensure enough capacity of the appropriate memory/storage type is available for applciations and also for applying the right capacity optimization techniques to optimize the physical storage capacity available. The following methods could be used:

  • Compression
  • Deduplication
  • Erasure Coding

Data Performance Manager

The Data Performance Manager is responsible to ensure that each application can access each piece of data at the appropriate performance level. This is accomplished using the following subsystems:

  • Latency Manager
    • This is responsible to place the data on the right medium to ensure that each data element can be accessed at the right latency level. This can be determined either by pre-defined policy or by heuristic/machine learning to detect data access pattern beyond LRU/MRU methods.
    • The Latency Manager could also monitor if a local process tends to access data that’s mostly remote. If that’s the case, instead of generally incurring the network access penalty, the process could simply be moved to the remote host for better performance through data locality.
  • Service Level Manager
    • The Service Level Manager is responsible to manage the various applications expectations in regards to performance.
    • The Service Level Manager could optimize data persistence in order to meet its objective. For example, if the local non-volatile storage response time is unacceptable, it could choose to persist the data remotely and then trigger the Replication Manager to bring a copy of the data back locally.
  • Data Variation Manager
    • A subsystem could be conceived to persist a tranformed state of the data. For example, if there’s an aggregation on a dataset, it could be persisted and linked to the original data. If the original data changes the dependent aggregation variations could either be invalidated or updated as needed.

Data Security Manager

  • Access Control Manager
    • This would create hard security boundary between processes and ensure only authorized access is being granted, independently of the storage mechanism/medium.
  • Encryption Manager
    • This would be responsible for the encryption of the data if required as per a defined security policy.
  • Auditing Manager
    • This would audit data access as per a specific security policy. The events could be forwarded to a centralized logging solution for further analysis and event correlation.
    • Data accesses could be logged in an highly optimized graph database to allow:
      • Build a map of what data is accessed by processes
      • Build a temporal map of how the processes access data
  • Malware Prevention Manager
    • Data access patterns can be detected in-line by this subsystem. For instance, it could notice that a process is trying to access credit card number data based on things like regex for instance. Third-party anti-virus solutions would also be able to extend the functionality at that layer.

Legacy Construct Emulator

The goal of the Legacy Construct Emulator to is to provide to legacy/existing applications the same storage constructs they are using at this point in time to ensure backward compatibility. Here are a few examples of constructs that would be emulated under the Distributed Universal Memory model:

  • Block Emulator
    • To emulate the simplest construct to simulator the higher level construct of the disk emulator
  • Disk Emulator
    • Based on the on the block emulator, simulates the communication interface of a disk device
  • File Emulator
    • For the file emulator, it could work in a couple of ways.
      • If the application only needs to have a file handle to perform IO and is fairly agnostic of the underlying file system, the application could simply get a file handle it can perform IO on.
      • Otherwise, it could get that through the file system that’s layered on top of a volume that makes use of the disk emulator.
  • Volatile Memory Emulator
    • The goal would be to provide the necessary construct to the OS/application to store it’s state data that’s might be typically stored in RAM.

One of the key thing to note here is that even though all those legacy constructs are provided, the Distributed Universal memory model has the flexibility to persist the data as it sees fit. For instance, even though the application might think it’s persisting data to volatile memory, the data might be persisted to an NVMe device in practice. Same principle would apply for file data; a file block might actually be persisted to RAM (similar a block cache) that’s then being replicated to multiple nodes synchronously to ensure availability, all of this potentially without the application being aware of it.

Metrics Manager

The metrics manager is to capture/log/forward all data points in the system. Here’s an idea:

  • Availability Metrics
    • Replication latency for synchronous replication
    • Asynchronous backlog size
  • Capacity Metrics
    • Capacity used/free
    • Deduplication and compression ratios
    • Capacity optimization strategy overhead
  • Performance Metrics
    • Latency
    • Throughput (IOPS, Bytes/second, etc.)
    • Bandwidth consumed
    • IO Type Ratio (Read/Write)
    • Latency penalty due to SLA per application/process
  • Reliability Metrics
    • Device error rate
    • Operation error rate
  • Security Metrics
    • Encryption overhead

High Level Memory Allocation Process

More details coming soon.

Potential Applications

  • Application high availability
    • You could decide to synchronously replicate a process memory to another host and simply start the application binary on the failover host in the event where the primary host fails
  • Bring server cached data closer to the client
    • One could maintain a distributed coherent cache between servers and client computers
  • Move processes closer to data
    • Instead of having a process try to access data accross the network, why not move the process to where the data is?
  • User State Mobility
    • User State Migration
      • A user state could move freely between a laptop, a desktop and a server (VDI or session host) depending on what the user requires.
    • Remote Desktop Service Session Live Migration
      • As the user session state memory is essentially virtualized from the host executing the session, it can be freely moved from one host to another to allow zero impact RDS Session Host maintenance.
  • Decouple OS maintenance/upgrades from the application
    • For instance, when the OS needs to be patched, one could simply move the process memory and execution to another host. This would avoid penalties such as buffer cache rebuilds in SQL Server for instance which can trigger a high number of IOPS on a disk subsystem in order to repopulate the cache based on popular data. For systems with an large amount of memory, this can be fairly problematic.
  • Have memory/storage that spans to the cloud transparently
    • Under this model it would be fairly straightforward to implement a cloud tier for cold data
  • Option to preserve application state on application upgrades/patches
    • One could swap the binaries to run the process while maintaining process state in memory
  • Provide object storage
    • One could layer object storage service on top of this to support Amazon S3/Azure Storage semantics. This could be implemented on top of the native API if desired.
  • Provide distributed cache
    • One could layer distributed cache mechanisms such as Redis using the native Distributed Universal Memory API to facilitate porting of applications to this new mechanism
  • Facilitate application scale out
    • For instance, one could envision a SQL Server instance to be scaled out using this mechanism by spreading worker threads across multiple hosts that share a common coordinated address space.
  • More to come…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s