AWS Backup Service – should you jump in?

I’ve heard before that if you see a title ending in a question mark, the answer is most likely going to be “no”.

Unfortunately, I’m not going to breaking any traditions here.

Background: AWS announced today that they have added support for EC2 backups now in their backup service.  The AWS backup service now has backup options for Amazon EBS volumes, Amazon Relational Database Service (RDS) databases, Amazon DynamoDB tables, Amazon Elastic File System(EFS), Amazon EC2 instances and AWS Storage Gateway volumes.

But if you’re expecting this to be like the backup/recovery solutions you’ve run on-prem, you’re in for a rude awakening.  It’s not. This has the scent of a minimally-viable product, which I’m sure they’ll build on and make compelling one day, but that day isn’t today.

It’s VERY important that you read the fine print on how the backups occur on the different services covered, and more importantly how you restore, and at what granularity you’re restoring.

First- from a fundamental architectural perspective- backups are called “recovery points”.  That’s an important distinction. We’ve seen the recovery point term co-habitate with “snapshot”. In fact, EBS “backups” are just that- snapshots.

So for EC2 and EBS-related backups (oops, “Recovery Points”), you’re simply restoring a snapshot into a NEW resource. Want to restore a single file or directory in an EC2 instance or in a filesystem on your EBS volume? Nope. All or nothing. Or, you’ll restore from the RP into a new resource, and copy the needed data back into your live instance. I’m sorry, that’s just not up to today’s expectations in backup and recovery.

What about EFS? Well, glad you asked. This is NOT a snapshot. There is ZERO CONSISTENCY in the “recovery point” for EFS, as the backup in this case doesn’t snapshot- it iterates through the files, and if you change a file DURING a backup, there is a 100% chance that your “recovery point” won’t be a “point” at all- so you could break dependencies in your data. Yet they still call the copy of this data a “restore point”. Give them props for doing incremental forever here, but most file backup solutions (when paired with enterprise NAS systems or even just Windows) know how to stun the filesystem and stream their backups from the stunned version, rather than the volatile “live” filesystem.  Also, if you want to do a partial recovery, you cannot do it in-place- it goes to a new directory located at the root of your EFS.

The BIGGEST piece missing from the AWS Backup Service is something we’ve learned to take for granted from B/R solutions: CATALOG.  You need to know what you want to restore AND where to find it in order to recover it. With EFS, this can get REALLY dicey. It’s really easy to choose the wrong data, perhaps it’s a good thing they don’t allow you to restore in place yet!

Look, I applaud AWS for paying some attention to data protection here. This does shine a light on the fact that AWS data storage architecture lends itself to many data silos that require a SPOG to manage effectively and compliantly. However, there is a (very short) list of OEM B/R and data management vendors that can do this effectively not just within AWS but across clouds, and still give you the content-aware granularity you need to execute your complex data retention and compliance strategies and keep you out of trouble.

So many organizations are rushing to the cloud, make sure that you’re paying adequate attention to your data protection and compliance as you go, you’ll find that while the cloud providers are absolutely amazing at providing a platform for application innovation and transformation, data governance, archive, and protection are not necessarily getting the same level of attention from them- it’s up to YOU to protect that data and your business.

 

 

Use Google Cloud Functions (Python) to modify a GKE cluster

I wanted to create a way to easily “turn on” and “turn off” a GKE cluster, via an HTTP link that I could bookmark and hit, even from my iPhone. With GKE, if you set your node pool size to zero, you’re not incurring any charge since Google doesn’t hit you on the master nodes. So I wanted to easily set the pool size up and down.  Sure, I could issue a “gcloud container” command, or set up Ansible to do it (which I will do since I want to automate more stuff), but I also wanted to get my feet wet with Cloud Functions and GCP API’s.

In Google Cloud Functions, you simply write your functional code in the main file (main.py), AND include the correct dependencies in the requirements.txt file (for Python).  That dependency is represented by the same name of the module you’d use in a “pip install”.  The module for managing GKE is “google-cloud-container“.

Now one of the great things about using Cloud Functions is that authorization for all API’s within your project “just happen”.  You don’t need to figure out OAuth2 or use API keys.  You just need to write the code. If you’re going to use this python code outside of Cloud Functions, you’d need to add some code for that and set an environment to point to your secret json file for the appropriate service account for your project.

Here’s sample code to change your GKE Cluster node pool size.

import google.cloud.container

def startk8s(request):
    client = google.cloud.container.ClusterManagerClient()
    projectID = '<your-project-id>' 
    zone = 'us-east1-d' """ your zone obviously """
    clusterID = '<your-cluster-name>' 
    nodePoolID = 'default-pool' """ or your pool name """
    client.set_node_pool_size(projectID, zone, clusterID, nodePoolID, 3)
    return "200"

You need to set the name of the Function you want triggered:

execute

Notice the import statement- “google.cloud.container”.  Now you can’t exactly “pip install” into a Cloud Function, it’s not your Python instance! That’s where the dependency.txt file comes in.  (There’s a version of that for node.js – package.json, since you can’t npm install either). Here’s the sample dependency.txt file:

# Function dependencies, for example:
# package>=version
google-cloud-container

Note that the package version seems to be optional.  My code works without it.

You can test the cloud function by clicking on the “testing” sub-menu.

 

#SFD15: Datrium impresses

If I had to choose the SFD15 presenter with the most impressive total solution, it would be Datrium.  We saw some really cool tech this week all around but Datrium showed me something that I have not seen from too many vendors lately, specifically a deep and true focus on the end user experience and value extraction from technology.

Datrium is a hyperconverged offering that fits the “newer” definition of HCI, in that the compute nodes and storage nodes scale separately. There’s been an appropriate loosening of the HCI term of late, with folks applying it based on the user experience rather than a definition specified by a single vendor in the space. Datrium takes this further in my opinion by reaching for a “whole solution” approach that attempts to provide an entire IT lifecycle experience – primary workloads, local data protection, and cloud data protection – on top of the same HCI approach most solutions only offer in the on-premises gear.

From the physical perspective, Datrium’s compute nodes are stateless, and have local media that acts very much like a cache (they call it “Primary storage” but this media doesn’t accept writes directly from the compute layer).  They are able to perform some very advanced management features of this cache layer, including global dedupe and rapid location-aware data movement across nodes (I.e. when you move a workload), so I’ll compromise and call it a “super-cache”. Its main purpose is to keep required data on local flash media, so yeah, it’s a cache. A Datrium cluster can scale to 128 nodes, which is plenty for its market space since a system that size tested out at 12.3M 4k IOPS with 10 data nodes underneath.

The storage layer is scale-out and uses erasure coding, and internally leverages the Log-Structured File System approach that came out of UC Berkeley in the early 90’s. That does mean that as it starts filling up to 80%+, writes will cost more. While some other new storage solutions can boast extremely high capacity utilization rates, this is a thing we’ve had to work with for a long time with most enterprise storage solutions.  In other words, not thrilled about that, but used to it.

Some techies I talk to care about the data plane architecture in a hyperconverged solution. There are solutions that place a purpose-built VM in the hypervisor that exposes the scale-out storage cluster and performs all data management options, and so the data plane runs through that VM. Datrium (for one) does NOT do that. There is a VIB that sits below the hypervisor, so that should appease those who don’t like the VM-in-data-plane model. There is global deduplication, encryption, cloning, lots of no-penalty snapshots, basically all the features that are table stakes at this point.  Datrium performs these functions up on the compute nodes in the VIBs.  There is also a global search across all nodes, for restore and other admin functionality. Today, the restore level is at the virtual disk/VM level. More on that later.

The user, of course, doesn’t really see or care about any of this. There is a robust GUI with a ton of telemmetry available about workload and system performance. It’s super-easy from a provisioning and ongoing management perspective.

What really caught my attention was their cloud integration. Currently they are AWS-only, and for good reason. Their approach is to create a tight coupling to the cloud being used, using the cloud-specific best practices to manage that particular implementation. So the devs at Datrium leverage Lambda and CloudWatch to create, modify, monitor, and self-heal the cloud instance of Datrium (which of course runs in EC2 against EBS and S3). It even applies the security roles to the EC2 nodes for you so that you’re not creating a specific user in AWS, which is best practice as this method auto-rotates the tokens required to allow access.  It creates all the networking required for the on-prem instances to replicate/communicate with the VPC. It also creates the service endpoints for the VPC to talk to S3. They REALLY thought it through. Once up, a Lambda function is run periodically to make sure things are where they are supposed to be, and fix them if they’re not. They don’t use CloudFormation, and when asked they had really good answers why. The average mid-size enterprise user would NEVER (well, hardly ever) have the expertise to do much more than fire up some instances from AMI’s in a marketplace, and they’d still be responsible for all the networking, etc.

So I believe that Datrium has thought through not just the technology, but HOW it’s used in practice, and gives users (and partners) a deliverable best practice in HCI up front. This is the promise of HCI; the optimal combination of the leading technologies with the ease of use that allows the sub-large enterprise market to extract maximum value from them.

Datrium does have some work ahead of it; they still need to add the ability to restore single files from within virtual guest disks, and after they can do that they need to extract that data for single-record management later, perhaps archiving those records (and being able to search on them) in S3/Glacier etc.  Once they provide that, they no longer need another technology partner to provide that functionality. Also, the solution doesn’t yet deal with unstructured data (outside of a VM) natively on the storage.

Some folks won’t like that they are AWS only at the moment; I understand this choice as they’re looking to provide the “whole solution” approach and leave no administrative functions to the user. Hopefully they get to Azure soon, and perhaps GCP, but the robust AWS functionality Datrium manages may overcome any AWS objections.

In sum, Datrium has approached the HCI problem from a user experience approach, rather than creating/integrating some technology, polishing a front end to make it look good, and automating important admin features. Someone there got the message that outcomes are what matters, not just the technology, and made sure that message was woven into the fundamental architecture design of the product. Kudos.

 

 

 

 

NetApp gets the OpEx model right

Ever since the Dot-Com Boom, enterprise storage vendors have had “Capacity on Demand” programs that promised a pay-as-you-use consumption model for storage. Most of these programs met with very limited success, as the realities of the back-end financial models meant that the customers didn’t get the financial and operational flexibility to match the marketing terms.

The main cause of the strain was the requirement for some sort of leasing instrument to implement the program; meaning that there was always some baseline minimum consumption commitment, as well as some late-stage penalty payment if the customer failed to use as much storage as was estimated in the beginning of the agreement. This wasn’t “pay-as-you-use” as much as it was “just-pay-us-no-matter-what”.

NetApp has recently taken a novel approach to this problem, by eliminating the need for equipment title to change from NetApp to the financial entity backing the agreement. With NetApp’s new NetApp OnDemand, NetApp retains title of the equipment, and simply delivers what’s needed.

An even more interesting feature of this program is that the customer pays NOT for storage, but for capacity within three distinct performance service levels, each defined by a guaranteed amount of IOPS/TB, and each of these service levels has a $/GB/Mo associated with it.

To determine how much of each service level is needed at a given customer, NetApp will perform a free “Service Design Workshop” that uses the Netapp OnCommand Insight (OCI) tool to examine each workload and show what the IO Density (IOPS/TB) is for each. From there, NetApp simply delivers storage that is designed to meet those workloads (along with consideration for growth, after consulting with the customer). They include the necessary software tools to monitor the service levels (Workflow Automation, OnCommand Unified Manager, and OCI), as well as Premium support and all of the ONTAP features that are available in their Flash and Premium bundles.

Customers can start as low as $2k/month, and go up AND DOWN with their usage, paying only for what they use from a storage perspective AFTER efficiencies such as dedupe, compression, and compaction are taken into account. More importantly, the agreement can be month-to-month, or annually; the shorter the agreement duration of course, the higher the rate. This is America, after all.

The equipment can sit in the customer premises, or a co-location facility- even a near-cloud situation such as Equinix, making the Netapp Private Storage economics a true match for the cloud compute that will attach to it.

A great use case for NetApp OnDemand is with enterprise data management software, such as Commvault, which can be sold as a subscription as well as as a function of capacity. Since the software is now completely an OpEx, the target storage can be sold with the same financial model – allowing the customer to have a full enterprise data management solution with the economics of SaaS. Further, there would be no need to over-buy storage for large target environments, it would grow automatically as a function of use. This would be the case with any software sold on subscription, making an integrated solution easier to budget for as there is no need to cross the CapEx/OpEx boundary within the project.

This new consumption methodology creates all sorts of new project options. The cloud revolution is forcing companies such as NetApp to rethink how traditional offerings can be re-spun to fit the new ways of thinking in the front offices of enterprises. In my opinion, NetApp has gotten something very right here.

NetApp + SolidFire…or SolidFire + NetApp?

So what just happened?

First- we just saw AMAZING execution of an acquisition.  No BS.  No wavering.  NetApp just GOT IT DONE, months ahead of schedule.  This is right in-line with George Kurian’s reputation of excellent execution.  This mitigated any doubt, any haziness, and gets everyone moving towards their strategic goals.  When viewed against other tech mergers currently in motion, it gives customers and partners comfort to know that they’re not in limbo and can make decisions with confidence.  (Of course, it’s a relatively small, all-cash deal- not a merger of behemoths).

Second -NetApp just got YOUNGER.  Not younger in age, but younger in technical thought.  SolidFire’s foundational architecture is based on scalable, commodity-hardware cloud storage, with extreme competency in OpenStack.  The technology is completely different than OnTAP, and provides a platform for service providers that is extremely hard to match.   OnTAP’s foundational architecture is based on purpose-built appliances that perform scalable enterprise data services, that now extend to hybrid cloud deployments.  Two different markets.  SolidFire’s platform went to market in 2010, 19 years after OnTAP was invented – and both were built to solve the problems of the day in the most efficient, scalable, and manageable way.

Third – NetApp either just made themselves more attractive to buyers, or LESS attractive, depending on how you look at it.

One could claim they’re more attractive now as their stock price is still relatively depressed, and they’re set up to attack the only storage markets that will exist in 5-10 years, those being the Enterprise/Hybrid Cloud market and the Service Provider/SaaS market.  Anyone still focusing on SMB/MSE storage in 5-10 years will find nothing but the remnants of a market that has moved all of its data and applications to the cloud.

Alternatively, one could suggest a wait-and-see approach to the SolidFire acquisition, as well as the other major changes NetApp has made to its portfolio over the last year (AFF, AltaVault, cloud integration endeavors, as well as all the things it STOPPED doing). [Side note: with 16TB SSD drives coming, look for AFF to give competitors like Pure and xTremeIO some troubles.]

So let’s discuss what ISN’T going to happen.

There is NO WAY that NetApp is going to shove SolidFire into the OnTAP platform.  Anyone who is putting that out there hasn’t done their homework to understand the foundational architectures of the VERY DIFFERENT two technologies.  Also, what would possibly be gained by doing so?   In contrast, Spinnaker had technology that could let OnTAP escape from its two-controller bifurcated storage boundaries.  The plan from the beginning was to use the SpinFS goodness to create a non-disruptive, no-boundaries platform for scalable and holistic enterprise storage, with all the data services that entailed.

What could (and should) happen is that NetApp add some Data Fabric goodness into the SF product- perhaps this concept is what is confusing the self-described technorati in the web rags.  NetApp re-wrote and opened up the LRSE (SnapMirror) technology so that it could move data among multiple platforms, so this wouldn’t be a deep integration, but rather an “edge” integration, and the same is being worked into the AltaVault and StorageGRID platforms to create a holistic and flexible data ecosystem that can meet any need conceivable.

While SolidFire could absolutely be used for enterprise storage, its natural market is the service provider who needs to simply plug and grow (or pull and shrink).  Perhaps there could be a feature or two that the NetApp and SF development teams could share over coffee (I’ve heard that the FAS and FlashRay teams had such an event that resulted in a major improvement for AFF), but that can only be a good thing.  However the integration of the two platforms isn’t in anyone’s interests, and everyone I’ve spoken to at NetApp both on and off the record are adamant that Netapp isn’t going to “OnTAP” the SolidFire platform.

SolidFire will likely continue to operate as a separate entity for quite a while, as sales groups to service providers are already distinct from the enterprise/commercial sales groups at NetApp.  Since OnTAP knowledge won’t be able to be leveraged when dealing with SolidFire, I would expect that existing NetApp channel partners won’t be encouraged to start pushing the SF platform until they’ve demonstrated both SF and OpenStack chops.  I would also expect the reverse to be true; while many of SolidFire’s partners are already NetApp partners, it’s unknown how many have Clustered OnTAP knowledge.

I don’t see this acquisition as a monumental event that has immediately demonstrable external impact to the industry, or either company.  The benefits will become evident 12-18 months out and position NetApp for long-term success, viz-a-viz “flash in the pan” storage companies that will find their runway much shorter than expected in the 3-4 year timeframe.  As usual, NetApp took the long view.  Those who see this as a “hail-mary” to rescue NetApp from a “failed” flash play aren’t understanding the market dynamics at work.  We won’t be able to measure the success of the SolidFire acquisition for a good 3-4 years; not because of any integration that’s required (like the Spinnaker deal), but because the bet is on how the market is changing and where it will be at that point – with this acquisition, NetApp is betting it will be the best-positioned to meet those needs.

 

Parse.com – R.I.P. 2016 (technically 2017)

Today we witnessed a major event in the evolution of cloud services. 

In 2013, Facebook purchased a cloud API and data management service provider, Parse.com. This popular service served as the data repository and authentication/persistence management backend for over 600,000 applications. Parse.com provided a robust and predictably affordable set of functionalities that allowed the developers of these mobile and web applications to create sustainable business models without needing to invest in robust datacenter infrastructures. These developers built Parse’s API calls directly into their application source code, and this allowed for extremely rapid development and deployment of complex apps to a hungry mobile user base.

Today, less than three short years later, Facebook announced that Parse.com would be shuttered and gave their customers less than a year to move out. 

From the outside, it’s hard to understand this decision. Facebook recently announced that they had crossed the $1B quarterly profit number for the first time, so it’s not reasonable to assume that the Parse group was bleeding the company dry. Certainly the internal economics of the service aren’t well known, so it’s possible that the service wasn’t making Facebook any, or possibly enough, money. There was no change in pricing that was attempted, and this announcement was rather sudden.

No matter the internal (and hidden) reason, this development provides active evidence of an extreme threat to those enterprises that choose to utilize cloud services not just for hosting of generic application workloads and data storage, but for specific offerings such as database services, analytics, authentication or messaging- things that can’t be easily moved or ported once internal applications reference these services’ specific API’s. 

Why is this threat extreme? 

Note that Facebook is making LOTS of money and STILL chose to shutter this service. Now, point your gaze at Amazon or Microsoft and see the litany of cloud services they are offering. Amazon isn’t just EC2 and S3 anymore- you’ve got Redshift and RDS among dozens of other API-based offerings that customers can simply tap into at will. It’s a given that EACH of these individual services will require groups of people to continue development, and provide customer support, and so each comes with it an ongoing and expensive overhead. 

However, it’s NOT a given that each (or any) of these other individual services will provide the requisite profits to Amazon (or Microsoft, IBM, etc) that would prevent the service provider from simply changing their minds and focusing their efforts on more profitable services, leaving the users of the unprofitable service in the lurch. There’s also the very real dynamic of M&A, where the service provider can purchase a technology that would render the existing service (and its expensive overhead) redundant. 

While it’s relatively simple to migrate OS-based server instances and disk/object-based data from one cloud provider to another (there are several tools and cloud offerings that can automate this), it’s another thing entirely to re-write internal applications that directly reference the APIs of these cloud-based data services, and replicate the data services’ functionality. Certainly there are well-documented design patterns that can abstract the API calls themselves, however migrating to a similar service given a pending service shutdown (as is faced today with Parse.com) requires the customer to hunt down another service that will provide almost identical functionality, and if that’s not possible, the customer will have to get (perhaps back) into the infrastucture game.

Regardless of how the situation is resolved, it forces the developer (and CIO) to re-think the entire business model of the application, as a service shuttering such as this can easily turn the economics of a business endeavor upside-down. This event should serve as a wake-up call for developers thinking of using such services, and force them to architect their apps up-front, utilizing multiple cloud data services simultaneously through API abstraction. Of course, this changes the economics up-front as well. 

So for all you enterprise developers building your company’s apps and thinking about not just using services and storage in the cloud, but possibly porting your internal SQL and other databases to the service-based data services provided by the likes of Amazon, buyer beware. You’ve just been given a very recent, real-world example of what can happen when you not only outsource your IT infrastructure, but your very business MODEL, to the cloud. Perhaps there are some things better left to internal resources.