AWS Backup Service – should you jump in?

I’ve heard before that if you see a title ending in a question mark, the answer is most likely going to be “no”.

Unfortunately, I’m not going to breaking any traditions here.

Background: AWS announced today that they have added support for EC2 backups now in their backup service.  The AWS backup service now has backup options for Amazon EBS volumes, Amazon Relational Database Service (RDS) databases, Amazon DynamoDB tables, Amazon Elastic File System(EFS), Amazon EC2 instances and AWS Storage Gateway volumes.

But if you’re expecting this to be like the backup/recovery solutions you’ve run on-prem, you’re in for a rude awakening.  It’s not. This has the scent of a minimally-viable product, which I’m sure they’ll build on and make compelling one day, but that day isn’t today.

It’s VERY important that you read the fine print on how the backups occur on the different services covered, and more importantly how you restore, and at what granularity you’re restoring.

First- from a fundamental architectural perspective- backups are called “recovery points”.  That’s an important distinction. We’ve seen the recovery point term co-habitate with “snapshot”. In fact, EBS “backups” are just that- snapshots.

So for EC2 and EBS-related backups (oops, “Recovery Points”), you’re simply restoring a snapshot into a NEW resource. Want to restore a single file or directory in an EC2 instance or in a filesystem on your EBS volume? Nope. All or nothing. Or, you’ll restore from the RP into a new resource, and copy the needed data back into your live instance. I’m sorry, that’s just not up to today’s expectations in backup and recovery.

What about EFS? Well, glad you asked. This is NOT a snapshot. There is ZERO CONSISTENCY in the “recovery point” for EFS, as the backup in this case doesn’t snapshot- it iterates through the files, and if you change a file DURING a backup, there is a 100% chance that your “recovery point” won’t be a “point” at all- so you could break dependencies in your data. Yet they still call the copy of this data a “restore point”. Give them props for doing incremental forever here, but most file backup solutions (when paired with enterprise NAS systems or even just Windows) know how to stun the filesystem and stream their backups from the stunned version, rather than the volatile “live” filesystem.  Also, if you want to do a partial recovery, you cannot do it in-place- it goes to a new directory located at the root of your EFS.

The BIGGEST piece missing from the AWS Backup Service is something we’ve learned to take for granted from B/R solutions: CATALOG.  You need to know what you want to restore AND where to find it in order to recover it. With EFS, this can get REALLY dicey. It’s really easy to choose the wrong data, perhaps it’s a good thing they don’t allow you to restore in place yet!

Look, I applaud AWS for paying some attention to data protection here. This does shine a light on the fact that AWS data storage architecture lends itself to many data silos that require a SPOG to manage effectively and compliantly. However, there is a (very short) list of OEM B/R and data management vendors that can do this effectively not just within AWS but across clouds, and still give you the content-aware granularity you need to execute your complex data retention and compliance strategies and keep you out of trouble.

So many organizations are rushing to the cloud, make sure that you’re paying adequate attention to your data protection and compliance as you go, you’ll find that while the cloud providers are absolutely amazing at providing a platform for application innovation and transformation, data governance, archive, and protection are not necessarily getting the same level of attention from them- it’s up to YOU to protect that data and your business.

 

 

Short thoughts on Project Nautilus (VMWare Fusion tech preview)

I’m going to be installing this puppy tonight I think. After reading the VMware blog on it here I do have concerns/questions about some of the things it brings to the table.

First, containers are going to be running in their own “PodVM” or pod, which is going to create all sorts of confusion when they bring Kubernetes to the table (as the article says they are going to do), as in K8s “pods” refer to one or a group of containers that are instantiated together and run as an application on a single host.  So in that case, a pod would be a group of containers that run in their own…pods.  I strongly suggest that the really smart folks at VMware find a different name for this construct, even though all the cool ones may already be taken. (“LiteVM”, maybe? “MiniVM”? or just “space”?)

Second- they’ve done something interesting with networking here. In Docker, if you want your container to talk to the network, you need to portmap the container to the localhost, hostport:containerport. This needs to be explicitly stated when you start your container.

With Nautilus, when you start your container, it gets automatically added to a VMnet- so out of the box you’ll get an IP on the NAT’d network so your local machine can get to the container- WITHOUT any explicit exposure/mapping of ports- everything looks like it’s open on that IP address that’s no longer the localhost.  If you add it to a bridged network, the LAN will give the container an IP via DHCP, and any listening ports will be available.   (If I’m wrong here, I’ll correct this ASAP).

Now, one of the things I REALLY like about apps being deployed on K8s is that you’re FORCED to explicitly state what ports the container will allowed to communicate on. This dramatically reduces attack surface and forces the developers and engineers to be much more aware of how their apps are using network resources. I’m sure (hoping) there will be other ways of locking down the containers that get IP’s from the VMnets, but as it looks like they won’t by default, I’m fearful that the quick and dirty way will lead to less security.

I’m looking forward to playing with this, in particular seeing how it works with things like PVCs, and other pipeline, testing, and integration toolsets.  I know it’s just the desktop version and it’s VERY new, but I have a hunch that at least some of the lessons learned are going to end up in Pacific.

[Off-Topic] Automatically set your iPad to “Do Not Disturb” when you open your Kindle (or other reader) app

One of the things I dislike about reading on my iPad is that there are so many distractions that can break your concentration, like iMessages, Facebook Messenger, Twitter notifications, etc. Now sure…you can swipe down and just tap the crescent moon and turn on Do Not Disturb.

But I forget to do that. Every time. Why can’t it just turn on DND automatically when it knows I’m reading??

Well…it can. And, it’s REALLY EASY. Here’s a step-by-step showing you how to do this using the Shortcuts app. I’ve written these instructions for beginners so if you know what you’re doing, you’ll fly through this quickly, I apologize for the very specific instructions.

1) Open up Shortcuts and click on “Automation” on the bottom center of the screen. Click on “Create Personal Automation.”

2) In the “New Automation” screen, choose “Open App” and choose Kindle (or your reader), this automation will trigger when you open the app.

3) Choose “Add Action” so we can tell Shortcuts what we want done.

4) In the search bar, type “Do Not Disturb”, and you’ll see it listed towards the bottom. Click the “Do Not Disturb” in the results.

5) At the bottom of the following screen, turn OFF the “Ask before running” so that you don’t have to acknowledge to Shortcuts that you actually want this done every time you bring up Kindle.

6) Click “Done” and that’s it! Make sure DND is OFF, launch Kindle, and you’ll get a pull-down notification saying your automation has run. Check DND, it should be on!

Note- If you want to disable DND….that’s on you. 😉 Pull down from the top right of the screen and tap the moon.

VMWare Workstation – DUP! packet issue resolved…sort of

I was getting VERY frustrated with some networking issues with my virtual guests in VMW 15.5 (and prior), on Windows 10.  See below:

dup-packets.png

If you look, you’ll see that for every ping request I’m sending to my gateway (or ANY other IP address outside the Windows host), I’m getting FOUR RESPONSES. This also manifests itself in *very* slow downloads for packages or updates I’m installing on the VM’s.  And, it’s just wrong so it needed fixing.

Note that the standard Google answer to this issue is to stop and delete the Routing and Remote Access Service.  The first time this happened, this solved the problem! There were a ton of other ‘solutions’ out there but none really understood the problem. Windows was creating some sort of packet amplification. (When I have time I’m going to reinstall pcap and dig into this).

But then….months later….

It came back.  I hadn’t re-enabled routing and remote access. I hadn’t made any networking changes inside the host or on my network.   I HAD done some other stuff, such as enabling Windows Services for Linux and installing Ubuntu for bash scripting purposes.  You know…messing around. Some of this could’ve re-written the bindings and orders of networks/protocols/services etc., but if so, it wasn’t reflected anywhere in the basic or advanced network settings. VERY frustrating!

I deleted a TON of stuff I’d installed that I no longer needed (which had to be done anyway, but I was saving that for New Years’). I re-installed the VMware bridge protocol. I repaired VMware Workstation. I REMOVED and re-installed VMware Workstation.

**Here’s what finally RE-solved the problem:

  • I RE-ENABLED RRAS (!)
  • I went into the properties of “Incoming Connections” in Network Adapter Settings and UNCHECKED IPv4, leaving IPv6 checked. (I’m not sure if this matters, try it without this step first).
  • I RE-DISABLED RRAS (!)

And…here’s the result.

non-dup-packets.png

I can only surmise that the act of STOPPING RRAS does a config of the network stack where it doesn’t amplify packets. And, you can’t stop a service unless it’s already started, right?

Makes complete sense.

NOT.

But, all’s well that ends.

VMWare Workstation 15 REST API – Control power state for multiple machines via Python

Or..How to easily power up/suspend your entire K8s cluster at once in VMWare Workstation 15

In VMWare Workstation 15, VMWare introduced the REST API, which allows all sorts of automation.  I was playing around with it, and wrote a quick Python script to fire up (or suspend) a bunch of machines that I listed in an array up top in the initialization section. In this case, I want to control the state of a 4-node Kubernetes cluster, as it was just annoying me to click on the play/suspend 4 times (I have other associated virtual machines as well, which only added to the annoyance.)

Your REST API exe (vmrest.exe) MUST be running if you’re going to try this. If you haven’t set that up yet, stop here and follow these instructions You’ll notice that Vmrest.exe normally runs as an interactive user mode application, but I’ve now set up the executable to run as a service on my Windows 10 machine using NSSM, I’ll have a separate blog entry to show how that’s done.

Some notes on the script:

  • Script Variables – ip/host:port (you need the port, as vmrest.exe gives you an ephemeral port number to hit), machine list, and authCode
  • Regarding the authCode.  WITH vmrest.exe running, go to “https://ip_of_vmw:port” to get the REST API explorer page (shown below). Click “authorization” up top, and you’ll get to log in. Use the credentials you used to set up the VMW Rest API via these instructions

Screen Shot 2019-12-24 at 3.05.04 PM.png

Then do a “Try it out!” on any GET method that doesn’t require variables and your Auth Code will appear in the Curl section in the “Authorization” header. Grab that code, you’ll use it going forward.

curl.png

Here’s the script, with relatively obvious documentation. Since more than likely your SSL for the vmrest.exe API server will use a self-signed, untrusted certificate, you’re probably going to need to ignore any SSL errors that will occur. That’s what the “InsecureRequestWarning” stuff is all about, we disable the warnings.  My understanding is that the disabled state is reset with every request made, so we need to re-disable it before every REST call.

I’ve posted this code on GitHub HERE.

#!/usr/bin/env python3
import requests
import urllib3
import sys
from urllib3.exceptions import InsecureRequestWarning

'''Variable Initiation'''

ip_addr = 'your-ip-or-hostname:Port' #change ip:port to what VMW REST API is showing
machine_list = ['k8s-master','k8s-worker1','k8s-worker2','k8s-worker3']
authCode = 'yourAuthCode'

'''Section to handle the script arg'''

acceptable_actions = ['on', 'off', 'shutdown', 'suspend', 'pause', 'unpause']

try:

    sys.argv[1]

except NameError:

        action = "on"

else:

    if sys.argv[1] in acceptable_actions:

        action = sys.argv[1]

    else:

        print("ERROR: Action must be: on, off, shutdown, suspend, pause, or unpause")

        exit()


'''Section to get the list of all VM's '''

urllib3.disable_warnings(category=InsecureRequestWarning)

resp = requests.get(url='https://' + ip_addr + '/api/vms', headers={'Accept': 'application/vnd.vmware.vmw.rest-v1+json', 'Authorization': 'Basic ' + authCode}, verify=False)

if resp.status_code != 200:

    #something fell down

    print("Status Code " + resp.status_code + ": Something bad happened")

result_json = resp.json()


'''Go through entire list and if the VM is in the machine_list, if so,
 act! '''

for todo_item in resp.json():

    current_id = todo_item['id']

    current_path = todo_item['path']

    for machine in machine_list:

        if current_path.find(machine) > -1:

        print(machine + ': ' + current_id)

        urllib3.disable_warnings(category=InsecureRequestWarning)

        current_url = 'https://' + ip_addr + '/api/vms/' + current_id + '/power'

        resp = requests.put(current_url, data=action, headers={'Content-Type': 'application/vnd.vmware.vmw.rest-v1+json', 'Accept': 'application/vnd.vmware.vmw.rest-v1+json', 'Authorization': 'Basic ' + authCode}, verify=False)

        print(resp.text)

        '''Better exception handling should be written here of course. 


**12/27/19 NOTE!** – I’ve noticed what I believe to be a bug in VMW 15.5 where if you control power state via the REST API, you lose the ability to control the VM via the built-in VMWare console in the app.  The VMs behave fine (assuming everything else is working), but for some reason the VMW app doesn’t attach the console process correctly.  If you want to follow this issue I’ve submitted to the community here.

Use Google Cloud Functions (Python) to modify a GKE cluster

I wanted to create a way to easily “turn on” and “turn off” a GKE cluster, via an HTTP link that I could bookmark and hit, even from my iPhone. With GKE, if you set your node pool size to zero, you’re not incurring any charge since Google doesn’t hit you on the master nodes. So I wanted to easily set the pool size up and down.  Sure, I could issue a “gcloud container” command, or set up Ansible to do it (which I will do since I want to automate more stuff), but I also wanted to get my feet wet with Cloud Functions and GCP API’s.

In Google Cloud Functions, you simply write your functional code in the main file (main.py), AND include the correct dependencies in the requirements.txt file (for Python).  That dependency is represented by the same name of the module you’d use in a “pip install”.  The module for managing GKE is “google-cloud-container“.

Now one of the great things about using Cloud Functions is that authorization for all API’s within your project “just happen”.  You don’t need to figure out OAuth2 or use API keys.  You just need to write the code. If you’re going to use this python code outside of Cloud Functions, you’d need to add some code for that and set an environment to point to your secret json file for the appropriate service account for your project.

Here’s sample code to change your GKE Cluster node pool size.

import google.cloud.container

def startk8s(request):
    client = google.cloud.container.ClusterManagerClient()
    projectID = '<your-project-id>' 
    zone = 'us-east1-d' """ your zone obviously """
    clusterID = '<your-cluster-name>' 
    nodePoolID = 'default-pool' """ or your pool name """
    client.set_node_pool_size(projectID, zone, clusterID, nodePoolID, 3)
    return "200"

You need to set the name of the Function you want triggered:

execute

Notice the import statement- “google.cloud.container”.  Now you can’t exactly “pip install” into a Cloud Function, it’s not your Python instance! That’s where the dependency.txt file comes in.  (There’s a version of that for node.js – package.json, since you can’t npm install either). Here’s the sample dependency.txt file:

# Function dependencies, for example:
# package>=version
google-cloud-container

Note that the package version seems to be optional.  My code works without it.

You can test the cloud function by clicking on the “testing” sub-menu.

 

Live Blog (a little late) – NetApp Insight Keynote Day 2

Sorry i’m late, I was….ermm….detained. ;). Going to live blog this from the green room backstage!

8:44a

Did I just hear “NetApp Kubernetes Services”???? Who the hell is this NetApp??

With all this automation, Anthony Lye’s group is answering the question “This Data Fabric thing sounds all wonderful, but HOW do I utilize it without a Ph.D. In Netapp and the various cloud providers??”

So the room I’m in has some cool people in it at the moment; Henri Richard, Dave Hitz, Joel Reich, Kim Weller, bunch of other really smart folks!

Wait- FREEMIUM MODEL??? Somebody actually talked to the AppDev folks.

OK time for Cloud Insights. James Holden on stage.

8:52a

Cloud Insights is GA! Again, free trial!

Lots of focus on using performance and capacity data to save money

“All that power and nothing to install.” – Anthony Lye

8:56a

Time to talk about Hybrid Cloud

Brad Anderson , SVP and GM Cloud Infrastructure Group

Hybrid Cloud Infrastructure – If it talks like a cloud, and walks like a cloud…(then it’s not a cloud because they neither walk nor talk.)

Seamless access to all the clouds and pay-as-you-grow.

“Last year it was just a promise, today hundreds of customers are enjoying the benefit of hybrid cloud computing. ”

Consultel Cloud – from Australia! Why are so many of the cool NetApp customers from down under? Dave Hitz says to me that companies in Australia are very forward leaning in regards to technology.

These guys are leveraging Netapp HCI to provide agile cloud services to their base, with great success. They “shatter customer expectations”.

100% Netapp HCI across the globe. Got common tasks done 68% faster. Using VMWare. They looked at other solutions, they already had SolidFire experience so that probably helped.

50% cost savings over former storage platform (but…weren’t they Solidfire before this? Maybe something else too?)

So Netapp has made cloud apps a TON easier – and letting them run wherever you want. This has been the dream that the marketing folks have been talking about for years, made real.

9:30a – Joel Reich and my friend Kim Weller up there to talk about the future of Hybrid Cloud.

In the future most data will be generated at the edge, processed in the cloud.

Data Pipeline – Joel Reich, a self-proclaimed “experienced manager” will use Kim’s checklist

Snapmirror from Netapp HCI to the cloud.

Octavian looking like DOC OC! He has a “mobile data center” on his BACK. Running NetApp Select! MQTT protocol to Netapp Select (for connected devices)

Netapp automating the administration for setting up a FabricPool. You don’t have to be an NCIE to do this. Nice.

FlexCache is back and it’s better! Solves a major problem for distributed read access of datasets.

Netapp Data Availability Services – now this is something a TON of users will find valuable.

9:51 – Here’s what I was waiting for – MAX DATA.

“It makes everything faster”.

Collab with Intel – Optane persistent memory.

Will change the way your datacenter looks.

11X – MongoDB accelerated 11X vs same system without it.

NO application rewrites! In the future they will make your legacy hardware faster.

In the future will work in the cloud.

Looking forward to more specifics here. Wanted to see a demo. But we’ll see it soon enough.