Slide 2 – Intro, Background, Approach
What makes me qualified to give this presentation? For the last 8 years I’ve been migrating applications to both AWS and Azure. In addition to moving from on-prem to cloud, I’ve also been containerizing applications using Docker and Kubernetes as well as setting up DevSecOps teams and practices.
Let's recap the problem statement. We have a legacy application that we’re planning on moving to the cloud. The questions we’re trying to answer are:
Does it even make sense to migrate?
What are the benefits, what are the pitfalls?
Can we put controls in place to minimize risk?
What are those controls?
and ultimately, How do we measure success?
My approach for the risks - which I’ve split up into people, process, and technology - are to present the questions that I’d be asking the project team. Mentioned within those questions are some the artifacts and controls we can put into place to mitigate any risks.
I’ll follow that with a list of benefits and finally some KPIs - Key Performance Indicators.
Slide 3 - Risks - People
We’re starting with the most important risk - pertaining to people, specifically internal staff.
Do all internal stakeholders understand the “why”? Have the benefits of a migration been explained to the project team? Do THEY know why we’re doing this?
Are staff convinced that this is the right move? Is there support and enthusiasm for the implementation? Has internal staff been involved in the decision making process? Does everyone feel a sense of ownership because of that participation? In an ideal situation, we want a transformational mentality where the majority of the team feels that achieving the goal is the reward.
Is there a RACI matrix? Is it clear who’s Responsible, Accountable, Consulted, and Informed - which helps to avoid confusion on roles and responsibilities.
Have pulse checks been conducted? Have we analyzed what those survey results telling us? I like using Survey Gizmo / Alchemer. I poll questions like: Are we moving in the right direction? Is there a better way of doing things? What don’t you like? What do you like? In my experience, the more input the team has, the better the outcome.
Does staff have the competency to complete all tasks? Do they have the experience? Do Network Engineers, Information Security Officers, Architects, and DevOps staff have the requisite skills and qualifications?
Is there a skills matrix to help identify resourcing gaps? Is there a hiring strategy? Have we found the best people for the job?
What about learning? Has staff received training? The right training? Does the infrastructure team know about the AWS Well-Architected Framework? Do the architects and development staff understand the differences between Cloud Formation vs Cloud Front vs. Cloud Trail vs Cloud Watch? Are staff up-to-date with their cloud training? Have they attended relevant conferences? Have they watched cloud migration specific YouTube or LinkedIn Learning videos?
Are succession plans in place? Do we have an up-to-date knowledge base? Change makes many staff nervous. Uncertainty breeds fear. There’s a heightened risk of attrition – especially with application SMEs, network engineering, and operations staff. Some may start looking for other jobs. Are we prepared to deal with staff with institutional knowledge leaving? Are employee retention strategies in place?
Slide 4 - Risks - Process
Do we have a robust risk management methodology? Do we have risk rankings? Do we understand vulnerabilities, threats, exposure? Do we currently follow the Identify -> Evaluate -> Prioritize -> Mitigate process? Have we thought about impacts to the business ecosystem - which covers social, technological, legal, cultural, environmental, political, and economic aspects? For example, the flow of information and directives for a few of the federal projects I’ve worked on have varied depending on the administration.
Are we really agile? Is there a cadence? Will there be an MVP (minimum viable product) every 2 weeks – or however long the cadence is? Are we measuring velocity? Do we understand why the velocity might be changing from sprint to sprint?
What’s the end-to-end SDLC process look like? Where are requirements being stored? Are we tracking tasks in a system like Jira? Is there clear traceability of requirements to code to deployments? What’s our Continuous Integration and Continuous Deployment process - especially in the cloud? Have we thought about using Gitlab over Bamboo or Jenkins? Is there separation of duties – will the developer writing the code be deploying the code to production? What controls are in place (perhaps peer reviews, code pull requests)? Are we using testing frameworks? Have we automated tests using software like Selenium? Do we conduct performance testing which tests the application performance under normal conditions? Do we understand the upper limit of the system by conducting load testing? Are stress tests conducted to observe what happens under high load for a long period of time - perhaps a DDOS attack simulation?
Do we have all our legacy documentation: Requirements, Architecture (solutions and enterprise), Entity Relationship Diagrams, User documentation, SOPs, Operations Manual? Are they being updated regularly? When was the last time they were updated?
Are we QAing our processes? Are we auditing the QA? I’ve seen issues where QA reviewers drop the ball – no one’s flawless.
How effective is our Incident Management process? Do we have a call tree system in place? How does communications get handled for the legacy and migrated application? What’s our current level of support? Do we have or have we purchased a corresponding or better support level with AWS? Is ChatOps a part of our strategy?
Do we have a BCP - Business Continuity Plan? How are we going to keep things operational? Do we need High Availability? What about Disaster Recovery? How do we restore data? What is our RTO - Recovery Time Objective which tells us how quickly restores should take? What is our RPO - Recovery Point Objective – which deals with how much data can we afford to lose? Are SLAs - Service Level Agreements in place for each component, especially if different groups are working on different services?
The quote's from Patrick Ness.
Slide 5 - Risks - Technology
Is there a single point of failure? For each service and component have we determined whether redundancy is needed? What if all of AWS is down or perhaps just their EC2 service? What would our financial losses look like? Do we need a multi-cloud strategy? Perhaps a cold standby on Azure or Google Cloud with data being replicated to both on-prem and the other cloud providers?
Have we really tested our failovers? We should set a primary node to unhealthy and test what happens.
Have we provisioned virtual private clouds, subnets, gateways, and security groups correctly? Are we using APIs to reduce the risk of database exposure?
Are we using Web Application Firewalls? Are we protected against ransomware? Are we backing up to physical storage and sending data to Iron Mountain? How quickly can we get a new account created and configured – especially if our existing account is taken over by a hacker? It’s easier to spin up an environment if we have CloudFormation templates or use something like Terraform which lets us deploy Infrastructure as Code.
Are we doing tabletop exercises? Is everyone clear on their roles in case of an emergency? What is our privacy policy? Is data encrypted both in transit and at rest - including data in backups, read replicas, and snapshots? Do we have compliance checklists? 508 compliance is important - especially for federal sites. Is the site accessible for people with disabilities? Are we looking at color contrasts, text alternatives for components, headings and labels, focus order? How do we protect PII – personally identifiable information? Do we have a data governance plan? When I’m hosting Federal portals and websites, I need an Authority to Operate and depending on the information sensitivity level, let’s assume moderate (serious adverse affect) – approximately 236 controls need to be audited. See NIST 800-53.
How are we handling monitoring and logging? Do we have a unified solution like DataDog or NewRelic? Do we use Site 24x7? Are we making it easy on our operations staff to monitor both on-prem and AWS instances?
Licensing is something I’ve seen teams overlook. What are the licensing implications especially when we’re using Open Source software? If you fork the code, or make a derivative work, do you have to release it for free? Do we know where to go for support? Are there Discord or Slack channels? Are there forums to chat with the open source developers? Have we reviewed the AMIs (Amazon Machine Images) we’re using? They usually build in licensing fees into the per hour instance cost. What are the dependencies and pitfalls if we need to upgrade a specific library within that AMI? Do we understand end-of-life and support dates for every component we’re using (CentOS 6 for example stopped getting updates November of last year and many teams haven’t upgraded to 7 or 8 yet).
There are horror stories of leaving instances up and racking up charges – only to find out a month later about owing 30K or more. Do we understand the costs estimates? Especially with multiple environments – INT, UAT, STG, PRD? Do we have strategies in place to cut costs (shutdown unneeded servers after hours; auto-scale down)? Have we considered long term EC2 contracts / pricing? They have reserved instances – about a 60% discount over regular instances.
Slide 6 - Benefits
Why IS everyone moving to the cloud? I see 4 major benefits.
First is scale and elasticity. If it’s a good architecture, the application will automatically scale up and down based on usage without the need for any operations staff. Let’s say an application looks up policies. Whether you’re querying just 1 or 83 million, a service like AWS Elastic Beanstalk lets you handle that type of load without a whimper. With the use of CDNs – content delivery networks, you can serve static content across the country with less latency – users will not need to wait as much for content to load. There are caching techniques available as well to speed up some of the more popular API calls.
From a cost savings standpoint, we no longer have to worry about data center upkeep. Transfer the risks of maintaining a datacenter and the cost to AWS - who have this down to a science. Cost containment is easily implemented. If a cost threshold is hit, we can send alerts and even start scaling down if necessary.
Innovation becomes easier. It takes less than 15 minutes to provision a LAMP stack. You don’t have to wait for your Operations team to find the resources or spin up a new physical server to add to your VM cluster. In AWS you literally click a few buttons and your instances are online. You spend more time working on what really matters.
Some ease of management examples include configuring a single file to spin up all your instances. A billing interface that shows you which services are using up more budget is available. One click backups (zero click backups if you set-up using AWS Backup) provide disaster recovery peace of mind.
Slide 7 - Measures of Success
I think the most important measure of success is CSAT - customer satisfaction. How satisfied is the customer? Are they happier? Is the application faster and easier to use? Are there less disruptions? Better uptime? Better load speeds? Has the overall User Experience changed in any way? Are there metrics on Net Promoter Score (NPS) – how likely will the application or brand be recommended to a friend or colleague? Be sure to keep an eye on those app ratings and reviews.
What's the business impact? Is there increased usage of the application? Are there higher engagement and re-engagement percentages?
We need to understand baseline costs. For on-prem, what are the costs related to hardware, replacement, networking, wiring, electricity, cooling, storage, and insurance? Include labor costs. Compare that with monthly AWS billing, staffing, and third party tools and consultants.
When it comes to system design, we should be thinking about fewer error rates, availability, lower latency, better response time, higher throughput, and fewer time-outs.
Infrastructure teams will be happy about optimizing and fine-tuning the servers based on CPU, Memory, I/O reads/writes. Right-sizing our instances should be a priority
That concludes the deck. Any questions?