How do you escape from good Enough island?
Most people start with a fresh look at their strategy, architecture, and software portfolio. They do a white board exercise and design a cloud-centered, standards-based, open source, SaaS, PaaS, IaaS, NaaS, …everything-as-a-service environment. That’s the fun part.
Then they map the “as is” to the “to be” and do a gap analysis. Then they craft a roadmap and budget. Then they jump off a bridge-as-a-service. The challenge in getting to Google performance levels, is not in strategy or architecture, but untangling the the interconnections among all the systems your enterprise relies on today. Everything is connected to everything else.
So you focus on creating a set of clean interfaces for every application. You implement a service oriented architecture using an enterprise service bus from IBM or TIBCO. That works pretty well.
So you look for opportunities to carve out whole sections of your infra- structure and put it in the cloud, you rewrite big chunks in Java, and that works …well, not so well, better than you thought, not as well as the legacy system?
If it is not well or not as well as the system it replaces then you have a problem. Of course, you have a problem anyway. While you are migrating pieces to the cloud, everything else has to continue working. It is sometimes difficult to figure out where to go, but it is always difficult to get there.
That’s where people get stuck. The tyranny of the present can make implementing a mid-or long range plan difficult. So you are left with three options: start from scratch, continue skipping down the Cloud road map, or expend more resources to improve your current systems.
So you, like your peers at every other company, and all the generations that came before you, do the only logical thing. You do all of these at the same time. So here you are, while Google marches on. It’s really not as bad as that, but the challenges remain. What do you migrate? When do you migrate? How do you ensure that the new is better than the old? How do you keep the old performing well for as long as possible? Everyone faces the same decisions. After all, that’s why they hired a smart person like you to figure it all out. There is one overarching rule to consider no matter which course of trade-offs you decide to follow:
What cannot be measured, cannot be improved.
Without monitoring, measuring, tracking, isolating, and analyzing performance issues, deciding what to migrate, what to retain, and how to improve response times are just guesses. That is where enterprise software application performance monitoring comes in. Enterprise APM software has a different design center than language- based performance monitoring tools. To succeed in the enterprise, APM software must help you do at least ten things:
- Measure the real experience of all end users
- Track complete end-to-end transactions across all hops
- Isolate performance problems down to the required level of detail
- Report information in a way that can be used by different constituencies
- Monitor new, legacy, and packaged application software
- Monitor browser, rich client, Citrix, and mobile-based applications
- Maintain the business context of all transaction data
- Dynamically map and monitor heterogeneous and ever-changing topologies
- See through middleware to understand all aspects of performance
- Provide a single version of the truth on which everyone can agree
Let’s discuss each of these in more detail.
1. Measure the real experience of all end users While Google sets people’s expectations about performance, your employees, partners, customers, and prospects also measure your success on a daily basis. As a result, you need to understand the real performance they experience when they are at their desktop, at home, or on the move. One of the challenges of most performance management solutions is that they only measure averages or samples of user experience. The reason is simple. Capturing information about users is data intensive and, in the case of SaaS solutions, communications intensive. Capturing, indexing, correlating, and analyzing all this information is a Google-size task. It is far easier to rely on statistical analysis. That can be a problem. Enterprises are, by nature, large and distributed. There may be employees and customers around the globe. Understanding the average performance of an application is meaningless in these environments. While it is important to understand performance on a grand scale, it is the experience of each individual user that matters. If the performance in New York is a half-second and the performance in New Delhi is five and a half seconds, then you have achieved your service level agreement (SLA) objective of a three second application response time monitoring. Tell that to the people in New Delhi. Even an average response time of 99% against an SLA requirement of 95% doesn’t matter if the people affected are your most important customers. Samples and averages also don’t capture the business impact of intermittent problems. Every senior IT executive has had the experience of a performance problem in the middle of the night, that starts, stops, and then starts again. There are two problems. First, it is hard to find the problem and fix it. The second reason is equally important, but more subtle. Relying on statistical analysis alone increases the probability that a user will discover the problem, not your IT staff. This automatically increases the temperature of the problem whether it deserves it or not.
2. Track complete end-to-end transactions across all hops Your e-commerce or CRM system is slowing down. Users are experiencing unacceptable wait times. Some transactions terminate unexpectedly and your help desk is starting to complain. Where do you begin? In a networked world, Murphy’s Law has two corollaries. If anything can go wrong, it will. At the worst possible moment. For the least obvious reason. That is why it so important to monitor all transactions across their complete journey. Sampling or averaging lacks the fidelity to understand performance in all its complexity. Let’s take the obvious case. You have a service level agreement in place that guarantees a two second response time for your e-commerce site. In a ten minute period, there are 1,000 transactions. Nine hundred of them take one second each and 100 take six seconds apiece. Have you met the terms of your SLA and what is the average response time over the period? The answers: “Yes” and “1.5 seconds.”
Why are people so unhappy?
Tracking every transaction enables you to understand performance in all its complexity, but only if you can see the complete transaction. In our ecommerce example, groups of transactions may be different. Some will simply go from the web server to the application server and then query a database. Some may take the extra step of connecting to the CRM system or billing software. Some may check an inventory or ERP system. Transactions can be complex touching multiple applications, systems, middleware, and databases. Understanding performance and resolving problems
means you need to see everything.
3. Isolate performance problems down to the required level of detail
When problems arise, how do you resolve them? Most large companies have many forms of performance monitoring, log aggregation, and testing software. The network team has their tools, the database team theirs, developers have multiple tools and the same is true throughout the data center. What often happens when problems arise is that each team member reverts to their comfortable tool kit.
Dozens and sometimes hundreds of IT people rush to their cubicles and
begin analyzing the problem.
Development tools, log aggregators, network analyzers, security scanners, transaction monitors, and many more all go into operation. Whew, it’s not my fault!
What is missing is a common methodology to monitor, track, isolate, and analyze the problem as a team. This is one area where APM tools in general, and Enterprise APM software in particular, shine.
The typical process is to view the affected topology, identify which users and transactions are having problems, isolate the components that are impacted, and analyze them down to identify the line of code, SQL statement, or other technology at fault.
In the enterprise, this approach is much more difficult in two ways. First, there are more technologies that may be at fault. In our e-commerce example, a simple transaction my touch a few components, a complex one dozens or more.What level of isolation granularity is enough? In some instances, it may just be a network segment that is unavailable, in others their may be contention issues on the enterprise service bus, and obviously, the application code may be at fault.
Diving deep to find the problem is only as good as the technologies that can be seen and the level at which they can be monitored.
4. Report information in a way that can be used by different constituencies
Since everyone cherishes their own troubleshooting tool, it is difficult to try to get them to use something else. The focus should be on ensuring that everyone is sharing the same data, receives it in a timely fashion, and can make sense of what they are seeing.
This implies a hierarchy of reporting tools: real-time data flows, dashboards, and periodic reports.
5. Monitor new, legacy, and commercial application software
In the mythical world we described in Section 1, every application is new, hosted elsewhere, and managed by someone else. In the real world, business processes are implemented with some combination of new, legacy, and commercial applications. Understanding performance across these technologies is of key importance.
This can be a challenge for APM software tools that use byte code insertion as the sole means to monitor applications. This approach is ill suited to commercial applications and to more traditional development languages like C and C++.
6.Monitor browser, rich client, Citrix, and mobile-based applications
Since enterprise IT must support new, legacy, and commercial applications it must also manage a variety of endpoints. Most APM tools capture the user experience on browsers and mobile devices.
What about rich clients, Citrix, and terminals? They will ultimately be replaced by browsers, right? Maybe, but not tomorrow, or next year. The last rich client will probably linger long after the last mainframe is retired. Understanding enterprise performance will require support for multiple endpoint technologies for many years to come.
7. Maintain the business context of all transaction data
Transactions are more than bytes traversing a network. To understand performance, it is important to capture the business context of each action. Who initiated the transaction? Was it a person or a program? What was it’s business purpose? Is it a complete transaction or a segment of another one? This can be a challenge for APM tools that rely on network “sniffing.” Monitoring network flows from one system to another is important, but the value is in the details.
8. Dynamically map and monitor heterogeneous and everchanging topologies
Your network is dynamic. New servers are being added and old ones are being retired. Software is constantly being added and updated to new versions. Network connections between systems can change daily. Managing all this manually is a non-starter. It is important for performance management tools to automatically detect the topology and capture changes as they take place.
9. See through middleware to understand all aspects of performance
TIBCO, IBM Integration Bus, Oracle Tuxedo, and other middleware components are critical to enterprise IT organizations. With new, legacy, and commercial applications in place it is important to have a central place to manage integrations and workflows. Service-oriented architectures and other approaches to governance mean that middleware is here to stay. This can cause a problem for some performance monitoring tools. It is not acceptable simply to detect middleware components and treat them
as a black box.
Enterprise APM software understands the complete journey of a transaction, not just to the middleware layer, but through it to another application or the end-user.
10.Provide a single version of the truth on which everyone can agree
Every piece of software has a design center. It can be created for development, operations, network, database, and many other IT groups. Most APM products are designed for developers or network professionals. These are key players in any IT organization. The challenge is that they both view the work slightly differently and only represent some of the organization as a whole.
Perhaps the most important requirement of Enterprise APM is to provide a set of data that all constituencies can rely on. This is not to say, that the data and associated dashboards, deep dive, and analytics capabilities need to be all things to all people.
Each IT group may have its own monitoring and diagnostic software that is uniquely suited to their role in the organization. Still, there must be one tool, one set of data, and one version of the truth that everyone can rely on. It’s design center is to unite disparate IT groups and provide an objective view of performance as a whole.
About the Author
Yossi Shirizli Yossi Shirizli is the Director of R&D at Correlsense. Yossi has over 16 years of vast hands-on managerial experience in the software development field. Yossi joined the Correlsense core team as a Server Team Leader and was involved in managing and developing the company’s flagship product’s Server in its evolving generations, starting with its prototype and ending with its big data enterprise scalable version. Yossi is now responsible for all R&D activities and leads a group of three teams.
Leave a Reply