Documentation

📈 How does the K8stimator work?

The K8stimator only needs an API Token with access to "Read metrics (GET Requests to Metrics API v2)" and the tenant URL. The program will fetch all PGIs that lived in the specified time that ran in Kubernetes or OpenShift environments and mimic the DPS Algorithm stated in our documentation for Pod-hours and Gib-hours. It will do two queries per iteration, one for the Pod-hours another for the Gib-hours. The POD-hours are calculated with all pods that existed in the environment for the timeframe specified, the Gib-hours needs also the pods that existed in the enviroment but only those where an OneAgent technology existed, basically the PODs that were instrumented.

The resolution defines the datapoints or existance of a pod within the specified time, basically aswering the question how many times did PGI X lived in this timeframe with the specified resolution. For the Gib-hour query, the avg size of the memory is given and this will be rounded up to the next 0.25 Gib as stated in the algorithm used in our DPS documentation. Each iteration calculates the sum of Pod-hours and Gib-hours for the timeframe. It also calculates the short-living pods for verifying the accuracy of the resolution since the recommended resolution is 1 hour and not 15 minutes (as stated in the documentation). This because of increasing performance (reducing the payload by 400%) and avoiding metrics limitations without impacting the accuracy of the estimation. Normally a resolution of 1h in big environments is not bigger that 1%. Meaning only 1% of the short-living pods existed one time (had one data point) during the specified resolution, with this approach is possible to increase the resolution knowing how big the impact is in the accuracy. However, is good to know that the impact is just a higher impact on the consumption by calculating more data-points (Gib-hours) of the pod than they should which at the end will not result in unpredicted consumption to the customer. Still the goal is to be as accurate as possible, but if there is an inaccuracy, this should be at least in higher consumption and not the other way around.

At the end of all the iterations, a sum will be done calculating the total ammount of Pod-hours and Gib-hours. Also a daily average will be calculated and a yearly estimation will be done. This assuming of course the dynamics of the enviroment stay linear through the whole year.

⚠️ I saw a warning in the iterations! what do I do?

A warning means that the data is capped due the amount of data fetched. This will lead to unaccurate results. This can be due the lenght of the iteration and/or the resolution. Also the older the data is, higher the aggregation on the metrics are and the server has more trouble to fetch data with a higher resolution.
Try to adapt the "from timeframe" as closer to today. If we are in March, I'd recommend to use this date 2024-02-01 and iterate 4 to 8 times for 7 days per iteration with a resolution of 1 hour. Meaning you'll fetch the data for one or two months. Be carefull that the "from_timeframe" + "the iterations" x "days per iteration" does not surpasses today.

❌ I get an error 503! Service Unavailable within the warning of the query, what do I do?

If you get an error like the following:
:warning: There was an error: Query [builtin:tech.generic.mem.workingSetSize] status : code:503 reason:Service Unavailable Content:b'{"error": {"code": 503,"message": "We've run into technical difficulties. Don't worry though - your data is secure and we've logged this issue. Please try again in 5 minutes."}}'

Most likely the Cluster has a lot of Kubernetes entities (within the millions or billions) and the query is being timeouted in the backend. There is a workaround though. Take the last 7 or 14 days and make an estimate like the following:

Resolution: 1h
From timeframe: today - 14 days -> if today is 15 April of 2024 then you enter: 2024-04-01
Iterations: 14
Days per iteration: 1

With the above query, you query the server in a resolution that is acceptable and accurate, you get the last 14 days and with this the program will estimate the average consumption and the yearly consumption. If you need assistance, please don't hesitate in contacting Sergio.

Warning, do not calculate Iterations of 1 day with a resolution of 1 day, since the estimation will not be accurate and will incur in higher consumption.

🔄 Why do we need an Iterative approach?

In order to be able to calculate and fetch data from big Kubernetes or OpenShift environments with tausends, hundreds of tausends or even millions of pods, we need to split the timeframe with an iterational approach. Specially because in order to be sure that the fetched PGIs were actually PODs where the OneAgent existed, we need to merge the PGIs to its Services which by nature is a one-to-many relationship. We do this with an API :parent operation. Doing this assertion, the queried entities will easily explode into the millions, hence an iteration approach that is customizable is the most appropiate that can be adapted to the size and dynamic of any environment.

🔓 Does it work with Managed Clusters?

Yes, this program is running at the moment in a replica set in a GKE Cluster with no connection to our corporate VPN. Mission Control connection details are hidden. When running inside VPN, a Solution Enginner can establish a connection to the Managed Cluster, enter the connection details (which are hidden if not inside VPN) and the make the same query. An alternate way is to run the container as a standalone docker container within our corporate VPN. For more guidance contact Sergio.

🐛 I opened a new tab and the Estimator is still running, is this a bug?

No, this is made by design for various reasons. Querying the Dynatrace Clusters with this type of data can take multiple seconds and even minutes per query. If the user is not patient, which we are not by default, we tend to send a new request. This behaviour will overload the application making it fail also for all other users and queries. The program is distributed in various pods on the backend. The NGINX Proxy uses sticky sessions to link the user session to the pod running the query and calculating the consumption. This way the user can see the progress of the query and we avoid overloading of requests. The application is multithreaded and can manage multiple requests at the same time. In case you need to run another query at the same time, please just open another browser or a browser with incognito window.

ℹ️ More information about Kubernetes Observability and Application Observability consumption

💻 What's the techstack of the K8stimator?

The K8stimator was born within a simple Python script that turned into a somewhat full fledged Python app including Flask Web technology and Caching for asynchronous management of data. If you want to contribute, that's great! any help is welcome.
Take a look at the source code, here is the github repository: https://github.com/sergiohinojosa/dynatrace-kubernetes-licensing-estimation

K8stimator - Documentation