Просмотр исходного кода

Merge branch 'develop' of https://github.com/kubecost/cost-model into develop

Ajay Tripathy 5 лет назад
Родитель
Сommit
b0d5ed190f

+ 3 - 1
PROMETHEUS.md

@@ -57,8 +57,10 @@ sum(node_total_hourly_cost) * 730
 | node_gpu_hourly_cost | Hourly cost per GPU on this node  |
 | node_ram_hourly_cost   | Hourly cost per Gb of memory on this node                       |
 | node_total_hourly_cost   | Total node cost per hour                       |
+| kubecost_load_balancer_cost   | Hourly cost of a load balancer                 |
+| kubecost_cluster_management_cost | Hourly management fee per cluster                 |
+| pv_hourly_cost   | Hourly cost per GP on a persistent volume                 |
 | container_cpu_allocation   | Average number of CPUs requested/used over last 1m                      |
 | container_gpu_allocation   | Average number of GPUs requested over last 1m                      |
 | container_memory_allocation_bytes   | Average bytes of RAM requested/used over last 1m                 |
 | pod_pvc_allocation   | Bytes provisioned for a PVC attached to a pod                      |
-| pv_hourly_cost   | Hourly cost per GP on a persistent volume                 |

+ 2 - 0
README.md

@@ -25,6 +25,8 @@ Here is a summary of features enabled by this cost model:
 You can deploy Kubecost on any Kubernetes 1.8+ cluster in a matter of minutes, if not seconds. 
 Visit the Kubecost docs for [recommended install options](https://docs.kubecost.com/install). Compared to building from source, installing from Helm is faster and includes all necessary dependencies. 
 
+> If you want to deploy cost model in Prometheus exporter only mode, check out [kubecost-exporter.md](kubecost-exporter.md).
+
 ## Contributing
 
 We :heart: pull requests! See [`CONTRIBUTING.md`](CONTRIBUTING.md) for information on buiding the project from source

+ 117 - 0
kubecost-exporter.md

@@ -0,0 +1,117 @@
+# Running Kubecost as a Prometheus metric exporter
+
+Running Kubecost as a Prometheus metric exporter allows you to export various cost metrics to Prometheus without setting up any other Kubecost dependencies. Doing so lets you write PromQL queries to calculate the cost and efficiency of any Kubernetes concept, e.g. namespace, service, label, deployment, etc. You can also calculate the cost of different Kubernetes resources, e.g. nodes, PVs, LoadBalancers, and more. Finally, you can do other interesting things like create custom alerts via AlertManager and custom dashboards via Grafana. 
+
+## Installing
+
+> Note: all deployments of Kubecost function as a Prometheus metric exporter. We strongly recommend helm as an install path to take advantage of Kubecost’s full potential. [View recommended install](http://docs.kubecost.com/install).
+
+If you would prefer to not use the recommended install option and just deploy the Kubecost open source cost model as a metric exporter, you can follow these steps:
+
+
+1. Apply the combined YAML:
+
+    ```
+    kubectl apply -f https://raw.githubusercontent.com/kubecost/cost-model/develop/kubernetes/exporter/exporter.yaml --namespace cost-model
+    ```
+
+    > If you want to use a namespace other than `cost-model`, you will have to edit the `ClusterRoleBinding` after applying the YAML to change `subjects[0].namespace`. You can do this with `kubectl edit clusterrolebinding cost-model`.
+
+2. To verify that metrics are available:
+
+    ```
+    kubectl port-forward --namespace cost-model service/cost-model 9003
+    ```
+
+    Visit [http://localhost:9003/metrics](http://localhost:9003/metrics) to see exported metrics
+
+Add Kubecost scrape config to Prom ([more info](https://prometheus.io/docs/introduction/first_steps/#configuring-prometheus))
+```
+- job_name: cost-model
+  scrape_interval: 1m
+  scrape_timeout: 10s
+  metrics_path: /metrics
+  scheme: http
+  static_configs:
+    - targets: ['cost-model.cost-model.:9003']
+```
+
+Done! Kubecost is now exporting cost metrics. See the following sections for different metrics available and query examples.
+
+## Available Prometheus Metrics 
+
+| Metric       | Description                                                                                            |
+| ------------ | ------------------------------------------------------------------------------------------------------ |
+| node_cpu_hourly_cost | Hourly cost per vCPU on this node  |
+| node_gpu_hourly_cost | Hourly cost per GPU on this node  |
+| node_ram_hourly_cost   | Hourly cost per Gb of memory on this node                       |
+| node_total_hourly_cost   | Total node cost per hour                       |
+| kubecost_load_balancer_cost   | Hourly cost of a load balancer                 |
+| kubecost_cluster_management_cost | Hourly management fee per cluster                 |
+| container_cpu_allocation   | Average number of CPUs requested over last 1m                      |
+| container_memory_allocation_bytes   | Average bytes of RAM requested over last 1m                 |
+
+By default, all cost metrics are based on public billing APIs. See the Limitations section below about reflecting your precise billing information. Supported platforms are AWS, Azure, and GCP. For on-prem clusters, prices are based on configurable defaults. 
+
+More metrics are available in the recommended install path and are described in [PROMETHEUS.md](PROMETHEUS.md).
+
+## Dashboard examples
+
+Here’s an example dashboard using Kubecost Prometheus metrics: 
+
+![sample dashboard](https://grafana.com/api/dashboards/8670/images/5480/image)
+
+You can find other example dashboards at https://grafana.com/orgs/kubecost
+
+## Example Queries
+
+Once Kubecost’s cost model is running in your cluster and you have added it in your Prometheus scrape configuration, you can hit Prometheus with useful queries like these:
+
+#### Monthly cost of all nodes
+
+```
+sum(node_total_hourly_cost) * 730
+```
+
+#### Hourly cost of all load balancers broken down by namespace
+
+```
+sum(kubecost_load_balancer_cost) by (namespace)
+```
+
+#### Monthly rate of each namespace’s CPU request
+
+```
+sum(container_cpu_allocation * on (node) group_left node_cpu_hourly_cost) by (namespace) * 730
+```
+
+#### Historical memory request spend for all `fluentd` pods in the `kube-system` namespace
+
+```
+avg_over_time(container_memory_allocation_bytes{namespace="kube-system",pod=~"fluentd.*"}[1d])
+  * on (pod,node) group_left
+avg(count_over_time(container_memory_allocation_bytes{namespace="kube-system"}[1d:1m])/60) by (pod,node)
+  * on (node) group_left
+avg(avg_over_time(node_ram_hourly_cost[1d] )) by (node)
+```
+
+
+## Setting Cost Alerts
+
+Custom cost alerts can be implemented with a set of Prometheus queries and can be used for alerting with AlertManager or Grafana alerts. Below are example alerting rules. 
+
+#### Determine in real-time if the monthly cost of all nodes is > $1000
+
+```
+sum(node_total_hourly_cost) * 730 > 1000
+```
+
+## Limitations
+
+Running Kubecost in exporter-only mode by definition limits functionality. The following limitations of this install method are addressed by the [recommended install path](http://docs.kubecost.com/install).
+
+- Persistent volume metrics not available (coming soon!)
+- For large clusters, these Prometheus queries might not scale well over large time windows. We recommend using [Kubecost APIs](https://github.com/kubecost/docs/blob/master/apis.md) for these scenarios.
+- Allocation metrics, like `container_cpu_allocation` only contain _requests_ and do not take usage into account.
+- Related to the previous point, efficiency metrics are not available.
+- Public billing costs on default. The standard Kubecost install and a cloud integration gives you accurate pricing based on your bill. 

+ 1 - 0
pkg/cloud/azureprovider.go

@@ -243,6 +243,7 @@ type AzureServiceKey struct {
 	ServiceKey     *AzureAppKey `json:"serviceKey"`
 }
 
+
 // Validity check on service key
 func (ask *AzureServiceKey) IsValid() bool {
 	return ask.SubscriptionID != "" &&

+ 2 - 1
pkg/cloud/provider.go

@@ -18,7 +18,8 @@ import (
 )
 
 const authSecretPath = "/var/secrets/service-key.json"
-const storageConfigSecretPath = "/var/secrets/azure-storage-config.json"
+const storageConfigSecretPath = "/var/azure-storage-config/azure-storage-config.json"
+
 
 var createTableStatements = []string{
 	`CREATE TABLE IF NOT EXISTS names (

+ 8 - 23
pkg/costmodel/clusters/clustermap.go

@@ -1,8 +1,8 @@
 package clusters
 
 import (
+	"context"
 	"fmt"
-	"math/rand"
 	"strings"
 	"sync"
 	"time"
@@ -10,6 +10,7 @@ import (
 	"github.com/kubecost/cost-model/pkg/log"
 	"github.com/kubecost/cost-model/pkg/prom"
 	"github.com/kubecost/cost-model/pkg/thanos"
+	"github.com/kubecost/cost-model/pkg/util/retry"
 
 	prometheus "github.com/prometheus/client_golang/api"
 )
@@ -120,33 +121,17 @@ func (pcm *PrometheusClusterMap) loadClusters() (map[string]*ClusterInfo, error)
 	}
 
 	// Execute Query
-	tryQuery := func() ([]*prom.QueryResult, prometheus.Warnings, error) {
+	tryQuery := func() (interface{}, error) {
 		ctx := prom.NewContext(pcm.client)
-		return ctx.QuerySync(clusterInfoQuery(offset))
+		r, _, e := ctx.QuerySync(clusterInfoQuery(offset))
+		return r, e
 	}
 
-	var qr []*prom.QueryResult
-	var err error
-
 	// Retry on failure
-	delay := LoadRetryDelay
-	for r := LoadRetries; r > 0; r-- {
-		qr, _, err = tryQuery()
-
-		// non-error breaks out of loop
-		if err == nil {
-			break
-		}
-
-		// wait the delay
-		time.Sleep(delay)
+	result, err := retry.Retry(context.Background(), tryQuery, uint(LoadRetries), LoadRetryDelay)
 
-		// add some random backoff
-		jitter := time.Duration(rand.Int63n(int64(delay)))
-		delay = delay + jitter/2
-	}
-
-	if err != nil {
+	qr, ok := result.([]*prom.QueryResult)
+	if !ok || err != nil {
 		return nil, err
 	}
 

+ 1 - 1
pkg/env/costmodelenv.go

@@ -69,7 +69,7 @@ const (
 // GetAWSAccessKeyID returns the environment variable value for AWSAccessKeyIDEnvVar which represents
 // the AWS access key for authentication
 func GetAppVersion() string {
-	return Get(AppVersionEnvVar, "1.74.0")
+	return Get(AppVersionEnvVar, "1.75.1")
 }
 
 // IsEmitNamespaceAnnotationsMetric returns true if cost-model is configured to emit the kube_namespace_annotations metric

+ 15 - 6
pkg/kubecost/allocation.go

@@ -316,7 +316,7 @@ func (a *Allocation) add(that *Allocation, isShared, isAccumulating bool) {
 
 		aggTotalCost := a.TotalCost + that.TotalCost
 		if aggTotalCost > 0 {
-			a.TotalEfficiency = (a.TotalEfficiency*a.TotalCost + that.TotalEfficiency*that.TotalCost) / aggTotalCost
+			a.TotalEfficiency = (a.TotalEfficiency*(a.TotalCost-a.ExternalCost) + that.TotalEfficiency*(that.TotalCost-that.ExternalCost)) / (aggTotalCost - a.ExternalCost - that.ExternalCost)
 		} else {
 			aggTotalCost = 0.0
 		}
@@ -733,13 +733,22 @@ func (as *AllocationSet) AggregateBy(properties Properties, options *AllocationA
 	// exact key match, given each external allocation's proerties, and
 	// aggregate if an exact match is found.
 	for _, alloc := range externalSet.allocations {
-		key, err := alloc.generateKey(properties)
-		if err != nil {
-			continue
+		skip := false
+		for _, ff := range options.FilterFuncs {
+			if !ff(alloc) {
+				skip = true
+				break
+			}
 		}
+		if !skip {
+			key, err := alloc.generateKey(properties)
+			if err != nil {
+				continue
+			}
 
-		alloc.Name = key
-		aggSet.Insert(alloc)
+			alloc.Name = key
+			aggSet.Insert(alloc)
+		}
 	}
 
 	// (9) Combine all idle allocations into a single "__idle__" allocation

+ 44 - 0
pkg/util/retry/retry.go

@@ -0,0 +1,44 @@
+package retry
+
+import (
+	"context"
+	"fmt"
+	"math/rand"
+	"time"
+)
+
+// RetryCancellationErr is the error type that's returned if the retry is cancelled
+var RetryCancellationErr error = fmt.Errorf("RetryCancellationErr")
+
+// IsRetryCancelledError returns true if the error was a cancellation
+func IsRetryCancelledError(err error) bool {
+	return err != nil && err.Error() == "RetryCancellationErr"
+}
+
+// Retry will run the f func until we receive a non error result up to the provided attempts or a cancellation.
+func Retry(ctx context.Context, f func() (interface{}, error), attempts uint, delay time.Duration) (interface{}, error) {
+	var result interface{}
+	var err error
+
+	d := delay
+	for r := attempts; r > 0; r-- {
+		select {
+		case <-ctx.Done():
+			return nil, RetryCancellationErr
+		default:
+		}
+
+		result, err = f()
+
+		if err == nil {
+			break
+		}
+
+		time.Sleep(d)
+
+		jitter := time.Duration(rand.Int63n(int64(d)))
+		d = d + jitter/2
+	}
+
+	return result, err
+}

+ 121 - 0
pkg/util/retry/retry_test.go

@@ -0,0 +1,121 @@
+package retry
+
+import (
+	"context"
+	"fmt"
+	"sync/atomic"
+	"testing"
+	"time"
+)
+
+type Obj struct {
+	Name string
+}
+
+func TestPtrSliceRetry(t *testing.T) {
+	const Expected uint64 = 3
+
+	var count uint64 = 0
+
+	f := func() (interface{}, error) {
+		c := atomic.AddUint64(&count, 1)
+		fmt.Println("Try:", c)
+
+		if c == Expected {
+			return []*Obj{
+				{"A"},
+				{"B"},
+				{"C"},
+			}, nil
+		}
+
+		return nil, fmt.Errorf("Failed: %d", c)
+	}
+
+	result, err := Retry(context.Background(), f, 5, time.Second)
+	objs, ok := result.([]*Obj)
+	if err != nil || !ok {
+		t.Fatalf("Failed to correctly cast back to slice type")
+	}
+
+	t.Logf("Length: %d\n", len(objs))
+}
+
+func TestSuccessRetry(t *testing.T) {
+	const Expected uint64 = 3
+
+	var count uint64 = 0
+
+	f := func() (interface{}, error) {
+		c := atomic.AddUint64(&count, 1)
+		fmt.Println("Try:", c)
+
+		if c == Expected {
+			return struct{}{}, nil
+		}
+
+		return nil, fmt.Errorf("Failed: %d", c)
+	}
+
+	_, err := Retry(context.Background(), f, 5, time.Second)
+	if err != nil {
+		t.Fatalf("Unexpected error: %s", err)
+	}
+}
+
+func TestFailRetry(t *testing.T) {
+	const Expected uint64 = 5
+
+	expectedError := fmt.Sprintf("Failed: %d", Expected)
+	var count uint64 = 0
+
+	f := func() (interface{}, error) {
+		c := atomic.AddUint64(&count, 1)
+		fmt.Println("Try:", c)
+		return nil, fmt.Errorf("Failed: %d", c)
+	}
+
+	_, err := Retry(context.Background(), f, 5, time.Second)
+	if count != 5 {
+		t.Fatalf("Expected Count: %d, Actual: %d", Expected, count)
+	}
+
+	if err.Error() != expectedError {
+		t.Fatalf("Expected error: %s, Actual error: %s", expectedError, err.Error())
+	}
+}
+
+func TestCancelRetry(t *testing.T) {
+	const Expected uint64 = 5
+
+	var count uint64 = 0
+
+	f := func() (interface{}, error) {
+		c := atomic.AddUint64(&count, 1)
+		fmt.Println("Try:", c)
+		return nil, fmt.Errorf("Failed: %d", c)
+	}
+
+	wait := make(chan error)
+	ctx, cancel := context.WithCancel(context.Background())
+
+	// execute retry in go routine
+	go func() {
+		_, err := Retry(ctx, f, 5, time.Second)
+
+		wait <- err
+	}()
+
+	// cancel after 2 seconds
+	go func() {
+		time.Sleep(time.Second * 2)
+		cancel()
+	}()
+
+	// wait for error result
+	e := <-wait
+
+	if !IsRetryCancelledError(e) {
+		t.Fatalf("Expected CancellationError, got: %s", e)
+	}
+}