Skip to content

Commit

Permalink
Add AC hit rate metrics with prometheus labels
Browse files Browse the repository at this point in the history
This is a draft, not ready to be merged.

Summary of what is added by this commit:

 - Prometheus counter for cache hit ratio of only AC requests.
 - Support for prometheus labels based on custom HTTP and gRPC headers.

Cache hit ratio for CAS entries is easily misinterpreted. Example:

  A typical action cache hit often involves 3 or more HTTP requests:

    GET AC 200
    GET CAS 200 (.o file)
    GET CAS 200 (.d file)
    ...

  But a cache miss for the same action is typically a single HTTP request:

    GET AC 404

The ratio between all HTTP GET 200 vs HTTP GET 404 above does not represent
the cache hit ratio experienced by the user for actions. The ratio of only
AC requests is easier to reason about, especially when AC requests checks
existence of CAS dependencies.

The number of AC hits and misses can be directly compared against numbers
printed in the end of each build by bazel client. And against other
prometheus counters produced by remote execution systems for executed actions.

An understanding about the reason for cache misses is necessary to improve the
cache hit ratio. It could be that the system has been configured in a way
that prevent artifacts from being reused between different OS. Or that the
cache is only populated by CI jobs on master, potentially resulting in cache
misses for other users, etc. It becomes easier to notice such patterns, if
cache hit ratio could be calculated for different categories of builds.
Such categories can be set as custom headers via bazel flags
--remote_header=branch=master and applied as prometheus labels. Mapping of
headers to prometheus labels are controlled in bazel-remote's config file.

The ratio between cache uploads and cache misses is also relevant, as an
view about which categories are not uploading their result. The ratio of cache
uploads can also indicate if much is uploaded but seldom requested. E.g. does it
make sense to populate central caches from interactive builds or only from CI?

Categories and custom headers, could also be set for an overview about:
 - Bazel versions using a cache instance?
 - How much separate organizations are using a cache instance?
 - From which network traffic originates?
 - Which products are built using the cache?
 - If the traffic comes via proxy adding its own headers?
 - Distinguish dummy requests for monitoring the cache, from real requests?
 - ...
  • Loading branch information
ulrfa committed Sep 16, 2020
1 parent 0a6dd55 commit 25e244e
Show file tree
Hide file tree
Showing 15 changed files with 335 additions and 38 deletions.
1 change: 1 addition & 0 deletions BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ go_library(
"//config:go_default_library",
"//server:go_default_library",
"//utils/idle:go_default_library",
"//utils/metrics:go_default_library",
"//utils/rlimit:go_default_library",
"@com_github_abbot_go_http_auth//:go_default_library",
"@com_github_grpc_ecosystem_go_grpc_prometheus//:go_default_library",
Expand Down
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,23 @@ host: localhost

# If true, enable experimental remote asset API support:
#experimental_remote_asset_api: true

# Allows mapping HTTP and gRPC headers to prometheus
# labels. Headers can be set by bazel client as:
# --remote_header=os=ubuntu18-04. Not all counters are
# affected.
#metrics:
# categories:
# os:
# - rhel7
# - rhel8
# - ubuntu16-04
# - ubuntu18-04
# branch:
# - master
# user:
# - ci

```

## Docker
Expand Down
7 changes: 6 additions & 1 deletion cache/cache.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ const (
// used for HTTP when running with the --disable_http_ac_validation
// commandline flag.
RAW

UNKNOWN
)

func (e EntryKind) String() string {
Expand All @@ -30,7 +32,10 @@ func (e EntryKind) String() string {
if e == CAS {
return "cas"
}
return "raw"
if e == RAW {
return "raw"
}
return "unknown"
}

// Logger is designed to be satisfied by log.Logger.
Expand Down
8 changes: 8 additions & 0 deletions config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@ type HTTPBackendConfig struct {
BaseURL string `yaml:"url"`
}

// Metrics stores configuration for prometheus metrics.
type Metrics struct {
Categories map[string][]string `yaml:"categories"`
}

// Config holds the top-level configuration for bazel-remote.
type Config struct {
Host string `yaml:"host"`
Expand All @@ -55,6 +60,7 @@ type Config struct {
DisableGRPCACDepsCheck bool `yaml:"disable_grpc_ac_deps_check"`
EnableACKeyInstanceMangling bool `yaml:"enable_ac_key_instance_mangling"`
EnableEndpointMetrics bool `yaml:"enable_endpoint_metrics"`
Metrics *Metrics `yaml:"metrics"`
ExperimentalRemoteAssetAPI bool `yaml:"experimental_remote_asset_api"`
HTTPReadTimeout time.Duration `yaml:"http_read_timeout"`
HTTPWriteTimeout time.Duration `yaml:"http_write_timeout"`
Expand All @@ -73,6 +79,7 @@ func New(dir string, maxSize int, host string, port int, grpcPort int,
disableGRPCACDepsCheck bool,
enableACKeyInstanceMangling bool,
enableEndpointMetrics bool,
metrics *Metrics,
experimentalRemoteAssetAPI bool,
httpReadTimeout time.Duration,
httpWriteTimeout time.Duration) (*Config, error) {
Expand All @@ -95,6 +102,7 @@ func New(dir string, maxSize int, host string, port int, grpcPort int,
DisableGRPCACDepsCheck: disableGRPCACDepsCheck,
EnableACKeyInstanceMangling: enableACKeyInstanceMangling,
EnableEndpointMetrics: enableEndpointMetrics,
Metrics: metrics,
ExperimentalRemoteAssetAPI: experimentalRemoteAssetAPI,
HTTPReadTimeout: httpReadTimeout,
HTTPWriteTimeout: httpWriteTimeout,
Expand Down
8 changes: 5 additions & 3 deletions main.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import (
"github.com/buchgr/bazel-remote/config"
"github.com/buchgr/bazel-remote/server"
"github.com/buchgr/bazel-remote/utils/idle"
"github.com/buchgr/bazel-remote/utils/metrics"
"github.com/buchgr/bazel-remote/utils/rlimit"

grpc_prometheus "github.com/grpc-ecosystem/go-grpc-prometheus"
Expand Down Expand Up @@ -283,6 +284,7 @@ func main() {
ctx.Bool("disable_grpc_ac_deps_check"),
ctx.Bool("enable_ac_key_instance_mangling"),
ctx.Bool("enable_endpoint_metrics"),
nil,
ctx.Bool("experimental_remote_asset_api"),
ctx.Duration("http_read_timeout"),
ctx.Duration("http_write_timeout"),
Expand Down Expand Up @@ -311,6 +313,7 @@ func main() {

accessLogger := log.New(os.Stdout, "", logFlags)
errorLogger := log.New(os.Stderr, "", logFlags)
metrics := metrics.NewMetrics(c.Metrics)

var proxyCache cache.Proxy
if c.GoogleCloudStorage != nil {
Expand Down Expand Up @@ -344,8 +347,7 @@ func main() {
}

validateAC := !c.DisableHTTPACValidation
h := server.NewHTTPCache(diskCache, accessLogger, errorLogger, validateAC, c.EnableACKeyInstanceMangling, gitCommit)

h := server.NewHTTPCache(diskCache, accessLogger, errorLogger, metrics, validateAC, c.EnableACKeyInstanceMangling, gitCommit)
var htpasswdSecrets auth.SecretProvider
cacheHandler := h.CacheHandler
if c.HtpasswdFile != "" {
Expand Down Expand Up @@ -444,7 +446,7 @@ func main() {
validateAC,
c.EnableACKeyInstanceMangling,
enableRemoteAssetAPI,
diskCache, accessLogger, errorLogger)
diskCache, accessLogger, errorLogger, metrics)
if err3 != nil {
log.Fatal(err3)
}
Expand Down
1 change: 1 addition & 0 deletions server/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ go_library(
"//cache:go_default_library",
"//cache/disk:go_default_library",
"//utils/idle:go_default_library",
"//utils/metrics:go_default_library",
"@com_github_abbot_go_http_auth//:go_default_library",
"@com_github_bazelbuild_remote_apis//build/bazel/remote/asset/v1:go_default_library",
"@com_github_bazelbuild_remote_apis//build/bazel/remote/execution/v2:go_default_library",
Expand Down
10 changes: 6 additions & 4 deletions server/grpc.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ import (

"github.com/buchgr/bazel-remote/cache"
"github.com/buchgr/bazel-remote/cache/disk"

"github.com/buchgr/bazel-remote/utils/metrics"
_ "github.com/mostynb/go-grpc-compression/snappy" // Register snappy
_ "github.com/mostynb/go-grpc-compression/zstd" // and zstd support.
)
Expand All @@ -39,6 +39,7 @@ type grpcServer struct {
errorLogger cache.Logger
depsCheck bool
mangleACKeys bool
metrics metrics.Metrics
}

// ListenAndServeGRPC creates a new gRPC server and listens on the given
Expand All @@ -48,27 +49,28 @@ func ListenAndServeGRPC(addr string, opts []grpc.ServerOption,
validateACDeps bool,
mangleACKeys bool,
enableRemoteAssetAPI bool,
c *disk.Cache, a cache.Logger, e cache.Logger) error {
c *disk.Cache, a cache.Logger, e cache.Logger, m metrics.Metrics) error {

listener, err := net.Listen("tcp", addr)
if err != nil {
return err
}

return serveGRPC(listener, opts, validateACDeps, mangleACKeys, enableRemoteAssetAPI, c, a, e)
return serveGRPC(listener, opts, validateACDeps, mangleACKeys, enableRemoteAssetAPI, c, a, e, m)
}

func serveGRPC(l net.Listener, opts []grpc.ServerOption,
validateACDepsCheck bool,
mangleACKeys bool,
enableRemoteAssetAPI bool,
c *disk.Cache, a cache.Logger, e cache.Logger) error {
c *disk.Cache, a cache.Logger, e cache.Logger, m metrics.Metrics) error {

srv := grpc.NewServer(opts...)
s := &grpcServer{
cache: c, accessLogger: a, errorLogger: e,
depsCheck: validateACDepsCheck,
mangleACKeys: mangleACKeys,
metrics: m,
}
pb.RegisterActionCacheServer(srv, s)
pb.RegisterCapabilitiesServer(srv, s)
Expand Down
12 changes: 12 additions & 0 deletions server/grpc_ac.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,12 @@ import (
pb "github.com/bazelbuild/remote-apis/build/bazel/remote/execution/v2"
"github.com/golang/protobuf/proto"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/metadata"
"google.golang.org/grpc/peer"
"google.golang.org/grpc/status"

"github.com/buchgr/bazel-remote/cache"
"github.com/buchgr/bazel-remote/utils/metrics"
)

var (
Expand Down Expand Up @@ -63,6 +65,7 @@ func (s *grpcServer) GetActionResult(ctx context.Context,
}
if rdr == nil || sizeBytes <= 0 {
s.accessLogger.Printf("%s %s %s", logPrefix, req.ActionDigest.Hash, "NOT FOUND")
s.incAcRequestMetrics(metrics.METHOD_GET, metrics.NOT_FOUND, ctx)
return nil, status.Error(codes.NotFound,
fmt.Sprintf("%s not found in AC", req.ActionDigest.Hash))
}
Expand All @@ -82,6 +85,7 @@ func (s *grpcServer) GetActionResult(ctx context.Context,
}

s.accessLogger.Printf("%s %s OK", logPrefix, req.ActionDigest.Hash)
s.incAcRequestMetrics(metrics.METHOD_GET, metrics.OK, ctx)
return result, nil
}

Expand All @@ -93,6 +97,7 @@ func (s *grpcServer) GetActionResult(ctx context.Context,

if result == nil {
s.accessLogger.Printf("%s %s NOT FOUND", logPrefix, req.ActionDigest.Hash)
s.incAcRequestMetrics(metrics.METHOD_GET, metrics.NOT_FOUND, ctx)
return nil, status.Error(codes.NotFound,
fmt.Sprintf("%s not found in AC", req.ActionDigest.Hash))
}
Expand Down Expand Up @@ -129,6 +134,7 @@ func (s *grpcServer) GetActionResult(ctx context.Context,
}

s.accessLogger.Printf("GRPC AC GET %s OK", req.ActionDigest.Hash)
s.incAcRequestMetrics(metrics.METHOD_GET, metrics.OK, ctx)

return result, nil
}
Expand Down Expand Up @@ -290,6 +296,7 @@ func (s *grpcServer) UpdateActionResult(ctx context.Context,
}

s.accessLogger.Printf("GRPC AC PUT %s OK", req.ActionDigest.Hash)
s.incAcRequestMetrics(metrics.METHOD_PUT, metrics.OK, ctx)

// Trivia: the RE API wants us to return the ActionResult from the
// request, in order to follow this standard method style guide:
Expand Down Expand Up @@ -331,3 +338,8 @@ func addWorkerMetadataGRPC(ctx context.Context, ar *pb.ActionResult) {

ar.ExecutionMetadata.Worker = worker
}

func (s *grpcServer) incAcRequestMetrics(method metrics.Method, status metrics.Status, ctx context.Context) {
headers, _ := metadata.FromIncomingContext(ctx)
s.metrics.IncomingRequestCompleted(metrics.AC, method, status, headers, metrics.GRPC)
}
3 changes: 2 additions & 1 deletion server/grpc_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ func TestMain(m *testing.M) {

accessLogger := testutils.NewSilentLogger()
errorLogger := testutils.NewSilentLogger()
metrics := testutils.NewMetricsStub()

listener = bufconn.Listen(bufSize)

Expand All @@ -87,7 +88,7 @@ func TestMain(m *testing.M) {
validateAC,
mangleACKeys,
enableRemoteAssetAPI,
diskCache, accessLogger, errorLogger)
diskCache, accessLogger, errorLogger, metrics)
if err2 != nil {
fmt.Println(err2)
os.Exit(1)
Expand Down
Loading

0 comments on commit 25e244e

Please sign in to comment.