Hands-on with G123 scalable ELK stack deployment on Kubernetes (AWS EKS)
Bo | Posted on 06-15
Troubleshooting and fixing issues by analyzing log data is a major topic in our multi-cluster(cloud) system design. Unlike databases or log files in other services, each service/application in such a multi-cluster design works in its own environment, which means that they don’t share resources in the same way. Therefore, it is always a pain for SRE team to get notified or locate the root cause for issues regarding cross-service dependency. In G123, we have hundreds of online services (including backend services and batch processes) running in various cloud providers (AWS and Alicloud). A centralized logging approach is especially important for data aggregation, processing, storage, and analysis among our engineering and SRE team.
We, in G123, choose ELK as the centralized logging system solution. ELK(stands for Elasticsearch, Logstash, and Kibana) stack is one of the most popular open-source log collection, storage, and analytics platform, which has been embraced by Netflix, LinkedIn, Twitter, etc.
Elasticsearch: Open-source, full-text search and analysis engine, based on the Apache Lucene search engine.
Logstash: Server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms and sends to a “stash” like Elasticsearch.
Kibana: Visualizes Elasticsearch data and navigates the Elastic Stack. Do anything from tracking query load to understanding the way requests flow through applications.
ELK deployment on the container-based platform is constantly evolving. Recently, they came up with a Kubernetes opertor based deployment solution for ELK stack. In this blog, we will walk through our architecture solution and illustrate how we deploy the fully scalable ELK stack on G123 Kubernetes cluster based on the Kubernetes operator. We will also go over how we gather logs across different G123 services, including Kubernetes system and AWS services.
Log Sources
In our target list, we have three categories of logs:
Kubernetes system log: generated by the Kubernetes cluster itself, i.e, from Kubernetes system pods (kube-proxy, coredns, etc.)
Application log: generated by the application or services deployed by developers.
AWS service log: generated by each AWS services we are using.
Architecture
For Kubernetes system log and application log, we use Filebeat (from beats family released by elastic.co) to collect them from Kubernetes nodes, and then send them to Logstash, from where we can further process and enhance them before sending to Elasticsearch cluster, and visualizing in Kibana UI eventually.
The centralized management is provided in Logstash.
Performance
It consumes only minimum memory.
It uses an elastic beat for tails and leaves and consumes more memory storage.
Transport and Traffic Management
It has built-in reliability.
It is deployed with Redis for enhanced reliability.
Based on performance and dependability, we choose Filebeat over Logstash to gather Kubernetes system log from each cluster node, transmit logs to logstash for further processing, then centralize in an elasticsearch cluster. On the other hand, we use logstash to collect logs directly from AWS services. Figure below illustrates the overall architecture of our deployment, in which there’re two types of log collection agents: Logstash and Filebeat, each of which is responsible for collecting different sorts of log documents.
Ther are three types of nodes/pods in the Elasticsearch cluster:
Data node(pod): stores data and executes data-related operation search, etc.
Master node(pod): in charge of cluster-wide management and configuration.
Ingestion node(pod): for pre-processing documents before indexing.
In this blog, we’ll demostrate an elasticsearch cluster with 3 master nodes and 3 data nodes, the data node plays the same role as ingest node in our deployment case.
Prerequisite
Let’s take a look at the prerequisites before we begin with the deployment
Kubernetes cluster
All we need is one operational Kubernetes cluster. We use an AWS EKS cluster with 5 m5.xlarge spot nodes labeled with elk namespaces for our deployment.
Deployment
Deploy Kubernetes Operator
The first step is to download the Kubernetes operator from official repo and apply to our cluster.
1 2 3 4 5 6 7 8 9 10 11 12
$ kubectl apply -f https://download.elastic.co/downloads/eck/1.1.0/all-in-one.yaml customresourcedefinition.apiextensions.k8s.io/apmservers.apm.k8s.elastic.co created customresourcedefinition.apiextensions.k8s.io/elasticsearches.elasticsearch.k8s.elastic.co created customresourcedefinition.apiextensions.k8s.io/kibanas.kibana.k8s.elastic.co created clusterrole.rbac.authorization.k8s.io/elastic-operator created clusterrolebinding.rbac.authorization.k8s.io/elastic-operator created namespace/elastic-system created statefulset.apps/elastic-operator created serviceaccount/elastic-operator created validatingwebhookconfiguration.admissionregistration.k8s.io/elastic-webhook.k8s.elastic.co created service/elastic-webhook-server created secret/elastic-webhook-server-cert created
Deploy Elasticsearch Cluster
After that, we can create the Elasticseach cluster. We have 3 master nodes, and 3 data nodes configured (data node also has the role of ingest node), the code block below is the deployment yaml file (saved as elasticsearch.yaml).
In above deployment file, we have some customizations:
Setting xpack.ml.enabled: true , we can utilize machine learning apis. but if the CPU does not support SSE4.2, we need to disable this. (SSE4.2 are supported on Intel Core i7 (“Nehalem”), Intel Atom (Silvermont core), AMD Bulldozer, AMD Jaguar, and later processors).
The nodeSelector part is following our dedicated node labeling: --node-labels=dedicated=elk
Pod resource specification according to the usage.
Then apply the yaml file in cluster
1
$ kubectl apply -f elasticsearch.yaml
After this, we can perform a health checking on services and pods within the cluster.
Here are explanations about Elasticsearch clusters from official document:
green: All shards are assigned.
yellow: All primary shards are assigned, but one or more replica shards are unassigned. If a node in the cluster fails, some data could be unavailable until that node is repaired.
red: One or more primary shards are unassigned, so some data is unavailable. This can occur briefly during cluster startup as primary shards are assigned.
Deploy Kibana
Once we get the Elasticsearch cluster up and running, we can now deploy the Kibana application for data visualization. Below is the deployment file (kibana.yaml) .
In our case, horizontal pod auto scale function is also used to monitor our Kibana pod, which scales the pod into multiple instances if the resource consumption threshold is reached(measured by memory usage).
HorizontalPodAutoscaler.yaml file for our kibana deployment:
After applying above two yaml files, the service will run in the cluster.
1 2 3
$ kubectl get svc -n elastic-system | grep kibana NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kibana-cluster-kb-http ClusterIP 172.20.201.58 <none> 5601/TCP 7h
By now we can check the kibana services and visit the service in the deploy machine. First, forward service to the deploy machine using port-forward.
1 2 3
$ kubectl port-forward service/kibana-cluster-kb-http 5601 -n elastic-system Forwarding from 127.0.0.1:5601 -> 5601 Forwarding from [::1]:5601 -> 5601
Visit http://localhost:5601 in the local browser. The Kibana UI will show after we input the user name elastic and password generated from Elasticsearch deployment process(Step 2).
Until now, we have a running Elasticsearch cluster and a Kibana application. In next section, we’ll deploy the log collection part: Logstash and Filebeat.
Deploy Logstash
The first step for Logstash deployment is creating a configmap in Kubernetes to store the configurations. (logstash_cm.yaml)
Then we will create the deployment file(logstash_deployment.yaml) for the Logstash service. The configmap is mounted as a volume inside the Logstash pod.
The last step is creating a Logstash service deployment file (logstash_svc.yaml), this will generate an extenal-ip address for cross cluster service discovery(let Filebeat find this Logstash from other cluster nodes).
After apply all the above three yaml files in Kubernetes, we can get the Logstash service information, the extenal-ip address will be used in Filebeat configuration part in next step.
1
$ kubectl get svc -n elastic-system | grep logstash-service
Below is the output from our deployment.
We’re not going to introduce all kinds of AWS log collection practices in this blog. Below is one example to explain how we collect AWS Redshift audit logs which were saved to AWS S3 bucket, the configuration of how to save audit log to S3 is in this manual.
In our Logstash configuration file(logstash_cm_aws.yaml), we use the S3 input plugin to monitor bucket changes in S3, extract log documents to the Elasticsearch cluster from Logstash.
After deploying above two yaml files, we can view and check Redshift logs.
Deploy Filebeat
We use following Filebeat deployment file(filebeat.yaml) to collect log from our Kubernetes cluster. In the output.logstash: part, we will use the logstash service external-ip which generated in previous step.
--- apiVersion:v1 kind:ConfigMap metadata: name:filebeat-config namespace:kube-system labels: k8s-app:filebeat data: filebeat.yml:|- filebeat.inputs: -type:container paths: -/var/log/containers/*.log processors: -add_kubernetes_metadata: host:${NODE_NAME} matchers: -logs_path: logs_path:"/var/log/containers/" -add_tags: tags:[internal-k8s] output.logstash: hosts:["internal-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.amazonaws.com:5044"] setup.template.name:"filebeat" setup.template.pattern:"filebeat-*" --- apiVersion:apps/v1 kind:DaemonSet metadata: name:filebeat namespace:kube-system labels: k8s-app:filebeat spec: selector: matchLabels: k8s-app:filebeat template: metadata: labels: k8s-app:filebeat spec: nodeSelector: kubernetes.io/os:linux tolerations: -effect:NoSchedule key:dedicated operator:Exists serviceAccountName:filebeat terminationGracePeriodSeconds:30 hostNetwork:true dnsPolicy:ClusterFirstWithHostNet containers: -name:filebeat image:docker.elastic.co/beats/filebeat:7.13.0 args:[ "-c","/etc/filebeat.yml", "-e", ] env: -name:NODE_NAME valueFrom: fieldRef: fieldPath:spec.nodeName securityContext: runAsUser:0 # If using Red Hat OpenShift uncomment this: #privileged: true resources: limits: memory:200Mi requests: cpu:100m memory:100Mi volumeMounts: -name:config mountPath:/etc/filebeat.yml readOnly:true subPath:filebeat.yml -name:data mountPath:/usr/share/filebeat/data -name:varlibdockercontainers mountPath:/var/lib/docker/containers readOnly:true -name:varlog mountPath:/var/log readOnly:true volumes: -name:config configMap: defaultMode:0640 name:filebeat-config -name:varlibdockercontainers hostPath: path:/var/lib/docker/containers -name:varlog hostPath: path:/var/log # data folder stores a registry of read status for all files, so we don't send everything again on a Filebeat pod restart -name:data hostPath: # When filebeat runs as non-root user, this directory needs to be writable by group (g+w). path:/var/lib/filebeat-data type:DirectoryOrCreate --- apiVersion:rbac.authorization.k8s.io/v1 kind:ClusterRoleBinding metadata: name:filebeat subjects: -kind:ServiceAccount name:filebeat namespace:kube-system roleRef: kind:ClusterRole name:filebeat apiGroup:rbac.authorization.k8s.io --- apiVersion:rbac.authorization.k8s.io/v1 kind:ClusterRole metadata: name:filebeat labels: k8s-app:filebeat rules: -apiGroups:[""]# "" indicates the core API group resources: -namespaces -pods -nodes verbs: -get -watch -list -apiGroups:["apps"] resources: -replicasets verbs:["get","list","watch"] --- apiVersion:v1 kind:ServiceAccount metadata: name:filebeat namespace:kube-system labels: k8s-app:filebeat ---
After running kubectl apply -f filebeat.yaml in our cluster, we can see the Filebeat daemonset is deployed in all our Kubernetes nodes (There’re 2 more nodes in the cluster except the ELK nodes).
1 2 3
$ kubectl get daemonsets -n kube-system | grep filebeat NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE filebeat 8 8 8 8 8 kubernetes.io/os=linux 7h
And in the Kibana UI, we can see the Kubernetes system logs are correctly collected.
Conclusion
In this blog, we introduced the architecture of our centerlized logging system, and the procedures to deploy a fully functional ELK stack in a Kubernetes cluster(AWS EKS in our case). Detailed procedures of setting up Logstash and Filebeat, as well as collecting logs from an AWS service were also covered.