This demonstration has the purpose of showing how to develop and implement a webscrapping solution using Containers. The application can deploy webscrapping containers into either EKS or ECS, each container will scrape its own URL defined by Controller Application.
You have webscraping code that you want to run in thousands of ephemeral containers using thousands of unique IPs to scrape websites for potentially malicious code. The life of each container is the duration of the webscraping job. You are working with the constraint that you must use your own pool of elastic IP address in your VPC and not randomly assigned public IPs. Ideally you want to use AWS Fargate or Lambda services execpt the issue you've hit with is that neither of them support the use of Elastic IPs.
So you want an alternative solution using EC2 based EKS (which does support Elastic IPs) but want the user experience of that service to be similar to Fargate - when it comes to automation of management overhead.
This solution showcases that by demonstrating automation in the following areas:
- Kubernetes cluster management is automated in two ways:
- Amazon EKS automates the control plane management including scaling and HA
- Karpenter automates worker node management including scaling and HA
- Container management is automated by Kubernetes including scaling and HA
- Karpenter itself is deployed as a Kubernets app and hence it's management is managed by Kubernetes (scaling and HA)
- The residual management overhead that's left for the user is minimal, mostly involving design time concerns like choices for instance types and sizes & observability of infra metrics
The user experience for deploying an application remains the same between this soltution and Fargate-EKS. Users would have to build a application container and submit it to the Kubernetes engine to run.
- kubectl
- awscli
- Pre Configured AWS auth (Access Key and Secret Key)
- Terraform 1.3.3 >
- Docker and Docker Compose
- Helm 3 > (Optional)
This solution can deploy web-scrapping container apps either on EKS and on ECS.
The EKS Architecture consists in the following components:
- VPC, with Public and Private subnets, each private subnet has the default route towards the Nat Gateway.
- EKS Control Plane, Kubernetes API Server components managed by AWS
- EKS Managed Node Group, those Nodes are needed to deploy the first Kubernetes add-ons, such as CoreDNS, Karpenter, Kube-Proxy and so on.
- Karpenter Pod, add-on deployed into the EKS cluster that will be responsible to scale the nodes up and down.
- S3 Bucket, we are using this bucket to save the scrapped content from the websites.
- DynamoDB Table, the DynamoDB table has the purpose of store the IP that we used for do each scrapping and the URL that it scrapped.
- Controller App, this application is reponsible for creating the scrapping pods into Kubernetes, one pod per URL.
- Scrape App, this is the application responsible for scrapping the URLs defined by the controller app, one pod per URL.
[TBD]
This demonstration consists in 3 applications.
- Controller App that is reponsible for provisioning the scrapping containers into EKS or ECS.
- Scrape App is the application responsible for scrape the URLs defined into the Controller App. Each container will run until scrape completition and it will be scaled down.
- Dashboard App that is responsible for creating a Dashboard of how many URLs were scrapped and show how many IPs we used.
In the EKS cluster the Controller App will apply one scrape application manifest per URL to be scraped (One Pod per URL). In the first moment the pods will remain in Pending state. That will trigger Kapenter to provision the Nodes to hadle those Pods (Defined into the provision.yaml), since we need Public IPs, Karpenter will provision those Nodes into Public Subnets.
During the creation each Node has an User Data that is responsible to assignee an Elastic IP that is available from your pool of elastic IPs. If you don't have any Elastic IPs available, it will use the Node own Public IP (Since we are launching it into public Subnets).
After the Pods complete their Task, Karpenter will notice that the nodes are idle and will remove it, releasing the Elastic IPs again for use. So with EKS + Karpenter and some automations we can have the Nodes available in less than 120 seconds, and the amount of Containers that can run in each Node, will depend on how you have configured Karpenter Provisioner and how much CPU and Memory you have defined for you scrape app. That being said, Karpenter can provision more or less Nodes depending on how many it need to handle the workload, each Node has his own Public/Elastic IP, so all the Containers that run in the same Node will use that IP.
In ECS, the Controller App calls the AWS API to provision the task into the ECS cluster (One task per URL), we are using Fargate host mode for those tasks, and because of that we don't have any servers to manage. Each Fargate task is deployed into a Public Subnet, doing that each Task will have its own Public IP. After the scrape completes, the Fargate task will be scaled down. With that solution we cannot use Elastic Ips, since Fargate doens't supports it.
For deploying the solution we are gonna need to execute few steps, everything in this repository is automated to make it easier to reproduce in any AWS account.
We are using EKS Blueprints for the EKS componentes, and pure terraform registry aws resources for other resourcers, such as, S3 bucket, DynamoDB table etc.
You can also enable terraform to create extra Elastic IPs if you desire. Set the eip_config variable to True and select the number of IPs you need.
cd terraform
terraform init
Terraform init is responsible for initializing the resources, now let's plan, with terraform plan
we are able to see the infraestructure changes that are gonna be executed.
terraform plan -var='aws_region=YOUR_REGION_HERE'
Replace the region with your region, you should see an output similar to this:
Let's now apply, with terraform apply
it will create the entire needed infraestructure.
terraform apply -var='aws_region=YOUR_REGION_HERE' --auto-approve
This will take around 20 minutes to execute.
Let's test the clister access by executing the following kubectl
commands:
kubectl get nodes
In this sample we will use Karpenter for scaling our Nodes.
Karpenter launches the right compute resources to handle your cluster's applications.
In Karpenter provisioner manifest we specify the instance types, size, capacity type and more, check this doc to understand more about Karpenter provisioner.
For the purpose of this demonstration, we are not gonna change the default configuration of the provisioner.yaml
manifest file.
cd .. && kubectl apply -f provisioner.yaml
Scrape is the only application that will be executed so we need to create the application image and push it to the ECR repository created by terraform.
./automation.sh
The script requires you to provide the AWS_ACCOUNT_ID
and AWS_REGION
you are using.
For the purpose of this demonstration, we are running both Controller Application
and Dashboard
locally, controllerd by docker-compose.
To run our apps, let's first replace the needed variables on docker-compose.yml
file, the values that we need to replace were generated by terraform output.
services:
controller-app:
build: ./controller_app
networks:
- project-network
ports:
- "3000:5000"
restart: always
environment:
AWS_ACCESS_KEY_ID: YOU HAVE TO CREATE IT (REPLACE)
AWS_SECRET_ACCESS_KEY: YOU HAVE TO CREATE IT (REPLACE)
AWS_DEFAULT_REGION: REGION DEFINED IN TERRAFORM (REPLACE)
EKS_CLUSTER_NAME: web-scrapping-demo
ECS_CLUSTER_NAME: web-scrapping-demo-ecs
SCRAPE_IMAGE_URI: ecr_image_uri (REPLACE)
BUCKET_NAME: s3_bucket_name (REPLACE)
DYNAMO_DB_TABLE_NAME: web-scrapping
PUBLIC_SUBNETS: vpc_public_subnet_ids (REPLACE)
SECURITY_GROUPS: ecs_security_group_id (REPLACE)
TASK_DEFINITION: web-scrapping-app:1
dashboard-app:
depends_on:
- controller-app
build: ./dashboard
networks:
- project-network
ports:
- "5000:5000"
restart: always
environment:
AWS_ACCESS_KEY_ID: YOU HAVE TO CREATE IT (REPLACE)
AWS_SECRET_ACCESS_KEY: YOU HAVE TO CREATE IT (REPLACE)
AWS_DEFAULT_REGION: REGION DEFINED IN TERRAFORM (REPLACE)
DYNAMO_DB_TABLE: web-scrapping
networks:
project-network:
driver: bridge
After filling the environemnt variables, it is time to execute the application locally.
docker-compose up --build
You should see the following result.
NAME STATUS ROLES AGE VERSION
ip-10-24-11-209.ec2.internal Ready,SchedulingDisabled <none> 62m v1.23.13-eks-fb459a0
ip-10-24-12-26.ec2.internal Ready,SchedulingDisabled <none> 62m v1.23.13-eks-fb459a0
SchedulingDisabled
status, it means that we are going to force Karpenter to provision new Nodes to handle the Scape application pods.
Kube-ops-view provides a common operational picture for a Kubernetes cluster that helps with understanding our cluster setup in a visual way.
helm install kube-ops-view christianknell/kube-ops-view --set service.type=LoadBalancer --set rbac.create=True
Kube ops view, will make easier to see how karpenter deals with scaling nodes to handle the scrape pods.
It will deploy a LoadBalancer type service, to get the URL execute the following command.
kubectl get svc kube-ops-view | awk '{print $4}' | grep -vi external
Open the URL into your browser, it should look like the following.
To guarantee that Karpenter will scale new nodes when we create the scrapping pods, let's cordon the nodes of the managed node groups, by executing the following command.
kubectl cordon $(kubectl get nodes | awk '{print $1}' | grep -vi name | xargs) && kubectl get nodes
Open in your browser the following URL, http://localhost:3000/. This URL directs you to the controller interface which is where we are going to define the URLs for our application to scrape.
In the text box, let's define the URLs and where we want to execute, in our case EKS
. It should look like the following.
Check below some sample URLs for testing https://github.com/lusoal/web-scrapping-container-environment/blob/3ec3cee91a005cde331cc00becdb5f397bac5ab3/urls.txt
After defining the URLs for scarapping, submit the request.
You can check what is happenning behind the scenes either using kubectl get nodes && kubectl get pods
or via kube-ops-view if you opted for installing it.
You will notice that a new node was already provisioned by Karpenter
and Kube Scheduler
already have placed the scrape pods
, the amount of nodes will depends on how many CPU and Memory we have defined for our application, and what are the instance types that we've defined on Karpenter provisioner.
The pods will run until completion, after this, Karpenter
will notice that the Node is not needed anymore and scale down the replicas.
The Dashboard application in responsible to give you ability to visualize what URLs were scrapped by the Pods and also which IP it used for scrapping. Just open http://localhost:5000
First check for the name of the bucket and the ECR repo created in the automation. Update the data_clean_up.sh
script with the resources names and region and run.
./data_clean_up.sh
After the files and images are removed, delete the remaining infrastructure:
cd terraform
terraform apply -var='aws_region=YOUR_REGION_HERE' --auto-approve
This process may take several minutes.