The project is a PoC of data ingestion with Nginx and Fluent Bit to BigQuery. This solution is built with AWS CDK. For more information of CDK, please check the document: https://aws.amazon.com/cdk/.
- An AWS account
- An IAM user has permissions for at least for CloudFormation, Cloud9, and EC2.
For using AWS CDK in a region at the first time, you will need to bootstrap CDK. Please find more details in the CDK Bootstrapping document.
- BigQuery Connection API
- Google Service account Export a GCP service account into a JSON (service_account.json) file and put the file into root/src/lambda/
You can also export BigQuery schemas using the command below, and put the schema files into root/src/lambda/bq_schemas/
bq show --schema --format=prettyjson [project_name:dataset.table]
bq show --schema --format=prettyjson us-bq-project-1:data_ingestion_test_1.lambda-target-1
Create an AWS Cloud9 environment in the region where you want to deploy the project, and upload the ZIP file into the Cloud9 environment.
AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser. It includes a code editor, debugger, and terminal.
Extend the EBS volume to at 20GB in the AWS console, because the default EBS volume (10GB) is not enough for compiling the CDK package . It might take a few minutes. For more information, please check the extend an EBS volume on Linux document.
lsblk
sudo growpart /dev/nvme0n1 1
sudo xfs_growfs -d /
wget "https://github.com/seeinyou/data-ingestion-with-nginx-fluent-bit/archive/refs/heads/master.zip"
unzip master.zip
cd data-ingestion-with-nginx-fluent-bit/
npm install
npx cdk deploy --context targetEnv=Dev
Please find the configurations of CDK in root/cdk.context.json.
Configurations:
- project_name
- cidr: VPC CIDR. Please don't overlap with your other VPCs.
- nginx_http_port
- asg_min_capacity: The minimum capacity setting of the auto-scaling group
- asg_max_capacity: The maximum capacity setting of the auto-scaling group
- asg_desired_capacity: The desired capacity setting of the auto-scaling group
- ingestion_output_target_s3: Fluent-bit outputs logs to S3
- ingestion_output_target_kinesis_stream: Fluent-bit outputs logs to Kinesis
- ingestion_output_target_msk: Fluent-bit outputs logs to MSK (Kafka)
- fluentbit_log_level
- certificate_arn: ACM ARN of the domain SSL/TLS certificate
- kafka_connect_key_id: Only valid when "ingestion_output_target_msk" is set to true
- kafka_connect_secret: Only valid when "ingestion_output_target_msk" is set to true
- enable_data_enrich: Enable the Lambda function to process log files on S3
- enable_data_enrich_with_sqs: Enable S3 send the creation event to SQS for the Lambda function
- bq_project_name: The BigQuery project name for final destination
- bq_dataset: The BigQuery dataset name for final destination
After the stack is created, CDK will output an Elastic Load Balancer endpoint. You can send POST requests to the endpoint on configured port and path /log. The endpoint will be a HTTPS endpoint when you provide the ACM ARN in root/cdk.context.json.
For example:
curl -d "log-data" http://elb-endpoint:[port]/log