This monorepo contains a feature extraction pipeline that runs on the Spark data processing framework, a Scala application that acts as a web client for an instance of the Weaviate vector database, a Java application for consuming an API that provides access to zone files for generic top-level domains, a web application for allowing users to perform similarity search across a dataset consisting of newly registered domain names by querying the vector database, and an application for provisioning the required cloud infrastructure.
- What It Does
- Local Setup
- Directory Structure
- Run Job in Local Development Environment
- Deployment
- Run Job in Production Environment
This is a feature extraction pipeline written in Scala that uses Spark's machine learning library to generate feature vectors for a text corpus consisting of domain names.
If your system runs on the Apple Silicon M1 processor, disable fork safety by adding the following line to your shell initialisation file.
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
The Hadoop-AWS module requires access to AWS credentials with permission to write data to S3. Access to your credentials can be configured using the AWS CLI by running aws configure
and following the prompts.
To start a local instance of the Weaviate vector database using Docker Compose, paste the Compose specification below into a local docker-compose.yml
file and run the command docker-compose up
from the directory containing the YML file. The database server will be reachable at the hostname localhost:8080
.
version: "3.4"
services:
weaviate:
image: semitechnologies/weaviate:1.14.0
ports:
- 8080:8080
restart: on-failure:0
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
DEFAULT_VECTORIZER_MODULE: "none"
CLUSTER_HOSTNAME: "node1"
📦spark
┣ 📂lib
┃ ┗ 📜weaviate-spark-connector-assembly-v0.1.2.jar
┣ 📂project
┣ 📂target
┣ 📂src
┃ ┣ 📂main
┃ ┃ ┗ 📂scala
┃ ┃ ┃ ┗ 📂spark
┃ ┃ ┃ ┃ ┗ 📜DomainNames.scala
┣ 📜.gitignore
┣ 📜Dockerfile
┗ 📜build.sbt
Execute the following command from the spark
directory to launch a Spark session in standalone mode and run the DomainNames
job in the same virtual machine as sbt.
sbt "run <zone-file-uri> <bucket-name> <weaviate-host>"
The placeholder values for the script arguments refer to:
- a path to a TXT file generated by the Zone File Client application documented below;
- the name of an S3 bucket for storing the serialised model that can be accessed using AWS credentials configured for your local environment; and
- the hostname for a local or remote instance of the Weaviate vector database.
An assembly jar containing the project files and dependencies needs to be uploaded to an S3 bucket from where it can be downloaded by the EMR Serverless application when running in production. The AWS CDK app takes care of bundling the source code and uploading the deployment artifact. To deploy the application using the AWS CDK Toolkit, change the current working directory to cdk
and run cdk deploy NameRingersEMRServerlessStack
. See the AWS CDK app section for details of how to set up the AWS CDK Toolkit. The AWS CDK app outputs the ID of the EMR Serverless application created by the CloudFormation stack, the ARN for the IAM execution role, and S3 URIs for the assembly jar and logs folder.
The following is an example of how to submit a job to the EMR Serverless application deployed by the AWS CDK app using the AWS CLI. The placeholder values should be replaced with the values outputted by the CDK app after deployment of the NameRingersEMRServerlessStack
and WeaviateStack
stacks.
aws emr-serverless start-job-run \
--region eu-west-1 \
--application-id <application-ID> \
--execution-role-arn <role-ARN> \
--job-driver '{
"sparkSubmit": {
"entryPoint": <assembly-jar-URI>,
"entryPointArguments": [<zone-file-URI>, <bucket-name>, <weaviate-host>],
"sparkSubmitParameters": "--class com.nameringers.spark.DomainNames --conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.amzn2.0.1.x86_64 --conf spark.emr-serverless.driverEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.amzn2.0.1.x86_64 --conf spark.dynamicAllocation.enabled=false --conf spark.executor.instances=2 --conf spark.driver.cores=2 --conf spark.executor.cores=2"
}
}' \
--configuration-overrides '{
"monitoringConfiguration": {
"s3MonitoringConfiguration": {
"logUri": <logs-URI>
}
}
}'
The spark.emr-serverless.driverEnv.JAVA_HOME
and spark.executorEnv.JAVA_HOME
configuration properties refer to an image stored in ECR that is used to package the specific JDK version required by the project. The spark
directory contains a Dockerfile
with instructions for building the JDK 11 image.
- What It Does
- Local Setup
- Directory Structure
- Invoke Lambda Function in Local Development Environment
- Deployment
- Invoke Lambda REST API in Production Environment
This is a Lambda function written in Scala that acts as a web client for the database server. The Lambda proxy integration type is used to integrate a REST API with the Lambda function. The function handler method translates the query parameters passed via the REST API into a GraphQL query that is submitted to the GraphQL endpoint exposed by the database server. Search queries are transformed into feature vectors using a fitted pipeline generated by the Spark application that is shared via a Lambda layer using the MLeap serialisation format and run using the MLeap execution engine.
The SAM CLI requires access to AWS credentials with permission to pull Lambda layer versions. Access to your credentials can be configured using the AWS CLI by running aws configure
and following the prompts.
📦weaviate-client
┣ 📂project
┣ 📂target
┣ 📂src
┃ ┣ 📂main
┃ ┃ ┣ 📂java
┃ ┃ ┗ 📂scala
┃ ┃ ┃ ┗ 📂client
┃ ┃ ┃ ┃ ┗ 📜ScalaHandler.scala
┣ 📜Dockerfile
┣ 📜build.sbt
┣ 📜event.json
┗ 📜template.yaml
To invoke the Lambda function locally using the AWS SAM CLI, run sam local invoke -e event.json
from the weaviate-client
directory. The ARN for the Lambda layer used to share the serialised Spark model will need to be added to the SAM template file along with a value for the WEAVIATE_ENDPOINT
environment variable.
The AWS CDK app takes care of bundling the project files and dependencies into an assembly jar for deployment to Lambda. The CDK app also creates a new version of the Lambda layer used for sharing the serialised model trained by the Spark application. To deploy the application using the AWS CDK Toolkit, change the current working directory to cdk
and run cdk deploy WeaviateClientStack
.
The following example shows how to invoke the API that routes HTTP requests to the Lambda function that interfaces with the database server.
curl \
-X GET \
-G \
-d query=test \
-d distance=0.5 \
https://<API-ID>.execute-api.eu-west-1.amazonaws.com/prod
- What It Does
- Local Setup
- Directory Structure
- Deployment
- Invoke Lambda Function in Production Environment
This is a Lambda function written in Java that fetches compressed TXT files containing domain names extracted from the zone files for generic top-level domains, decompresses the data and uploads them to an S3 bucket for retrieval by the feature extraction pipeline. To ensure that the Lambda memory limit is not exceeded as a result of decompressing large files, data are written to an EFS volume mounted onto the function rather than the ephemeral Lambda file system.
📦zonefile-client
┣ 📂src
┃ ┣ 📂main
┃ ┃ ┗ 📂java
┃ ┃ ┃ ┗ 📂com
┃ ┃ ┃ ┃ ┗ 📂nameringers
┃ ┃ ┃ ┃ ┃ ┗ 📂zonefile
┃ ┃ ┃ ┃ ┃ ┃ ┣ 📜App.java
┃ ┃ ┃ ┃ ┃ ┃ ┗ 📜DependencyFactory.java
┣ 📂target
┣ 📜.gitignore
┣ 📜pom.xml
┗ 📜template.yaml
The AWS CDK app takes care of bundling the project files and dependencies into an assembly jar for deployment to Lambda. To deploy the application using the AWS CDK Toolkit, change the current working directory to cdk
and run cdk deploy ZoneFileStack
.
The following example shows how to invoke the Lambda function using the AWS CLI. The function output will be saved to a file named response.json
.
aws lambda invoke \
--function-name <function-name> \
--payload '{ "zone": "com" }' \
response.json
This is a React-based web application architected with the Next.js framework that allows users to perform similarity search across a dataset consisting of newly registered domain names by querying an index of vectors stored in an instance of the Weaviate vector database.
Install the Node dependencies by running npm install
from the web-app
directory. Run npm run dev
to start the local development server. By default the server is started on port 3000. Navigate to http://localhost:3000
to view the site in a web browser.
To create a file for storing environment variables used by the CDK application during deployment of the web application, change the current working directory to web-app/cdk
and run cp .env.example .env
.
📦web-app
┣ 📂cdk
┃ ┣ 📂bin
┃ ┃ ┗ 📜cdk.ts
┃ ┣ 📂lib
┃ ┃ ┗ 📜web-app-stack.ts
┃ ┣ 📜.env.example
┃ ┣ 📜.eslintrc.json
┃ ┣ 📜.gitignore
┃ ┣ 📜.npmignore
┃ ┣ 📜.prettierrc
┃ ┣ 📜cdk.context.json
┃ ┣ 📜cdk.json
┃ ┣ 📜jest.config.js
┃ ┣ 📜package-lock.json
┃ ┣ 📜package.json
┃ ┗ 📜tsconfig.json
┣ 📂nexjs-app
┃ ┣ 📂components
┃ ┃ ┣📜SearchForm.tsx
┃ ┃ ┗📜...
┃ ┣ 📂context
┃ ┃ ┗📜domainsContext.tsx
┃ ┣ 📂hooks
┃ ┃ ┗📜useDomainsObserver.tsx
┃ ┣ 📂pages
┃ ┃ ┣ 📜_app.tsx
┃ ┃ ┣ 📜_document.tsx
┃ ┃ ┗ 📜index.tsx
┃ ┣ 📂public
┃ ┃ ┣ 📜favicon.ico
┃ ┃ ┗ 📜robots.txt
┃ ┣ 📂styles
┃ ┃ ┗ 📜globals.css
┃ ┣ 📂types
┃ ┃ ┗ 📜index.ts
┃ ┣ 📜.eslintrc.json
┃ ┣ 📜.gitignore
┃ ┣ 📜.prettierrc.json
┃ ┣ 📜next-env.d.ts
┃ ┣ 📜next.config.js
┃ ┣ 📜package-lock.json
┃ ┣ 📜package.json
┃ ┣ 📜postcss.config.js
┃ ┣ 📜tailwind.config.js
┃ ┗ 📜tsconfig.json
The project is deployed via an AWS CDK application located in the web-app/cdk
directory. The CDK app takes care of bundling the project files using the standalone output build mode for deployment to Lambda. To deploy the application using the AWS CDK Toolkit, change the current working directory to web-app/cdk
and run cdk deploy NameRingersWebFrontendStack
.
The AWS Cloud Development Kit (CDK) is a framework for defining cloud infrastructure in code and provisioning it through AWS CloudFormation. This is an AWS CDK application that defines the cloud infrastructure required by the services contained in this repository.
To install the CDK Toolkit (a CLI tool for interacting with a CDK app) using the Node Package Manager, run the command npm install -g aws-cdk
. The CDK Toolkit needs access to AWS credentials. Access to your credentials can be configured using the AWS CLI by running aws configure
and following the prompts.
Install the Node dependencies by running npm install
from the cdk
directory.
To create a file for storing environment variables, run cp .env.example .env
.
📦cdk
┣ 📂src
┃ ┣ 📂main
┃ ┃ ┗ 📂java
┃ ┃ ┃ ┗ 📂com
┃ ┃ ┃ ┃ ┗ 📂nameringers
┃ ┃ ┃ ┃ ┃ ┣ 📜CdkApp.java
┃ ┃ ┃ ┃ ┃ ┣ 📜EMRServerlessStack.java
┃ ┃ ┃ ┃ ┃ ┣ 📜WeaviateClientStack.java
┃ ┃ ┃ ┃ ┃ ┣ 📜WeaviateFileSystemStack.java
┃ ┃ ┃ ┃ ┃ ┣ 📜WeaviateStack.java
┃ ┃ ┃ ┃ ┃ ┗ 📜ZoneFileStack.java
┣ 📂target
┣ 📜.env.example
┣ 📜.gitignore
┣ 📜cdk.context.json
┣ 📜cdk.json
┣ 📜pom.xml
┗ 📜weaviate.Dockerfile
To deploy all the stacks defined by the application, change the current working directory to cdk
and run cdk deploy --all
.