NameRingers

This monorepo contains a feature extraction pipeline that runs on the Spark data processing framework, a Scala application that acts as a web client for an instance of the Weaviate vector database, a Java application for consuming an API that provides access to zone files for generic top-level domains, a web application for allowing users to perform similarity search across a dataset consisting of newly registered domain names by querying the vector database, and an application for provisioning the required cloud infrastructure.

Feature Extraction Pipeline
Weaviate Client
Zone File Client
Web App
AWS CDK App

1. Feature Extraction Pipeline

What It Does
Local Setup
- Prerequisites
- Set Up Environment
Directory Structure
Run Job in Local Development Environment
Deployment
Run Job in Production Environment

1.1. What It Does

This is a feature extraction pipeline written in Scala that uses Spark's machine learning library to generate feature vectors for a text corpus consisting of domain names.

1.2. Local Setup

1.2.1. Prerequisites

1.2.2. Set Up Environment

If your system runs on the Apple Silicon M1 processor, disable fork safety by adding the following line to your shell initialisation file.

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

The Hadoop-AWS module requires access to AWS credentials with permission to write data to S3. Access to your credentials can be configured using the AWS CLI by running aws configure and following the prompts.

To start a local instance of the Weaviate vector database using Docker Compose, paste the Compose specification below into a local docker-compose.yml file and run the command docker-compose up from the directory containing the YML file. The database server will be reachable at the hostname localhost:8080.

version: "3.4"
services:
  weaviate:
    image: semitechnologies/weaviate:1.14.0
    ports:
      - 8080:8080
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
      PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
      DEFAULT_VECTORIZER_MODULE: "none"
      CLUSTER_HOSTNAME: "node1"

1.3. Directory Structure

📦spark
 ┣ 📂lib
 ┃ ┗ 📜weaviate-spark-connector-assembly-v0.1.2.jar
 ┣ 📂project
 ┣ 📂target
 ┣ 📂src
 ┃ ┣ 📂main
 ┃ ┃ ┗ 📂scala
 ┃ ┃ ┃ ┗ 📂spark
 ┃ ┃ ┃ ┃ ┗ 📜DomainNames.scala
 ┣ 📜.gitignore
 ┣ 📜Dockerfile
 ┗ 📜build.sbt

1.4. Run Job in Local Development Environment

Execute the following command from the spark directory to launch a Spark session in standalone mode and run the DomainNames job in the same virtual machine as sbt.

sbt "run <zone-file-uri> <bucket-name> <weaviate-host>"

The placeholder values for the script arguments refer to:

a path to a TXT file generated by the Zone File Client application documented below;
the name of an S3 bucket for storing the serialised model that can be accessed using AWS credentials configured for your local environment; and
the hostname for a local or remote instance of the Weaviate vector database.

1.5. Deployment

An assembly jar containing the project files and dependencies needs to be uploaded to an S3 bucket from where it can be downloaded by the EMR Serverless application when running in production. The AWS CDK app takes care of bundling the source code and uploading the deployment artifact. To deploy the application using the AWS CDK Toolkit, change the current working directory to cdk and run cdk deploy NameRingersEMRServerlessStack. See the AWS CDK app section for details of how to set up the AWS CDK Toolkit. The AWS CDK app outputs the ID of the EMR Serverless application created by the CloudFormation stack, the ARN for the IAM execution role, and S3 URIs for the assembly jar and logs folder.

1.6. Run Job in Production Environment

The following is an example of how to submit a job to the EMR Serverless application deployed by the AWS CDK app using the AWS CLI. The placeholder values should be replaced with the values outputted by the CDK app after deployment of the NameRingersEMRServerlessStack and WeaviateStack stacks.

aws emr-serverless start-job-run \
    --region eu-west-1 \
    --application-id <application-ID> \
    --execution-role-arn <role-ARN> \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": <assembly-jar-URI>,
            "entryPointArguments": [<zone-file-URI>, <bucket-name>, <weaviate-host>],
            "sparkSubmitParameters": "--class com.nameringers.spark.DomainNames --conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.amzn2.0.1.x86_64 --conf spark.emr-serverless.driverEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.amzn2.0.1.x86_64 --conf spark.dynamicAllocation.enabled=false --conf spark.executor.instances=2 --conf spark.driver.cores=2 --conf spark.executor.cores=2"
        }
    }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": <logs-URI>
            }
        }
    }'

The spark.emr-serverless.driverEnv.JAVA_HOME and spark.executorEnv.JAVA_HOME configuration properties refer to an image stored in ECR that is used to package the specific JDK version required by the project. The spark directory contains a Dockerfile with instructions for building the JDK 11 image.

2. Weaviate Client

What It Does
Local Setup
- Prerequisites
- Set Up Environment
Directory Structure
Invoke Lambda Function in Local Development Environment
Deployment
Invoke Lambda REST API in Production Environment

2.1. What It Does

This is a Lambda function written in Scala that acts as a web client for the database server. The Lambda proxy integration type is used to integrate a REST API with the Lambda function. The function handler method translates the query parameters passed via the REST API into a GraphQL query that is submitted to the GraphQL endpoint exposed by the database server. Search queries are transformed into feature vectors using a fitted pipeline generated by the Spark application that is shared via a Lambda layer using the MLeap serialisation format and run using the MLeap execution engine.

2.2. Local Setup

2.2.1. Prerequisites

2.2.2. Set Up Environment

The SAM CLI requires access to AWS credentials with permission to pull Lambda layer versions. Access to your credentials can be configured using the AWS CLI by running aws configure and following the prompts.

2.3. Directory Structure

📦weaviate-client
 ┣ 📂project
 ┣ 📂target
 ┣ 📂src
 ┃ ┣ 📂main
 ┃ ┃ ┣ 📂java
 ┃ ┃ ┗ 📂scala
 ┃ ┃ ┃ ┗ 📂client
 ┃ ┃ ┃ ┃ ┗ 📜ScalaHandler.scala
 ┣ 📜Dockerfile
 ┣ 📜build.sbt
 ┣ 📜event.json
 ┗ 📜template.yaml

2.4. Invoke Lambda Function in Local Development Environment

To invoke the Lambda function locally using the AWS SAM CLI, run sam local invoke -e event.json from the weaviate-client directory. The ARN for the Lambda layer used to share the serialised Spark model will need to be added to the SAM template file along with a value for the WEAVIATE_ENDPOINT environment variable.

2.5. Deployment

The AWS CDK app takes care of bundling the project files and dependencies into an assembly jar for deployment to Lambda. The CDK app also creates a new version of the Lambda layer used for sharing the serialised model trained by the Spark application. To deploy the application using the AWS CDK Toolkit, change the current working directory to cdk and run cdk deploy WeaviateClientStack.

2.6. Invoke Lambda REST API in Production Environment

The following example shows how to invoke the API that routes HTTP requests to the Lambda function that interfaces with the database server.

curl \
  -X GET \
  -G \
  -d query=test \
  -d distance=0.5 \
  https://<API-ID>.execute-api.eu-west-1.amazonaws.com/prod

3. Zone File Client

What It Does
Local Setup
- Prerequisites
Directory Structure
Deployment
Invoke Lambda Function in Production Environment

3.1. What It Does

This is a Lambda function written in Java that fetches compressed TXT files containing domain names extracted from the zone files for generic top-level domains, decompresses the data and uploads them to an S3 bucket for retrieval by the feature extraction pipeline. To ensure that the Lambda memory limit is not exceeded as a result of decompressing large files, data are written to an EFS volume mounted onto the function rather than the ephemeral Lambda file system.

3.2. Local Setup

3.2.1. Prerequisites

3.3. Directory Structure

📦zonefile-client
 ┣ 📂src
 ┃ ┣ 📂main
 ┃ ┃ ┗ 📂java
 ┃ ┃ ┃ ┗ 📂com
 ┃ ┃ ┃ ┃ ┗ 📂nameringers
 ┃ ┃ ┃ ┃ ┃ ┗ 📂zonefile
 ┃ ┃ ┃ ┃ ┃ ┃ ┣ 📜App.java
 ┃ ┃ ┃ ┃ ┃ ┃ ┗ 📜DependencyFactory.java
 ┣ 📂target
 ┣ 📜.gitignore
 ┣ 📜pom.xml
 ┗ 📜template.yaml

3.4. Deployment

The AWS CDK app takes care of bundling the project files and dependencies into an assembly jar for deployment to Lambda. To deploy the application using the AWS CDK Toolkit, change the current working directory to cdk and run cdk deploy ZoneFileStack.

3.5. Invoke Lambda Function in Production Environment

The following example shows how to invoke the Lambda function using the AWS CLI. The function output will be saved to a file named response.json.

aws lambda invoke \
  --function-name <function-name> \
  --payload '{ "zone": "com" }' \
  response.json

4. Web App

What It Does
Local Setup
- Prerequisites
- Set Up Environment
Directory Structure
Deployment

4.1. What It Does

This is a React-based web application architected with the Next.js framework that allows users to perform similarity search across a dataset consisting of newly registered domain names by querying an index of vectors stored in an instance of the Weaviate vector database.

4.2. Local Setup

4.2.1. Prerequisites

Node.js JavaScript runtime environment

4.2.2. Set Up Environment

Install the Node dependencies by running npm install from the web-app directory. Run npm run dev to start the local development server. By default the server is started on port 3000. Navigate to http://localhost:3000 to view the site in a web browser.

To create a file for storing environment variables used by the CDK application during deployment of the web application, change the current working directory to web-app/cdk and run cp .env.example .env.

4.3. Directory Structure

📦web-app
 ┣ 📂cdk
 ┃ ┣ 📂bin
 ┃ ┃ ┗ 📜cdk.ts
 ┃ ┣ 📂lib
 ┃ ┃ ┗ 📜web-app-stack.ts
 ┃ ┣ 📜.env.example
 ┃ ┣ 📜.eslintrc.json
 ┃ ┣ 📜.gitignore
 ┃ ┣ 📜.npmignore
 ┃ ┣ 📜.prettierrc
 ┃ ┣ 📜cdk.context.json
 ┃ ┣ 📜cdk.json
 ┃ ┣ 📜jest.config.js
 ┃ ┣ 📜package-lock.json
 ┃ ┣ 📜package.json
 ┃ ┗ 📜tsconfig.json
 ┣ 📂nexjs-app
 ┃ ┣ 📂components
 ┃ ┃ ┣📜SearchForm.tsx
 ┃ ┃ ┗📜...
 ┃ ┣ 📂context
 ┃ ┃ ┗📜domainsContext.tsx
 ┃ ┣ 📂hooks
 ┃ ┃ ┗📜useDomainsObserver.tsx
 ┃ ┣ 📂pages
 ┃ ┃ ┣ 📜_app.tsx
 ┃ ┃ ┣ 📜_document.tsx
 ┃ ┃ ┗ 📜index.tsx
 ┃ ┣ 📂public
 ┃ ┃ ┣ 📜favicon.ico
 ┃ ┃ ┗ 📜robots.txt
 ┃ ┣ 📂styles
 ┃ ┃ ┗ 📜globals.css
 ┃ ┣ 📂types
 ┃ ┃ ┗ 📜index.ts
 ┃ ┣ 📜.eslintrc.json
 ┃ ┣ 📜.gitignore
 ┃ ┣ 📜.prettierrc.json
 ┃ ┣ 📜next-env.d.ts
 ┃ ┣ 📜next.config.js
 ┃ ┣ 📜package-lock.json
 ┃ ┣ 📜package.json
 ┃ ┣ 📜postcss.config.js
 ┃ ┣ 📜tailwind.config.js
 ┃ ┗ 📜tsconfig.json

4.4. Deployment

The project is deployed via an AWS CDK application located in the web-app/cdk directory. The CDK app takes care of bundling the project files using the standalone output build mode for deployment to Lambda. To deploy the application using the AWS CDK Toolkit, change the current working directory to web-app/cdk and run cdk deploy NameRingersWebFrontendStack.

5. AWS CDK App

What It Does
Local Setup
- Prerequisites
- Set Up Environment
Directory Structure
Deployment

5.1. What It Does

The AWS Cloud Development Kit (CDK) is a framework for defining cloud infrastructure in code and provisioning it through AWS CloudFormation. This is an AWS CDK application that defines the cloud infrastructure required by the services contained in this repository.

5.2. Local Setup

5.2.1. Prerequisites

Node.js JavaScript runtime environment

5.2.2. Set Up Environment

To install the CDK Toolkit (a CLI tool for interacting with a CDK app) using the Node Package Manager, run the command npm install -g aws-cdk. The CDK Toolkit needs access to AWS credentials. Access to your credentials can be configured using the AWS CLI by running aws configure and following the prompts.

Install the Node dependencies by running npm install from the cdk directory.

To create a file for storing environment variables, run cp .env.example .env.

5.3. Directory Structure

📦cdk
 ┣ 📂src
 ┃ ┣ 📂main
 ┃ ┃ ┗ 📂java
 ┃ ┃ ┃ ┗ 📂com
 ┃ ┃ ┃ ┃ ┗ 📂nameringers
 ┃ ┃ ┃ ┃ ┃ ┣ 📜CdkApp.java
 ┃ ┃ ┃ ┃ ┃ ┣ 📜EMRServerlessStack.java
 ┃ ┃ ┃ ┃ ┃ ┣ 📜WeaviateClientStack.java
 ┃ ┃ ┃ ┃ ┃ ┣ 📜WeaviateFileSystemStack.java
 ┃ ┃ ┃ ┃ ┃ ┣ 📜WeaviateStack.java
 ┃ ┃ ┃ ┃ ┃ ┗ 📜ZoneFileStack.java
 ┣ 📂target
 ┣ 📜.env.example
 ┣ 📜.gitignore
 ┣ 📜cdk.context.json
 ┣ 📜cdk.json
 ┣ 📜pom.xml
 ┗ 📜weaviate.Dockerfile

5.4. Deployment

To deploy all the stacks defined by the application, change the current working directory to cdk and run cdk deploy --all.

alistairdivorty / nameringers Goto Github PK

nameringers's Introduction