Giter Club home page Giter Club logo

areg-sdk's Introduction

Data Engineering Capstone Project

A core responsibility of The National Travel and Tourism Office (NTTO) is to collect, analyze, and disseminate international travel and tourism statistics.

Introduction

NTTO's Board of Managers are charged with managing, improving, and expanding the system to fully account and report the impact of travel and tourism in the United States. The analysis results help to forcecast and operation, support make decision creates a positive climate for growth in travel and tourism by reducing institutional barriers to tourism, administers joint marketing efforts, provides official travel and tourism statistics, and coordinates efforts across federal agencies.

Project Description

In this project, some source datas will be use to do data modeling:

  • I94 Immigration: The source data for I94 immigration data is available in local disk in the format of sas7bdat. This data comes from US National Tourism and Trade Office. The data dictionary is also included in this project for reference. The actual source of the data is from https://travel.trade.gov/research/reports/i94/historical/2016.html. This data is already uploaded to the workspace.

  • World Temperature Data: This dataset came from Kaggle. This data is already uploaded to the workspace.

  • Airport Code: This is a simple table with airport codes. The source of this data is from https://datahub.io/core/airport-codes#data. It is highly recommended to use it for educational purpose only but not for commercial or any other purpose. This data is already uploaded to the workspace.

  • Other text files such as * I94Addr.txt *, * I94CIT_I94RES.txt *, * I94Mode.txt *, * I94Port.txt * and * I94Visa.txt * files are used to enrich immigration data for better analysis. These files are created from the * I94_SAS_Labels_Descriptions.SAS * file provided to describe each and every field in the immigration data.

The project follows the follow steps:

  • Step 1: Scope the Project and Gather Data
  • Step 2: Explore and Assess the Data
  • Step 3: Define the Data Model
  • Step 4: Run ETL to Model the Data
  • Step 5: Complete Project Write Up

Table of contents

  1. Step 1: Scope the Project and Gather Data
  2. Step 2: Explore and Assess the Data
  3. Step 3: Define the Data Model
  4. Step 4: Run Pipelines to Model the Data
  5. Step 5: Complete Project Write Up
  6. Final: Contribution Thanks
  7. Examples
  8. Licensing
  9. Call for action

Step 1: Scope the Project and Gather Data

Data Volume Assessment(#data_volume_assessment)

  • Create confg file etl.cfg to set basic configuration:
[DIR]
INPUT_DIR = .
OUTPUT_DIR = ./storage

[DATA]
I94_IMMI = ../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat
WORLD_TEMPE = ../../data2/GlobalLandTemperaturesByCity.csv
CITY_DEMOGRAPHIC = ./us-cities-demographics.csv
AIR_PORT = ./airport-codes_csv.csv

[SPLIT]
I94_IMMI_SPLITED_DIR = ./storage/.sas7bdat
WORLD_TEMPE_SPLITED_DIR = ./storage/.csv

๐Ÿ’ก By default, the router.init and log.init files are located in the config subfolder of binaries.
๐Ÿ’ก To enable all logs of all applications, use scope.* = DEBUG | SCOPE ; .
๐Ÿ’ก In the current version the logging is possible only in file.

Data Attributions Assessment(#data_attributions_assessment)

Configure log.init to set scopes, log priorities and log file name:

log.file        = %home%/logs/%appname%_%time%.log # create logs in 'log' subfolder of user home 
scope.mcrouter.*= NOTSET ;                         # disable logs for mcrouter.

scope.my_app.*                   = DEBUG | SCOPE ; # enable all logs of my_app
scope.my_app.ignore_this_scope   = NOTSET ;        # disable logs of certain scopes in my_app
scope.my_app.ignore_this_group_* = NOTSET ;        # disable logs of certain scope group in my_app

๐Ÿ’ก By default, the router.init and log.init files are located in the config subfolder of binaries.
๐Ÿ’ก To enable all logs of all applications, use scope.* = DEBUG | SCOPE ; .
๐Ÿ’ก In the current version the logging is possible only in file.

Scope the Project(#scope_the_project)

Traditionally, devices are connected to clients to stream data to the cloud or fog servers for further processing.

IoT-to-Cloud (Nebula) network


Step 2: Explore and Assess the Data

When we were designing AREG SDK, the guiding principle was to provide a homogeneous solution for multithreading, multiprocessing and internet communication wrapped in services appropriately having Local, Public and Internet categories. These services are neither processes nor tasks managed by the operating system, they are software components with a predefined interface, in which methods are invoked remotely.
AREG SDK distributed services

๐Ÿ’ก In current version, the AREG engine handles multithreading (Local) and multiprocessing (Public) communication.

The AREG engine forms a fault-tolerant system, automatically discovers services, automates communication, simplifies distributed programming, and helps developers to focus on application business logic as if they would program a single process application with one thread where methods of objects are event-driven. The engine guarantees that:

  • The crash of one application does not cause the crash of the system.
  • The service clients are automatically notified about service availability status.
  • The client requests are automatically invoked to run on the service component.
  • The service responses are automatically invoked on the exact client, and they are not mixed or missed.
  • The subscriptions on data, responses and broadcasts are automatically invoked on the client when service triggers a call.

Step 3: Define the Data Model

AREG SDK consists of:

  1. Multicast router (mcrouter) to use for IPC. It runs either as a service managed by the OS or as a console application.
  2. AREG framework (or engine) is a library (shared or static) linked in every application.
  3. Code generator tool to create client and server base objects from a service prototype document.

The framework contains a dynamic and configurable logging service. More tools and features are planned for future releases.


Step 4: Run Pipelines to Model the Data

An example to get source codes and compile under Linux. You'd need at least C++17 g++ (default) compiler. Open Terminal console in your projects folder and take the following steps:

# Step 1: Get sources from GitHub
$ git clone https://github.com/aregtech/areg-sdk.git
$ cd areg-sdk
# Step 2: Compile sources from terminal by calling: make [all] [framework] [examples]
$ make all

After compilation, the binaries are located in <areg-sdk>/product/build/<compiler-platform-path>/bin folder.

AREG SDK sources are developed for:

Supported OS Linux (list of POSIX API), Windows 8 and higher.
Supported CPU x86, x86_64, arm and aarch64.
Supported compilers Version C++17 GCC, g++, clang and MSVC.

๐Ÿ’ก The other POSIX-compliant OS and compilers are not tested yet.

Compile AREG SDK sources and examples:

Operating System Quick actions to use tools and compile
Linux or Windows Import projects in Eclipse to compile with POSIX API (you may need to change Toolchain).
Windows Open areg-sdk.sln file in MS Visual Studio (VS2019 and higher) to compile with Win32 API.
Linux Open gnome-terminal in Linux and call โ€œmakeโ€ to compile with POSIX API.

๐Ÿ’ก Compilation with Eclipse under Windows might require to switch the Toolchain. For example, Cygwin GCC.
๐Ÿ’ก For Linux the default compiler is g++. Set prefered C++17 compiler in conf/make/user.mk file.

Details on how to change compiler, load and compile sources for various targets are described in HOWTO.


Step 5: Complete Project Write Up

Mulitcast router

Configure router.init file to set the IP-address and the port of multicast router:

connection.address.tcpip    = 127.0.0.1	# the address of mcrouter host
connection.port.tcpip       = 8181      # the connection port of mcrouter

The multicast router forms the network and can run on any device. For example, in case of M2M communication, it can run on a gateway, in case of IPC, it can run on the same machine. In case of multithreading application development, there is no need to configure router.init and run mcrouter.

Logging service

Configure log.init to set scopes, log priorities and log file name:

log.file        = %home%/logs/%appname%_%time%.log # create logs in 'log' subfolder of user home 
scope.mcrouter.*= NOTSET ;                         # disable logs for mcrouter.

scope.my_app.*                   = DEBUG | SCOPE ; # enable all logs of my_app
scope.my_app.ignore_this_scope   = NOTSET ;        # disable logs of certain scopes in my_app
scope.my_app.ignore_this_group_* = NOTSET ;        # disable logs of certain scope group in my_app

๐Ÿ’ก By default, the router.init and log.init files are located in the config subfolder of binaries.
๐Ÿ’ก To enable all logs of all applications, use scope.* = DEBUG | SCOPE ; .
๐Ÿ’ก In the current version the logging is possible only in file.

Development

The development guidance and step-by-step example to create a simple service-enabled application are described in DEVELOP.


Final: Contribution Thanks

AREG SDK can be used in a very large scope of multithreading and multiprocessing application development running on Linux or Windows machines.

Distributed solution

AREG SDK is a distributed computing solution, where the services can be distributed and run on any node of the network. The automatic service discovery makes service location transparent, so that the applications interact as if the components are located in one process. Developers define a model, which is a description of service relationship, and load it to start services during runtime. The services can easily be distributed between multiple processes.

The following is a demonstration of a static model description, which is loaded to start services and unloaded to stop them.

// main.cpp source file.

// Defines static model with 2 services
BEGIN_MODEL(NECommon::ModelName)

    BEGIN_REGISTER_THREAD( "Thread1" )
        BEGIN_REGISTER_COMPONENT( "RemoteRegistry", RemoteRegistryService )
            REGISTER_IMPLEMENT_SERVICE( NERemoteRegistry::ServiceName, NERemoteRegistry::InterfaceVersion )
        END_REGISTER_COMPONENT( "RemoteRegistry" )
    END_REGISTER_THREAD( "Thread1" )

    BEGIN_REGISTER_THREAD( "Thread2" )
        BEGIN_REGISTER_COMPONENT( "SystemShutdown", SystemShutdownService )
            REGISTER_IMPLEMENT_SERVICE( NESystemShutdown::ServiceName, NESystemShutdown::InterfaceVersion )
        END_REGISTER_COMPONENT( "SystemShutdown" )
    END_REGISTER_THREAD( "Thread2" )

END_MODEL(NECommon::ModelName)

// the main function
int main()
{
    // Initialize application, enable logging, servicing and the timer.
    Application::initApplication(true, true, true, true, nullptr, nullptr );

    // load model to start service components
    Application::loadModel(NECommon::ModelName);

    // wait until Application quit signal is set.
    Application::waitAppQuit(NECommon::WAIT_INFINITE);

    // stop and unload service components
    Application::unloadModel(NECommon::ModelName);

    // release and cleanup resources of application.
    Application::releaseApplication();

    return 0;
}

In the example, the "RemoveRegistry" and the "SystemShudown" are the names of components called roles, and the NERemoteRegistry::ServiceName and the NESystemShutdown::ServiceName are the interface names. In combination, they define the service name used to access in the network. These MACRO create static model NECommon::ModelName, which is loaded when call Application::loadModel(NECommon::ModelName), and the services are stopped when call Application::unloadModel(NECommon::ModelName).

In this example, services can be merged in one thread or distributed in 2 processes by defining a model in each process. Independent on service location, neither software developers, nor service client objects feel a difference except for possible slight network latency when running IPC. The services must have unique names within the scope of visibility. Means, in case of Public services, the names are unique within a network, and in case of Local services, the names are unique within a process scope. An example of developing a service and a client in one and multiple processes is in Hello Service! project described in the development guide.

Driverless devices

Normally, the devices are supplied with the drivers to install in the system and with the header files to integrate in the application(s). The drivers often run in Kernel mode and the crash of the driver crashes the entire system. Driver development requires a special technique, which is different for each operating system, and it is hard to debug.
Kkernel-mode driver solution
Our proposal is to deliver driverless service-enabled devices, where device-specific services are described in the interface prototype documents.
AREG SDK driverless solution
In contrast to drivers, the service development does not differ from user mode application development, it is faster to develop, easily serves multiple applications (service clients), contains fewer risks and requires less development resources. The client object generated from the supplied service interface prototype document is easily integrated into the application to communicate and trigger device-specific service(s).

Real-time solutions

When a remote method of the service interface is called, the engine of AREG SDK immediately generates and delivers messages to the target component, which invokes appropriate methods of addressed service. This makes communication real-time with ultra-low networking latency. Such solutions are highly required to develop time-sensitive applications for automotive, flock of drones, medtech, real-time manufacturing, real-time monitoring and other projects.
AREG SDK and multicast features

Digital twin

Often, the digital twin applications use client-server architecture, where the middleware server collects the data of external devices and the UI application virtualizes them. In such solutions, devices interact either through server or UI client applications. The event-driven and the service-oriented architecture, and the real-time communication of AREG SDK is a perfect solution to develop digital twin applications that virtualize, monitor and control external devices, and immediately react to environment or device state change in real-time mode. External devices may also communicate without additional layers, which is an important factor for emergency, security and safety cases.

Simulation and test automations

When hardware provisioning to all employees is impossible, testing and checking unexpected phenomena of rapidly changing software in a simulated environment can be the most rational solution. If unit tests are used by developers to test a small portion of code and they may contain bugs, the simulation is used by developers and testers to check the system's functionality and stability. Simulations are portable and accessible to everyone, help to optimize solutions and avoid unnecessary risks. Projects using simulations are better prepared for remote work and easier to outsource.
Software application 4 layers
The software components in applications normally are split into Data, Controller, Business and the optional Presentation layers. Distributed and service-oriented solution of the AREG engine eases system testing in a simulated environment, where the Simulation application provides an implementation of Data layer services, so that the rest of the application can be tested without any change.

The same technique of simulating data can be used to create API-driven test automations.


areg-sdk's People

Contributors

aregtech avatar biwiki avatar doanvanthanhfpt avatar pybaker avatar gitter-badger avatar vukihub avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.