Giter Club home page Giter Club logo

azure-data-factory-integration-runtime-in-windows-container's Introduction

Azure Data Factory Integration Runtime in Windows Container Sample

This repo contains the sample for running the Azure Data Factory Integration Runtime in Windows Container

Support SHIR version: 5.0 or later

For more information about Azure Data Factory, see https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime

QuickStart

  1. Prepare Windows for containers
  2. Build the Windows container image in the project folder
> docker build . -t <image-name> [--build-arg="INSTALL_JDK=true"]

Arguments list

Name Necessity Default Description
INSTALL_JDK Optional false The flag to install Microsoft's JDK 11 LTS.
  1. Run the container with specific arguments by passing environment variables
> docker run -d -e AUTH_KEY=<ir-authentication-key> \
    [-e NODE_NAME=<ir-node-name>] \
    [-e ENABLE_HA={true|false}] \
    [-e HA_PORT=<port>] \
    [-e ENABLE_AE={true|false}] \
    [-e AE_TIME=<expiration-time-in-seconds>] \
    <image-name>

Arguments list

Name Necessity Default Description
AUTH_KEY Required The authentication key for the self-hosted integration runtime.
NODE_NAME Optional hostname The specified name of the node.
ENABLE_HA Optional false The flag to enable high availability and scalability.
It supports up to 4 nodes registered to the same IR when HA is enabled, otherwise only 1 is allowed.
HA_PORT Optional 8060 The port to set up a high availability cluster.
ENABLE_AE Optional false The flag to enable offline nodes auto-expiration.
If enabled, the node will be marked as expired when it has been offline for timeout duration defined by AE_TIME.
AE_TIME Optional 600 The expiration timeout duration for offline nodes in seconds.
Should be no less than 600 (10 minutes).

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

azure-data-factory-integration-runtime-in-windows-container's People

Contributors

johndowns avatar microsoftopensource avatar missingcharacter avatar pkothare avatar wxygeek avatar xumou-ms avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azure-data-factory-integration-runtime-in-windows-container's Issues

Cleanup previous connected node

Hi,

When I'm restarting the docker container I'm getting the following error

Registration of new node is forbidden when Remote Access is disabled on another node. To enable it, you can login the machine where the other node is installed and run 'dmgcmd.exe -EnableRemoteAccess "<port>" ["<thumbprint>"]'.

The other node it's referring to is my previous registration of the same node. After deleting the old node (with has the same name etc) registration succeeds again.

Is there a workaround for this? It seems that a cleanup before registering again should do the trick.

Context:
I'm running this windows container in an AKS edge essentials environment. Although this is a new and experimental setup this issue doesn't seem to be related.

Update:
I'm not able to run the -EnableRemoteAccess command because the old node is already gone.

New security control allow or disallow local SHIR file system access through File system connector

What is the best way to opt-in or opt-out of this new SHIR feature for SHIR Windows Containers?

https://go.microsoft.com/fwlink/?linkid=853077
CURRENT VERSION (5.23.8355.1)
· What’s new –
o Azure Data Factory: introduce a new security control for the Self-hosted Integration Runtime (SHIR) admin that lets them allow or disallow local SHIR file system access through File system connector. SHIR admins can use the local command line (dmgcmd.exe -DisableLocalFolderPathValidation/-EnableLocalFolderPathValidation) to allow or disallow. Refer here to learn more.

Note: We have changed the default setting to disallow local SHIR file system access from SHIR versions (>= 5.22.8297.1). Using the above command line, you should explicitly opt-out the security control and allow local SHIR file system if needed.

Multiple IR on one VM Invalid AUTH_KEY Value

Hello,

I built and run the Self-hosted IR in windows container on windows VM successfully

However, when I try to connect to the second Self-hosted IR by runing the second conainer, it does not work.

docker ps -a
poc-shir2 "powershell C:/SHIR/…" About a minute ago Exited (1) 43 seconds ago shir-container-2
poc-shir "powershell C:/SHIR/…" 34 minutes ago Up 34 minutes (healthy) compassionate_matsumoto

Docker logs:
The error message : Invalid AUTH_KEY Value

So why the second container on the same VM with the seocnd Auth-Key does not work?

Thanks

SHIR unable to reconnect to Synapse after forced restart

Problem: when container terminates non-gracefully it is unable to connect to synapse during restart

Background: I have setup the container as suggested, with below alterations

  • FROM: servercore:ltsc2022
  • Additionally installed
  1. Denodo ODBC
  2. Oracle Instant Client Basic Lite
  3. VS 2017 redistributable (required for Oracle Instant Client / ODBC)
  4. Oracle ODBC
  5. Oracle Instant Client sqlplus tool
  6. Generated a few DSNs from above
  7. added server certificate

It is running fine on an AKS cluster and the deployment is controlled using ArgoCd.
When I (or AKS backend) terminates/deletes the pod is restarted by ArgoCd. This would end up in an error "Registration of new node is forbidden when Remote Access is disabled on another node. To enable it, you can login the machine where the other node is installed and run 'dmgcmd.exe -EnableRemoteAccess "" [""]'."

As a work-around I am able to delete the node in synapse and then restart the pod. With these steps it is able to re-connect .
How can I get around this problem

Passing in HA_PORT doesn't work, as $PORT isn't being passed in correctly in code

The current logic that implements passing in $HA_PORT from the docker run isn't properly making it to dmgcmd.exe due to this line of code:

$PORT = $HA_PORT -or "8060"
Start-Process $DmgcmdPath -Wait -ArgumentList "-EnableRemoteAccess"

I think -or is being incorrectly used here. By this logic, $PORT always evaluates to True (rather than picking up 8060 if no $HA_PORT is passed in - I think the intention here was to perform NULL Coalesce?)

So you get this error every time:
image

The value of port is invalid. Please set an integer bigger than or equal to 0 and less than or equal to 65535.

Since the expression turns out to be .\dmgcmd.exe -EnableRemoteAccess TRUE.

Possible Solution
image

Performance seems inadequate

Hi
When this container is installed and being used in a Synapse environment, pulling data from a DB2 database, transfer times are really slow! 700k records which is around 3GB of data takes 9 1/2 hrs to transfer.
If we take docker out of the equation and install the SHIR directly on the windows server the performance issues are gone.

I am told Docker is setup to use Default process isolation mode, which should allow docker to consume as much of the hosts resources?

I also note that the stats are blank/static within the Integration Runtime monitor when using Docker, again when direct on Server they work. Memory is always 0MB, CPU stuck at 50%, Network always 0
image

As far as i am aware there is no other way to install multiple SHIRs on a VM for Synapse so i really need to get this working!

health-check.ps1 logic is faulty, when setup.ps1 is accessing status-check.txt, the container shuts down

Because the health-check.ps1 and setup.ps1 are both using status-check.txt for grabbing the output from dmgcmd.exe -cgc command, I noticed when the cycles (120 seconds vs 60 seconds) clash, the Container shuts down because the health check tries to delete the same text file that setup.ps1 might be accessing:
image

Options could have health-check.ps1 use another filename? I disabled the healthcheck in the Dockerfile as a workaround.

Adding second container doesn't work for HA due to order of operations and incorrect remote access command

When adding a second container with ENABLE_HA and HA_PORT specified, the container doesn't correctly get registered and also fails the health check:
image

I found the following fixes the issue:

  1. Use -EnableRemoteAccessInContainer instead of EnableRemoteAccess
  2. Order of operation matters on the Second Node, need to run -EnableRemoteAccessInContainer first, before -RegisterNewNode is executed

With this setup - the second node is successful:
image

Enables HA:
image

Upgrade shir to current version

Hi,

What is the policy for upgrading the shir version in this container? I would like to have version 5.37.8767.4, but the setup script explicitly downloads 5.34.8675.1.

Could anyone bring some insight or workaround, or should I produce a pull request?

Create Linked service file system not work

Hello,

I have installed and created Self-hosted IR in windows container.
Then I try to add a linked service file system to ADF, it did't work

First, I got the error 'Access to fold is not allowed'
Then I exec the docker and run container and run this command '.\dmgcmd.exe -DisableLocalFolderPathValidation'

The access is allowed, but when I try test the connection here is the error message:
"Cannot connect to 'C:\data'. Detail Message: The operation completed successfully
The operation completed successfully"

So how could I set up the linked service pls?

Thanks

How are local files supposed to work?

I've got SHIR in a windows container, and I mounted the files it needs as a local volume. how do I make azure data factory see them?

it seems azure data factory requires a username/password, even for files local to the SHIR, so I modified the dockerfile to create a user, but it can't read anything. am I missing something?

dockerfile changes:

FROM mcr.microsoft.com/windows/servercore:ltsc2022
ARG INSTALL_JDK=false

# Download the latest self-hosted integration runtime installer into the SHIR folder
COPY SHIR C:/SHIR/

RUN ["powershell", "C:/SHIR/build.ps1"]

# Allow local folder Navigation
RUN ["powershell", "'C:/Program Files/Microsoft Integration Runtime/6.0/Shared/dmgcmd.exe -DisableLocalFolderPathValidation'"]

RUN mkdir "C:/data"

RUN net user dataUser '(PASSWORD_HERE)' /ADD

RUN ["icacls", "C:/data", "/grant", "dataUser:(OI)(CI)F"]

ENTRYPOINT ["powershell", "C:/SHIR/setup.ps1"]

ENV SHIR_WINDOWS_CONTAINER_ENV True

HEALTHCHECK --start-period=120s CMD ["powershell", "C:/SHIR/health-check.ps1"]

I then run the container with the following docker compose:

services:
  shir:
    image: shir
    build: "B:/docker/MSSHIR/Azure-Data-Factory-Integration-Runtime-in-Windows-Container-main"
    volumes:
      - type: bind
        source: "B:/data/"
        target: "C:/data"
    environment:
      AUTH_KEY: "(Auth_Key)"
      NODE_NAME: "shir"
      ENABLE_AE: true
      ENABLE_HA: true

is anyone else using local files this way? I'm pulling my hair out, this feels much easier with linux containers.

Multiple SHIR containers with cascading failures - 0x80010002 (RPC_E_CALL_CANCELED))

We are running several windows SHIR containers on the same physical machines all containers are using the same network and default nat docker switch. Once one container is unhealthy it starts to slowly cascade to the rest of the SHIR containers. We do not use a proxy and network issues are not occurring between the onprem and Azure ADF/Synapse instance.

Are their any issues with running multiple SHIR containers on the same host that all connect to different Azure ADF/Synapse instances? We have the need to scale this out to hundreds of SHIR containers.

Server 2019 Standard 1809 build 17763.3406

Dockerfile is latest with this addtion:
RUN MD C:\Download ADD https://github.com/adoptium/temurin8-binaries/releases/download/jdk8u345-b01/OpenJDK8U-jdk_x64_windows_hotspot_8u345b01.zip C:/Download RUN MD "C:\Program Files\Eclipse Adoptium\jdk8u345-b01" RUN tar -xf C:/Download/OpenJDK8U-jdk_x64_windows_hotspot_8u345b01.zip -C "C:\Program Files\Eclipse Adoptium" RUN SETX PATH "%PATH%;C:\Program Files\Eclipse Adoptium\jdk8u345-b01\bin;C:\Program Files\Eclipse Adoptium\jdk8u345-b01\jre\bin\server" /m RUN SETX JAVA_HOME "C:\Program Files\Eclipse Adoptium\jdk8u345-b01\" /m

image

The only docker warning that is logged on the host server:
Health check for container 39fbbf4f690da051145d18f9d4df16b6666108c76dd39cf73d177179bf961f60 error: context deadline exceeded

This show up on all the containers that are unhealthy
`[09/22/2022 12:23:08] Registering SHIR node with the node key: redacted@ServiceEndpoint=usgovva.frontend.datamovement.azure.us@Vredacted

[09/22/2022 12:23:09] Registering SHIR node with the node name: redacted
[09/22/2022 12:23:09] Registering SHIR node with the enable high availability flag: true

[09/22/2022 12:23:09] Registering SHIR node with the tcp port: 8060

[09/22/2022 12:25:54] Start registering a new SHIR node

[09/22/2022 12:25:54] Enable High Availability

[09/22/2022 12:25:54] Remote Access Port: 8060

[09/22/2022 12:31:59] Waiting 60 seconds for connecting

Get-WmiObject : Call was canceled by the message filter. (Exception from

HRESULT: 0x80010002 (RPC_E_CALL_CANCELED))

At C:\SHIR\setup.ps1:17 char:22

  • ... essResult = Get-WmiObject Win32_Process -Filter "name = 'diahost.exe' ...

  •             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    • CategoryInfo : InvalidOperation: (:) [Get-WmiObject], COMExcept

    ion

    • FullyQualifiedErrorId : GetWMICOMException,Microsoft.PowerShell.Commands

    .GetWmiObjectCommand

[09/22/2022 12:34:02] diahost.exe is not running

Get-WmiObject : Call was canceled by the message filter. (Exception from

HRESULT: 0x80010002 (RPC_E_CALL_CANCELED))

At C:\SHIR\setup.ps1:17 char:22

  • ... essResult = Get-WmiObject Win32_Process -Filter "name = 'diahost.exe' ...

  •             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    • CategoryInfo : InvalidOperation: (:) [Get-WmiObject], COMExcept

    ion

    • FullyQualifiedErrorId : GetWMICOMException,Microsoft.PowerShell.Commands

    .GetWmiObjectCommand

[09/22/2022 12:36:06] diahost.exe is not running

Get-WmiObject : Call was canceled by the message filter. (Exception from

[09/22/2022 12:38:09] diahost.exe is not running

HRESULT: 0x80010002 (RPC_E_CALL_CANCELED))

At C:\SHIR\setup.ps1:17 char:22

  • ... essResult = Get-WmiObject Win32_Process -Filter "name = 'diahost.exe' ...

  •             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    • CategoryInfo : InvalidOperation: (:) [Get-WmiObject], COMExcept

    ion

    • FullyQualifiedErrorId : GetWMICOMException,Microsoft.PowerShell.Commands

    .GetWmiObjectCommand

Get-WmiObject : Call was canceled by the message filter. (Exception from

HRESULT: 0x80010002 (RPC_E_CALL_CANCELED))

At C:\SHIR\setup.ps1:17 char:22

  • ... essResult = Get-WmiObject Win32_Process -Filter "name = 'diahost.exe' ...

  •             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    • CategoryInfo : InvalidOperation: (:) [Get-WmiObject], COMExcept

    ion

    • FullyQualifiedErrorId : GetWMICOMException,Microsoft.PowerShell.Commands

    .GetWmiObjectCommand

[09/22/2022 12:40:11] diahost.exe is not running

Get-WmiObject : Call was canceled by the message filter. (Exception from

HRESULT: 0x80010002 (RPC_E_CALL_CANCELED))

At C:\SHIR\setup.ps1:17 char:22

  • ... essResult = Get-WmiObject Win32_Process -Filter "name = 'diahost.exe' ...

  •             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    • CategoryInfo : InvalidOperation: (:) [Get-WmiObject], COMExcept

    ion

    • FullyQualifiedErrorId : GetWMICOMException,Microsoft.PowerShell.Commands

    .GetWmiObjectCommand

[09/22/2022 12:42:12] diahost.exe is not running

Get-WmiObject : Call was canceled by the message filter. (Exception from

HRESULT: 0x80010002 (RPC_E_CALL_CANCELED))

At C:\SHIR\setup.ps1:17 char:22

  • ... essResult = Get-WmiObject Win32_Process -Filter "name = 'diahost.exe' ...

  •             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    

[09/22/2022 12:44:12] diahost.exe is not running

+ CategoryInfo          : InvalidOperation: (:) [Get-WmiObject], COMExcept 

ion

+ FullyQualifiedErrorId : GetWMICOMException,Microsoft.PowerShell.Commands 

.GetWmiObjectCommand`

Guidelines for dynamic scaling?

I am running the container on an AKS cluster on a node with VM scaleset with 1-4 pods using a Horizontal Pod Autoscaler and an automated Cluster Auto-scaler.The SHIR is running as a backend for Synapse.

With this setup I am scaling up a node and a pod if the CPU goes above 60% and scale down if it remains below 30% for at least 10 minutes.
As far as I've been able to test, after the scale up event there is a time of around 10 minutes before the node is available to pick up load.
Similar, the scale down event is much faster.

However - I've encountered multiple issues with this setup. Thus looking for any guidelines on how to do this as best-practice. Issues I perceive:

  • At the scale-up-event (actual event - before node is available). Some (all?) running copy activities fail on lost connection.
  • At scale up - when new node is available there seem to be a situation where there are multiple activities in Queue, yet none is re-directed to the new node. Instead once an activity on the previous nodes completes, any new activity may be scheduled on the new node, while the ones on queue will only be scheduled on the previous node.
  • Sometimes the node appears good as a pod, i.e. the health-check script returns good status. Yet the Synapse monitor states there is a connectivity issue, sometimes there is also an error in the pod-log. Does the health check only check things are running inside the pod? Would it not make more sense to verify that not only is it running - but also it has a good contact with synapse?
  • The scale down-event would potentially kill and fail any low-effort-activity, it does not necessarily terminate the node gracefully. Any way to have the SHIR cluster shift the load on terminate request?

ERROR [internal] load metadata for mcr.microsoft.com/windows/servercore:ltsc2019

Hello,

When I download and run docker command on my PC windows, here is the error:
ERROR: failed to solve: mcr.microsoft.com/windows/servercore:ltsc2019: no match for platform in manifest sha256:035894dc32a667c5f6f40054a4e92e208a60691438958c68696084fcce709ed5: not found

My docker windows: Docker version 24.0.5, build ced0996

Do you have any idea?

thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.