Giter Club home page Giter Club logo

Comments (5)

A-Kibats avatar A-Kibats commented on September 13, 2024

Additionally during 16384 runs i'm now getting warnings of soft lock-up on the CPU when it reaches executing kernel:

Message from syslogd@nextgenio-amd01 at Dec  1 11:37:38 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [RunHardware.exe:85487]

Message from syslogd@nextgenio-amd01 at Dec  1 11:37:38 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [RunHardware.exe:85489]

Message from syslogd@nextgenio-amd01 at Dec  1 11:38:06 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [RunHardware.exe:85487]

Message from syslogd@nextgenio-amd01 at Dec  1 11:38:06 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [RunHardware.exe:85489]

Again this does not occur on smaller matrix sizes.

from gemm_hls.

A-Kibats avatar A-Kibats commented on September 13, 2024

Hi Again,

Further progress has been made into the issue. Having built a SGEMM build on a U250 card we encounter the same XRT error when running 16k matrices.

Here is the system configuration as given by xbutil examine:

System Configuration
  OS Name              : Linux
  Release              : 3.10.0-1160.99.1.el7.x86_64
  Version              : #1 SMP Wed Sep 13 14:19:20 UTC 2023
  Machine              : x86_64
  CPU Cores            : 128
  Memory               : 257749 MB
  Distribution         : CentOS Linux 7 (Core)
  GLIBC                : 2.17
  Model                : ProLiant DL385 Gen10 Plus

XRT
  Version              : 2.11.634
  Branch               : 2021.1
  Hash                 : 5ad5998d67080f00bca5bf15b3838cf35e0a7b26
  Hash Date            : 2021-06-09 05:08:58
  XOCL                 : 2.11.634, 5ad5998d67080f00bca5bf15b3838cf35e0a7b26
  XCLMGMT              : 2.11.634, 5ad5998d67080f00bca5bf15b3838cf35e0a7b26

Devices present
  [0000:c3:00.1] : xilinx_u280_xdma_201920_3 

We noticed that both our U250 and U280 cards fail test 7 when using xbutil validate:

Test 7 [0000:c3:00.1]     : Bandwidth kernel 
    Error(s)              : 
                            terminate called after throwing an instance of
                            'std::runtime_error'
                              what():  Multiple instances of XRT core shim library
                            detected, only one
                            can be loaded at any given time.  Please check if
                            application is
                            explicity linked with XRT core library (xrt_core,
                            xrt_hwemu, or
                            xrt_swemu) and remove this linking. Use XCL_EMULATION_MODE
                            set to
                            either hw_emu or sw_emu if running in emulation mode.
    Test Status           : [FAILED]

Could this possibly be the source of the issue?

from gemm_hls.

definelicht avatar definelicht commented on September 13, 2024

Hey! Since this only occurs with large matrix sizes and throws an I/O error, it could be related to the size of the memory transfer. If my math is right, transferring 3x 16384x16384 matrices amounts to 6.4 GB, which I suppose could be an issue for the virtual HBM channels on the U280 (I believe the individual virtual channels have smaller capacity than this), but should work fine in DDR 🤔

Are you completely sure the issue you see is identical between the U280 and the U250, or is there any chance that they are separate issues?

from gemm_hls.

A-Kibats avatar A-Kibats commented on September 13, 2024

Hi, thanks for the reply.

We were suggested this as well by AMD/Xilinx, that it is a memory issue and we're in the processes of checking the usage.

The issue is not completely identical as SGEMM works on U280 but doesn't work on U250 and has the same issue DGEMM has on U280. I've checked the Config.h in the directories and SGEMM was built with the same parameters on both cards so why U250 gives the same XRT issue is a mystery at the moment.

from gemm_hls.

definelicht avatar definelicht commented on September 13, 2024

Any news @A-Kibats?

from gemm_hls.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.