Giter Club home page Giter Club logo

ebay / nvidiagpubeat Goto Github PK

View Code? Open in Web Editor NEW
53.0 7.0 22.0 2.67 MB

nvidiagpubeat is an elastic beat that uses NVIDIA System Management Interface (nvidia-smi) to monitor NVIDIA GPU devices and can ingest metrics into Elastic search cluster, with support for both 6.x and 7.x versions of beats. nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.

Home Page: https://github.com/eBay/nvidiagpubeat

License: Apache License 2.0

Makefile 4.35% Go 78.34% Python 7.57% HTML 9.74%
nvidia nvidia-gpu nvidia-smi elasticsearch machine-learning ebay elasticbeats golang beats

nvidiagpubeat's People

Contributors

deepujain avatar minduni avatar ml-jzimmermann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

nvidiagpubeat's Issues

Exit out if nvidia-smi is not present in PATH for production mode.

  1. nvidiagpubeat can be run in production mode.
  2. In case nvidia-smi is not in PATH, then throw appropriate error message and exit

Possible solution

  1. gpu.go command() must return err object for above scenario.
    2.metrics.go must check on error and work accordingly.
if err != nil return err ```

Ex

2019-09-04T12:32:57.008-0700 INFO instance/beat.go:400 nvidiagpubeat start running.
2019-09-04T12:32:57.008-0700 INFO beater/nvidiagpubeat.go:57 nvidiagpubeat is running for ** production ** environment. ! Hit CTRL-C to stop it.
2019-09-04T12:32:58.038-0700 ERROR nvidia/gpu.go:52 E! nvidia-smi is not in PATH. Exit !!!

Event not generated, error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed

On Gpu Machine, I have installed driver successfully!
OS: Sles15sp2
Kubernetes Version - v1.18.6
containerD runtime

nvidia-smi
Thu Sep 23 14:19:02 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P4000        Off  | 00000000:D8:00.0 Off |                  N/A |
| 46%   32C    P8     5W / 105W |      0MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
nerdctl run --rm  --gpus all docker.io/nvidia/cuda:11.0-base  nvidia-smi 
Thu Sep 23 21:19:45 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P4000        Off  | 00000000:D8:00.0 Off |                  N/A |
| 46%   32C    P8     5W / 105W |      0MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvidiagpubeat pod running but not collecting metrics for that GPU machine!

kubectl -n kube-system logs nvidiagpubeat-qmsp6

2021-09-23T21:12:55.185Z	INFO	instance/beat.go:592	Home path: [/usr/share/nvidiagpubeat] Config path: [/usr/share/nvidiagpubeat/keystore] Data path: [/usr/share/nvidiagpubeat/data] Logs path: [/usr/share/nvidiagpubeat/logs]
2021-09-23T21:12:55.185Z	INFO	instance/beat.go:599	Beat UUID: 246846e6-6a14-47ec-94fb-d74ce0825d73
2021-09-23T21:12:55.185Z	INFO	[beat]	instance/beat.go:825	Beat info	{"system_info": {"beat": {"path": {"config": "/usr/share/nvidiagpubeat/keystore", "data": "/usr/share/nvidiagpubeat/data", "home": "/usr/share/nvidiagpubeat", "logs": "/usr/share/nvidiagpubeat/logs"}, "type": "nvidiagpubeat", "uuid": "246846e6-6a14-47ec-94fb-d74ce0825d73"}}}
2021-09-23T21:12:55.185Z	INFO	[beat]	instance/beat.go:834	Build info	{"system_info": {"build": {"commit": "unknown", "libbeat": "6.5.5", "time": "1754-08-30T22:43:41.128Z", "version": "6.5.5"}}}
2021-09-23T21:12:55.185Z	INFO	[beat]	instance/beat.go:837	Go runtime info	{"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":48,"version":"go1.12.5"}}}
2021-09-23T21:12:55.188Z	INFO	[beat]	instance/beat.go:841	Host info	{"system_info": {"host": {"architecture":"x86_64","boot_time":"2021-09-23T19:10:10Z","containerized":true,"name":"MY_VM","ip":["127.0.0.1/8","16.0.9.206/23","10.4.0.1/24","10.192.0.128/32"],"kernel_version":"5.3.18-22-default","mac":["b8:83:03:6c:87:58","b8:83:03:6c:87:59","86:53:55:aa:03:41","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","06:1e:08:04:ee:33","ee:ee:ee:ee:ee:ee"],"os":{"family":"redhat","platform":"centos","name":"CentOS Linux","version":"7 (Core)","major":7,"minor":9,"patch":2009,"codename":"Core"},"timezone":"UTC","timezone_offset_sec":0,"id":"6e5dfda404614386a9ad9543eec57cf0"}}}
2021-09-23T21:12:55.188Z	INFO	[beat]	instance/beat.go:870	Process info	{"system_info": {"process": {"capabilities": {"inheritable":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"permitted":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"effective":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"bounding":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"ambient":null}, "cwd": "/usr/share/nvidiagpubeat", "exe": "/usr/share/nvidiagpubeat/nvidiagpubeat", "name": "nvidiagpubeat", "pid": 1, "ppid": 0, "seccomp": {"mode":"disabled","no_new_privs":false}, "start_time": "2021-09-23T21:12:54.360Z"}}}
2021-09-23T21:12:55.189Z	INFO	instance/beat.go:278	Setup Beat: nvidiagpubeat; Version: 6.5.5
2021-09-23T21:12:55.189Z	INFO	elasticsearch/client.go:163	Elasticsearch url: https://16.0.14.117:9210
2021-09-23T21:12:55.189Z	INFO	[publisher]	pipeline/module.go:110	Beat name: MY_VM
2021-09-23T21:12:55.189Z	INFO	instance/beat.go:400	nvidiagpubeat start running.
2021-09-23T21:12:55.189Z	INFO	[monitoring]	log/log.go:117	Starting metrics logging every 30s
2021-09-23T21:12:55.189Z	INFO	beater/nvidiagpubeat.go:57	nvidiagpubeat is running for ** production ** environment. ! Hit CTRL-C to stop it.
2021-09-23T21:13:25.191Z	INFO	[monitoring]	log/log.go:144	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":0,"time":{"ms":8}},"total":{"ticks":30,"time":{"ms":46},"value":30},"user":{"ticks":30,"time":{"ms":38}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"94a47230-67d7-4616-8d91-b51067182931","uptime":{"ms":30021}},"memstats":{"gc_next":4194304,"memory_alloc":2107976,"memory_total":3747368,"rss":23207936}},"libbeat":{"config":{"module":{"running":0}},"output":{"type":"elasticsearch"},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"cpu":{"cores":48},"load":{"1":0,"15":0.07,"5":0.07,"norm":{"1":0,"15":0.0015,"5":0.0015}}}}}}
2021-09-23T21:13:25.193Z	ERROR	beater/nvidiagpubeat.go:75	Event not generated, error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed
2021-09-23T21:13:55.191Z	INFO	[monitoring]	log/log.go:144	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":0,"time":{"ms":1}},"total":{"ticks":40,"time":{"ms":4},"value":40},"user":{"ticks":40,"time":{"ms":3}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"94a47230-67d7-4616-8d91-b51067182931","uptime":{"ms":60021}},"memstats":{"gc_next":4194304,"memory_alloc":2472080,"memory_total":4111472}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0,"15":0.07,"5":0.06,"norm":{"1":0,"15":0.0015,"5":0.0013}}}}}}
2021-09-23T21:13:55.193Z	ERROR	beater/nvidiagpubeat.go:75	Event not generated, error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed
2021-09-23T21:14:25.191Z	INFO	[monitoring]	log/log.go:144	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":0},"total":{"ticks":40,"time":{"ms":3},"value":40},"user":{"ticks":40,"time":{"ms":3}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"94a47230-67d7-4616-8d91-b51067182931","uptime":{"ms":90022}},"memstats":{"gc_next":4194304,"memory_alloc":2823136,"memory_total":4462528}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0,"15":0.07,"5":0.05,"norm":{"1":0,"15":0.0015,"5":0.001}}}}}}
2021-09-23T21:14:25.193Z	ERROR	beater/nvidiagpubeat.go:75	Event not generated, error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed
2021-09-23T21:14:55.191Z	INFO	[monitoring]	log/log.go:144	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":10,"time":{"ms":2}},"total":{"ticks":60,"time":{"ms":8},"value":60},"user":{"ticks":50,"time":{"ms":6}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"94a47230-67d7-4616-8d91-b51067182931","uptime":{"ms":120022}},"memstats":{"gc_next":4194304,"memory_alloc":1767520,"memory_total":4950528,"rss":303104}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0,"15":0.06,"5":0.05,"norm":{"1":0,"15":0.0013,"5":0.001}}}}}}
2021-09-23T21:14:55.193Z	ERROR	beater/nvidiagpubeat.go:75	Event not generated, error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed
2021-09-23T21:15:25.191Z	INFO	[monitoring]	log/log.go:144	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":10},"total":{"ticks":60,"time":{"ms":3},"value":60},"user":{"ticks":50,"time":{"ms":3}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"94a47230-67d7-4616-8d91-b51067182931","uptime":{"ms":150021}},"memstats":{"gc_next":4194304,"memory_alloc":2119320,"memory_total":5302328}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0,"15":0.06,"5":0.04,"norm":{"1":0,"15":0.0013,"5":0.0008}}}}}}

Thanks in advance.

Any plans to make this compatible with 7.x?

Using nvidiagpubeat, installed as per docs. Elasticsearch instance is version 7.1 It's my understanding that in 7.0, they no longer allow doctype support, it's one of the breaking changes from 6.0. I could be wrong. Here's the output I get when gpubeat tries to install the template on ES:

2019-06-04T13:11:35.674-0700 DEBUG [elasticsearch] elasticsearch/client.go:731 PUT https://es:9200/_template/nvidiagpubeat-6.5.5 map[settings:{"index":{"mapping":{"total_fields":{"limit":10000}},"number_of_routing_shards":30,"query":{"default_field":["fields.*"]},"refresh_interval":"5s"}} index_patterns:[nvidiagpubeat-6.5.5-*] mappings:{"doc":{"_meta":{"version":"6.5.5"},"date_detection":false,"dynamic_templates":[{"strings_as_keyword":{"mapping":{"ignore_above":1024,"type":"keyword"},"match_mapping_type":"string"}}],"properties":{}}} order:1] 2019-06-04T13:11:35.680-0700 ERROR pipeline/output.go:100 Failed to connect to backoff(elasticsearch(https://es:9200)): Connection marked as failed because the onConnect callback failed: Error loading Elasticsearch template: could not load template. Elasticsearch returned: couldn't load template: couldn't load json. Error: 400 Bad Request: {"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"Root mapping definition has unsupported parameters: [doc : {_meta={version=6.5.5}, dynamic_templates=[{strings_as_keyword={mapping={ignore_above=1024, type=keyword}, match_mapping_type=string}}], properties={}, date_detection=false}]"}],"type":"mapper_parsing_exception","reason":"Failed to parse mapping [_doc]: Root mapping definition has unsupported parameters: [doc : {_meta={version=6.5.5}, dynamic_templates=[{strings_as_keyword={mapping={ignore_above=1024, type=keyword}, match_mapping_type=string}}], properties={}, date_detection=false}]","caused_by":{"type":"mapper_parsing_exception","reason":"Root mapping definition has unsupported parameters: [doc : {_meta={version=6.5.5}, dynamic_templates=[{strings_as_keyword={mapping={ignore_above=1024, type=keyword}, match_mapping_type=string}}], properties={}, date_detection=false}]"}},"status":400}. Response body: {"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"Root mapping definition has unsupported parameters: [doc : {_meta={version=6.5.5}, dynamic_templates=[{strings_as_keyword={mapping={ignore_above=1024, type=keyword}, match_mapping_type=string}}], properties={}, date_detection=false}]"}],"type":"mapper_parsing_exception","reason":"Failed to parse mapping [_doc]: Root mapping definition has unsupported parameters: [doc : {_meta={version=6.5.5}, dynamic_templates=[{strings_as_keyword={mapping={ignore_above=1024, type=keyword}, match_mapping_type=string}}], properties={}, date_detection=false}]","caused_by":{"type":"mapper_parsing_exception","reason":"Root mapping definition has unsupported parameters: [doc : {_meta={version=6.5.5}, dynamic_templates=[{strings_as_keyword={mapping={ignore_above=1024, type=keyword}, match_mapping_type=string}}], properties={}, date_detection=false}]"}},"status":400}. Template is: map[mappings:{"doc":{"_meta":{"version":"6.5.5"},"date_detection":false,"dynamic_templates":[{"strings_as_keyword":{"mapping":{"ignore_above":1024,"type":"keyword"},"match_mapping_type":"string"}}],"properties":{}}} order:%!s(int=1) settings:{"index":{"mapping":{"total_fields":{"limit":10000}},"number_of_routing_shards":30,"query":{"default_field":["fields.*"]},"refresh_interval":"5s"}} index_patterns:[nvidiagpubeat-6.5.5-*]]

Testimonials - Please submit PR

I have included a testimonial section (close to end of file) in Readme.md. When you find time, please share few words about your use case and submit a PR.
Regards,
Deepak

Wrong template exported by nvidiagpubeat

The template exported is wrong... and it seems to be the cause of timestamp mapped as a string (#15)
Whatever I set as argument (i.e: ./nvidiagpubeat -c nvidiagpubeat.yml -e -d "*" -E seccomp.enabled=false -v export template), the template is this:

{
  "index_patterns": [
    "nvidiagpubeat-6.8.15-*"
  ],
  "mappings": {
    "doc": {
      "_meta": {
        "version": "6.8.15"
      },
      "date_detection": false,
      "dynamic_templates": [
        {
          "strings_as_keyword": {
            "mapping": {
              "ignore_above": 1024,
              "type": "keyword"
            },
            "match_mapping_type": "string"
          }
        }
      ],
      "properties": {}
    }
  },
  "order": 1,
  "settings": {
    "index": {
      "mapping": {
        "total_fields": {
          "limit": 10000
        }
      },
      "number_of_routing_shards": 30,
      "query": {
        "default_field": [
          "fields.*"
        ]
      },
      "refresh_interval": "5s"
    }
  }
}

This is completly different from https://github.com/eBay/nvidiagpubeat/blob/master/nvidiagpubeat.template.json where there is at least this:

[...]
      "properties": {
        "@timestamp": {
          "type": "date"
        },
[...]

Support test mode for Linux environment

https://github.com/eBay/nvidiagpubeat#run-in-test-environment-macos Support MacOS as it uses localnvidiasmi ( nvidia-smi . simulator). This executable is built for Mac OS.

To support test mode in Linux environment, following actions are required

  1. Move away from localnvidiasmi executable.
  2. Use a Bash script ( can run on MacOS or Linux) that will emit mock events. Lets call this new script nvidiasmi-simulator
  3. nvidiagpubeat can nvidiasmi-simulator for test environment.

fork/exec error

After building on Ubuntu 18.04 and running ./nvidiagpubeat
environment is production.

./nvidiagpubeat -c nvidiagpubeat.yml -e -d "*"
I get a lot of fork/exec errors

2019-02-06T18:21:57.750Z INFO [monitoring] log/log.go:117 Starting metrics logging every 30s
2019-02-06T18:21:58.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted
2019-02-06T18:21:59.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted
2019-02-06T18:22:00.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted
2019-02-06T18:22:01.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted
2019-02-06T18:22:02.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted

Unable to send data to LogStash

We're trying to streamline access to our Elastic Search servers so we go through LogStash for some Beats. We also tried sending nvidiagpubeat to LogStash but we get these warnings that are errors since we have an empty index:

[WARN ] 2018-12-09 10:52:59.113 [nioEventLoopGroup-2-1] jsonlines - Received an event that has a different character encoding than you configured. {:text=>"\\x85\\x87S\\xF6\\x92\\xA1peoiij\\xC2\\u0010\\xF2\\u007F\\xDF\\xF3\\u0015E\\xED\\x8F\\a\\u001D\\xC5:NB.\\xC0\\xA0\\xA9\\xB7\\x87\\xF5\\xB6Y\\xD7\\xDD}\\xBD7\\xFA\\x9F\\xD1z\\xD3j\\xFD\\x88\\xE5=\\u0000\\u0000\\xFF\\xFF|[uk", :expected_charset=>"UTF-8"}
[WARN ] 2018-12-09 10:52:59.116 [nioEventLoopGroup-2-1] jsonlines - JSON parse error, original data now in message field {:error=>#<LogStash::Json::ParserError: Unexpected character ('\' (code 92)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: (String)"\x85\x87S\xF6\x92\xA1peoiij\xC2\u0010\xF2\u007F\xDF\xF3\u0015E\xED\x8F\a\u001D\xC5:NB.\xC0\xA0\xA9\xB7\x87\xF5\xB6Y\xD7\xDD}\xBD7\xFA\x9F\xD1z\xD3j\xFD\x88\xE5=\u0000\u0000\xFF\xFF|[uk"; line: 1, column: 2]>, :data=>"\\x85\\x87S\\xF6\\x92\\xA1peoiij\\xC2\\u0010\\xF2\\u007F\\xDF\\xF3\\u0015E\\xED\\x8F\\a\\u001D\\xC5:NB.\\xC0\\xA0\\xA9\\xB7\\x87\\xF5\\xB6Y\\xD7\\xDD}\\xBD7\\xFA\\x9F\\xD1z\\xD3j\\xFD\\x88\\xE5=\\u0000\\u0000\\xFF\\xFF|[uk"}
[WARN ] 2018-12-09 10:52:29.318 [nioEventLoopGroup-2-1] jsonlines - JSON parse error, original data now in message field {:error=>#<LogStash::Json::ParserError: Unexpected character ('\' (code 92)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: (String)"\xBBM\xB3\xD1X\u0014B\u0012\u0012\x869|Sg\xE4\xC4=L\xD3*\xC8$4´\xFBV+\xBCD\xE62o\xBBEA\xD8\u0005\x8E$9\xF2\xD7iÍ®,hH0O\x88\xE9\u0002\x85!s*\x86\xA3\xF5\xF9"; line: 1, column: 2]>, :data=>"\\xBBM\\xB3\\xD1X\\u0014B\\u0012\\u0012\\x869|Sg\\xE4\\xC4=L\\xD3*\\xC8$4´\\xFBV+\\xBCD\\xE62o\\xBBEA\\xD8\\u0005\\x8E$9\\xF2\\xD7iͮ,hH0O\\x88\\xE9\\u0002\\x85!s*\\x86\\xA3\\xF5\\xF9"}[WARN ] 2018-12-09 10:54:29.195 [nioEventLoopGroup-2-2] jsonlines - Received an event that has a different character encoding than you configured. {:text=>"2W\\u0000\\u0000\\u0000\\u00012C\\u0000\\u0000\\u0000\\xFCx^t\\x90OK\\xC3@\\u0010\\xC5ӯ\\xF2\\xCEӒ&)m\\xF6ԫ\\x9E\\xF5\\xA2x\\x98\\x921.d7\\xCB\\xEEl\\xFF\\u0018\\xF2\\xDD%\\x82\\u0005\\u0011o\\xC30\\xEF\\xC7\\xFC^\\xF5X\\u0014Ū(V~\\xC2щr\\xC7\\xCA0\\u0013\\xF4\\u0016\\u0004\\u0006\\xFEl;\\xCB}\\xC8'a\\u0005\\xE1$\\xAC\\u007F\\xD63!$e\\u0015\\x98\\u0003A\\xC5\\u0005\\x89\\xAC9\\xCA\\u0002\\xEAC\\x86\\xA9v3\\xA1\\u000F\\xF9\\xC1wr\\x85)\\xE9?~V;\\xD8OV;\\xFA{\\xB8$8qc\\xBC\\xC1\\x943\\xE1\\xA8\\xD6IRv\\u0001\\u0006U\\xB9=\\xAC\\xB7պl\\x9Fʽ\\xA9[Sכ\\xA6\\u07BD\\x80\\xEE\\x99\\t:*\\u000F0;\\xA9\\t\\xEFQd\\x99\\xB7-!'\\xE9`\\xAAf\\xFE\\u0011\\x9B\\xF01&\\xF5\\xEC\\u0016\\xF7\\x98.\\xA9mA8KL\\xDF\\u000Fa\\xB7\\xA965\\b\\xBF/f\\x82r\\x9F`^\\u0011\\xD3\\u0005\\x84>KZ\\xDA\\u001A\\xAC\\xCFW\\u0010\\x9EO\\xD9k\\u0006\\xE1*\\xDE\\xF2\\x80\\xB7\\xF9+\\u0000\\u0000\\xFF\\xFF\\v&up", :expected_charset=>"UTF-8"}

Any idea what's happening? Has this ever been tested or are we forced to send data directly to ES?

Ciao,
Antonio

Moving existing issues from https://github.com/deepujain/nvidiagpubeat to here
Created on behalf of @antoniokaust

Can nvidiagpubeat be made to also export the process running on each card?

Since nvidiagpubeat is based on nvidia-smi and nvidia-smi is able to list the processes that are currently using the gpu cards, in theory nvidiagpubeat should be able to export the process info as metrics. Please correct me if I am wrong.

I am interest to know if there is any plan to do this? It will be very helpful in identifying the GPU resource usage of processes and the code efficiency.

All the best.

FATAL [nvidiagpubeat] instance/beat.go:154 Failed due to panic. {"panic": "runtime error: index out of range", "stack":

Can someone help in solving the issue @deepujain
[nvidia-smi is installed on my machine]

2021-02-03T17:33:10.784Z INFO instance/beat.go:592 Home path: [/usr/share/nvidiagpubeat] Config path: [/usr/share/nvidiagpubeat] Data path: [/usr/share/nvidiagpubeat/data] Logs path: [/usr/share/nvidiagpubeat/logs]
2021-02-03T17:33:10.784Z INFO instance/beat.go:599 Beat UUID: 0250fd46-f397-4def-9098-ad27429e08c2
2021-02-03T17:33:10.784Z INFO [beat] instance/beat.go:825 Beat info {"system_info": {"beat": {"path": {"config": "/usr/share/nvidiagpubeat", "data": "/usr/share/nvidiagpubeat/data", "home": "/usr/share/nvidiagpubeat", "logs": "/usr/share/nvidiagpubeat/logs"}, "type": "nvidiagpubeat", "uuid": "0250fd46-f397-4def-9098-ad27429e08c2"}}}
2021-02-03T17:33:10.784Z INFO [beat] instance/beat.go:834 Build info {"system_info": {"build": {"commit": "unknown", "libbeat": "6.5.5", "time": "1754-08-30T22:43:41.128Z", "version": "6.5.5"}}}
2021-02-03T17:33:10.786Z INFO [beat] instance/beat.go:837 Go runtime info {"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":48,"version":"go1.12.5"}}}
2021-02-03T17:33:10.788Z INFO [beat] instance/beat.go:841 Host info {"system_info": {"host": {"architecture":"x86_64","boot_time":"2021-01-22T06:04:54Z","containerized":true,"name":"8bb58f1899fe","ip":["127.0.0.1/8","172.17.0.2/16"],"kernel_version":"3.10.0-1160.11.1.el7.x86_64","mac":["02:42:ac:11:00:02"],"os":{"family":"redhat","platform":"centos","name":"CentOS Linux","version":"7 (Core)","major":7,"minor":9,"patch":2009,"codename":"Core"},"timezone":"UTC","timezone_offset_sec":0,"id":"d097cfbbf25e4ea2b5a1f0b530456ab7"}}}
2021-02-03T17:33:10.788Z INFO [beat] instance/beat.go:870 Process info {"system_info": {"process": {"capabilities": {"inheritable":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"permitted":null,"effective":null,"bounding":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"ambient":null}, "cwd": "/usr/share/nvidiagpubeat", "exe": "/usr/share/nvidiagpubeat/nvidiagpubeat", "name": "nvidiagpubeat", "pid": 1, "ppid": 0, "seccomp": {"mode":"filter","no_new_privs":false}, "start_time": "2021-02-03T17:33:09.899Z"}}}
2021-02-03T17:33:10.788Z INFO instance/beat.go:278 Setup Beat: nvidiagpubeat; Version: 6.5.5
2021-02-03T17:33:10.789Z INFO elasticsearch/client.go:163 Elasticsearch url: http://xx.xx.xx.xx:9210
2021-02-03T17:33:10.789Z INFO [publisher] pipeline/module.go:110 Beat name: 8bb58f1899fe
2021-02-03T17:33:10.789Z INFO instance/beat.go:400 nvidiagpubeat start running.
2021-02-03T17:33:10.789Z INFO beater/nvidiagpubeat.go:57 nvidiagpubeat is running for ** production ** environment. ! Hit CTRL-C to stop it.
2021-02-03T17:33:10.789Z INFO [monitoring] log/log.go:117 Starting metrics logging every 30s
2021-02-03T17:33:40.792Z INFO [monitoring] log/log.go:144 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":20,"time":{"ms":20}},"total":{"ticks":50,"time":{"ms":58},"value":50},"user":{"ticks":30,"time":{"ms":38}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"391584c9-bc61-4868-a440-9886fba4a756","uptime":{"ms":30011}},"memstats":{"gc_next":4194304,"memory_alloc":2017704,"memory_total":3635056,"rss":14045184}},"libbeat":{"config":{"module":{"running":0}},"output":{"type":"elasticsearch"},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"cpu":{"cores":48},"load":{"1":0.21,"15":0.25,"5":0.24,"norm":{"1":0.0044,"15":0.0052,"5":0.005}}}}}}
2021-02-03T17:33:40.798Z INFO [monitoring] log/log.go:152 Total non-zero metrics {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":20,"time":{"ms":22}},"total":{"ticks":50,"time":{"ms":61},"value":50},"user":{"ticks":30,"time":{"ms":39}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":6},"info":{"ephemeral_id":"391584c9-bc61-4868-a440-9886fba4a756","uptime":{"ms":30019}},"memstats":{"gc_next":4194304,"memory_alloc":2521296,"memory_total":4138648,"rss":14045184}},"libbeat":{"config":{"module":{"running":0}},"output":{"type":"elasticsearch"},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"cpu":{"cores":48},"load":{"1":0.21,"15":0.25,"5":0.24,"norm":{"1":0.0044,"15":0.0052,"5":0.005}}}}}}
2021-02-03T17:33:40.798Z INFO [monitoring] log/log.go:153 Uptime: 30.020403448s
2021-02-03T17:33:40.798Z INFO [monitoring] log/log.go:130 Stopping metrics logging.
2021-02-03T17:33:40.798Z INFO runtime/panic.go:522 nvidiagpubeat stopped.
2021-02-03T17:33:40.798Z FATAL [nvidiagpubeat] instance/beat.go:154 Failed due to panic. {"panic": "runtime error: index out of range", "stack": "github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance.Run.func1.1\n\t/usr/build/nvidiagpubuild/beats_dev/src/github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance/beat.go:155\nruntime.gopanic\n\t/opt/go/src/runtime/panic.go:522\nruntime.panicindex\n\t/opt/go/src/runtime/panic.go:44\ngithub.com/ebay/nvidiagpubeat/nvidia.Utilization.run\n\t/usr/build/nvidiagpubuild/beats_dev/src/github.com/ebay/nvidiagpubeat/nvidia/gpu.go:123\ngithub.com/ebay/nvidiagpubeat/nvidia.Metrics.Get\n\t/usr/build/nvidiagpubuild/beats_dev/src/github.com/ebay/nvidiagpubeat/nvidia/metrics.go:50\ngithub.com/ebay/nvidiagpubeat/beater.(*Nvidiagpubeat).Run\n\t/usr/build/nvidiagpubuild/beats_dev/src/github.com/ebay/nvidiagpubeat/beater/nvidiagpubeat.go:73\ngithub.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance.(*Beat).launch\n\t/usr/build/nvidiagpubuild/beats_dev/src/github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance/beat.go:410\ngithub.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance.Run.func1\n\t/usr/build/nvidiagpubuild/beats_dev/src/github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance/beat.go:181\ngithub.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance.Run\n\t/usr/build/nvidiagpubuild/beats_dev/src/github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance/beat.go:182\ngithub.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd.genRunCmd.func1\n\t/usr/build/nvidiagpubuild/beats_dev/src/github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/run.go:37\ngithub.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/vendor/github.com/spf13/cobra.(*Command).execute\n\t/usr/build/nvidiagpubuild/beats_dev/src/github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:704\ngithub.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/usr/build/nvidiagpubuild/beats_dev/src/github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:785\ngithub.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/usr/build/nvidiagpubuild/beats_dev/src/github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/vendor/github.com/spf13/cobra/command.go:738\nmain.main\n\t/usr/build/nvidiagpubuild/beats_dev/src/github.com/ebay/nvidiagpubeat/main.go:34\nruntime.main\n\t/opt/go/src/runtime/proc.go:200"}

Incompatible with driver version 460.32.03 (because of the two dots)

I believe the current version is not compatible with Nvidia driver version 460.32.03, due to it having two dots in the name.

Please see the end of the line below:

2021-01-13T20:22:02.494Z	WARN	elasticsearch/client.go:535	Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xbff7f37a9d203e60, ext:32072702120, loc:(*time.Location)(0x2090e20)}, Meta:common.MapStr(nil), Fields:common.MapStr{"agent":common.MapStr{"ephemeral_id":"fd12b93b-9db1-4e24-9e0d-747229464c00", "hostname":"dca939cfb9c2", "id":"33c198bd-3989-4a66-9683-fe258efbe53b", "type":"nvidiagpubeat", "version":"7.3.3"}, "clocks":common.MapStr{"gr":0, "mem":405, "sm":0}, "count":2, "driver_version":"460.32.03", "ecs":common.MapStr{"version":"1.0.1"}, "fan":common.MapStr{"speed":30}, "gpuIndex":1, "host":common.MapStr{"name":"dca939cfb9c2"}, "index":1, "memory":common.MapStr{"total":24268, "used":5942}, "name":"GeForceRTX3090", "power":common.MapStr{"draw":7.14, "limit":350}, "pstate":8, "temperature":common.MapStr{"gpu":28}, "type":"nvidiagpubeat", "utilization":common.MapStr{"gpu":0, "memory":0}}, Private:interface {}(nil), TimeSeries:false}, Flags:0x0} (status=400): {"type":"mapper_parsing_exception","reason":"failed to parse field [driver_version] of type [float] in document with id 'tvFp_XYBFpUHIuKQj2b6'. Preview of field's value: '460.32.03'","caused_by":{"type":"number_format_exception","reason":"multiple points"}}

Field @timestamp is mapped as type:string

The timestamp field is mapped as string. Please update the mapping before the index creation to fix this issue.

This issue was created as a result of the #11 Issue.

Thanks,
Julius

Duplicate field @timestamp

Hi thanks for the great work so far.

The beater creates the field @timestamp twice which raises: mapper_parsing_exception with reason: failed to parse caused by: Duplicate field @timestamp.

Can you change the output fields or show a way to handle this in elasticsearch / kibana?

Regards,
Julius

unable to build on CentOS Linux release 7.5.1804 (Core)

I am unable to complete the build on CentOS Linux release 7.5.1804 (Core).

I tied following the instructions but seeing below error:

make go build -i -ldflags "-X github.com/elastic/beats/libbeat/version.buildTime=2020-05-10T21:00:06Z -X github.com/elastic/beats/libbeat/version.commit=8cdb6f8f6994100b312a3095cf0fd4493fb9f62e"# github.com/ebay/nvidiagpubeat./main.go:31:15: undefined: cmd.GenRootCmd make: *** [libbeat] Error 2

Able to resolve go path:

which go /usr/local/go/bin/go

I couldn't get the GPU name

Excuse, me.

My config is here.

nvidiagpubeat:
  period: 30s
  query: "name,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
  env: "production"

My Result

{
  "index": 3,
  "temperature": {
    "gpu": 33
  },
  "memory": {
    "total": 11172,
    "used": 10
  },
  "utilization": {
    "gpu": 0,
    "memory": 0
  },
  "clocks": {
    "current": {
      "graphics": 0,
      "sm": 0,
      "memory": 0
    }
  },
  "ecs": {
    "version": "1.5.0"
  },
  "power": {
    "limit": 0,
    "draw": 0
  },
  "gpuIndex": 3,
  "gpu_name": 0,
  },
  "count": 4,
  "fan": {
    "speed": 23
  },
  "driver_version": 0,
  "pstate": 8,
  "type": "nvidiagpubeat",
}

gpu_name is 0.
power.limit is 0.
And another field has incorrected value of 0.

value, _ := strconv.Atoi(record[i])

Please, fix it. My boss is going to happy.

Best regards.

Readme.md - Init Project uses ebay and eBay

The commands in the Init Project section of Readme.md use a mix of ebay and eBay. I think these lines:

mkdir $WORKSPACE/src/github.com/eBay
cd $WORKSPACE/src/github.com/eBay

need to be lower case ('ebay') for the go import statements to work? Or since the GitHub repository is 'eBay' you might want to standardise on using 'eBay' everywhere instead?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.