Giter Club home page Giter Club logo

out-of-memory's Introduction

Out-Of-Memory Investigation .py

Python 2.5.x - 3.6.x compatibility

The following python script can be used to calculate the estimated RSS (RAM) value of each service at the time a kernel invoked OOM killer.

At the time of an OOM incident, the system logs the estimated RSS value of each service in its system log. Based off of this information the script will calculate how much RAM the services were "theoretically" trying to use, the total RAM value of all services and how much RAM your system actually has to offer these services. Allowing for further investigation into the memory usage of the top "offending" service(s).

The script looks in /var/log/messages or /var/log/syslog and takes the values recorded by the system just before the incident occurs.


Running

Usage:

OOMUsage.png output

There are currently 3 methods for running.

If no argument is parsed, it will default to using the current ACTIVE system log:

python oom-investigate.py

You can also specify an old/rotated/compresses log:

python oom-investigate.py -f <old_rotated_file>

Or you can summarise the log files quickly with:

python oom-investigate.py -q


Method(s)


wget https://raw.githubusercontent.com/LukeShirnia/out-of-memory/master/oom-investigate.py

or

git clone https://github.com/LukeShirnia/out-of-memory.git

or

curl -s https://raw.githubusercontent.com/LukeShirnia/out-of-memory/master/oom-investigate.py | python

NOTE:

The script currently works on the following OS:

  • RHEL/CentOS 6,7

  • Ubuntu 14.04LTS/16.04LTS

  • Redhat/CentOS 5 - Only works on some devices,AND you may need to specify python2.6 or 2.6



Script Breakdown:

The output from this script can be broken down into 4 main sections:


Section 1 - Log File Information

This section is a quick overview of the log file used for reference.
Example:

LogInformation.png output


Section 2 - Total Services Killed

During out-of-memory investigations its not always obvious what service(s) have been killed, especially when most of the entries in the system log shows httpd/apache. This output allows you to quickly discover if a backup agent or mysql was killed at some point in the start/end date of the log file.
Example:

KilledServices.png output


Section 3 - Date of OOM Issues

This helps narrow down problematic times such as; peak traffic times, backup times etc
Example:

OOMOccurrences.png output


Section 4 - Top 5 OOM Consumers

This section allows you to narrow down the cause of heaviest memory consumer. This gives you a good starting point to prevent the issue occuring again.
Example:

Top5Consumer.png output




Full Example Output

The following example shows the output of the script when run against a compressed log file in a "non standard" directory:

FullExample.png output



Example - No OOM in log file

This example shows the output when NO oom has occurred in the log file. NO options were passed with the running of this script (Method 1 was used).

The script will now prompt you to enter an option if the main file doesn't contain any oom incidents but another file does:

NoOOM.png output

out-of-memory's People

Contributors

lukeshirnia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

out-of-memory's Issues

Add journalctl compatibility

Fedora 28 (and soon RHEL 8) will require journalctl compatibility as they do not log to /var/log/messages as before.
Add this functionality

Different Return Codes

Investigate implementing different return codes depending on what is returned by the script

Example:

Run script with the following output:

  1. NO files have OOM issues - return code 1
  2. OOM Found in 1 file - return code 2
  3. Loads of OOM found (> 5) - return code 3

This allows automation tools to implement different "time saved" values depending on the scripts return code

Ctrl+c, ctrl+d error (--quick)

[root@lga-db ~]# monkey.py -o -- -q
Downloading oom tool ...
----------------------------------------
      _____ _____ _____ 
     |     |     |     |
     |  |  |  |  | | | |
     |_____|_____|_|_|_|
     Out Of Memory Analyser

Disclaimer:
If system OOMs too viciously, there may be nothing logged!
Do NOT take this script as FACT, investigate further
----------------------------------------
Checking other logs, select an option:
Option: 1  /var/log/messages          - Occurrences: 1
           /var/log/messages-20170806.gz - Occurrences: 0
Option: 2  /var/log/messages-20170814.gz - Occurrences: 1
Option: 3  /var/log/messages-20170820.gz - Occurrences: 1
           /var/log/messages-20170827.gz - Occurrences: 0

Which file should we check next?
Select an option number between 1 and 3: ^C
Traceback (most recent call last):
  File "/home/rack/monkeys/monkey-2517a72ed6.py", line 1479, in <module>
    main()
  File "/home/rack/monkeys/monkey-2517a72ed6.py", line 1437, in main
    external_script_action(opt, args)
  File "/home/rack/monkeys/monkey-2517a72ed6.py", line 1402, in external_script_action
    p.communicate(script)
  File "/usr/lib64/python2.7/subprocess.py", line 797, in communicate
    self.wait()
  File "/usr/lib64/python2.7/subprocess.py", line 1376, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/usr/lib64/python2.7/subprocess.py", line 478, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt

'NoneType' object has no attribute 'endswith'

Here's the output I'm getting on Fedora 25:

Python 2.7.13

~/out-of-memory $ python oom-investigate.py 
----------------------------------------
      _____ _____ _____ 
     |     |     |     |
     |  |  |  |  | | | |
     |_____|_____|_|_|_|
     Out Of Memory Analyser

Disclaimer:
If system OOMs too viciously, there may be nothing logged!
Do NOT take this script as FACT, investigate further
----------------------------------------
Unsupported OS

Error:
'NoneType' object has no attribute 'endswith'

----------------------------------------

oom in dmesg but not system logs

Dmesg reporting OOM issues but script reporting nothing

This is because dmesg is reporting very old oom incidents.
System has rotated and purged all logs since that report so there is nothing left in the log file.

Grab the oldest date (1st line in oldest compressed file) and print message to explain that its an old message, there are no incidents since $date

FR: Add override for 300MB file limit

Currently the script does not run when the file is larger than 300Mb.
Add an option like --override to allow to bypass this limit if the device has a significant amount of RAM.

Note: Limit was put in place for small devices. Maybe add a check for free RAM and base the size of the file on that.

RHEL 5 - compatibility issue

File "<stdin>", line 80
    with open("/proc/meminfo", "r") as meminfo:
            ^
SyntaxError: invalid syntax

Note: RHEL5 and python 2.4.x are EOL. I will not go out of my way to accommodate for EOL OS's and python versions.

CentOS/RHEL 5 - No information provided

Doesnt provide information with there is an OOM incident with CentOS/RHEL 5

Although the system does not log in the same manner, more information can be provided.
Update script to provide a little bit more information

Date objects

Fix date objects so the sort works correctly rather than the current sorting via alphabetical

Report # of processes

Add functionality to report on the total number of processes recorded when the system ooms

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.