Giter Club home page Giter Club logo

super-speedy-syslog-searcher's Introduction

Super Speedy Syslog Searcher! (s4)

Speedily search and merge log messages by datetime.

MSRV License docs.rs

crates.io version crates.io downloads crates.io downloads (version) CHANGELOG lib.rs

Build status coveralls.io Commits since

Super Speedy Syslog Searcher (s4) is a command-line tool to search and merge varying log messages from varying log files, sorted by datetime. Datetime filters may be passed to narrow the search to a datetime range.

s4 can read standardized log message formats like RFC 3164 and RFC 5424 ("syslog"), Red Hat Audit logs, strace output, and can read many non-standardized ad-hoc log message formats, including multi-line log messages. It also parses binary accounting records acct, lastlog, and utmp (acct, pacct, lastlog, utmp, utmpx, wtmp), systemd journal logs (.journal), and Microsoft Event Logs (.evtx). s4 can read logs that are compressed (.bz2, .gz, .lz4, .xz), or archived logs (.tar).

s4 aims to be very fast.



Use

Install super_speedy_syslog_searcher

Assuming rust is installed, run

cargo install --locked super_speedy_syslog_searcher

A C compiler is required.

allocator mimalloc or jemalloc

The default allocator is the System allocator.

Allocator mimalloc is feature mimalloc and allocator jemalloc is feature jemalloc. Allocator mimalloc is the fastest according to mimalloc project benchmarks. jemalloc is also very good.


mimalloc

cargo install --locked super_speedy_syslog_searcher --features mimalloc

Error Bus error is a known issue on some aarch64-unknown-linux-gnu systems.

$ s4 --version
Bus error

Either use jemalloc or the default System allocator.


jemalloc

cargo install --locked super_speedy_syslog_searcher --features jemalloc


Here are the packages for building super_speedy_syslog_searcher with jemalloc or mimalloc on various Operating Systems.

Alpine
apk add gcc make musl-dev
Debian and Ubuntu
apt install gcc make libc6-dev

or

apt install build-essential
OpenSUSE
zypper install gcc glibc-devel make
Red Hat and CentOS
yum install gcc glibc-devel make
feature mimalloc on Windows

Compiling mimalloc on Windows requires lib.exe which is part of Visual Studio Build Tools. Instructions at rustup.rs.

Run s4

For example, print all the log messages in syslog files under /var/log/

s4 /var/log

On Windows, print the ad-hoc logs under C:\Windows\Logs

s4.exe C:\Windows\Logs

On Windows, print all .log files under C:\Windows (with the help of Powershell)

Get-ChildItem -Filter '*.log' -File -Path "C:\Windows" -Recurse -ErrorAction SilentlyContinue `
  | Select-Object -ExpandProperty FullName `
  | s4.exe -

• note that UTF-16 encoded logs cannot be parsed, see Issue #16 • note that opening too many files causes error too many files open, see Issue #270, so Get-ChildItem -Filter lessens the number of files opened by s4.exe

On Windows, print the Windows Event logs

s4.exe C:\Windows\System32\winevt\Logs

Print the log messages after January 1, 2022 at 00:00:00

s4 /var/log -a 20220101

Print the log messages from January 1, 2022 00:00:00 to January 2, 2022

s4 /var/log -a 20220101 -b 20220102

or

s4 /var/log -a 20220101 -b @+1d

Print the log messages on January 1, 2022, from 12:00:00 to 16:00:00

s4 /var/log -a 20220101T120000 -b 20220101T160000

Print the record-keeping log messages from up to a day ago (with the help of find)

find /var -xdev -type f \( \
    -name 'lastlog' \
    -or -name 'wtmp' \
    -or -name 'wtmpx' \
    -or -name 'utmp' \
    -or -name 'utmpx' \
    -or -name 'acct' \
    -or -name 'pacct' \
  \) \
    2>/dev/null \
    | s4 - -a=-1d

Print the journal log messages from up to an hour ago, prepending the journal file name (with the help of find)

find / -xdev -name '*.journal' -type f 2>/dev/null \
    | s4 - -a=-1h -n

Print only the log messages that occurred two days ago (with the help of GNU date)

s4 /var/log -a $(date -d "2 days ago" '+%Y%m%d') -b @+1d

Print only the log messages that occurred two days ago during the noon hour (with the help of GNU date)

s4 /var/log -a $(date -d "2 days ago 12" '+%Y%m%dT%H%M%S') -b @+1h

Print only the log messages that occurred two days ago during the noon hour in Bengaluru, India (timezone offset +05:30) and prepended with equivalent UTC datetime (with the help of GNU date)

s4 /var/log -u -a $(date -d "2 days ago 12" '+%Y%m%dT%H%M%S+05:30') -b @+1h

--help

Speedily search and merge log messages by datetime.
DateTime filters may be passed to narrow the search.
s4 aims to be very fast.

Usage: s4 [OPTIONS] <PATHS>...

Arguments:
  <PATHS>...  Path(s) of log files or directories.
              Directories will be recursed. Symlinks will be followed.
              Paths may also be passed via STDIN, one per line. The user must
              supply argument "-" to signify PATHS are available from STDIN.

Options:
  -a, --dt-after <DT_AFTER>
          DateTime Filter After: print log messages with a datetime that is at
          or after this datetime. For example, "20200102T120000" or "-5d".
  -b, --dt-before <DT_BEFORE>
          DateTime Filter Before: print log messages with a datetime that is at
          or before this datetime.
          For example, "2020-01-03T23:00:00.321-05:30" or "@+1d+11h"
  -t, --tz-offset <TZ_OFFSET>
          Default timezone offset for datetimes without a timezone.
          For example, log message "[20200102T120000] Starting service" has a
          datetime substring "20200102T120000".
          That datetime substring does not have a timezone offset
          so this TZ_OFFSET value would be used.
          Example values, "+12", "-0800", "+02:00", or "EDT".
          To pass a value with leading "-" use "=" notation, e.g. "-t=-0800".
          If not passed then the local system timezone offset is used.
          [default: -07:00]
  -z, --prepend-tz <PREPEND_TZ>
          Prepend a DateTime in the timezone PREPEND_TZ for every line.
          Used in PREPEND_DT_FORMAT.
  -u, --prepend-utc
          Prepend a DateTime in the UTC timezone offset for every line.
          This is the same as "--prepend-tz Z".
          Used in PREPEND_DT_FORMAT.
  -l, --prepend-local
          Prepend DateTime in the local system timezone offset for every line.
          This is the same as "--prepend-tz +XX" where +XX is the local system
          timezone offset.
          Used in PREPEND_DT_FORMAT.
  -d, --prepend-dt-format <PREPEND_DT_FORMAT>
          Prepend a DateTime using the strftime format string.
          If PREPEND_TZ is set then that value is used for any timezone offsets,
          i.e. strftime "%z" "%:z" "%Z" values, otherwise the timezone offset value
          is the local system timezone offset.
          [Default: %Y%m%dT%H%M%S%.3f%z]
  -n, --prepend-filename
          Prepend file basename to every line.
  -p, --prepend-filepath
          Prepend file full path to every line.
  -w, --prepend-file-align
          Align column widths of prepended data.
      --prepend-separator <PREPEND_SEPARATOR>
          Separator string for prepended data.
          [default: :]
      --separator <LOG_MESSAGE_SEPARATOR>
          An extra separator string between printed log messages.
          Per log message not per line of text.
          Accepts a basic set of backslash escape sequences,
          e.g. "\0" for the null character, "\t" for tab, etc.
      --journal-output <JOURNAL_OUTPUT>
          The format for .journal file log messages.
          Matches journalctl --output options.
          [default: short]
          [possible values: short, short-precise, short-iso, short-iso-precise,
          short-full, short-monotonic, short-unix, verbose, export, cat]
  -c, --color <COLOR_CHOICE>
          Choose to print to terminal using colors.
          [default: auto]
          [possible values: always, auto, never]
      --blocksz <BLOCKSZ>
          Read blocks of this size in bytes.
          May pass value as any radix (hexadecimal, decimal, octal, binary).
          Using the default value is recommended.
          Most useful for developers.
          [default: 65536]
  -s, --summary
          Print a summary of files processed to stderr.
          Most useful for developers.
  -h, --help
          Print help
  -V, --version
          Print version

Given a file path, the file format will be processed based on a best guess of
the file name.
If the file format is not guessed then it will be treated as a UTF8 text file.
Given a directory path, found file names that have well-known non-log file name
extensions will be skipped.

DateTime Filters may be strftime specifier patterns:
    "%Y%m%dT%H%M%S*"
    "%Y-%m-%d %H:%M:%S*"
    "%Y-%m-%dT%H:%M:%S*"
    "%Y/%m/%d %H:%M:%S*"
    "%Y%m%d"
    "%Y-%m-%d"
    "%Y/%m/%d"
    "+%s"
Each * is an optional trailing 3-digit fractional sub-seconds,
or 6-digit fractional sub-seconds, and/or timezone.

Pattern "+%s" is Unix epoch timestamp in seconds with a preceding "+".
For example, value "+946684800" is be January 1, 2000 at 00:00, GMT.

DateTime Filters may be custom relative offset patterns:
    "+DwDdDhDmDs" or "-DwDdDhDmDs"
    "@+DwDdDhDmDs" or "@-DwDdDhDmDs"

Custom relative offset pattern "+DwDdDhDmDs" and "-DwDdDhDmDs" is the offset
from now (program start time) where "D" is a decimal number.
Each lowercase identifier is an offset duration:
"w" is weeks, "d" is days, "h" is hours, "m" is minutes, "s" is seconds.
For example, value "-1w22h" is one week and twenty-two hours in the past.
Value "+30s" is thirty seconds in the future.

Custom relative offset pattern "@+DwDdDhDmDs" and "@-DwDdDhDmDs" is relative
offset from the other datetime.
Arguments "-a 20220102 -b @+1d" are equivalent to "-a 20220102 -b 20220103".
Arguments "-a @-6h -b 20220101T120000" are equivalent to
"-a 20220101T060000 -b 20220101T120000".

Without a timezone, the Datetime Filter is presumed to be the local
system timezone.

Command-line passed timezones may be numeric timezone offsets,
e.g. "+09:00", "+0900", or "+09", or named timezone offsets, e.g. "JST".
Ambiguous named timezones will be rejected, e.g. "SST".

--prepend-tz and --dt-offset function independently:
--dt-offset is used to interpret processed log message datetime stamps that
do not have a timezone offset.
--prepend-tz affects what is pre-printed before each printed log message line.

--separator accepts backslash escape sequences:
    "\0", "\a", "\b", "\e", "\f", "\n", "\r", "\\", "\t", "\v"

Resolved values of "--dt-after" and "--dt-before" can be reviewed in
the "--summary" output.

s4 uses file naming to determine the file type.

s4 can process files compressed and named .bz2, .gz, .lz4, .xz, and files
archived within a .tar file.

Log messages from different files with the same datetime are printed in order
of the arguments from the command-line.

Datetimes printed for .journal file log messages may differ from datetimes
printed by program journalctl.
See Issue #101

DateTime strftime specifiers are described at
https://docs.rs/chrono/latest/chrono/format/strftime/

DateTimes supported are only of the Gregorian calendar.

DateTimes supported language is English.

Further background and examples are at the project website:
https://github.com/jtmoon79/super-speedy-syslog-searcher/

Is s4 failing to parse a log file? Report an Issue at
https://github.com/jtmoon79/super-speedy-syslog-searcher/issues/new/choose

About

Why s4?

Super Speedy Syslog Searcher (s4) is meant to aid Engineers in reviewing varying log files in a datetime-sorted manner. The primary use-case is to aid investigating problems wherein the time of a problem occurrence is known and there are many available logs but otherwise there is little source evidence.

Currently, log file formats vary widely. Most logs are an ad-hoc format. Even separate log files on the same system for the same service may have different message formats! Sorting these logged messages by datetime may be prohibitively difficult. The result is an engineer may have to "hunt and peck" among many log files, looking for problem clues around some datetime; so tedious!

Enter Super Speedy Syslog Searcher 🦸 ‼

s4 will print log messages from multiple log files in datetime-sorted order. A "window" of datetimes may be passed, to constrain the period of printed messages. This will assist an engineer that, for example, needs to view all log messages that occurred two days ago between 12:00 and 12:05 among log files taken from multiple systems.

The ulterior motive for Super Speedy Syslog Searcher was the primary developer wanted an excuse to learn rust 🦀, and wanted to create an open-source tool for a recurring need of some Software Test Engineers 😄

See the real-world example rationale in the section below, logging chaos: the problem s4 solves.

Features

File name guessing

Given a file path, s4 will attempt to parse it. The type of file must be in the name. Guesses are made about files with non-standard names.

For example, standard file name utmp will always be treated as a utmp record file. But non-standard name log.utmp.1 is guessed to be a utmp record file. Similar guesses are applied to lastlog, wtmp, acct, pacct, journal, and evtx files. When combined with compression or archive file name extensions, e.g. .bz2, .gz, .lz4, or .xz, then s4 makes a best attempt at guessing the compression or archive type and the file within the archive based on the name. For example, user.journal.gz is guessed to be a systemd journal file within a gzip compressed file. However, if that same file is named something unusual like user.systemd-journal.gz then it is guessed to be a text log file within a gzip compressed file.

When a file type cannot be guessed then it is treated as a UTF8 text log file. For example, a file name just unknown is not any obvious type so it is attempted to be parsed as a UTF8 ad-hoc text log file.

tar files are inspected for parseable files.[2]

Directory walks

Given a directory path, s4 will walk the directory and all subdirectories and follow symbolic links and cross file system paths. s4 will ignore files with extensions that are known to be non-log files. For example, files with extensions .dll, .mp3, .png, or .so, are unlikely to be log files and so are not processed.

So given a file /tmp/file.mp3, an invocation of s4 /tmp will not attempt to process file.mp3. An invocation of s4 /tmp/file.mp3 will attempt to process file.mp3. It will be treated as a UTF8 text log file.

Limitations

  • Only processes UTF-8 or ASCII encoded syslog files. (Issue #16)
  • Cannot process multi-file .gz files (only processes first stream found). (Issue #8)
  • Cannot process multi-file .xz files (only processes first stream found). (Issue #11)
  • Cannot process .zip archives (Issue #39)
  • [1] ISO 8601
    • ISO 8601 forms recognized (using ISO descriptive format)
      • YYYY-MM-DDThh:mm:ss
      • YYYY-MM-DDThhmmss
      • YYYYMMDDThhmmss (may use date-time separator character 'T' or character blank space ' ')
    • ISO 8601 forms not recognized:
  • [2] Cannot process archive files or compressed files within other archive files or compressed files (Issue #14)
    e.g. cannot process logs.tar.xz, nor file log.gz within logs.tar

Hacks



More

Comparisons

An overview of features of varying log mergers including GNU tools.

Symbol
Yes
Most
Some
No
with an accompanying GNU program
! with user input
with complex user input

General Features

Program Source CLI TUI Interactive live tail merge varying log formats datetime search range
grep | sort C tail
s4 Rust
lnav C++
logmerger Python
tl Python
logdissect.py Python

Formal Log DateTime Supported

Program RFC 2822 RFC 3164 RFC 3339 RFC 5424 ISO 8601
grep | sort ! ! !
s4
lnav
logmerger ! !
tl

RFC 2822: Internet Message Format: Date and Time Specification; e.g. Wed, 1 Jan 2020 22:00:00 PST message…

RFC 3164: The BSD syslog Protocol: HEADER Part of a syslog Packet; e.g. <8>Jan 1 22:00:00 message…

RFC 3339: Date and Time on the Internet: Internet Date/Time Format; e.g. e.g. 2020-01-01T22:00:00-08:00 message…

RFC 5424: The Syslog Protocol: TIMESTAMP; e.g. 2020-01-01T22:00:00-08:00 message…

ISO 8601: Data elements and interchange formats – Information interchange – Representation of dates and times; e.g. 2020-01-01T22:00:00-08:00 message…, 20200101T220000-0800 message…, etc.


Other Log or File Formats Supported

Binary formats supported:

Program journal acct/lastlog/utmp .evtx .pcap/.pcapng .jsonl
grep | sort
s4
lnav
logmerger
tl

Ad-hoc text formats:

Program Ad-hoc text formats Red Hat Audit Log strace Apache Common Log Format
grep | sort !
s4
lnav
logmerger
tl

All programs besides s4 fail to merge different text log formats.


Archive Formats Supported

Program .gz .lz .lz4 .bz .bz2 .xz .tar .zip
grep | sort zgrep lz lz4 bzip bzip2 xz
s4
lnav ?
logmerger
tl
logdissect.py

Speed Comparison

A comparison of merging three large log files running on Ubuntu 22 on WSL2. The three log files have 5000 lines, 2158138 bytes (≈2.1 MB) each, with high-plane unicode. Each program had 30 runs except toolong.

Command Mean (ms) Min (ms) Max (ms) Max RSS (KB) CPU %
grep | sort 16.5 ± 0.6 15.7 18.6 5512 41%
s4 (system) 37.0 ± 1.8 34.3 40.9 48060 182%
s4 (jemalloc) 37.2 ± 2.0 33.9 43.0 71536 165%
s4 (mimalloc) 32.0 ± 2.1 27.4 36.1 75776 182%
lnav 155.9 ± 1.8 153.0 162.7 37320 94%
logmerger 779.3 ± 10.4 760.3 803.2 55288 99%
toolong 53208 40%

Mean is mean runtime in milliseconds Min is minimum runtime in milliseconds Max is maximum runtime in milliseconds Max RSS is maximum Resident Set Size in Kilobytes CPU % is an average of CPU used over the runtime

Programs tested:

  • GNU grep 3.7, GNU sort 8.32
  • s4 0.7.75
  • logmerger 0.9.0 on Python 3.10.12
  • tl 1.5.0 on Python 3.10.12

Using hyperfine to measure timing and GNU time to measure RSS and CPU.

See directory results in compare-log-mergers.txt.


Building locally

See section Install super_speedy_syslog_searcher.

Parsing .journal files

Requires libsystemd to be installed to use libsystemd.so at runtime.

Requesting Support For DateTime Formats; your particular log file

If you have found a log file that Super Speedy Syslog Searcher does not parse then you may create a new Issue type Feature request (datetime format).

Here is an example user-submitted Issue.

"syslog" and other project definitions

syslog

In this project, the term "syslog" is used generously to refer to any log message that has a datetime stamp on the first line of log text.

Technically, "syslog" is defined among several RFCs proscribing fields, formats, lengths, and other technical constraints. In this project, the term "syslog" is interchanged with "log".

The term "sysline" refers to a one log message which may comprise multiple text lines.

See docs section Definitions of data for more project definitions.

log message

A "log message" is a single log entry for any type of logging scheme; an entry in a utmpx file, an entry in a systemd journal, an entry in a Windows Event Log, a formal RFC 5424 syslog message, or an ad-hoc log message.


logging chaos: the problem s4 solves

In practice, most log file formats are an ad-hoc format. And among formally defined log formats, there are many variations. The result is merging varying log messages by datetime is prohibitively tedious. If an engineer is investigating a problem that is symptomatic among many log files then the engineer must "hunt and peck" among those many log files. Log files can not be merged for a single coherent view.

The following real-world example log files are available in project directory ./logs.

open-source software examples

nginx webserver

For example, the open-source nginx web server logs access attempts in an ad-hoc format in the file access.log

192.168.0.115 - - [08/Oct/2022:22:26:35 +0000] "GET /DOES-NOT-EXIST HTTP/1.1" 404 0 "-" "curl/7.76.1" "-"

which is an entirely dissimilar log format to the neighboring nginx log file, error.log

2022/10/08 22:26:35 [error] 6068#6068: *3 open() "/usr/share/nginx/html/DOES-NOT-EXIST" failed (2: No such file or directory), client: 192.168.0.115, server: _, request: "GET /DOES-NOT-EXIST HTTP/1.0", host: "192.168.0.100"

nginx is following the bad example set by the apache web server.

Debian 11

Here are log snippets from a Debian 11 host.

file /var/log/alternatives.log

update-alternatives 2022-10-10 23:59:47: run with --quiet --remove rcp /usr/bin/ssh

file /var/log/dpkg.log

2022-10-10 15:15:02 upgrade gpgv:amd64 2.2.27-2 2.2.27-2+deb11u1

file /var/log/kern.log

Oct 10 23:07:16 debian11-b kernel: [    0.10034] Linux version 5.10.0-11-amd64

file /var/log/unattended-upgrades/unattended-upgrades-shutdown.log

2022-10-10 23:07:16,775 WARNING - Unable to monitor PrepareForShutdown() signal, polling instead.

binary files

And then there are binary files, such as the wtmp file on Linux and other Unix Operating Systems. Using tool utmpdump, a utmp record structure is converted to text like:

[7] [12103] [ts/0] [user] [pts/0] [172.1.2.1] [172.1.2.2] [2023-03-05T23:12:36,270185+00:00]

And from a systemd .journal file, read using journalctl

Mar 03 10:26:10 host systemd[1]: Started OpenBSD Secure Shell server.
░░ Subject: A start job for unit ssh.service has finished successfully
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit ssh.service has finished successfully.
░░
░░ The job identifier is 120.
Mar 03 10:31:23 host sshd[4559]: Accepted login for user1 from 172.1.2.1 port 51730 ssh2

Try merging those two log messages by datetime using GNU grep, sort, sed, or awk!

Additionally, if the wtmp file is from a different architecture or Operating System, then the binary record structure is likely not parseable by the resident utmpdump tool. What then!?

commercial software examples

Commercial software and computer hardware vendors nearly always use ad-hoc log message formatting that is even more unpredictable among each log file on the same system.

Synology DiskStation

Here are log file snippets from a Synology DiskStation host.

file DownloadStation.log

2019/06/23 21:13:34	(system) trigger DownloadStation 3.8.13-3519 Begin start-stop-status start

file sfdisk.log

2019-04-06T01:07:40-07:00 dsnet sfdisk: Device /dev/sdq change partition.

file synobackup.log

info	2018/02/24 02:30:04	SYSTEM:	[Local][Backup Task Backup1] Backup task started.

(yes, those are tab characters)

Mac OS 12

Here are log file snippets from a Mac OS 12.6 host.

file /var/log/system

Oct 11 15:04:55 localhost syslogd[110]: Configuration Notice:
	ASL Module "com.apple.cdscheduler" claims selected messages.
	Those messages may not appear in standard system log files or in the ASL database.

file /var/log/wifi

Thu Sep 21 23:05:35.850 Usb Host Notification NOT activated

file /var/log/fsck_hs.log

/dev/rdisk2s2: fsck_hfs started at Thu Sep 21 21:31:05 2023
  QUICKCHECK ONLY; FILESYSTEM CLEAN

file /var/log/anka.log

Fri Sep 22 00:06:05 UTC 2023: Checking /Library/Developer/CoreSimulator/Volumes/watchOS_20S75...

file /var/log/displaypolicyd.log

2023-09-15 04:26:56.330256-0700: Started at Fri Sep 15 04:26:56 2023

file /var/log/com.apple.xpc.launchd/launchd.log.1

2023-10-26 16:56:23.287770 <Notice>: swap enabled

file /var/log/asl/logs/aslmanager.20231026T170200+00

Oct 26 17:02:00: aslmanager starting

Did you also notice how the log file names differ in unexpected ways?

Microsoft Windows 10

Here are log snippets from a Windows 10 host.

file ${env:SystemRoot}\debug\mrt.log

Microsoft Windows Malicious Software Removal Tool v5.83, (build 5.83.13532.1)
Started On Thu Sep 10 10:08:35 2020

file ${env:SystemRoot}\comsetup.log

COM+[12:24:34]: ********************************************************************************
COM+[12:24:34]: Setup started - [DATE:05,27,2020 TIME: 12:24 pm]

file ${env:SystemRoot}\DirectX.log

11/01/19 20:03:40: infinst: Installed file C:\WINDOWS\system32\xactengine2_1.dll

file ${env:SystemRoot}/Microsoft.NET/Framework/v4.0.30319/ngen.log

09/15/2022 14:13:22.951 [515]: 1>Warning: System.IO.FileNotFoundException: Could not load file or assembly

file ${env:SystemRoot}/Performance/WinSAT/winsat.log

68902359 (21103) - exe\logging.cpp:0841: --- START 2022\5\17 14:26:09 PM ---
68902359 (21103) - exe\main.cpp:4363: WinSAT registry node is created or present

(yes, it reads hour 14, and PM… 🙄)

Summary

This chaotic logging approach is typical of commercial and open-source software, AND IT'S A MESS! Attempting to merge log messages by their natural sort mechanism, a datetime stamp, is difficult to impossible.

Hence the need for Super Speedy Syslog Searcher! 🦸 s4 merges varying log files into a single coherent datetime-sorted log.


Further Reading

Stargazers

Stargazers over time


profile for @JamesThomasMoon on Stack Exchange, a network of free, community-driven Q&A sites

super-speedy-syslog-searcher's People

Contributors

dependabot[bot] avatar jtmoon79 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

baoyachi

super-speedy-syslog-searcher's Issues

printing prepended data may cause errant �

tldr Printing with prepended data may insert an errant unicode "unknown glyph" .

Reproduction

Given file ./logs/other/tests/dtf9c-12-baddate.log, with sample of contents

$ head -n2 ./logs/other/tests/dtf9c-12-baddate.log
Jan 1 01:00:00 0 😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃
Feb 2 02:00:00 1 😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃😀

run s4 as-is (it will print with color, no prepended data).

$ ./target/release/s4 ./logs/other/tests/dtf9c-12-baddate.log | head -n1
Jan 1 01:00:00 0 😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃

run s4 with a prepended filename

$ ./target/release/s4 -p ./logs/other/tests/dtf9c-12-baddate.log | head -n1
./logs/other/tests/dtf9c-12-baddate.log:Jan 1 01:00:00 0 😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻�😽😾😿🙀🙁🙂🙃�

ERROR a two errant is in the output.

This errant output will occur for any mode of prepended data, but at different rates:

  • with -l errant occurs 4 times
  • with -p errant occurs 20 times
  • with -l -p errant occurs 4 times

Other observations


dpo!("print_color_line_highlight_dt! slice_a.len() {} slice_b_dt.len() {} slice_c.len() {}", slice_a.len(), slice_b_dt.len(), slice_c.len());

Running the debug version of the build, with and without -p showed the selected slices were the same., e.g.

without -p, debug output

     print_color_line_highlight_dt! slice_a.len() 0 slice_b_dt.len() 15 slice_c.len() 277
Dec 12 12:00:00 11 😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃😀😁😂😃😄😅😆😇😈😉😊

with -p, debug output

     print_color_line_highlight_dt! slice_a.len() 0 slice_b_dt.len() 15 slice_c.len() 277
./logs/other/tests/dtf9c-12-baddate.log:Dec 12 12:00:00 11 😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃😀😁�😃😄😅😆😇😈😉😊�

  • Comparing the first line of output when using -p, colordiff.pl shows no errant difference
$ ./target/release/s4 -p ./logs/other/tests/dtf9c-12-baddate.log | head -n1 > out-p1.txt

$ ./target/release/s4 ./logs/other/tests/dtf9c-12-baddate.log | head -n1 > out1.txt

$ perl ./colorbindiff.pl out1.txt out-p1.txt
OFFSET   00  01  02  03  04  05  06  07  08  09  0A  0B  0C  0D  0E  0F                   OFFSET   00  01  02  03  04  05  06  07  08  09  0A  0B  0C  0D  0E  0F
0x0000*  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  .. ................  0x0000* +2e +2f +6c +6f +67 +73 +2f +6f +74 +68 +65 +72 +2f +74 +65 +73 ./logs/other/tes
0x0000*  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  .. ................  0x0010* +74 +73 +2f +64 +74 +66 +39 +63 +2d +31 +32 +2d +62 +61 +64 +64 ts/dtf9c-12-badd
0x0000*  ..  ..  ..  ..  ..  ..  ..  ..                                 ........          0x0020* +61 +74 +65 +2e +6c +6f +67 +3a                                 ate.log:
0x0000   1b  5b  30  6d  1b  5b  33  36  6d  1b  5b  30  6d  1b  5b  34 .[0m.[36m.[0m.[4  0x0028   1b  5b  30  6d  1b  5b  33  36  6d  1b  5b  30  6d  1b  5b  34 .[0m.[36m.[0m.[4
0x0010   6d  1b  5b  33  36  6d  4a  61  6e  20  31  20  30  31  3a  30 m.[36mJan 1 01:0  0x0038   6d  1b  5b  33  36  6d  4a  61  6e  20  31  20  30  31  3a  30 m.[36mJan 1 01:0
0x0020   30  3a  30  30  1b  5b  30  6d  1b  5b  33  36  6d  20  30  20 0:00.[0m.[36m 0   0x0048   30  3a  30  30  1b  5b  30  6d  1b  5b  33  36  6d  20  30  20 0:00.[0m.[36m 0
0x0030   f0  9f  98  80  f0  9f  98  81  f0  9f  98  82  f0  9f  98  83 ................  0x0058   f0  9f  98  80  f0  9f  98  81  f0  9f  98  82  f0  9f  98  83 ................
0x0040   f0  9f  98  84  f0  9f  98  85  f0  9f  98  86  f0  9f  98  87 ................  0x0068   f0  9f  98  84  f0  9f  98  85  f0  9f  98  86  f0  9f  98  87 ................
0x0050   f0  9f  98  88  f0  9f  98  89  f0  9f  98  8a  f0  9f  98  8b ................  0x0078   f0  9f  98  88  f0  9f  98  89  f0  9f  98  8a  f0  9f  98  8b ................
0x0060   f0  9f  98  8c  f0  9f  98  8d  f0  9f  98  8e  f0  9f  98  8f ................  0x0088   f0  9f  98  8c  f0  9f  98  8d  f0  9f  98  8e  f0  9f  98  8f ................
0x0070   f0  9f  98  90  f0  9f  98  91  f0  9f  98  92  f0  9f  98  93 ................  0x0098   f0  9f  98  90  f0  9f  98  91  f0  9f  98  92  f0  9f  98  93 ................
0x0080   f0  9f  98  94  f0  9f  98  95  f0  9f  98  96  f0  9f  98  97 ................  0x00A8   f0  9f  98  94  f0  9f  98  95  f0  9f  98  96  f0  9f  98  97 ................
0x0090   f0  9f  98  98  f0  9f  98  99  f0  9f  98  9a  f0  9f  98  9b ................  0x00B8   f0  9f  98  98  f0  9f  98  99  f0  9f  98  9a  f0  9f  98  9b ................
0x00A0   f0  9f  98  9c  f0  9f  98  9d  f0  9f  98  9e  f0  9f  98  9f ................  0x00C8   f0  9f  98  9c  f0  9f  98  9d  f0  9f  98  9e  f0  9f  98  9f ................
0x00B0   f0  9f  98  a0  f0  9f  98  a1  f0  9f  98  a2  f0  9f  98  a3 ................  0x00D8   f0  9f  98  a0  f0  9f  98  a1  f0  9f  98  a2  f0  9f  98  a3 ................
0x00C0   f0  9f  98  a4  f0  9f  98  a5  f0  9f  98  a6  f0  9f  98  a7 ................  0x00E8   f0  9f  98  a4  f0  9f  98  a5  f0  9f  98  a6  f0  9f  98  a7 ................
0x00D0   f0  9f  98  a8  f0  9f  98  a9  f0  9f  98  aa  f0  9f  98  ab ................  0x00F8   f0  9f  98  a8  f0  9f  98  a9  f0  9f  98  aa  f0  9f  98  ab ................
0x00E0   f0  9f  98  ac  f0  9f  98  ad  f0  9f  98  ae  f0  9f  98  af ................  0x0108   f0  9f  98  ac  f0  9f  98  ad  f0  9f  98  ae  f0  9f  98  af ................
0x00F0   f0  9f  98  b0  f0  9f  98  b1  f0  9f  98  b2  f0  9f  98  b3 ................  0x0118   f0  9f  98  b0  f0  9f  98  b1  f0  9f  98  b2  f0  9f  98  b3 ................
0x0100   f0  9f  98  b4  f0  9f  98  b5  f0  9f  98  b6  f0  9f  98  b7 ................  0x0128   f0  9f  98  b4  f0  9f  98  b5  f0  9f  98  b6  f0  9f  98  b7 ................
0x0110   f0  9f  98  b8  f0  9f  98  b9  f0  9f  98  ba  f0  9f  98  bb ................  0x0138   f0  9f  98  b8  f0  9f  98  b9  f0  9f  98  ba  f0  9f  98  bb ................
0x0120   f0  9f  98  bc  f0  9f  98  bd  f0  9f  98  be  f0  9f  98  bf ................  0x0148   f0  9f  98  bc  f0  9f  98  bd  f0  9f  98  be  f0  9f  98  bf ................
0x0130   f0  9f  99  80  f0  9f  99  81  f0  9f  99  82  f0  9f  99  83 ................  0x0158   f0  9f  99  80  f0  9f  99  81  f0  9f  99  82  f0  9f  99  83 ................
0x0140   0a                                                             .                 0x0168   0a

The difference is as-expected: the output with -p has extra ASCII bytes to print the prepended filename. However, there are no errant control codes or partial unicode characters.


Using Windows Terminal 1.15, Ubuntu 22.

lines with '\r' but no '\n' will overwrite prepended options

Syslog lines with '\r' but no '\n' may overwrite prepended options like -u or -p.

For example, from /var/log/apt/term.log

$ s4 -p /var/log/apt/term.log
...
/var/log/apt/term.log:
/var/log/apt/term.log:Log started: 2022-09-01  06:21:45
(Reading database ... 111770 files and directories currently installed.)
/var/log/apt/term.log:Removing linux-modules-5.15.0-43-generic (5.15.0-43.46) ...

The line (Reading database ... 111770 files and directories currently installed.) is several visually overlapping Reading database ... statements.
As seen in hexyl

$ grep -Fe '(Reading database ... 111770 files' /var/log/apt/term.log | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 28 52 65 61 64 69 6e 67 ┊ 20 64 61 74 61 62 61 73 │(Reading┊ databas│
│00000010│ 65 20 2e 2e 2e 20 0d 28 ┊ 52 65 61 64 69 6e 67 20 │e ... _(┊Reading │
│00000020│ 64 61 74 61 62 61 73 65 ┊ 20 2e 2e 2e 20 35 25 0d │database┊ ... 5%_│
│00000030│ 28 52 65 61 64 69 6e 67 ┊ 20 64 61 74 61 62 61 73 │(Reading┊ databas│
│00000040│ 65 20 2e 2e 2e 20 31 30 ┊ 25 0d 28 52 65 61 64 69 │e ... 10┊%_(Readi│
│00000050│ 6e 67 20 64 61 74 61 62 ┊ 61 73 65 20 2e 2e 2e 20 │ng datab┊ase ... │
│00000060│ 31 35 25 0d 28 52 65 61 ┊ 64 69 6e 67 20 64 61 74 │15%_(Rea┊ding dat│
│00000070│ 61 62 61 73 65 20 2e 2e ┊ 2e 20 32 30 25 0d 28 52 │abase ..┊. 20%_(R│
│00000080│ 65 61 64 69 6e 67 20 64 ┊ 61 74 61 62 61 73 65 20 │eading d┊atabase │
│00000090│ 2e 2e 2e 20 32 35 25 0d ┊ 28 52 65 61 64 69 6e 67 │... 25%_┊(Reading│
│000000a0│ 20 64 61 74 61 62 61 73 ┊ 65 20 2e 2e 2e 20 33 30 │ databas┊e ... 30│
│000000b0│ 25 0d 28 52 65 61 64 69 ┊ 6e 67 20 64 61 74 61 62 │%_(Readi┊ng datab│
│000000c0│ 61 73 65 20 2e 2e 2e 20 ┊ 33 35 25 0d 28 52 65 61 │ase ... ┊35%_(Rea│
│000000d0│ 64 69 6e 67 20 64 61 74 ┊ 61 62 61 73 65 20 2e 2e │ding dat┊abase ..│
│000000e0│ 2e 20 34 30 25 0d 28 52 ┊ 65 61 64 69 6e 67 20 64 │. 40%_(R┊eading d│
│000000f0│ 61 74 61 62 61 73 65 20 ┊ 2e 2e 2e 20 34 35 25 0d │atabase ┊... 45%_│
│00000100│ 28 52 65 61 64 69 6e 67 ┊ 20 64 61 74 61 62 61 73 │(Reading┊ databas│
│00000110│ 65 20 2e 2e 2e 20 35 30 ┊ 25 0d 28 52 65 61 64 69 │e ... 50┊%_(Readi│
│00000120│ 6e 67 20 64 61 74 61 62 ┊ 61 73 65 20 2e 2e 2e 20 │ng datab┊ase ... │
│00000130│ 35 35 25 0d 28 52 65 61 ┊ 64 69 6e 67 20 64 61 74 │55%_(Rea┊ding dat│
│00000140│ 61 62 61 73 65 20 2e 2e ┊ 2e 20 36 30 25 0d 28 52 │abase ..┊. 60%_(R│
│00000150│ 65 61 64 69 6e 67 20 64 ┊ 61 74 61 62 61 73 65 20 │eading d┊atabase │
│00000160│ 2e 2e 2e 20 36 35 25 0d ┊ 28 52 65 61 64 69 6e 67 │... 65%_┊(Reading│
│00000170│ 20 64 61 74 61 62 61 73 ┊ 65 20 2e 2e 2e 20 37 30 │ databas┊e ... 70│
│00000180│ 25 0d 28 52 65 61 64 69 ┊ 6e 67 20 64 61 74 61 62 │%_(Readi┊ng datab│
│00000190│ 61 73 65 20 2e 2e 2e 20 ┊ 37 35 25 0d 28 52 65 61 │ase ... ┊75%_(Rea│
│000001a0│ 64 69 6e 67 20 64 61 74 ┊ 61 62 61 73 65 20 2e 2e │ding dat┊abase ..│
│000001b0│ 2e 20 38 30 25 0d 28 52 ┊ 65 61 64 69 6e 67 20 64 │. 80%_(R┊eading d│
│000001c0│ 61 74 61 62 61 73 65 20 ┊ 2e 2e 2e 20 38 35 25 0d │atabase ┊... 85%_│
│000001d0│ 28 52 65 61 64 69 6e 67 ┊ 20 64 61 74 61 62 61 73 │(Reading┊ databas│
│000001e0│ 65 20 2e 2e 2e 20 39 30 ┊ 25 0d 28 52 65 61 64 69 │e ... 90┊%_(Readi│
│000001f0│ 6e 67 20 64 61 74 61 62 ┊ 61 73 65 20 2e 2e 2e 20 │ng datab┊ase ... │
│00000200│ 39 35 25 0d 28 52 65 61 ┊ 64 69 6e 67 20 64 61 74 │95%_(Rea┊ding dat│
│00000210│ 61 62 61 73 65 20 2e 2e ┊ 2e 20 31 30 30 25 0d 28 │abase ..┊. 100%_(│
│00000220│ 52 65 61 64 69 6e 67 20 ┊ 64 61 74 61 62 61 73 65 │Reading ┊database│
│00000230│ 20 2e 2e 2e 20 31 31 31 ┊ 37 37 30 20 66 69 6c 65 │ ... 111┊770 file│
│00000240│ 73 20 61 6e 64 20 64 69 ┊ 72 65 63 74 6f 72 69 65 │s and di┊rectorie│
│00000250│ 73 20 63 75 72 72 65 6e ┊ 74 6c 79 20 69 6e 73 74 │s curren┊tly inst│
│00000260│ 61 6c 6c 65 64 2e 29 0d ┊ 0a                      │alled.)_┊_       │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

allow any leading timezone

Currently

CLI options -u and -l will preprint the UTC or Local timezone. However, a user may want to review syslogs according to a different timezone.

Feature Request

Allow CLI option for any timezone to be preprinted. e.g.

   -pz, --prepend-timezone    Prepend DateTime in the passed Timezone for every line.
                              Accepts all strftime timezone formats.

extra TODOs

Update other CLI options.

   -u , --prepend-utc   Prepend DateTime in the UTC Timezone for every line. This is the same as passing "-pt UTC".
   -l , --prepend-local Prepend DateTime in the UTC Timezone for every line. This is the same as passing "-pt Local".

Add -pz to the same CLI option grouping.

Add example to README.md

            Print only the syslog lines that occurred two days ago during the noon hour in
            Bengaluru, India (timezone offset +05:30) but shown in timezone Bengaluru, India.

            ```lang-text
            s4 /var/log -pz "+05:30" -a "$(date -d "2 days ago 12:00" '+%Y-%m-%dT%H:%M:%S') +05:30" -b "$(date -d "2 days ago 13:00" '+%Y-%m-%dT%H:%M:%S') +05:30"
            ```

prepend -w includes unprinted files

Problem

Passing -p -w will create a prepended file width length for all files that might be printed.

Given files

  • syslog with two syslines
  • fileWithLongNameNoSyslines with no syslines

The prepended filename syslog will have many extra spaces:

$ s4 -p -w syslog fileWithLongNameNoSyslines
syslog                    : 20220101T000001 sysline 1
syslog                    : 20220101T000002 sysline 2

Solution

Only get prepended file width length for syslogs that will be printed.

preprinted datetime dropping fractional information

Problem

Preprinted datetimes (-l, -u) are not setting fractional. (it's always zero).

For example, from an Ubuntu host

$ ./target/release/s4 -l -n -w -a 20220813T120000 -b 20220813T200000 /var/log/
...
ubuntu-advantage-timer.log:20220813T180826.000000 +0000:2022-08-13 18:08:26,175 - timer.py:(46) [DEBUG]: Executed job: update_messaging
...

In the terminal, the datetime is correctly underlined, including the fractional, 2022-08-13 18:08:26,175 (notice 175).
But the preprinted datetime has fractional value 0, 20220813T180826.000000 +0000,

Solution

Set the fractional value in the preprinted datetime.

printed errors while piping data will errantly colorize error messages

Errors printed are the same color as the previously printed sysline.

Reproduction

Given some syslog file with multiple syslines, run

$ s4 /var/log/syslog | head -n1

Prints

Aug  7 00:00:00 ubuntu22-s4b systemd[1]: logrotate.service: Deactivated successfully.
ERROR: failed to print Broken pipe (os error 32)

The first line is colorized, which is expected. However, the line ERROR: failed to print Broken pipe (os error 32) is also colorized. It should be printed with the default text settings.

This was run in bash shell within Windows Terminal.

More...

The reason is the shell escape codes to return to default text settings occur after the newline. This escape code to default text settings is written to the pipe | and then head drops that default text escape code. So the terminal does not see the default text escape code. Thus, proceeding lines retain the same colorization settings as the prior sysline.

This relates to a larger problem that user-facing errors are currently handled in a generic way eprintln!. However, user-facing errors should have awareness of TermColor settings, and handle their own colorization. This Issue may require significant refactoring, depending upon the completeness of the solution.

read `.xz` file by requested block

Problem

An .xz file is entirely read during BlockReader::new.
This may cause problems for very large compressed files (the s4 program will hold the entire uncompressed file in memory; it would use too much memory).

The crate lzma-rs does not provide API xz_decompress_with_options which would allow limiting the bytes returned per call. It only provides xz_decompress which decompresses the entire file in one call. See gendx/lzma-rs#110

Solution

Read an .xz file per block request, as done for normal files.


Update: see Issue #283

Meta-Issue #182

parse CLI -a -b dates as written expressions, e.g. "two days ago"

Currently, dates must be given exactly and in full, e.g. -a "2000-01-01T03:04:05" or -a "2000-01-01". But often, end-users will be thinking in terms of relative past dates, e.g. "12 hours ago". It can be a little bit of mental difficulty to figure out a relative verbal expression "two days ago" into an exact numeric date. And often, users might be Test Engineers elbows-deep into investigating a difficult bug, and sparing those brain cycles is a burden.

Add parsing for written expressions of relative dates, e.g. "two days ago at noon", "10 hours ago", "1 year ago", etc., that can be passed to CLI options -a and -b.

This need not handle of every written form and abbreviation of some expression with super-duper AI "2 d noonish", "two days in the past!", "seek the past two days!", etc. Only needs a few simple written patterns, similar to GNU date, "2 days ago", "2 days ago at 12:00", "2 hours ago", "12:00" (today at 12:00), etc.

support parsing more written languages

Problem

Currently, only English language logging phrases are matched for regular expressions, e.g

    // from file `logs/Windows10Pro/debug/mrt.log`
    // example with offset:
    //
    //               1         2         3         4          5         6         7         8         9
    //     01234567890123456789012345678901234567890012345678901234567890123456789012345678901234567890
    //     ---------------------------------------------------------------------------------------
    //     Microsoft Windows Malicious Software Removal Tool v5.83, (build 5.83.13532.1)
    //     Started On Thu Sep 10 10:08:35 2020
    //     ...
    //     Results Summary:
    //     ----------------
    //     No infection found.
    //     Successfully Submitted Heartbeat Report
    //     Microsoft Windows Malicious Software Removal Tool Finished On Tue Nov 10 18:54:47 2020
    //
    DTPD!(
        concatcp!("(Started On|Started on|started on|STARTED|Started|started|Finished On|Finished on|finished on|FINISHED|Finished|finished)[:]?", RP_BLANK, CGP_DAYa, RP_BLANK, CGP_MONTHb, RP_BLANK, CGP_DAYde, RP_BLANK, CGP_HOUR, D_T, CGP_MINUTE, D_T, CGP_SECOND, RP_BLANK, CGP_YEAR, RP_NOALNUM),
        DTFSS_YbdHMS, 0, 140, CGN_DAYa, CGN_YEAR,

https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.6.62/src/data/datetime.rs#L2715-L2740

In a non-English version of Windows, the phrases will be different; e.g. the phrase Started on in Windows set to locale es_MX might be comenzo en (just a guess, I haven't checked). I know for a fact that Windows does a fair amount of work to use localized strings for all user-facing messages (since I worked on that project at Microsoft 😁). This might not affect all the obscure text logs under C:\Windows but I bet it will affect some.

The only way to know the phrasing used in the non-English version of mrt.log is to prop up an instance of Windows using a different locale/language. Then do whatever is necessary to generate the file (save it as part of ./logs), and then update the code to match that non-English phrase. It's a bit of work for just one locale instance. There are many possible locales for Windows to use.

I would skip an approach that merely switched the language of the running Windows system. IME on every OS, only some applications correctly switch locale when the user or system changes locale. Most applications permanently stick to the locale that was present when the application first ran.

The same work would be necessary for Linux and Linux-based things.

dmesg file errantly parsed

File dmesg (and dmesg.1, dmesg.2.gz, etc.) is errantly parsed because of a datetime within the file.

Problem

dmesg contents

...
[    0.868675] kernel: rtc_cmos 00:03: setting system clock to 2022-07-31T20:39:27 UTC (1659299967)
[    0.869775] kernel: rtc_cmos 00:03: alarms up to one day, y3k, 242 bytes nvram, hpet irqs
[    0.870734] kernel: i2c_dev: i2c /dev entries driver
...

The datetime 2022-07-31T20:39:27 UTC is successfully parsed. The remainder of lines in the file are (wrongly) presumed to be part of one very large sysline. They are individual syslines.

$ ./target/release/s4 -l /var/log/dmesg
20220804T052008.000000 +0000:[    1.203365] kernel: rtc_cmos 00:03: setting system clock to 2022-08-04T05:20:08 UTC (1659590408)
20220804T052008.000000 +0000:[    1.204496] kernel: rtc_cmos 00:03: alarms up to one day, y3k, 242 bytes nvram, hpet irqs
20220804T052008.000000 +0000:[    1.205674] kernel: i2c_dev: i2c /dev entries driver
...

Solution

File dmesg format needs special handling. It is a fair amount of refactoring to support it.
or
Have special handling to reject parsing of dmesg files.

handle text encodings besides ASCII and UTF-8

Problem

Only ASCII encoded and UTF-8 encoded files can be processed. Log files with different encodings are processed but no text is found by s4.

For example, on a Windows 11 host, among .log files under C:\Windows, 28 of 128 files were not printed. A spot check of a few of those non-printed files showed they were UTF-16 encoded.

PS> Get-ChildItem -Filter '*.log' -File -Path "C:\Windows" -Recurse -ErrorAction SilentlyContinue `
   | Select-Object -ExpandProperty FullName `
   | s4.exe - --summary

Solution

Handle other text encodings.

Handling UTF-16 and UTF-32 would be satisfactory.

support ISO 8601 format; Unicode "minus sign"

According to Wikipedia

To represent a negative offset, ISO 8601 specifies using a minus sign, (). If the interchange character set is limited and does not have a minus sign character, then the hyphen-minus should be used, (-). ASCII does not have a minus sign, so its hyphen-minus character (code is 45 decimal or 2D hexadecimal) would be used. If the character set has a minus sign, then that character should be used. Unicode has a minus sign, and its character code is U+2212 (2212 hexadecimal); the HTML character entity invocation is &minus;.

Problem

chrono parse_from_str fails to parse a Unicode "minus sign" character (U+2212).

Solution

To more fully support the ISO 8601 standard, s4lib should parse Unicode "minus sign" character (U+2212) in numeric timezone offsets, e.g. −07:00.


Follows from chronotope/chrono#835

support parsing datetime without seconds

Problem

Arch package manager pacman uses the following syslog message format:

[2019-03-01 16:56] [PACMAN] synchronizing package lists

(there are no seconds)

Solution

Support parsing datetime without seconds. The seconds value would be 0.

Add built-in datetime.rs test cases the comment on the format.

lines ending in only '\r' will overwrite prepended data

Lines that end in \r and not \n or \r\n will overwrite prepended data (-l, -p, -u).

Reproduction

Given a file with lines ending in \r, prepended output will be overwritten.

Using file ./logs/other/tests/dtf5-3-LF-CR.log, run

$ ./target/release/s4 -p ./dtf5-3-r-n.log
./logs/other/tests/dtf5-3-LF-CR.log:2000-01-01 00:00:00 [dtf5-6a] LF
./logs/other/tests/dtf5-3-LF-CR.log:second line, first sysline LF
2000-01-01 00:00:02 [dtf5-6a] LFlog:2000-01-01 00:00:01 [dtf5-6a] CR
./logs/other/tests/dtf5-3-LF-CR.log:sixth line, third sysline LF

Hoping for

$ ./target/release/s4 -p ./dtf5-3-r-n.log
./logs/other/tests/dtf5-3-LF-CR.log:2000-01-01 00:00:00 [dtf5-6a] LF
./logs/other/tests/dtf5-3-LF-CR.log:second line, first sysline LF
./logs/other/tests/dtf5-3-LF-CR.log:2000-01-01 00:00:01 [dtf5-6a] CR
./logs/other/tests/dtf5-3-LF-CR.log:fourth line, second sysline CR
./logs/other/tests/dtf5-3-LF-CR.log:2000-01-01 00:00:02 [dtf5-6a] LF
./logs/other/tests/dtf5-3-LF-CR.log:sixth line, third sysline LF

This is a contrived example. However, log files that record updating output will have many lines ending in \r, e.g. a percent indicator that is repeatedly overwritten. The s4 output will look strange.

support for datetime format for nvidia-installer

Problem

NVidia program nvidia-installer has unsupported datetime format parsing

nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Fri May 31 13:49:08 2019
installer version: 340.107

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/root/bin

This program runs in Ubuntu and so should be supported.

Solution

Add support for the datetime format. Add built-in tests to datetime.rs and comment which test cases are that format.

chained block reads

Problem

Currently, only one "depth" of compressed or archived file is supported.
e.g. can read syslog stored in logs.tar. Cannot read syslog.gz in logs.tar, nor logs.tar stored in logs.tar.xz.
e.g. can read syslog stored in syslog.gz. Cannot read syslog.gz stored in syslog.gz.xz. Cannot read a the special gzip+tar file logs.tgz.

Related, only plain text files are extractable from compressed files or archived files. EVTX, Journal files, and utmp files stored as a compressed or archived file are not readable. See FileType

Solution

Refactor BlockReader reading to handle arbitrary "chains" of reads for text files and UTMPX files.

Currently, JournalReader reads Journal files using libsystemd calls for reading. BlockReader is not used by the JournalReader. Processing Journal files that are compressed or archived are outside the scope of this issue.

Currently, EvtxReader reads EVTX files using EvtxParser. BlockReader is not used by the EvtxReader. Processing EVTX files that are compressed or archived are outside the scope of this issue.

Relates to Issue #7.

support for datetime format for Synology custom log synocrond-execute.log

Problem

Synology DiskStation OS has an unsupported syslog datetime format in file synocrond-execute.log

12-06 00:59:44 running job: builtin-synosharing-default with command: /usr/syno/bin/synosharingcron as user root
12-06 05:33:44 running job: pkg-DownloadStation-DownloadStationUpdateJob with command: /var/packages/DownloadStation/target/bin/synodlupdate --update as user DownloadStation
12-06 09:00:44 running job: builtin-synosharesnapshot-default with command: /usr/syno/sbin/synosharesnapshot misc subvol-clean as user root

Solution

Either:

  • Support the ad-hoc message datetime format. Add test cases in datetime.rs and comment where they originate.
  • Do not support this poor ad-hoc format, mark as wontfix. The MM-DD format is ambiguous, as it could also mean DD-MM. While it could be determined via checking values, i.e. is the first two chars in 1-12?, that get's troublesome to do, and may not be possible to determine (e.g. syslog file has one message like 01-02 00:00:00 hello; what then?).

partial hour timezone offset not correctly correlated

Passing a "partial hour" timezone offset, e.g. +05:30, is not interpreted to UTC correctly.

$ s4 /var/log -u -a "$(date -d "2 days ago 12:00" '+%Y-%m-%dT%H:%M:%S') +05:30" -b "$(date -d "2 days ago 12:05" '+%Y-%m-%dT%H:%M:%S') +05:30"
20220915T063000.000000 +0000:Sep 14 22:30:00 ubuntu22-s4b sshd[1426912]: Received disconnect from 187.35.147.87 port 56531:11: Bye Bye [preauth]

Notice the first -a datetime should be a :30 minute mark. Yet it appears the time Sep 14 22:30:00 is associated with -u datetime 20220915T063000.000000 +0000, which has the same :30 minute offset. These times should be 30 minutes different.

refactor regex parsing of datetime, build strftime pattern during regex capture processing

tl;dr removing const variables CGP_* representing strftime patterns, and instead building the strftime pattern dynamically based on regex capture results, would simplify processing and simplify declarations of DTPD!

Currently

Parsing a datetime from a string consists of several tedious

  1. call fn bytes_to_regex_to_datetime https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/data/datetime.rs#L2910
  2. does a regex capture to named capture groups
  3. passes named capture groups to fn captures_to_buffer_bytes https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/data/datetime.rs#L2690
  4. fn captures_to_buffer_bytes rearranges captured data into a buffer
  5. the buffer and a strftime pattern is passed to fn datetime_parse_from_str https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/data/datetime.rs#L2992-L2999

The strfrtime pattern is an unnecessary duplicate information. They are defined at compile time within the DATETIME_PARSE_DATAS
https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/data/datetime.rs#L1535
The DTP_* variables complicate the transformation of bytes to regex captures to a datetime string.
https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/data/datetime.rs#L446-L494

The complicating part is the regular expression data must be transformed by fn captures_to_buffer_bytes to fit the predefined pattern (declared as const CGP_*).

Improvement Approach 1

Instead, the strftime pattern could be created dynamically alongside the extracted data within captures_to_buffer_bytes.
https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/data/datetime.rs#L2689-L2696
The predefined global patterns (DTP_* variables) could be removed.
And the listings of DTPD! would be significantly simpler. And, the captures_to_buffer_bytes would not have to transform variables to match a predefined DTP_* pattern.

Example

For example, currently fn captures_to_buffer_bytes must look at the length of the named capture group fractional. From there it transforms the captured data to fit strftime specifier %f. This is tedious.

Instead fn captures_to_buffer_bytes can look at the captured fractional and determine if the associated strftime specifier should be %3f, %6f, or %9f. Then the function need not modify the data, it only needs to set the correct strftime specifier, say %3f for captured data "123". The fn captures_to_buffer_bytes can dynamically create the strftime pattern to pass to fn datetime_parse_from_str.

There would be no need for the tedious predeclared CGP_* variables within DTPD! declarations.

Improvement Approach 2

Instead, the strftime pattern could be determined from the extracted capture group names found in the capture groups.

Currently, there are capture group names "<month>", "<day>", etc.

These capture group names could be modified to represent the format of data they have, e.g. "<month_b>" would signify data for strftime month specifier %b, "<month_m>" would signify data for strftime month specifier %m, etc.

In captures_to_buffer_bytes, a sequence of name queries would be made to the captures variable, e.g. if captures.name("month_b") { copy to buffer the "Jan"-like data; month = DTFS_Month::b; } else if captures.name("month_m") { copy to buffer the "01"-like data; month = DTFS_Month::m; } ....

Then in captures_to_buffer_bytes, a long match statement would choose the correct strftime specifier string. e.g.

   let strftime_format: &str = match(year, month, day, ...) {
      (DTFS_Year::Y, DTFS_Month::m, DTFS_Day::d, ...) => {
           "%Y%m%d..."
      }
     ...
   }

Later, the filled buffer and variable strftime_format would be passed to local fn datetime_parse_from_str.

user-passed `--prepend-dt-format` with variable width format string will be vertically unaligned

Problem

Recently added features --prepend-dt-format (Issue #28) allows users to pass a variable width strftime format string, e.g. "%A".
The following command will print vertically unaligned output:

$ cat /tmp/a.log
2000-01-01T00:00:00 [dtf2-2]

2000-01-02T00:00:01 [dtf2-2]a

$ s4 -d '%A:' -u /tmp/a.log
Saturday:2000-01-01T00:00:00 [dtf2-2]
Saturday:
Sunday:2000-01-02T00:00:01 [dtf2-2]a

Previously, the prepended datetime format was the consistent width strftime format, %Y%m%dT%H%M%S%.3f%z:.

This vertical un-alignment can be visually difficult to grok for end-users.

Solution

Allow the user to pass a variable width --prepend-dt-format format strings, e.g. "%A", but maintain vertical alignment.

This could be mandatory behavior, or, another CLI option, like --prepend-dt-align (similar to current option --prepend-file-align).

support journal files

Problem

systemd .journal files are common and will become more common. Not being able to process a .journal file loses very important information (which degrades the utility of this project).

Solution

Support processing .journal files. They are remarkably different than processing ad-hoc syslog files, so the strategies within super-speedy-syslog-search will not apply. However, they are designed well and should be fast to process.

A "journal processor class" would be adjacent to a SyslogProcessor. In other words, all functionality of a SyslogProcessor (and it's contained SyslineReader, LineReader, BlockReader) would be created anew in some JournalProcessor "class".

Here is a technical description of .journal file format (archived).

allow HMS only in -a -b values

Currently, an end-user can pass only a date, e.g. -a "2020-01-02". The Hours, Minutes, Seconds (HMS) is then set to 00:00:00 for that DateTime instance, e.g. the -a "2020-01-02" arguments become Local offset DateTime value "2020-01-02T00:00:00".

Allow the same for HMS. If passed, then the Date portion is set to chrono::Local::today(). For example, passing CLI arguments -a 01:02:03 would become DateTime instance 2022-09-21T01:02:03 (today) with Local timezone offset.

Similar area as #35

refactor chained paths

Problem

Currently, a file within a .tar file is represented as file.tar:log.txt using the : separator. Only one "depth" is supported, e.g. cannot have file.tar:log.tar:syslog.

Solution

  • refactor path passing to use something that understands "chained depth". It should not rely on arbitrary separator character like :.
  • use : for user-facing printed separator of such paths

support processing bzip compressed files (`.bz`, `.bz2`)

Support processing .bz and .bz2 files; both compressed by bzip2.

Such compression is rarely used for log files common on Mac OS and other BSD.

The good news: IIUC, bzip only compresses one file to one compressed file.
From Wikipedia

It is not an archiver like tar or ZIP; the program itself has no facilities for multiple files, encryption or archive-splitting

Hopefully this "one file to one file" compression means bzip is easier than gz and xz support in that there is no need for hacks to handle/workaround "multiple streams" (see Issue #11 , Issue #8 ).


Rust crates available:

  • bzip2_rs
    bzip2_rs is a pure Rust bzip2 decoder.
  • bzip2
    This library contains bindings to libbz2 to support bzip compression and decompression for Rust.
  • bzip2_sys
    Does not look ready for consumption.

process return code is confusing

The process return code for success 0 or failure 1 is confusing.


A few successful return codes:

$ ./target/release/s4 /var/log/wtmp
WARNING: no syslines found "/var/log/wtmp"

$ echo $?
0

$ ./target/release/s4 /var/log/ 1> /dev/null
WARNING: not a parseable type "/var/log/journal/c3a57680c1d26ca313b9c7ec36a5beaa/system.journal"
WARNING: no syslines found "/var/log/lastlog"
WARNING: no syslines found "/var/log/wtmp"

$ echo $?
0

Failure return code for some unknown files.

$ ./target/release/s4 /var/log/journal/c3a57680c1d26ca313b9c7ec36a5beaa/system.journal /var/log/wtmp /var/log/lastlog
WARNING: no syslines found "/var/log/lastlog"
WARNING: no syslines found "/var/log/wtmp"

$ echo $?
1

Yet this succeeds:

$ ./target/release/s4 /var/log/wtmp /var/log/lastlog
WARNING: no syslines found "/var/log/lastlog"
WARNING: no syslines found "/var/log/wtmp"

$ echo $?
0

No permissions causes error code (notice file noaccess)

$ ls -a /tmp/a
total 500
drwxr-xr-x  2 root root   4096 Aug  8 21:26 .
drwxrwxrwt 18 root root 479232 Aug  8 21:26 ..
-rw-r--r--  1 root root     32 Aug  6 03:50 dtf5-1.log
-rw-r--r--  1 root root      0 Aug  6 06:30 emptry
----------  1 root root     32 Aug  6 06:09 noaccess
-rw-r--r--  1 root root   4345 Aug  8 21:26 syslog.gz
-rw-r--r--  1 root root    200 Aug  6 05:50 wtmp200
-rw-r--r--  1 root root   1319 Aug  8 21:26 wtmp.gz

$ ./target/release/s4 /tmp/a 1>/dev/null
WARNING: no syslines found "/tmp/a/wtmp.gz"
WARNING: no syslines found "/tmp/a/wtmp200"

$ echo $?
0

$ sudo -u nobody -- ./target/release/s4 /tmp/a 1>/dev/null
WARNING: no syslines found "/tmp/a/wtmp200"
WARNING: no syslines found "/tmp/a/wtmp.gz"

$ echo $?
1

Dead symlink does not cause an error

$ ln -vs /tmp/DOES_NOT_EXIST /tmp/DEAD_LINK
'/tmp/DEAD_LINK' -> '/tmp/DOES_NOT_EXIST'

$ ./target/release/s4 /tmp/DEAD_LINK

$ echo $?
0

Parsing a symlink to an unparsable file does cause an error

$ ls -l /lib/libdmmp.so
lrwxrwxrwx 1 root root 16 Feb 21 21:18 /lib/libdmmp.so -> libdmmp.so.0.2.0

$ ./target/release/s4 /lib/libdmmp.so

$ echo $?
1


Overall, when and why an return code is success or failure requires further thought. Additionally, some unparsed files print an error while others do not (somewhat relates to Issue #3).

What does similar program GNU grep do?

CLI passed %Z fails to parse

Problem

The following command-line fails:

$ ./target/debug/s4 /var/log/syslog -a '2022-08-08 12:00:00 PST'
...
     process_dt: datetime_parse_from_str("2022-08-08 12:00:00 PST", "%Y-%m-%d %H:%M:%S %Z", true, +00:00)
        →datetime_parse_from_str: datetime_parse_from_str(pattern "%Y-%m-%d %H:%M:%S %Z", tz_offset +00:00, data "2022-08-08 12:00:00 PST")
        ←datetime_parse_from_str: DateTime::parse_from_str("2022-08-08 12:00:00 PST", "%Y-%m-%d %H:%M:%S %Z") failed ParseError: input is not enough for unique date and time
...
ERROR: Unable to parse a datetime from "2022-08-08 12:00:00 PST"

The --help message suggests possible datetimes:

$ ./target/release/s4 --help
super_speedy_syslog_searcher 0.0.25
....
DateTime Filter patterns may be:
    "%Y%m%dT%H%M%S"
    "%Y%m%dT%H%M%S%z"
    "%Y%m%dT%H%M%S%Z"
    "%Y-%m-%d %H:%M:%S"
    "%Y-%m-%d %H:%M:%S %z"
    "%Y-%m-%d %H:%M:%S %Z"
    "%Y-%m-%dT%H:%M:%S"
    "%Y-%m-%dT%H:%M:%S %z"
    "%Y-%m-%dT%H:%M:%S %Z"
    "%Y/%m/%d %H:%M:%S"
    "%Y/%m/%d %H:%M:%S %z"
    "%Y/%m/%d %H:%M:%S %Z"
    "%Y%m%d"
    "%Y%m%d %z"
    "%Y%m%d %Z"
    "+%s"
...

Solution

Support %Z parsing for command-line passed -a or -b.

allow setting prepended datetime format

Currently

CLI option -u, -l are hardcoded to strftime format "%Y%m%dT%H%M%S%.6f %z:".

Feature Request

All the user to pass a string that is a strftime format. A format with a timezone is required.

When passed, do a sanity check that a DateTime can be created from it

read tar file by requested block

Problem

An .tar file is entirely read during BlockReader::read_block_FileTar.
This may cause problems for very large compressed files (the s4 program will have the entire unarchived file in memory; it will use too much memory).

This is due to design of the tar crate. The crate does not provide a method to store tar::Archive<File> instance and tar::Entry<'a, R: 'a + Read> instance due to inter-instance references and explicit lifetimes. (or is prohibitively complex; I made many attempts using various strategies involving references, lifetimes, pointers, etc.)

A tar::Entry holds a reference to data within the tar::Archive<File>. I found it impossible to store both related instances during new() or read_block_FileTar() and then later, during another call to read_block_FileTar(), utilize the same tar::Entry.
A new tar::Entry could be created per call to read_block_FileTar(). But then to read the requested BlockOffset, the entire .tar file entry would have to re-read. This means reading an entire file entry within a .tar file would be an O(n^2) algorithm.

Solution

Read an .tar file per block request, as done for normal files.


Meta Issue #182
Similar problem as Issue #12.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.