hops / pack2 Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 2.0 56 KB

License: MIT License

Rust 100.00%

pack2's People

Contributors

Stargazers

Watchers

Forkers

sy5surf gavz

pack2's Issues

feature: boundary window for cgrams

It would be useful to optionally produce cgrams within a window of X characters "beyond" the character change boundary.

For example, the default behavior of pack2 cgrams would have a range value of 0. With a specified range of '1',this input:

abcd1234

... would produce:

abcd
abcd1
d1234
1234

... and a range of '2' would produce:

abcd
abcd1
abcd12
cd1234
d1234
1234

In other words, this would produce a focused subset of what would be generated by window/slider tools, or tools like cutb, still informed by character-change boundaries.

Add a command line parser

We want to have options and a nice CLI interface. Most likely we will be using the structopt crate for this task.
It supports subcommands, options (both short and long) and a help generator so our interface would look like this:

$ pack2 statsgen --option=value

feature: unhex of in-line HEX

Support in-context unhex of outfiles / potfiles / etc.

For example, this input:

96a1bbb41c713dce96b49dd13b6f6d07:$HEX[636f6c6f6e3a636f6c6f6e]

... should produce this output:

96a1bbb41c713dce96b49dd13b6f6d07:colon:colon

Add an option to choose the encoding of the input list

The default encoding will always be UTF-8 but we are not living in a perfect world so there will be lists with a different encoding. Converting to UTF-8 isn't a problem but there are some open questions.

Do we always use UTF-8 as output encoding for the approach described in #2 ?
If not we would have to implement the same for each and every encoding (that's a nightmare).
Do we simply ignore/drop invalid encoded input? If not what should be the fallback method? (applies to any encoding)

new tool: merge frequency counts

Expanding to the more general case mentioned in #8 (comment), it would be very useful to have an optimized tool to efficiently merge frequency-count data.

The use case is merging frequency counts across large datasets, and incrementally adding new frequency counts over time as new data is discovered. Calculating a frequency count for a delta or a new data source, and then merging it with an existing frequency count, is significantly more efficient than recalculating the entire frequency count.

The uniq -c format (integer frequency count, a space, and the item being counted) is the most obvious case, but other formats could be supported.

It would be nice to be able to assume that the list is sorted by the item being counted, but the implementation should assume that it's not. Or, perhaps, like rli vs rli2, one version that does not assume sorting but is memory-bound, and another version that has no size limits but requires sorted input (or a flag to switch between the two).

Reference awk implementation is here.

feature: hex (opposite of unhex)

Would be useful to have this fast and centralized.

option to extract exhausted masks from a hashcat logfile

Option to extract masks with a specific exit status from a hashcat potfile. The exit status could be an optional parameter, with exhausted masks (status 5) the default.

In other words, the Rust equivalent of this crude shell example, which extracts all exhausted masks (status 5):

$ egrep -h "mask_ctx->mask|status-after-work" ~/.hashcat/sessions/blah*.log \
    | cut -f3- \
    | egrep -B 1 'status-after-work.*5$' \
    | grep '>mask' \
    | awk '{print $2}'

It would also be useful if multiple logfiles (globbed) could be accepted.

feature: operating on multiple files

For the subcommands that take input (pack2 cgrams, etc), it would be handy to directly read multiple files, instead of piping them in with cat.

feature: allow hard-coded characters in filternask masks

filtermask doesn't appear to properly process hard-coded strings in masks:

$ pack2 filtermask ?l0 test.list
[snip]
a
c
d
e
f
g
h
l
m
n
p
q
s
t
u
v
x
y
z

$ pack2 filtermask ?l00 test.list
thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libcore/slice/mod.rs:2842:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Add rulegen command

This command should implement the functionality of rulegen.py from PACK.
We also might be able to steal an idea or two from this project: https://github.com/bartavelle/rulesfinder

Properly handle UTF-8 characters

Currently we treat any byte outside of 0x20 - 0x7e as the mask character ?b. This is not ideal as we already know we don't have to check ?a (which is, of course, also part of ?b). (Still more accurate than PACK which uses ?s.)
Rust has native support for UTF-8 strings but it's to slow for us. The current idea is to check if at least on byte is outside of the ?a range and handle these in a slow path.
Once we have a validated UTF-8 character we map it to it's Unicode block
Mapping a Unicode block to a mask is be possible using custom charsets in combination with the --hex-charset flag.
Example input: Röschti
ö is part of the Latin-1 Supplement block.
This block in UTF-8 encoding ranges from [c2,c3] [80-bf] therefore our custom charsets would be ?1 c2c3 and ?2 808182...bf
Full mask:

c2c3,808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebf,?u?1?2?l?l?l?l?l

We could even go further and detect the that it in the "sub-block" letters and only use this in our mask.
This is a very basic example of how I think about this problem. I'm totally aware there will be cases which aren't this simple. This whole idea isn't set in stone and I'm open to any ideas and suggestions.

option: sort a wordlist by mask, then by natural sort order

It would be very useful to be able to sort a wordlist first by length, then by mask, then by natural sort order.

Example:

... would sort as:

Suggested mask order: ?d?l?u?s, to match the rough order of frequency groups.

It would be useful for this option to support sorting both raw wordlists and potfiles. Automatic interpretation of $HEX would also be good.

avoid panic if output terminates due to a downstream pipe

It would be clean to support an expected end of output due to a downstream pipe:

$ tail -n 100000 potfile | cut -d: -f2- | pack2 unhex | head
[redacted]
[redacted]
[redacted]
[redacted]
[redacted]
[redacted]
[redacted]
[redacted]
[redacted]
[redacted]
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }', src/unhex.rs:15:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

define and document default behavior

As we add more functionality I'd like to have some clear defined default behavior.
The ones I have in mind right now are:

Unless a file is specified it must read from stdin
Unless a file is specified it must write to stdout
Infos (e.g. stats) must always be written to stderr
Input lines in the $HEX[] format must always be decoded before processing
Output must always be encoded using the $HEX[] encoding if at least one character is outside of \x20 - \x7e

Of course there will be the occasional exception. For example formatting the output of the unhex tool in $HEX[] would be pointless as it's purpose is to decode such lines.

I'm open to ideas and suggestions, hence the discussion label.

option to show all words matching a mask or list of masks

An option could check all words from a given list that match a given list of masks. This helps with analysis of patterns in new founds.

The option could accept either a single mask, or a filename containing masks.

Add an option for the separator between mask and count

The default separator will be but allowing the user to choose a different one would be nice.
, will be blacklisted because it would make parsing harder as it's already used for specifying the custom charsets described in #2

new tool: analyze and split strings on character-class changes

As an aid to extracting likely base words, it would be very useful to split strings on character class. I've been calling the resulting strings 'tokens', but I think there's probably a better word. :)

An optional flag to consider changes in case to be significant could be useful.

For example, this list:

Hello123
PaSsWoRd$
hashes4evar

... might produce the following output, if case were treated as a character-class change:

Hello
123
Pa
Ss
Wo
Rd
$
hashes
4
evar

... and might produce this output, if case were not treated as a character-class change:

Hello
123
PaSsWoRd
$
hashes
4
evar

I'd argue that optionally normalizing the strings on the fly would also be useful, such that it might produce this output. This somewhat artificially inflates the significance of the lower-case version of the word, but since the lower-case form is likely the most "basic" / "proto" version of a given base word, it could be argued that this is a feature, not a bug :)

Hello
hello
123
Pa
Ss
Wo
Rd
PaSsWoRd
password
$
hashes
4
evar

Since a common use case for this is to obtain frequency counts, an optional flag to automatically also accumulate frequency count at the same time would be ideal (but also preserving the ability to not do this, to support larger data sets, would be good).

Either way, finding a way to do this in a very efficient way (in terms of both memory and speed) would be highly useful.

How to handle the long tail of non-ASCII / non-Unicode strings is up for discussion.

ability to compare two mask stats file

Comparing the percentage of a stats file from statsgen against a reference file (e.g. rockyou.masks), then sort from highest to lowest difference.
This could be helpful to get some low hanging fruits and insights which masks to try first.
To make it even more powerful we can add the keyspace of each mask into the equation.

option to show all words not matching a mask or list of masks

Basically, the opposite of #6.

Error message displayed when piping into HEAD command

When running the following command

./pack2 statsgen plains.txt | head

The following error message is displayed. Appears to be cosmetic.

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }', src/statsgen.rs:174:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

option: mask list set operations

Would be super useful to have a general way to merge, split, expand, and detect overlap in lists of masks.

For example, these masks:

?l?l?l?l?l?l?l
?l?l?l?l?l?l?u
?l?l?l?l?l?l?d
?l?l?l?l?l?l?s

... could be merged to:

?l?l?l?l?l?l?a

"Splitting" would be the opposite - turning ?a into its components.

Expanding could also be useful, perhaps with thresholding that is a little more sophisticated, based on target keyspace or runtime. For example:

?l?l?l?l?l

... could be split to:

?l?l?l?la
?l?l?l?lb
?l?l?l?lc
[etc]

... to fit a specific target runtime (--PPS).