Giter Club home page Giter Club logo

combine's People

Contributors

alexcpsec avatar darkan avatar davidski avatar gbrindisi avatar jedisct1 avatar jeffbryner avatar krmaxwell avatar waffle-iron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

combine's Issues

Speed up enrichment

We perform this process each time an indicator appears rather than using the data for the same core indicator from earlier in the same run. In other works, if four different feeds each list 8.8.8.8, we perform that lookup four times. This is inefficient.

Instead, we should create a list of all the unique indicators in the current dataset, enrich those, and then map back to the original data.

Group enrichments

From @alexcpsec in #21:

I would separate the enrichments by "groups" (for the lack of a better name) in a config file. And the groups would have a list of the sources that would be harvested by them.

And we start these groups out as "inbound" and "outbound".

If too generic (i.e, too much work for now), it is fine. But I think this would give you a lot of flexibility for further research (like a "CnC" group, a "malware download" group, etc, etc).

Currently we separate by inbound/outbound which is fine for initial release, but can be enhanced.

Running with "--tiq-test" without "-e" gives an error

Maybe the tiq-test option should have a "compulsory" -e?

Storing parsed data in crop.json
Reading processed data from crop.json
Output regular data as CSV to harvest.csv
Traceback (most recent call last):
  File "combine.py", line 42, in <module>
    tiq_output('crop.json', 'enrich.json')
  File "/Users/alexcp/src/combine/baler.py", line 19, in tiq_output
    with open(enr_file, 'rb') as f:
IOError: [Errno 2] No such file or directory: 'enrich.json'

Define normalized data model

For the "threshing" step, we need to define a normalized data model. This should be aligned with whatever MLSec already uses for ease of "baling" but does not necessarily need to be the same.

Exception in dnsdb queries

Enriching mail.TIKTIKZ.COM
Traceback (most recent call last):
  File "winnower.py", line 150, in <module>
    winnow('crop.json', 'crop.json', 'enriched.json')
  File "winnower.py", line 138, in winnow
    e_data = (addr, addr_type, direction, source, note, date, enrich_DNS(ipaddr, date, dnsdb))
  File "winnower.py", line 53, in enrich_DNS
    records = dnsdb.query_rrset(address, rrtype='A')
  File "/home/kmaxwell/src/combine/dnsdb_query.py", line 55, in query_rrset
    return self._query(path)
  File "/home/kmaxwell/src/combine/dnsdb_query.py", line 77, in _query
    http = urllib2.urlopen(req)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 396, in open
    protocol = req.get_type()
  File "/usr/lib/python2.7/urllib2.py", line 258, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: /lookup/rrset/name/95.85.191.8/A

Plugins

Implement a plugin system to thresh new sources (and likely for baling as well).

This is not a "first release" feature (i.e. post-DEFCON).

Make sure we can export in a dir structure that tiq-test can handle

MOAR work!

Here is how things look on the tiq-test data directory right now:

aperture-2:data alexcp$ ls
enriched    population  raw
aperture-2:data alexcp$ ls raw
public_inbound  public_outbound
aperture-2:data alexcp$ ls raw/pu
public_inbound/  public_outbound/
aperture-2:data alexcp$ ls raw/public_inbound/
20140615.csv.gz 20140618.csv.gz 20140622.csv.gz 20140625.csv.gz 20140628.csv.gz 20140701.csv.gz 20140704.csv.gz 20140707.csv.gz 20140710.csv.gz 20140713.csv.gz
20140616.csv.gz 20140619.csv.gz 20140623.csv.gz 20140626.csv.gz 20140629.csv.gz 20140702.csv.gz 20140705.csv.gz 20140708.csv.gz 20140711.csv.gz 20140714.csv.gz
20140617.csv.gz 20140620.csv.gz 20140624.csv.gz 20140627.csv.gz 20140630.csv.gz 20140703.csv.gz 20140706.csv.gz 20140709.csv.gz 20140712.csv.gz 20140715.csv.gz

Basically we have the following structure:
data/[DATATYPE]/[DATAGROUP]/[YYYYMMDD].csv.gz considering that:

  • DATATYPE should be either raw or enriched. The names are references to what to expect on the data structure of the CSVs inside (as described on the README). Disregard the population type, it should not be a target for this presentation.
  • DATAGROUP is in reference to the group name of the combine output (currently the "inbound" and "outbound" separation). They can be whatever you like, I am using public_inbound and public_outbound for the presentation data.
  • YYYYMMDDis the way dates should be represented in the whole world.

Please note the CSVs are gzipped. The code expects that as well.

Validate domain names

.dns.isValidDomain <- function(domain) {  
  # Check domains
  retval = !is.na(domain)
  domainLengths = sapply(domain[retval], nchar, USE.NAMES=F) 
  retval[retval] = (domainLengths > 0) & (domainLengths <= 253)
  rm(domainLengths)

  if(length(retval[retval]) > 0) {
    retval[retval] = sapply(
      str_split(domain[retval], fixed(".")), 
      function(x) { 
        labelSizes = nchar(x)
        return(length(x) <= 127 && all(labelSizes > 0) && all(labelSizes <= 63)) 
      },
      USE.NAMES=F
    )
  }

  # Is it a valid IP address, if so, it is not a domain
  retval[retval] = !isIPv4(domain[retval])
  # Is it has a slash, it is not a domain
  retval[retval] = !grepl("/", domain[retval], fixed=T)

  return(retval)
}

Parallelize DNS lookups

Not everybody has DNSDB type availability. Fail gracefully if they don't, preferably with some other source.

Enrichment strategies

We should think about alternative strategies for enrichment (e.g. not just maxhits).

Handle dates

Some sources provide a "last observed" date that we should handle. Specifically, we should exclude observations more than 24 hours ago.

CSV output requires quotes

Please make sure the CSV output has quotes (") around the strings and numbers (i.e, output all as strings).

We had settled in a non-quoted format before, but that confuses the parsers in R when I have the AS name information in the enriched versions. I am exporting/importing my data all with quotes because of this.

Please adjust accordingly.

Handling of "orphan" indicators

Today, indicators that for some reason do not match our "IPv4" or "FQDN" validation just stay there without a type. An example:

$ cat harvest.csv | grep -v FQDN | grep -v IPv4
"entity","type","direction","source","notes","date"
"2001:41d0:8:dcd4::1","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2002:5f18:8f82::5f18:8f82","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2002:c3d3:9a9f::c3d3:9a9f","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a00:1210:fffe:145::1","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a00:1210:fffe:72::1","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a01:238:20a:202:1000::25","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a01:540:2:bd5d:d849:1e69:7736:be41","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a03:7380:140:3:a90f:3bd1:d8d9:3485","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a03:7380:140:3:b86c:62e8:3e0e:a0fb","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a03:7380:2380:0:501b:91a5:76ff:8fa8","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a03:7380:2380:0:95db:5adb:685d:a0f0","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2001:41d0:1:c9b2::1","","inbound","http://www.blocklist.de/lists/bots.txt","","2014-09-04"
"2a01:430:17:1::ffff:376","","inbound","http://www.blocklist.de/lists/bots.txt","","2014-09-04"
"Export","","inbound","http://virbl.org/download/virbl.dnsbl.bit.nl.txt","","2014-09-04"
"ckaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","","outbound","http://www.nothink.org/blacklist/blacklist_malware_dns.txt","","2014-09-04"

We are not interested (for now) on IPv6 and the other stuff seem like parsing errors.

I believe we should filter out the indicators that do not match an specific type.

Enrichment

For each IP address, get the ASN and hostnames (if enrichment is enabled).

@alexcpsec: How do we want to handle multiple names for an IP address?

DNSDB API usage question

Is Farsight providing an API key for this project? Are they aware that thousands of people may be hitting them on this key each day?

UnicodeDecodeError in winnower

Dumping results
Traceback (most recent call last):
  File "winnower.py", line 150, in <module>
    winnow('crop.json', 'crop.json', 'enriched.json')
  File "winnower.py", line 146, in winnow
    json.dump(enriched, f, indent=2)
  File "/usr/lib/python2.7/json/__init__.py", line 189, in dump
    for chunk in iterable:
  File "/usr/lib/python2.7/json/encoder.py", line 431, in _iterencode
    for chunk in _iterencode_list(o, _current_indent_level):
  File "/usr/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 313, in _iterencode_list
    yield buf + _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe7 in position 22: invalid continuation byte

IndexError

While working on repro for #49 got:

(venv)kmaxwell@newton:~/src/combine$ python thresher.py
Loading raw feed data from harvest.json
[...]
Parsing feed from http://www.autoshun.org/files/shunlist.csv
Traceback (most recent call last):
  File "thresher.py", line 189, in <module>
    thresh('harvest.json', 'crop.json')
  File "thresher.py", line 166, in thresh
    harvest += thresher_map[site](response[2], response[0], 'inbound')
  File "thresher.py", line 108, in process_autoshun
    date = line.split(',')[1].split()[0]
IndexError: list index out of range

Change urllib2 to requests (maybe?)

We need something more configurable so we can set a proper User-Agent or something like that.

There are probably other adjustments we may want to do (header info, etc) to make it easier to download/scrape stuff.

Thresher is not filtering IP addresses correctly

(venv)kmaxwell@newton:~/src/combine$ python winnow.py
Traceback (most recent call last):
  File "winnow.py", line 88, in <module>
    winnow('crop.json', 'winnowed.json')
  File "winnow.py", line 78, in winnow
    ipaddr = IPAddress(addr)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/netaddr/ip/__init__.py", line 307, in __init__
    'address from %r' % addr)
netaddr.core.AddrFormatError: failed to detect a valid IP address from u'199.222.35.192.in-addr.arpa.'

Select TI feeds we are going to use on the presentation

We should select the mix of public and semi-private feeds we are going to use on the presentation, and adapt the 'harvester' code as necessary to be able to gather them.

I don't believe that we need to have full fledged tool implementation for the initial milestone, but at least the minimum we require to prove the concept for the CFP.

Traceback on combine.py

python combine.py
Fetching inbound URLs
Fetching outbound URLs
Storing raw feeds in harvest.json
Loading raw feed data from harvest.json
Parsing feed from http://www.projecthoneypot.org/list_of_ips.php?rss=1
---snip---
Parsing feed from http://www.nothink.org/blacklist/blacklist_malware_irc.txt
Storing parsed data in crop.json
Reading processed data from crop.json
Output regular data as CSV to harvest.csv
Traceback (most recent call last):
File "combine.py", line 41, in
if args.tiq-test:
AttributeError: 'Namespace' object has no attribute 'tiq'

Test suite

Should have a proper test suite to detect regression errors, support TDD, and help with refactoring.

SANS produces padded data

Which leads to things like:

Enriching 150.164.082.010
Traceback (most recent call last):
  File "combine.py", line 38, in <module>
    winnow('crop.json', 'crop.json', 'enrich.json')
  File "/home/kmaxwell/src/combine/winnower.py", line 122, in winnow
    ipaddr = IPAddress(addr)
  File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/netaddr/ip/__init__.py", line 307, in __init__
    'address from %r' % addr)
netaddr.core.AddrFormatError: failed to detect a valid IP address from u'150.164.082.010'

Additional sources to evaluate

We have some of these but need to evaluate the list for possible additional stuff.


http://1d4.us/archive/network-28-07-2014.txt
http://1d4.us/archive/network-29-07-2014.txt
http://1d4.us/archive/ssh-28-07-2014.txt.txt
http://1d4.us/archive/ssh-29-07-2014.txt.txt
http://1d4.us/archive/ssh-today.txt
http://1d4.us/archive/today.txt
http://atlas-public.ec2.arbor.net/public/ssh_attackers
http://bitcash.cz/misc/log/blacklist
http://charles.the-haleys.org/ssh_dico_attack_hdeny_format.php/hostsdeny.txt
http://cybercrime-tracker.net/all.php
http://danger.rulez.sk/projects/bruteforceblocker/blist.php
http://feodotracker.abuse.ch/blocklist.php?download=ipblocklist
http://jeroen.steeman.org/FS-PlainText
http://lists.blocklist.de/lists/all.txt
http://lists.clean-mx.com/pipermail/phishwatch/20140729.txt
http://lists.clean-mx.com/pipermail/phishwatch/20140730.txt
http://lists.clean-mx.com/pipermail/viruswatch/20140729.txt
http://lists.clean-mx.com/pipermail/viruswatch/20140730.txt
http://malc0de.com/bl/IP_Blacklist.txt
http://multiproxy.org/txt_all/proxy.txt
http://osint.bambenekconsulting.com/feeds/goz-iplist.txt
http://rules.emergingthreats.net/fwrules/emerging-PF-CC.rules
http://rules.emergingthreats.net/open/snort-2.9.0/rules/emerging-tor.rules
http://stefan.gofferje.net/sipblocklist.zone
http://torstatus.blutmagie.de/ip_list_all.php/Tor_ip_list_ALL.csv
http://torstatus.blutmagie.de/ip_list_exit.php/Tor_ip_list_EXIT.csv
http://un1c0rn.net/?module=hosts&action=list&page=1
...
http://un1c0rn.net/?module=hosts&action=list&page=200
http://vmx.yourcmc.ru/BAD_HOSTS.IP4
http://vxvault.siri-urz.net/URL_List.php
http://www.autoshun.org/files/shunlist.csv
http://www.ciarmy.com/list/ci-badguys.txt
http://www.cruzit.com/xwbl2txt.php
http://www.falconcrest.eu/IPBL.aspx
http://www.infiltrated.net/blacklisted
http://www.infiltrated.net/vabl.txt
http://www.infiltrated.net/voipabuse/netblocks.txt
http://www.infiltrated.net/webattackers.txt
http://www.malwaredomainlist.com/hostslist/ip.txt
http://www.michaelbrentecklund.com/whm-cpanel-cphulk-banlist-whm-cpanel-cphulk-blacklist/
http://www.nothink.org/blacklist/blacklist_malware_dns.txt
http://www.nothink.org/blacklist/blacklist_malware_http.txt
http://www.nothink.org/blacklist/blacklist_malware_irc.txt
http://www.nothink.org/blacklist/blacklist_ssh_day.txt
http://www.openbl.org/lists/base_1days.txt
http://www.spamhaus.org/drop/drop.txt
http://www.spamhaus.org/drop/edrop.txt
http://www.stopforumspam.com/downloads/listed_ip_1_all.zip
http://www.stopforumspam.com/downloads/toxic_ip_cidr.txt
http://www.voipbl.org/update/
https://blocklist.sigmaprojects.org/api.cfc?method=getList&lists=atma
https://blocklist.sigmaprojects.org/api.cfc?method=getList&lists=spyware
https://blocklist.sigmaprojects.org/api.cfc?method=getList&lists=webexploit
https://isc.sans.edu/api/sources/attacks/10000/2014-07-30
https://isc.sans.edu/api/topips/records/1000/2014-07-30
https://lists.malwarepatrol.net/cgi/getfile?receipt=f1377916320&product=8&list=smoothwall
https://palevotracker.abuse.ch/blocklists.php?download=ipblocklist
https://raw.githubusercontent.com/EmergingThreats/et-open-bad-ip-list/master/IPs.txt
https://reputation.alienvault.com/reputation.generic
https://security.berkeley.edu/aggressive_ips/ips
https://spyeyetracker.abuse.ch/blocklist.php?download=ipblocklist
https://www.dan.me.uk/torlist/
https://www.gpf-comics.com/dnsbl/export.php
https://www.maxmind.com/en/anonymous_proxies
https://zeustracker.abuse.ch/blocklist.php?download=ipblocklist

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.