Giter Club home page Giter Club logo

awesome-siem's Introduction

SIEM, Visibility, and Event-Driven Architecture Curated Solutions.

Architecture - the SecOps Data Engineering Framework

Diagram from my Security Operations Data Engineering Framework (SODEF) paper. These are the pieces you can look at when trying to build a cost-effective data engineering architecture and system for analytics and threat detection in your organization.

SIEM Research - A_Cost_Effective_SIEM_Framework___BenR___2024_IUPUI_MSCTS Grad School Projects and Content - https://github.com/cybersader/grad-school-projects

IMG-20240117125647565

SIEM Components, Features, Constraints

These are some things to consider when looking at SIEMs and security data analytics architectures.

  • Pricing models
    • Ingest-based for smaller implementations
    • Self-hosted
    • Compute accurate models or models that estimate or measure your computation and infrastructure usage with a premium on top to cover the technology and other costs (support, etc) (example: Splunk SVCs)
  • Data models, compatibility, or normalization approaches and/or logic
  • Data transformation capabilities
    • Transforming logs
  • Query language/syntax, building analytics?
    • Sigma rules?
  • Control over what gets warehoused or used with computation or infrastructure?
    • This is why data pipelines like Cribl should be used so that their is control
  • Detection Engineering
    • Machine Learning and AI
      • Anomaly Detection Models
      • Types of functions
      • Correlation
    • Community Marketplace / Database for Queries/Detections

Use a Data Pipeline to Save Resources/Money!!

Send all of your log data from your IT systems to a data pipelining system like any of the ones mentioned below.

The biggest reason to use data pipelines is that IT teams need to be log users rather than just log keepers. It allows them to align infrastructure costs with usage. Namely, they aren't spending an exorbitant amount of money warehousing data that never needed warehousing (Sidenote: warehousing or indexing the data to make it easier to query and analyze costs A LOT). Without a central data transformation proxy like a data pipeline tool, teams will likely be playing into the hands of analytics/SIEM platforms pricing that benefits from people who don't use all of their data similar to how Costco banks on consumers that don't make good use of their memberships.

Architecture Examples - WIP

Data Formats

  • Normalize all of your data to use...
    • CSV or DELIMITED format for "flat" data,
    • JSON for "nested" data,
    • and Parquet where optimization is valuable AND where not all of the columns are always being used

SIEM Focus

Open Source

For at-home implementations

  • Graylog
  • Wazuh

Proprietary

Not Marketed to Security

Open Source

Advanced implementation for at-home.

Proprietary Analytics Platforms, Lakehouses, etc

Curated Solutions

Research & Consulting

  • SIEM Matrices

Acronyms

  • SIEM - aggregating data and doing analysis
  • SOAR - logic to react to analysis (integrates with SIEM)
    • SOAR tools are a combination of threat intelligence platforms, Security Incident Response Platforms (SIRP) and Security Orchestration and Automation (SOA).
  • SIRP - security incident response platform
  • TIP - threat intel platform

Security-Focused Analytics Platforms (SIEM, SIRP)

search terms

  • .

Open-Source

Proprietary

Security-Focused Data Engineering

Security Data Lakes

  • Matano | Cloud native SIEM
  • The Average SIEM Deployment Costs $18M Annually…Clearly, Its time for a change! | by Dan Schoenbaum | Medium
    • Security-driven data can be dimensional, dynamic, and heterogeneous, thus, data warehouse solutions are less effective in delivering the agility and performance users need.
    • A data lake is considered a subset of a data warehouse, however, in terms of flexibility, it is a major evolution. The data lake is more flexible and supports unstructured and semi-structured data, in its native format and can include log files, feeds, tables, text files, system logs, and more.
    • For example, .03 cents per/GB/per month if in an S3 bucket. This capability makes the data lake the penultimate evolution of the SIEM.
  • Why Security Teams Are Adopting Security Data Lakes As Part Of A SIEM Strategy
  • The Average SIEM Deployment Costs $18M Annually…Clearly, Its time for a change! | by Dan Schoenbaum | Medium
    • Security-driven data can be dimensional, dynamic, and heterogeneous, thus, data warehouse solutions are less effective in delivering the agility and performance users need. A data lake is considered a subset of a data warehouse, however, in terms of flexibility, it is a major evolution. The data lake is more flexible and supports unstructured and semi-structured data, in its native format and can include log files, feeds, tables, text files, system logs, and more. You can stream all of your security data, none is turned away, and everything will be retained. This can easily be made accessible to a security team at a low cost. For example, .03 cents per/GB/per month if in an S3 bucket. This capability makes the data lake the penultimate evolution of the SIEM.
    • The value of the process is to compare newly observed behavior with historical trends, sometimes comparing to datasets spanning 10 years. This would be cost-prohibitive in a traditional SIEM.
    • Interesting companies to power your security data lake:
      • If you are planning on deploying a security data lake or already have, here are three cutting edge companies you should know about. I am not an employee of any of these companies, but I am very familiar with them and believe that each will change our industry in a very meaningful way and can transform your own security data lake initiative.
        1. Panther:Snowflake is a wildly popular data platform primarily focused on mid-market to enterprise departmental use. It was not a SIEM and had no security capabilities. Along came engineers from AWS and Airbnb who created Panther, a platform for threat detection and investigations. The company recently connected Panther with Snowflake and is able to join data between the two platforms to make Snowflake a “next-generation SIEM” or — perhaps better positioning — evolve Snowflake into a highly-performing, cost-effective, security data lake. It is still a newer solution, but it’s a cool idea with a lot of promise and has been replacing Splunk implementations at companies like Dropbox and others at an impressive clip. If you want to get a sense for what the future will look like, you can even try it for free here.
        1. Team Cymru is the most powerful security company you have yet to hear of. They have assembled a global network of sensors that “listen” to IP-based traffic on the internet as it passes through ISP’s and can “see” and therefore know more than anyone in a typical SOC. They have built the company by selling this data to large, public security companies such as Crowdstrike, FireEyeMicrosoft, and now Palo Alto Networks, with their acquisition of Expanse, which they snapped up for a cool $800M. In addition, cutting-edge SOC teams at JPMC and Walmart are embracing Cymru’s telemetry data feed. Now you can get access to this same data, you will want their 50+ data types and 10+ years of intelligence inside of your data lake to help your team to better identify adversaries and bad actors based on certain traits such as IP or other signatures.
      1. Varada.io: The entire value of a security data lake is easy, rapid, and unfettered access to vast amounts of information. It eliminates the need to move and duplicate data and offers the agility and flexibility users demand. As data lakes grow, queries become slower and require extensive data-ops to meet business requirements. Cloud storage may be cheap, but compute becomes very expensive quickly as query engines are most often based on full scans. Varada solved this problem by indexing and virtualizing all critical data in any dimension. Data is kept closer to the SOC — on SSD volumes — in its granular form so that data consumers can leverage the ultimate flexibility in running any query whenever they need. The benefit is a query response time up to 100x faster at a much cheaper rate by avoiding time-consuming full scans. This enables workloads such as the search for attack indicators, post-incident investigation, integrity monitoring, and threat-hunting. Varada was so innovative that data vendor Starburst recently acquired them.
    • The Security Data lake, while not a simple, “off the shelf” approach, centralizes all of your critical threat and event data in a large, central repository with simple access. It can still leverage an existing SIEM, which may leverage correlation, machine learning algorithms and even AI to detect fraud by evaluating patterns and then triggering alerts. However configured, the security data lake is an exciting step you should be considering, along with the three innovative companies I mentioned in this article.

Security Data Pipelines

Innovative, Unorthodox Visibility and Analysis (not marketed for security)

search terms

  • General Data Engineering, Storage, Analytics, Visualization?
    • multi-language, agnostic analytics engine
    • data engineering platform
    • distributed data analytics
    • data warehouse analytics
    • data lake analytics

Open-Source

  • Kibana- While commonly used with the ELK (Elasticsearch, Logstash, Kibana) stack for log analysis, Kibana can be adapted for non-security-related data visualization and exploration. : https://www.elastic.co/kibana/
  • Pentaho-  Is an open-source business analytics and data integration platform. It's utilized for data visualization, reporting, and ETL (Extract, Transform, Load) tasks. : https://www.hitachivantara.com/en-us/products/pentaho-platform/data-integration-analytics.html
  • Metabase- Is a business intelligence and data exploration tool that allows users to create interactive dashboards and analyze data stored in various databases. : https://www.metabase.com/
  • Jupyter- Is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. It's widely used in data science and research for data analysis and visualization. : https://jupyter.org/

Cloud - Proprietary

Threat Intel?

Open Source

Cloud - Proprietary

  • .

SOAR - Security Orchestration, Automation, and Response

search terms

  • Automated Incident Response
  • Incident Handling
  • Security Automation
  • IFTTT

Open-Source

Cloud - Proprietary

If this then this, security integrations, automations

ETL (Extract Transform Load), Data Transformation & Integration, Stream Processing, Moving Data, Data Quality, Data Pipelines, Observability

Data Engineering Landscape

Data Pipeline Architecture

Data Formats

My Approach

  • Normalize all of your data to use...
    • CSV or DELIMITED format for "flat" data,
    • JSON for "nested" data,
    • and Parquet where optimization is valuable AND where not all of the columns are always being used
  • From my research
    • Today, two data formats stick out more than any other and can cover most use cases of log and event data. Those data formats are the tabular or flat format "CSV" (comma-separated values) and the nested JSON (Javascript object notation) file format. In terms of tabular formats (think Excel files), the CSV file has been around since the 1970s. The CSV format was made official by RFC-4180. Tabular formats are incredibly useful for transporting relational and structured data. Where this structure is not possible, JSON is the next option. However, some logs or events do not follow either of these formats. Some may follow a delimited format which can be nested or flat such as a log using a mix of colons and brackets. Extensible Markup Language (XML) used to be the most popular option for nested data. However, over the years JSON has been adopted as the simpler and practical option for nested data (shown in Figure 1.) JSON was first introduced in 2001 but grew quickly in popularity as big tech companies like Google and Facebook started using it. XML also had additional security risks in implementation, so REST APIs and JSON became standard for development. On a side note, formats like Parquet are good for optimizing with certain use cases such as where one column in the tabular format is being used. Parquet performs better than CSV in those cases, so there are some caveats to CSV and use with flat data. Based on these facts, a good approach to threat detection would be to utilize data in these two formats as much as possible with some outliers where optimization is worth it. This makes it easier to transport and more likely to interface well with data engineering infrastructure and threat detection setups.

All

  • Apache Parquet - (Apache foundation / Data Format / Open Source / Free).
  • Apache ORC - (Apache foundation / Hortonworks / Facebook / Data Format / Open Source / Free).
  • Apache Avro - (Apache foundation / Data Format / Open Source / Free).
  • Apache Kudu - (Apache foundation / Cloudera / Data Format / Open Source / Free).
  • Apache Arrow - (Apache foundation / Data Format / Open Source / Free).
  • Delta - (Databricks / Data Format / Free or License fee).
  • JSON - (Data Format / Free).
  • CSV - (Data Format / Free).
  • TSV - (Data Format / Free).
  • HDF5 - (The HDF Group / Data Format / Open Source (licensed by HDF5) / Free).

Security Data Lakes

ETL & Data Transformation Tool Links Misc

More Gartner Magic Quadrants

  • Analytics & Business Intelligence
  • iPaas (Integration Platform)
  • Data Science & ML Platforms
    • 2023
      • .
    • 2021
    • 2020
  • Personalization Engines?

Related Awesome List Curations

awesome-siem's People

Contributors

cybersader avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

fareeday

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.