Giter Club home page Giter Club logo

Comments (6)

ethack avatar ethack commented on June 4, 2024

I've been running RITA with only a mocked conn log and haven't noticed anything break. I use this script to generate the data. I haven't tried it with real world data though.
https://gist.github.com/ethack/182fa4c1e6099f23acc31cd90874f6b8

from ipfix-rita.

Zalgo2462 avatar Zalgo2462 commented on June 4, 2024

I think the only issue will be with the meta database. Since we don't know when we're finished inserting into a a database, we make the metadb record when we make the first insert.

RITA could start analyzing a database while it is being written into which may produce oddities. (Not sure what exactly)

from ipfix-rita.

Zalgo2462 avatar Zalgo2462 commented on June 4, 2024

The main issue is inserting data into the database while RITA is analyzing. If the database is already analyzed, then nothing should break.

The simple solution is to have rita copy the input collections when starting analysis. However a copy could be very expensive. Need to benchmark. I have a suspicious that an identity based aggregation may be faster than .find().foreach(x => .insert(x)). Unfortunately, aggregations only work within the same database. That's not a deal breaker though. We could have conn-in, http-in, dns-in, and conn, http, dns collections all in the same database. One for collection for input and the other copied from the input collection for analysis.

https://stackoverflow.com/questions/10624964/whats-the-fastest-way-to-copy-a-collection-within-the-same-database

from ipfix-rita.

Zalgo2462 avatar Zalgo2462 commented on June 4, 2024

If someone runs rita analyze daily without specifying which database to analyze, we are guaranteed RITA will attempt to analyze all of the loaded databases at some point in the day.

If IPFIX-RITA inserts before the analyze command is run on the database, all is well.
If IPFIX-RITA inserts while the analyze command is being ran on the database, inconsistent results are likely to appear.
If IPFIX-RITA inserts after the analyze command is run on the database, the data will not be included in the analysis results.

If we had a flag in the MetaDB that marked whether a database was ready for analysis, this problem would be somewhat solved. rita analyze would only pick up on databases ready to be analyzed, avoiding the analysis of loading databases. However, at some point IPFIX-RITA would need to mark a database ready for analysis. If we received the data in order, we could mark the database ready after the output stream timestamps cross midnight. Unfortunately, we do not receive the data in order. A threshold of so-long-after-midnight could probably be established, but more testing would need to be done to come up with a good value for that threshold.

The traditional Bro importer also suffers from this problem. If two instances of rita are run concurrently, one instance of rita could start an import, while the other instance runs analysis, leading to the same issues as described above. A ready to analyze flag would solve this case as well. In this case, the time to set the flag is clearly defined.

from ipfix-rita.

Zalgo2462 avatar Zalgo2462 commented on June 4, 2024

We can implement a ready to analyze flag by adding the field import_finished to RITA's MetaDatabase database records.

Current MetaDB Database schema:

DBMetaInfo struct {
	ID             bson.ObjectId `bson:"_id,omitempty"`   // Ident
	Name           string        `bson:"name"`            // Top level name of the database
	Analyzed       bool          `bson:"analyzed"`        // Has this database been analyzed
	ImportVersion  string        `bson:"import_version"`  // Rita version at import
	AnalyzeVersion string        `bson:"analyze_version"` // Rita version at analyze
}

How to Alter the Import Process

  • Before a record is inserted into RITA, the appropriate MetaDatabase database record is created.
  • Records are inserted into the database referenced by the MetaDatabase database record
  • (new) When it is known that no more records will be inserted into the database referenced by the MetaDatabase record, the import_finished flag is set to true

How to Alter the Analyze Process

  • Loop over the databases registered in the MetaDatabase database collection
    • If the database is already analyzed, remove it from consideration
    • If the database is incompatible with the running version of rita, remove it from consideration
    • (new) if the import process is still altering the database (import_finished == true), remove it from consideration

from ipfix-rita.

Zalgo2462 avatar Zalgo2462 commented on June 4, 2024

For IPFIX-RITA, it is difficult to know with certainty when incoming data corresponding with a database will stop.

Currently, IPFIX-RITA chooses which database to send a record to based on its closing timestamp. If the data does not arrive in order, which it usually does not, it is hard to determine ahead of time if any more records will be sent to a given database.

We make several assumptions to ease the decision making process:

  • The system is efficient. The input buffer will not grow without bound, creating a larger lag between wall time and the timestamps of the input data.
  • The data is loosely correlated in time.
    • While the data may arrive somewhat out of order, there exists a large enough window, that when the timestamps are averaged within the given window, the timestamps grow monotonically with time.
    • The closing timestamps of the data arriving at the collector will match either the current day or previous day by wall time
      • NOTE: This assumption breaks current functionality when processing arbitrary YAF flows from pcap files

Using these assumptions we come up with the following:

We must insert a records into databases according to the following plan:

  • Given a duration, d, a relative timestamp to act as a cutoff, rcutoff, a record's
    closing timestamp, tclose, and the current time tcurrent
    • Calculate the current period, pcurrent as floor(tcurrent / d)
    • Calculate the current relative timestamp, rcurrent as tcurrent % d
    • Calculate the period of the closing timestamp, pclose as floor( tclose / d)
    • If pclose == pcurrent:
      • Insert the record into the current period's database
    • Else If pclose == pcurrent - 1 AND rcurrent < rcutoff:
      • Insert the record into the previous period's database
    • Else
      • Drop the record

If we follow this plan, then at tcurrent == d * pcurrent + rcutoff, we can set import_finished to true for each period.

This plan allows a grace period for records from the previous period to make it into their period's database.

If we set d to 24 hours, the algorithm reads a bit more simply,

  • If the date of the closing timestamp of the record is the current day,
    • Insert the record into today's database
  • If the date of the closing timestamp of the record is yesterday, and today's grace period has not elapsed,
    • Insert the record into yesterday's database
  • Else
    • Drop the record

Then, for each day, we can set the import_finished flag to true for yesterday's database after the grace period has elapsed.

from ipfix-rita.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.