Giter Club home page Giter Club logo

dodo's Introduction

Hi there, I'm Sebastian Schmidl ๐Ÿ‘‹

I'm a software engineer and PhD student at the chair for Information Systems (๐Ÿ’ป) at Hasso Plattner Institut for Digital Engineering (HPI). Currently, I'm working in the distributed computing research group, where we investigate computationally complex problems and how they can be solved in distributed environments.

My Skills


๐Ÿ”ญ Research Interests

  • Scalable and reactive systems, especially using actor programming concepts
  • Time series anomaly detection
  • Distributed computing
  • Data profiling

๐Ÿ’ป Current open-source activity

  • Maintainer @ TimeEval - An Evaluation Tool for Anomaly Detection Algorithms on Time Series
  • Core developer @ aeon - A toolkit for conducting machine learning tasks with time series data

๐Ÿ“ซ How to reach me


dodo's People

Contributors

codelionx avatar julkw avatar

Watchers

 avatar  avatar  avatar

dodo's Issues

Bad code style and performance hit with return

Desired Solution

We should change all algorithm-related code to not use return for early returning, but rather a if-else pair.
return is implemented in bytecode as an exception catching/throwing pair which, used in hot code, has performance implications. Additionally, the last statement in a method is automatically returned, thus the use of the return keyword is redundant.

Additional context

Link to codacy issue: https://app.codacy.com/app/dodo/dODo/file/35053121925/issues/source?bid=12812539&fileBranchId=12812539#l66

Handle null values correctly

Problem

We replace null values for columns of a specific data type (unequal to NullType) with a default value, e.g.:

// for LongType
def parse(value: String): Long =
  Try {
    value.toLong
  }
  .getOrElse(0L) // on error or `null`, we set the cell's value to `0L`

This eliminates all nulls from the column. This means, we can not follow standard SQL semantics: NULL equals NULL and NULLS FIRST for sorting.

Possible Solutions

  1. Use null as default value (on parse error) and check the whole algorithm to be able to deal with null values.
    • ๐Ÿ‘ default sorting usable and follows SQL semantics
    • ๐Ÿ‘Ž high likelihood for NPEs
  2. Use Option[T] for cell values
    • ๐Ÿ‘ default sorting usable and follows SQL semantics
    • ๐Ÿ‘ no NPEs, explicit null-handling enforced by compiler
    • ๐Ÿ‘Ž ugly / more complex code
    • ๐Ÿ‘Ž possible performance degradation and increased memory usage
  3. Use a separate data structure nullMask: BitSet (BiSet doc) to mask null values in the original array
    • ๐Ÿ‘ TypedColumn must handle null values internally
    • ๐Ÿ‘Ž every access to a cell must first go through the mask
    • ๐Ÿ‘Ž we can not use scala's sorting algorithms out-of-the-box

Handle delayed/lost `AckWorkReceived`

Potential Problem

When sending work to another node, the sending node waits for an acknowledgement of arrival. If that doesn't arrive within 5 seconds it is assumes that the work wasn't received and it is readded to the sender's own queue.
This leads to this work (and all work generated from these candidates) to be done twice if the work was received, but the acknowledgment message (AckWorkReceived) got lost or delayed.

Potential Solution

The work sender could resend the work instead of readding it to it's own queue. If the reason no acknowledgement was received was an overload of the system, though, this might worsen the problem. (And the receiving node would have to check received work for duplicates before adding it to its own queue)

Add Downing Protocol

Desired Solution

A node should only shutdown when all ODs etc have been found, i.e. when all nodes have nothing pending anymore. The protocol to find that out would start when a node discovers that the other nodes' workqueues are empty during workStealing.

Alternatives?

Right now nodes shut down when they don't have anything pending and all the other nodes' workqueues are empty. This might lead to too early shutdowns however.

Use column names (header from CSV file)

Desired Solution

Currently, we expect CSV files to have no header (ignore them during parsing).

We should have a setting to allow CSV files to have headers and extract them from it. If this setting is not set, we should assign upper case letters to the column (like Excel is doing it --> A B .. Z AA AB .. AZ ..). The mapping from column indices to column names should be stores somewhere. The ResultCollector actors can use the mapping to get the names for the columns back, so we can output proper column names in the ODs (e.g. A, D โ†ฆ B)

Alternatives?

  • Just output the indices in the ODs (e.g. 0, 3 โ†ฆ 1) and ignore headers if the setting is set.
  • The additional setting specifying if a CSV has headers is still needed. Otherwise we would set the first row to the default value during parsing.

Split detection does not work for checkODCandidate

Describe the bug

The checkOrderDependent() method does not return false for splits as described in the pending test The DependencyChecking should identify a split:

"identify a split" in pendingUntilFixed{
val dataset: Array[TypedColumn[_ <: Any]] = Array(
TypedColumnBuilder.from("A", "A", "A"),
TypedColumnBuilder.from(0L, 1L, 4L)
)
// the stable sort of the first column list makes this dependency true...
DependencyCheckingTester.checkOrderDependent(Seq(0) -> Seq(1), dataset) shouldEqual false
DependencyCheckingTester.checkOrderDependent(Seq(0) -> Seq(0, 1), dataset) shouldEqual false
}

To Reproduce

Steps to reproduce the behavior:

  1. Run the mentioned test

Expected behavior

The expected behavior is not clear yet. This could be desired behavior or when seen formally a split should prevent a OD to be found.

Accessing Array methods throws ClassCastExceptions

Describe the bug

Calling col.array.distinct instead of col.distinct on TypedColumns throws the following ClastCastException:

[00:07:38.735 ERROR]                    akka://dodo-system/user/systemcoordinator| [D cannot be cast to [Ljava.lang.Object;
java.lang.ClassCastException: [D cannot be cast to [Ljava.lang.Object;
	at com.github.codelionx.dodo.Preprocessor$.$anonfun$constantColumnIndices$1(Preprocessor.scala:12)
	at scala.runtime.java8.JFunction1$mcZI$sp.apply(JFunction1$mcZI$sp.java:23)
	at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:251)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250)
	at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248)
	at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
	at scala.collection.TraversableLike.filter(TraversableLike.scala:262)
	at scala.collection.TraversableLike.filter$(TraversableLike.scala:262)
	at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
	at com.github.codelionx.dodo.Preprocessor$.constantColumnIndices(Preprocessor.scala:9)
	at com.github.codelionx.dodo.actors.SystemCoordinator$$anonfun$receive$1.applyOrElse(SystemCoordinator.scala:57)
	at akka.actor.Actor.aroundReceive(Actor.scala:539)
	at akka.actor.Actor.aroundReceive$(Actor.scala:537)
	at com.github.codelionx.dodo.actors.SystemCoordinator.aroundReceive(SystemCoordinator.scala:18)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:610)
	at akka.actor.ActorCell.invoke(ActorCell.scala:579)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
	at akka.dispatch.Mailbox.run(Mailbox.scala:229)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:241)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

The rich TypedColumn API introduced in #7 is not affected by this bug (only when using .array again).

To Reproduce

Run the following snippet with the parsed data from the iris.csv data file:

val data: Array[TypedColumn[Any]] = _
val constColIndices = data.indices.filter(i => {
  val col = data(i)
  println(s"working on ${col.dataType}")
  col.dataType == NullType || col.array.distinct.length <= 1
})

Expected behavior

col.array.distinct should behave the same as col.distinct and return the distinct values in the column.

Environment (please complete the following information):

  • OS: Kubuntu 18.10
  • Java:
    openjdk version "1.8.0_191"
    OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.10.1-b12)
    OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
    
  • Scala v2.12.8, SBT v1.2.8, Akka v2.5.22
  • Branch or Commit-Hash: feature/richTypedColumn (066fab1)

Share `reducedCollumns` between nodes

Describe the bug

Currently the reducedCollumns calculated after pruning aren't shared across nodes. This leads to problems in the candidateGeneration in nodes that did not prune themselves.

To Reproduce

Start more than one node.

Expected behavior

The nodes need to share the reducedColumns somehow or all prune after receiving the data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.