mitdbg / aurum-datadiscovery Goto Github PK

View Code? Open in Web Editor NEW

74.0 74.0 49.0 44.56 MB

License: MIT License

Python 68.87% Java 15.05% HTML 0.19% JavaScript 14.53% C++ 1.07% Shell 0.01% CSS 0.28%

aurum-datadiscovery's People

Stargazers

Watchers

aurum-datadiscovery's Issues

Include the five-number summary as part of range analyzer

Find a (possibly) approximate way of computing percentiles in a streaming fashion.

Bad data slows down profiler

De-noising data in general will help on overall performance by:

making the profiler work more efficient
improving accuracy

This requires that errors occurred for a given column are counted and the processing of that column abandoned when these trespass a given threshold:

For example, in data.gov. I observe multiple messages like:
WARN preanalysis.PreAnalyzer - Error while parsing: For input string: "523986004252398600465239860072"

In any case, this requires a more in depth study of what other errors are causing trouble.

Adjust analyzer from elasticsearch

The analyzer can be configured with a stemming and word removal. We must understand all the options and adjust it for the best accuracy possible.

Our entity analyzer task consists of understanding the entities contained in a set of values. As these values are all supposed to "mean" the same, we should stop analyzing entities (which is costly) as soon as we are confident of having discovered the underlying entities.

Review multithread operation in standalone mode

When reading from a DB, do we need to pool connections for better use resource? Are we limited by the throttling mechanism of db otherwise? Explore these questions.

Accuracy of approximate summaries

Quantiles and cardinalities depend on approximations at the moment. We need some infrastructure to measure the accuracy of these methods, compare them and decide how to adjust them for better accuracy/performance.

Mismatch between java and python id generation

This leads to lots of problems when retrieving id from the store (java generated) and try to use it to lookup the graph (python generated). Unify them with crc32 or similar.

Enumerate strategies to compute overlap

Approximate strategies are ok

Find alternative to inmem ds for metaschema

Needed eventually

Specialized tokenizers for db data

Schemas have names like:

last_name
us_phone_number

find a better tokenizer to cover these cases. This will boost accuracy a lot.
These new tokenizers would be part of elastic search.

Configure separator for tsv/csv files from config

Right now this is hardcoded, which is horrible.

Parse input options

Parse input options, based on the options defined in ProfilerConfig, so that users can start the profiler with a set of parameters. Unit test.

Explain schema complement

Explain the answer to schema_complement queries

Appending cardinality as node attribute seems to break edges()

edges() on a networkx graph fails if the nodes contain a cardinality attribute (not confirmed this is the reason). If so, figure out how to append that attribute information properly so that edges() does not break.

ddprofiler throws RemoteTransportException with elasticsearch v2.3.5

Noticed today while running ddprofiler against my local installation of elasticsearch 2.3.5. The ddprofiler server started throwing the following exception:

org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.node.liveness.LivenessResponse]]

This only happens after commit 0ee4a9a - Merge algebra into master. I'm guessing because of the transport changes on the store client?

Workaround: I downgraded to elasticsearch v2.3.0 as per requirements.txt, and then it works fine.

Wrong matching of floats with mixed commas and dots

I have created a "SanitizationTest" that contains a few class attributes with real data. Those values have in some case the form:

160,124.05 (note the first comma)

When this is part of a field in a CSV file, the protocol says that the value should be escaped, in quotes. For example, if we have field A,B,C, with B of the form shown previously, a file could be like:

"Boston", "160,124.05", 89

The current implementation does not parse the second field properly, which introduces errors that propagate through the entire prototype.

Find alternative for bin files

So that there's no need for the parent directory to contain the files

Improve entity analysis and set automatic threshold for early stopping

These two features are commingled.

Enumerate noise-cleaning strategies

We cannot clean the data before profiling it due to cost-value arguments. However, we should perform a best-effort approach to remove noise (null values, "X" that mean NO, outliers that are evident, etc) to raise the quality of the profiled information.

Show command line options

Provide some option to enumerate this for users to configure the system

Fix naming issues with connectors

For example:

rec.getTuples().add(v1)

v1, however, is not a tuple, but one value of one column.
Identify the other cases and fix them too.

Better metrics for profiler

So that we understand what errors occur:

when processing a source
when processing a record

and we can report them when the job is done.

Improve source identification

Right now we have (source, field).
As we merge sources from different databases and repositories, it becomes useful to identify these too. So for example:
((database, source) , field)

Parallelism granularity

Right now is table to avoid redundant reads.
By splitting data on memory we could provide more fine-granular parallelism, i.e. per column, while still avoiding redundant reads.

Implementing additional readRows in DBConnector

The Connector class implements an additional function than initially.

"public abstract Map<Attribute, List> readRows(int num) throws IOException, SQLException;"

This abstract function is implemented for FileConnector, but not for DBConnector (empty method in line 208 of DBConnector).

This issue consists of implementing such method and testing it agains a database.

Complete and test offline mode

The profiler can run in two modes. In online mode it starts a server (embedded jetty) that receives requests in the form of a path and the name of a source.

In the offline mode, the user must configure a path in the command line, then Main reads the files in the path and creates tasks that are submitted to Conductor.

This issue consists of:
1- Make sure a user can configure a path to (CSV) files through the command line, and that these are added to ProfilerConfig correctly.
2- When reading files, get their name and path so that the WorkerTasks can be created properly. There are TODOs indicating where is the missing info (lines 85,86,87 in Main.java)

Include value examples with answers

So that it is easier for users to understand the results of a query.

Decouple indexing from profiling

EDIT: [Profiling is about 1 order of magnitude faster than indexing.
Decouple both processes]

Decouple profiling process into the smallest pieces possible. For example, one should be able to index only schemas and no data, or data alone, etc. Then find a way of combining then back again into the original form. Challenge is to maintain all perf guarantees.

ProfilerConfig property to choose Store

There are 3 store implementations right now that the profiler can choose. Let's create a new property so that we can choose which one to use and give that property to StoreFactory to do the rest.

Investigate other pre-trained NLP models online

For OpenNLP

Unify data source types

At the moment there is an enum for all types of data sources and another one for db types. Make this consistent and simplify it.

Error checking when input values do not exist

For similarity functions, mainly.

Add standard deviation to RangeAnalyzer

It is straightforward to compute this in a streaming fashion. Add the attribute to the Analyzer class, compute it and return it in the object NumericalRange.

Faster indexing

Check how to improve elasticsearch's performance
Build a pre-indexer that filters out data that has been indexed for a given column. Basically this requires a count-min sketch per column, so that we can decide not to send certain data to the store if it's been already indexed (note that even though sending the data won't change the index, it requires processing anyway).
throw more strategies here... (that don't involve building our own stuff, for now)

move scores from Hit (nodes) to edges

Unify keyword search

Search in content, schema name, table name, db name.
Then indicate the context of the results.

Stemming lookup queries and their indices in the store

schema_name_search("molecule")

won't return a field named "molecular"

bugs in the CardinalityAnalyzer

The CardinalityAnalyer uses HyperLogLogPlus to estimate the carnality. However, the results are error prone. For example, I use the module to derive the carnality of 500 records.

The results shown that the unique elements are 531 while the total number of records are 500.

Probably we should use a better estimator.

Preload all entity models from OpenNLP

The method EntityAnalyzer.loadModel() should return instead a list of TokenNameFinderModel, each of them corresponding to a different entity model. Then these models should be applied in the feed() method, so that all existing entities are detected.

Improve efficiency and accuracy of data type detection

Replace the existing try/catch model by a regex-based one that would provide the data type with higher accuracy and in a more efficient way.

FileConnector.getAttributes wrong usage of getTableAttributes

There is a comment that says that the usage of getTableAttributes is wrong. I left a FIXME asking what is wrong. Please indicate so that we can see how to fix this.

Inefficient data reads

Far too much object generation without a good reason when reading records. Streamline the implementation for CSV files. Write a perf test to track this in the future.

Showing exact overlap helps

This is a desirable property, total values, unique values, matching values. Overlap among columns.

Include thin processing layer

For quick lookup of values and access to statistics

Make profiler fault tolerant

When profiling a column fails, this should not break the profiler, but only log in the error and the cause to some file, skip and keep working.

Prepare e2e deployment of profiler

Configure store, properties through command line (make sure they are compatible) and possibly package profiler into a fat jar for easy deployment.

Unit test DBConnector with Oracle db

Make sure the DBConnector code is compatible with as many dbs as possible. In particular test that the method:

"public List getAttributes() throws SQLException {"

in DBConnector works well.

In particular, in the short-term, we need the code to work with an Oracle 10g database.

LSA fails when dimensions are smaller than target set of components

One cannot reduce dimensions to X when the current dimensions are < X. Just fix the parameter so that it can be set from a config file, and so that there is some check that fixes this dynamically if necessary.

Exact keyword/schema search

When searching for schema names, sometimes it is useful to search for exact matches, rather than approximations. In certain cases, users know the exact name of the schema, and therefore it is more useful to do an exact search. Add some property to permit this.

Keyword search returns duplicate resuls

This is due to the multiple input that indexing data produces (in the case of large input attribute values).

mitdbg / aurum-datadiscovery Goto Github PK

aurum-datadiscovery's People

Stargazers

Watchers

Forkers

aurum-datadiscovery's Issues

Recommend Projects

Recommend Topics

Recommend Org