mitdbg / aurum-datadiscovery Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Find a (possibly) approximate way of computing percentiles in a streaming fashion.
De-noising data in general will help on overall performance by:
This requires that errors occurred for a given column are counted and the processing of that column abandoned when these trespass a given threshold:
For example, in data.gov. I observe multiple messages like:
WARN preanalysis.PreAnalyzer - Error while parsing: For input string: "523986004252398600465239860072"
In any case, this requires a more in depth study of what other errors are causing trouble.
The analyzer can be configured with a stemming and word removal. We must understand all the options and adjust it for the best accuracy possible.
Our entity analyzer task consists of understanding the entities contained in a set of values. As these values are all supposed to "mean" the same, we should stop analyzing entities (which is costly) as soon as we are confident of having discovered the underlying entities.
When reading from a DB, do we need to pool connections for better use resource? Are we limited by the throttling mechanism of db otherwise? Explore these questions.
Quantiles and cardinalities depend on approximations at the moment. We need some infrastructure to measure the accuracy of these methods, compare them and decide how to adjust them for better accuracy/performance.
This leads to lots of problems when retrieving id from the store (java generated) and try to use it to lookup the graph (python generated). Unify them with crc32 or similar.
Approximate strategies are ok
Needed eventually
Schemas have names like:
last_name
us_phone_number
find a better tokenizer to cover these cases. This will boost accuracy a lot.
These new tokenizers would be part of elastic search.
Right now this is hardcoded, which is horrible.
Parse input options, based on the options defined in ProfilerConfig, so that users can start the profiler with a set of parameters. Unit test.
Explain the answer to schema_complement queries
edges() on a networkx graph fails if the nodes contain a cardinality attribute (not confirmed this is the reason). If so, figure out how to append that attribute information properly so that edges() does not break.
Noticed today while running ddprofiler against my local installation of elasticsearch 2.3.5. The ddprofiler server started throwing the following exception:
org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response of type [org.elasticsearch.action.admin.cluster.node.liveness.LivenessResponse]]
This only happens after commit 0ee4a9a - Merge algebra into master. I'm guessing because of the transport changes on the store client?
Workaround: I downgraded to elasticsearch v2.3.0 as per requirements.txt, and then it works fine.
I have created a "SanitizationTest" that contains a few class attributes with real data. Those values have in some case the form:
160,124.05 (note the first comma)
When this is part of a field in a CSV file, the protocol says that the value should be escaped, in quotes. For example, if we have field A,B,C, with B of the form shown previously, a file could be like:
"Boston", "160,124.05", 89
The current implementation does not parse the second field properly, which introduces errors that propagate through the entire prototype.
So that there's no need for the parent directory to contain the files
These two features are commingled.
We cannot clean the data before profiling it due to cost-value arguments. However, we should perform a best-effort approach to remove noise (null values, "X" that mean NO, outliers that are evident, etc) to raise the quality of the profiled information.
Provide some option to enumerate this for users to configure the system
For example:
rec.getTuples().add(v1)
v1, however, is not a tuple, but one value of one column.
Identify the other cases and fix them too.
So that we understand what errors occur:
and we can report them when the job is done.
Right now we have (source, field).
As we merge sources from different databases and repositories, it becomes useful to identify these too. So for example:
((database, source) , field)
Right now is table to avoid redundant reads.
By splitting data on memory we could provide more fine-granular parallelism, i.e. per column, while still avoiding redundant reads.
The Connector class implements an additional function than initially.
"public abstract Map<Attribute, List> readRows(int num) throws IOException, SQLException;"
This abstract function is implemented for FileConnector, but not for DBConnector (empty method in line 208 of DBConnector).
This issue consists of implementing such method and testing it agains a database.
The profiler can run in two modes. In online mode it starts a server (embedded jetty) that receives requests in the form of a path and the name of a source.
In the offline mode, the user must configure a path in the command line, then Main reads the files in the path and creates tasks that are submitted to Conductor.
This issue consists of:
1- Make sure a user can configure a path to (CSV) files through the command line, and that these are added to ProfilerConfig correctly.
2- When reading files, get their name and path so that the WorkerTasks can be created properly. There are TODOs indicating where is the missing info (lines 85,86,87 in Main.java)
So that it is easier for users to understand the results of a query.
EDIT: [Profiling is about 1 order of magnitude faster than indexing.
Decouple both processes]
Decouple profiling process into the smallest pieces possible. For example, one should be able to index only schemas and no data, or data alone, etc. Then find a way of combining then back again into the original form. Challenge is to maintain all perf guarantees.
There are 3 store implementations right now that the profiler can choose. Let's create a new property so that we can choose which one to use and give that property to StoreFactory to do the rest.
For OpenNLP
At the moment there is an enum for all types of data sources and another one for db types. Make this consistent and simplify it.
For similarity functions, mainly.
It is straightforward to compute this in a streaming fashion. Add the attribute to the Analyzer class, compute it and return it in the object NumericalRange.
Search in content, schema name, table name, db name.
Then indicate the context of the results.
schema_name_search("molecule")
won't return a field named "molecular"
The CardinalityAnalyer uses HyperLogLogPlus to estimate the carnality. However, the results are error prone. For example, I use the module to derive the carnality of 500 records.
The results shown that the unique elements are 531 while the total number of records are 500.
Probably we should use a better estimator.
The method EntityAnalyzer.loadModel() should return instead a list of TokenNameFinderModel, each of them corresponding to a different entity model. Then these models should be applied in the feed() method, so that all existing entities are detected.
Replace the existing try/catch model by a regex-based one that would provide the data type with higher accuracy and in a more efficient way.
There is a comment that says that the usage of getTableAttributes is wrong. I left a FIXME asking what is wrong. Please indicate so that we can see how to fix this.
Far too much object generation without a good reason when reading records. Streamline the implementation for CSV files. Write a perf test to track this in the future.
This is a desirable property, total values, unique values, matching values. Overlap among columns.
For quick lookup of values and access to statistics
When profiling a column fails, this should not break the profiler, but only log in the error and the cause to some file, skip and keep working.
Configure store, properties through command line (make sure they are compatible) and possibly package profiler into a fat jar for easy deployment.
Make sure the DBConnector code is compatible with as many dbs as possible. In particular test that the method:
"public List getAttributes() throws SQLException {"
in DBConnector works well.
In particular, in the short-term, we need the code to work with an Oracle 10g database.
One cannot reduce dimensions to X when the current dimensions are < X. Just fix the parameter so that it can be set from a config file, and so that there is some check that fixes this dynamically if necessary.
When searching for schema names, sometimes it is useful to search for exact matches, rather than approximations. In certain cases, users know the exact name of the schema, and therefore it is more useful to do an exact search. Add some property to permit this.
This is due to the multiple input that indexing data produces (in the case of large input attribute values).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.