Giter Club home page Giter Club logo

blocking's People

Contributors

berenz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

blocking's Issues

add quality metrics

Quality metrics about blocking. This would require specifying new argument: true_block

bug when `true_blocks` are provided

df_example <- data.frame(txt = c("jankowalski", "kowalskijan", "kowalskimjan",
"kowaljan", "montypython", "pythonmonty", "cyrkmontypython", "monty"))

testing <- blocking(x = df_example$txt,
                    deduplication = T)

testing2 <- blocking(x = df_example$txt,
                    deduplication = T,
                    true_blocks = testing$result[c(1,4,6), .(x,y,block)])

Error message

Error in modularity.igraph(graph, membership) : 
  At vendor/cigraph/src/community/modularity.c:132 : Membership vector size differs from number of vertices. Invalid value
> traceback()
4: modularity.igraph(graph, membership)
3: modularity(graph, membership)
2: igraph::make_clusters(eval_g1, membership = eval_blocks$block.x)
1: blocking(x = df_example$txt, deduplication = T, true_blocks = testing$result[c(1, 
       4, 6), .(x, y, block)])

Full `mlpack` support

Support for mlpack in:

  1. lsh functions
  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index
  1. knn functions
  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

Full `rnndescent` support

  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

remove duplicated pairs from the result

df_example <- data.frame(txt = c("jankowalski", "kowalskijan", "kowalskimjan",
"kowaljan", "montypython", "pythonmonty", "cyrkmontypython", "monty"))

result <- blocking(x = df_example$txt,
                   ann = "hnsw",
                   control_ann = controls_ann(hnsw = list(M = 5, ef_c = 10, ef_s = 10)))

result$result

The result contains pairs that refer to the same records (i.e. 1 - 2, 2-1)

> result$result
       x     y block       dist
   <int> <int> <num>      <num>
1:     1     2     1 0.09999990
2:     2     1     1 0.09999990
3:     2     3     1 0.14188361
4:     2     4     1 0.28286278
5:     5     6     2 0.08333343
6:     5     7     2 0.13397467
7:     6     5     2 0.08333343
8:     6     8     2 0.27831221

Release 0.2.0

Plans:

  • Support for rnndescent as it will be shipped to CRAN.

blocking by blocking variables

Allow user to specify block vector before ANN blocking. For instance, user may want to block records by gender / letter before applying ANN blocking.

`pair_ann` does not work with `data.table`

> pair_ann(x = df_example, on = "txt")
  First data set:  8 records
  Second data set: 8 records
  Total number of pairs: 10 pairs
  Blocking on: 'txt'

       .x    .y block
    <int> <int> <num>
 1:     1     2     1
 2:     1     3     1
 3:     1     4     1
 4:     2     3     1
 5:     2     4     1
 6:     5     6     2
 7:     5     7     2
 8:     5     8     2
 9:     6     7     2
10:     6     8     2
> pair_ann(x = setDT(df_example), on = "txt")
Error: j (the 2nd argument inside [...]) is a single symbol but column name 'on' is not found. If you intended to select columns using a variable in calling scope, please try DT[, ..on]. The .. prefix conveys one-level-up similar to a file system path.

Improvement of performance

Ideas for improving performance:

  • if a large dataset is present for index then index should be created iteratively as converting sparse to dense matrix is a bottleneck
  • if a large query data is present the same procedure should be applied.

Consider using:

  • sparse matrix (Matrix) as an input for x, y
  • bigmemory::big.matrix as an input for x, y

Full `RcppHNSW` support

Support for RcppHNSW in:

  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

Full `RcppAnnoy` support

Support for RcppAnnoy in:

  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

error in the `confusion` construction

Just an issue to remind that there is some error construction of confusion matrix

> confusion
          same_truth
same_block   FALSE    TRUE
     FALSE 1926684      11
     TRUE       11     960

cell (1, 2) is exactly the same as (2,1).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.