Giter Club home page Giter Club logo

Comments (5)

davidbp avatar davidbp commented on May 13, 2024

Where can you see that this function (ivf_search) call actually takes the most of time at query time?

from annlite.

numb3r3 avatar numb3r3 commented on May 13, 2024

ivf_search is where the search actually happens. BTW, I also run a profile

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   132                                               @line_profile
   133                                               def search_cells(
   134                                                   self,
   135                                                   query: np.ndarray,
   136                                                   cells: np.ndarray,
   137                                                   where_clause: str = '',
   138                                                   where_params: Tuple = (),
   139                                                   limit: int = 10,
   140                                                   include_metadata: bool = False,
   141                                               ):
   142        15         15.0      1.0      0.0          topk_dists, topk_docs = [], []
   143        30         48.0      1.6      0.0          for x, cell_idx in zip(query, cells):
   144                                                       # x.shape = (self.dim,)
   145        30     910683.0  30356.1     98.5              dists, doc_ids, cells = self.ivf_search(
   146        15          3.0      0.2      0.0                  x,
   147        15          3.0      0.2      0.0                  cells=cell_idx,
   148        15          3.0      0.2      0.0                  where_clause=where_clause,
   149        15          7.0      0.5      0.0                  where_params=where_params,
   150        15          5.0      0.3      0.0                  limit=limit,
   151                                                       )
   152
   153        15         32.0      2.1      0.0              topk_dists.append(dists)
   154        15        160.0     10.7      0.0              match_docs = DocumentArray()
   155       165        186.0      1.1      0.0              for dist, doc_id, cell_id in zip(dists, doc_ids, cells):
   156       150       5574.0     37.2      0.6                  doc = Document(id=doc_id)
   157       150         67.0      0.4      0.0                  if include_metadata:
   158       150       5568.0     37.1      0.6                      doc = self.doc_store(cell_id).get([doc_id])[0]
   159
   160       150       1736.0     11.6      0.2                  doc.scores[self.metric.name.lower()].value = dist
   161       150        196.0      1.3      0.0                  match_docs.append(doc)
   162        15         13.0      0.9      0.0              topk_docs.append(match_docs)
   163
   164        15          6.0      0.4      0.0          return topk_dists, topk_docs

from annlite.

davidbp avatar davidbp commented on May 13, 2024

Even if it is where search happens boiler plate code joining results could actually take more time. Nevertheless, this posts suggests that is not the case. at least al search_cells level.

from annlite.

numb3r3 avatar numb3r3 commented on May 13, 2024

Results on table.query

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   221                                               @line_profile
   222                                               def query(
   223                                                   self,
   224                                                   where_clause: str = '',
   225                                                   where_params: Tuple = (),
   226                                               ) -> Iterator[dict]:
   227                                                   """Query the records which matches the given conditions
   228
   229                                                   :param where_clause: where clause for query
   230                                                   :param where_params: where parameters for query
   231                                                   :return: iterator to yield matched doc
   232                                                   """
   233        15         20.0      1.3      0.0          sql = 'SELECT _id, _doc_id from {table} WHERE {where} ORDER BY _id ASC;'
   234
   235                                                   # where_conds = ['_deleted = ?']
   236        15         10.0      0.7      0.0          where_conds = []
   237        15          8.0      0.5      0.0          if where_clause:
   238        15         14.0      0.9      0.0              where_conds.append(where_clause)
   239        15         17.0      1.1      0.0          where_conds += ['_deleted = ?']
   240        15         15.0      1.0      0.0          where = ' and '.join(where_conds)
   241        15         54.0      3.6      0.0          sql = sql.format(table=self.name, where=where)
   242
   243                                                   # params = (0,) + tuple([_converting(p) for p in where_params])
   244        15         81.0      5.4      0.0          params = tuple([_converting(p) for p in where_params]) + (0,)
   245
   246                                                   # for row in self._conn.execute(f'PRAGMA index_list("{self.name}")'):
   247                                                   #     print(row)
   248
   249                                                   # # sql = 'EXPLAIN QUERY PLAN ' + sql
   250                                                   # for row in self._conn.execute('EXPLAIN QUERY PLAN ' + sql, params):
   251                                                   #     print(row)
   252
   253        15       9597.0    639.8      1.2          cursor = self._conn.execute(sql, params)
   254    500015     555813.0      1.1     68.7          for row in cursor:
   255    500000     243938.0      0.5     30.1              yield {'_id': row[0] - 1, '_doc_id': row[1]}

from annlite.

numb3r3 avatar numb3r3 commented on May 13, 2024

This PR #74 archives 3x improvement

from annlite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.