Giter Club home page Giter Club logo

Comments (9)

llvtt avatar llvtt commented on July 17, 2024

Hi @B0rner,

Upserts using the Solr DocManager do commit on every upsert, which could very well be excessive, but is default in pysolr's add method. From what I've read in the Solr Documentation, commits are by default "hard commits," which flush to disk. I don't know if "soft commits" are supported in Solr 4.x or not, given the warning exclamation point next to "Solr4.0" in the docs. I'd be willing to look into this further this week. At the very least, it seems as if there should be a feature to add documents to Solr without commiting, and then commit documents every X number of upserts or when there is no other activity happening. Does this sound reasonable to you?

As for your second question, you can do an "initial sync" of all your documents from MongoDB into Solr by truncating the oplog progress file (called "config.txt" by default). After you restart mongo-connector, it will replicate all the documents from your targeted namespace over to Solr, regardless of their presence in the oplog. Note that each document from the collection being dumped is inserted into Solr using the Solr DocManager's update method, which is where you've identified a possible bottleneck to performance with overcommitting.

If you need a quick solution to your problem, you might try changing the call to pysolr's add method to use commit=False, if only for the "initial sync" part of the process. I'm not sure when or if uncomitted documents make it to disk in Solr, so you may need to commit these upserts manually:

from pysolr import Solr
connection = Solr("http://localhost:8983/solr")
connection.commit()

Hope this helps. I'll look into a more performant way to handle commits in Solr soon.

from mongo-connector.

llvtt avatar llvtt commented on July 17, 2024

Update about better commit practices on Solr:

It looks as if the commit behavior can be configured in a solrconfig.xml file that comes with Solr. There's an autocommit option available since Solr 1.2 that allows the user to specify that documents should be committed after every X ms or after every Y documents needing to be written to disk. Since version 4.0-alpha, it seems that there's even an autosoftcommit option (see https://issues.apache.org/jira/browse/SOLR-2193). Since the user already has a lot of power over commit behavior in Solr, I don't think it's necessary for mongo-connector to make any commits in solr_doc_manager.py. Instead, there should be a blurb in the README informing the user that mongo-connector makes no commits in Solr and provide a link to the relevent Solr documentation about solrconfig.xml (given above) for how to configure commit behavior.

@B0rner, does this make sense to you?

As a side note, it looks as if elastic_doc_manager.py suffers from exactly the same problem being described here about the Solr DocManager, i.e., the DocManager "refreshes" after every upsert, even though this "refresh" behavior is also in the run_auto_commit method. I'll have to do some more research to determine what should be done in ES.

from mongo-connector.

llvtt avatar llvtt commented on July 17, 2024

An update about better commit practices on Elastic Search:

According to the ES "refresh" documentation, refreshes are by default already scheduled "periodically" (according to this page, periodically = every second). Flushes are handled automatically in ES. If mongo-connector were to assert control over refreshes on ES (currently refreshing after every update/insert), there is the notion of a refresh interval that refreshes an index every X seconds. It could be worth switching off periodic refreshes for collection dumps to improve write throughput and then resetting the refresh interval to whatever it was previously for that index. I think that either leveraging the "refresh interval" or taking out manual refreshes from elastic_doc_manager.py and letting ES do its default "periodic" refreshing is more performant than current behavior.

Another side note:

I'm noticing that the auto_commit parameter in the constructor of both the Solr DocManager and the ES DocManager is not configurable to the user of mongo-connector. It seems like this feature was originally meant to let the user configure whether mongo-connector should automatically refresh/commit on each upsert (auto_commit=True), or let the underlying configuration take care of this (auto_commit=False). Since both Solr and ES have functionality for "commit within X amount of time," the auto_commit parameter can be a number X specifying how long operations can hang around before being committed. For DocManagers that don't support committing within X amount of time, X != 0 could mean a commit on each upsert. Either way, this parameter should probably be exposed through a command-line parameter in connector.py, and both the Solr and ES DocManagers can take advantage of this.

from mongo-connector.

B0rner avatar B0rner commented on July 17, 2024

I'd be willing to look into this further this week. At the very least, it seems as if there should be a feature to add documents to Solr without commiting, and then commit documents every X number of upserts or when there is no other activity happening. Does this sound reasonable to you?

Yes, I think this would be very useful on bigger databases. But of cause, there is no need to develop an existing feature again, like solr-autocommit, as you wrote.

As for your second question, you can do an "initial sync" of all your documents from MongoDB into Solr by truncating the oplog progress file (called "config.txt" by default).

This means: truncating the oplog results in something like dumping the output of an mongo "find" command to solr?
This is good to know.

Instead, there should be a blurb in the README informing the user that mongo-connector makes no commits in Solr and provide a link to the relevent Solr documentation about solrconfig.xml (given above) for how to configure commit behavior.
@B0rner, does this make sense to you?

For my case: that makes sense. I think: it's not the goal to develop an equal autocommit-feature like the solr build-in. On the other hand: there is a wide range of users with different needs. So I don't know, if this is an solution, that helps most of all people.
For me: the mono-connector are two tools in one, but I can't use that for only one job:
1.) It's a tool to migrate data from mongo db to solr and
2.) it's a tool to establish a link between solr and mongo to sync all,over the time new incoming docs to solr.
For the 1st scenario (migration tool), i think there is no commit necessary after every update. Finally there is no update necessery at the end of that initial sync, because if I create an migration-skript using the mongo-connector, I can trigger the commit with my own script, after the mongo-connector finished the indexing.

For the 2nd "usecase of this tool: solr has an inital sync and waits for new documents), the commit after each document is useful, as long the new incoming documents are not to much (which depends on the different environments, out there). Probably the auto-commit option has an to big value for systems with less new update per hour.

I will try your workaround.... in that case, the new documents should be visible in solr, because of the the solr feature autocommit. Right??

from mongo-connector.

B0rner avatar B0rner commented on July 17, 2024

Update: I have realized your workaround by changing the solr doc manager, by setting Commit=False. The initial import works faster. With commit=true the update-process needs 0.05 - 0.08 seconds per document. Now it needs 0.04 second. This means the update is 25-50% faster. I think, I will need 40-50 hours to index 10.000.000 docs. (83 hours with commit=true)
But it's still much slower than the dataimport-Handler to index the same data from MySQL (3hours). Probably there are other reasons to, why that update needs so long, like establishing an HTTP-Connection for every document, etc.

from mongo-connector.

llvtt avatar llvtt commented on July 17, 2024

@B0rner,

The Solr DocManager uses the pysolr library to connect to Solr. Looking at the source, it looks like pysolr uses a requests.Session object to manage connections to Solr and that these sessions already take advantage of keep-alive, so I don't think the bottleneck here is in establishing new connections. After running some cursory tests, it seems like one of the biggest performance killers is the fact that upsert() only inserts or updates a single document at a time, whereas the Solr API is capable of doing batch operations. From these tests, it looks like batch upsert could be up to 30x faster than serial :). Having a batch insert/upsert method in the DocManager API is definitely worthwhile. I'll make another issue for this feature.

edit: new issue is #56

from mongo-connector.

B0rner avatar B0rner commented on July 17, 2024

From these tests, it looks like batch upsert could be up to 30x faster than serial :)

more than this! ;-)

Sorry, but I have developed my own importer now. Finally, because my skills in python are not so god. So I wrote some lines in php, which fits my needs. It works equal like the solr data import handler: The "Mongo Solr Importer"

The first version was able to push 2500 docs from mongoDB to solr.Per seconds, which is 75x faster than the mongo-connector. After adding multi threading the script was able to import 6700 docs per second to solr. It's factor 200. Thus, the duration of the import is reduced from 83 hours to 20 minutes.
I know, that your script works much more granular and can handles changes of documents, while my script can only do one thing: full import. But this very fast.
So maybe my final solution will be the combination of both: run the initial import with my PHP script and handle new docs with your mongo-connector.

By the way. You can find the "Mongo Solr Importer" Tool here:

https://github.com/5missions/mongoSolrImporter

B0rner

from mongo-connector.

llvtt avatar llvtt commented on July 17, 2024

The bulk_upsert method is now available and should make collection dumps much faster (as of commit 7e48f55). Working on better commit behavior in #68

from mongo-connector.

llvtt avatar llvtt commented on July 17, 2024

Better commit behavior closed with #68 in 447a80f

from mongo-connector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.