Giter Club home page Giter Club logo

derp-engine's Introduction

Derp Engine:

Derp engine is my Vandy Hacks III project!

What are you doing?

We are writing a search engine from scratch!

Why?

Why not?

What's special about this engine?

I am writing it, and I am special. Therefore, vicariously, so is this engine.

derp-engine's People

Contributors

allenh1 avatar

Watchers

James Cloos avatar  avatar

derp-engine's Issues

Remove Duplicate Pages

There are a lot of pages that are indexed because they have distinct url's, but their content is not distinct.

In such a situation, I'd prefer we keep the shorter URL version of the page.

bool crawler::discovered(const QString & url)
{
/* check for the url locally, first. */
auto x = m_local_url.find(url);
if (x != m_local_url.end()) {
return true;
}
if (!m_db.open()) {
std::cerr << "Error! Failed to open database connection!" << std::endl;
return true;
}
QSqlQuery query(m_db);
QString query_string = "SELECT DISTINCT url FROM websites WHERE url = \"";
query_string += url + "\"";
query.prepare(query_string);
if (!query.exec()) {
std::cout << "Error: Query failed to execute!" << std::endl;
std::cout << "Query: \"" << query.lastQuery().toStdString() << "\"" << std::endl;
m_local_url[url] = url;
return true;
}
return query.size();
}

It is my opinion that the best modification would be to the function above. We'll add a local cache of the page content, as well as the page URL (we can just change it to a std::pair of QStrings for simplicity).

Then we can check the database should it have been discovered by another node.

Modernize the Code

It would be nice to look at this again and decide what parts can be removed.

There's a lot of code here, and I'm pretty sure most of it is not being used.

Filter Content to Words

|      59950 | https://dota2.gamepedia.com/Pangolier | Pangolier - Dota 2 Wiki | Kt-1delete tniilength%Kttntenfunction btreturnstringtypeof t0 rConfig                  iadbinoitceteddlohserhtrHHt0ntfunction iterHtenFtevar rthisyeAfunctiontivttSeAfunctiontiwttReAfunctiontibttforvar o0o thisj599 itNteHteHtIeHtJitt tHtH                  tegtegtItt tgtgtttaddEventListenerunloadfunctionnWndaJnConfigiadbitemeletdbatniopdneyrnfthisccbfunctiontifnfztpushteiftnWt1evar oncmolaoNvoid 0nchmtnC                  onfigiadbilufelbanenoitcetedlvoid 0nchEtvoid 0nchStvoid 0nchjtvoid 0nchDtvoid 0nchNtvoid 0nchgtvoid 0nchktt0tndaJnConfigiadbitemele |

this is less than desirable. Maybe a dictionary lookup should be imposed, and as long as x percent of the content is not like the above, it may be stored (otherwise, it slides the window)?

Switch to PostgreSQL

It would be nice to use PostgreSQL for the backend database instead of MySQL, or, better still, have a setting to change the backend database and maintain support for both.

I'm a MySQL fan myself, but, since it lacks good Qt support (at least as of when I last checked -- around Qt 5.5, I think) for threaded insertion (cannot insert from multiple threads, even synchronously).

Create Docker Image

It would be cool to set up a self-contained docker image that includes a clean database for testing purposes/making it easier to play around with this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.