Giter Club home page Giter Club logo

universal-search-recommendation's Introduction

Note: Universal Search has been retired and is no longer under development. Read about our takeaways and try out another Test Pilot experiment.

universal-search-recommendation

Recommendation server for the Universal Search Test Pilot add-on.

Build Status Test Coverage Requirements

Documentation

Contributing

If you'd like to get involved, take a look at our defined bugs, or say hello on IRC (#universal-search on Mozilla IRC) or on our mailing list.

By participating in this project, you agree to abide by our code of conduct.

universal-search-recommendation's People

Contributors

april avatar chuckharmston avatar clouserw avatar hannosch avatar jaredhirsch avatar mostlygeek avatar pdehaan avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

universal-search-recommendation's Issues

Get images from `apple-touch-icon` and/or `og:image`.

Suggested approach:

  • In SearchRecommendation, after the result property is determined, but before the classifiers are applied, do an HTTP request for result.url and store the DOM on a property of the SearchRecommendation instance.
  • Pass the DOM during the construction of each classifier instance for the result.
  • Have helpers on the base classifier to help with the querying of the DOM.
  • Each classifier will then have the ability to query the DOM as necessary.
  • A new classifier can be written to get the apple touch icon for websites, and a separate classifier can be written to get the og:image for article-type content.

Write favicon classifier.

We should write a classifier applied to all results that attempts to return a favicon from Embedly.

When that i finished, we should back out the changes from #63.

Reconsider conditions under which Embedly data is included

In #63, we temporarily turned EmbedlyClassifier on for all results in order to get access to favicons for every site. In #77, we will revert those changes when adding a new classifier solely responsible for dealing with favicons. That means we will see other Embedly data—most notably key images—for results where EmbedlyClassifier is not applied.

When toying around with it, that feels a lot worse. One example of that:

screen shot 2016-03-24 at 4 38 09 pm screen shot 2016-03-24 at 4 33 17 pm

Embedly data is currently included for top level domains, directories, and Wikipedia articles. We don't want to always include it—there are quite a few bytes involved in that—but we could definitely afford to expand the current definition.

Send correct headers for CDN caching

Moved from mozilla/universal-search#123

Any CDN or proxy we end up with will respect the standard caching headers, so we should make sure we are sending them.

This can be a no-cache header for empty results and a cache for 24 hours header for results with stuff in them or whatever. cache-control, expires, etags, pragma, etc. whatever you kids are using these days.

🚨

Add heartbeat endpoints

Copied from mozilla/universal-search#1

The operations infrastructure is expecting some endpoints for each project to reply in a standardized fashion. This will be expected for all Test Pilot experiments as well as Test Pilot itself. They are handy for debugging too!

1. Respond to /__version__ with a JSON version object (moved to #41).
2. Respond to /__heartbeat__ with a HTTP 200 or 5xx on error

this endpoint is used by external monitoring to check the health of the service
it is recommended to check the app's dependencies (database, caches, etc) here
  1. Respond to /__lbheartbeat__ with an HTTP 200

    used by the load balancer to check if the server and application is OK
    do not include dependency checks as this check may trigger automatic node termination and replacement

Discontinue use of Travis

  • Get CircleCI builds running (#40)
  • Remove Travis badge from README
  • Get coverage/coveralls working with CircleCI
  • Remove Travis webhook

Load testing

This issue was originally posted as mozilla/universal-search#120 on the universal-search repo, moved to the server repo.

@wresuolc said:

We talked about this in a meeting, but I forget which one.

Now that we have stage, we should run a load tester on it to get a rough idea of what kind of performance we can expect. Even if it's just returning nulls from memcache, it would be helpful to get a rough idea of requests/sec.

Ideally testing this won't cost us a fortune in API queries. :)

@chuckharmston replied:

I don't think we should consider this until we have some sort of upstream cache; either Varnish-like or a CDN.

We could avoid costing us money in API queries by disallowing outbound HTTP requests from the worker server. All the jobs would fail, but as far as load testing the application server, it should be fine (still 1 memcached read per request).

Should we filter adult content?

We don't verify the user is above the age of consent, so I'm not sure we should be showing preview content (images/text) from adult sites in response to user keystrokes. I dunno. Maybe legal should be involved? cc @nchapman

Technically speaking, the BOSS API docs mention filters for 'porn' and 'hate', but only provide an example for 'porn':

https://yboss.yahooapis.com/ysearch/web?q={keywords}&format=xml&filter=-porn

¯\_(ツ)_/¯

Web server creates but does not close redis connections

I got a pagerduty alert. Investigating found that there were thousands of connections open to the redis server:

> netstat -aln | grep 6379 | wc -l
28232

In the logs I see:

...
Apr 08 04:01:59 ip-172-31-3-146 docker[2782]: File "/usr/local/lib/python3.5/site-packages/redis/connection.py", line 482, in _connect
Apr 08 04:01:59 ip-172-31-3-146 docker[2782]: sock.connect(socket_address)
Apr 08 04:01:59 ip-172-31-3-146 docker[2782]: OSError: [Errno 99] Cannot assign requested address
Apr 08 04:01:59 ip-172-31-3-146 docker[2782]: During handling of the above exception, another exception occurred:
Apr 08 04:01:59 ip-172-31-3-146 docker[2782]: Traceback (most recent call last):
Apr 08 04:01:59 ip-172-31-3-146 docker[2782]: File "/usr/local/lib/python3.5/site-packages/redis/client.py", line 572, in execute_command
Apr 08 04:01:59 ip-172-31-3-146 docker[2782]: connection.send_command(*args)
...

The important line in the log is: OSError: [Errno 99] Cannot assign requested address

This is a OS / kernel error that comes up when no more connections to an IP can be made. Most likely this is the result of the __heartbeat__ endpoint being monitored and it opens but doesn't close connections to redis.

Remove Yahoo! BOSS Search references

TL;DR: Access to the BOSS APIs will continue until March 31, 2016.


Per discussions in #66.

I'm going to keep Yahoo around for now, since the plan is apparently to keep riding those coattails as long as we are able.

Looks like the BOSS service will be removed in ~2 weeks, according to https://developer.yahoo.com/boss/search/:

"At Yahoo, we’re always looking for ways to streamline and simplify products for our customers. With this focus in mind, we will discontinue the BOSS JSON Search API on March 31, 2016.

Access to the BOSS APIs will continue until March 31, 2016. Moving forward, customers leveraging the BOSS JSON Search API can instead use YPA, a Javascript Solution that provides algorithmic web results with search ads for publishers who manage their own search engine results pages (SERPs). Click here to apply or learn more about YPA, or if you are working with a Yahoo Partner Manager, they can help you explore your options."

Refactor to Dockerflow-compliant single-container application.

The original plan was to use a multiple-container Docker configuration and use one of AWS' services that allow multiple-container deployment. That's changed, and we've now been asked to use Dockerflow, which is a substantial departure. I've tried to reuse as much work as possible, but there have been enough gotchas that I've talked myself into starting over with the production architecture in mind.

This bug will track design efforts, and bugs resulting from it will be tracked in the Deployment-ready milestone.

Desired end state:

  • On successful builds on master, CI will build the application into a Docker container that is then published to Docker Hub and deployed.
  • The built container is Dockerflow-compliant.
  • Local development is as contributor-friendly as possible.

Proposal:

  • Refactor from a multiple-container configuration to a single-container one that only manages the application.
  • For local development, use a Compose configuration that manages data storage and task queues.
  • Eliminate nginx entirely; use uWSGI directly as a server, and use Flask's send-file for our few static files.

Feedback wanted: @mostlygeek @6a68

Unit Testing

We'd like (as-close-as-is-pragmatically-possible-to-) 100% unit testing coverage on all the code introduced by e2c278b before deploying for usage.

__heartbeat__ should return a body hinting at what's down

To help in debugging what part of a service is down __heartbeat__ should return a hint of what's down.

For example:

  • memcache down
  • redis down
  • celery down

It would help us in determine which part of the stack to start investigating in.

Move flake8 to Travis; fix linting errors.

@mostlygeek found a bug in the Circle CI configuration; tests weren't actually being run, which means that linting fails weren't actually breaking the build. Once #70 is merged, do the following:

  • Move linting to Travis
  • Remove the excludes being added in #70
  • Fix any linting errors that have snuck through

What should RECOMMENDATION_ENV be set to in stage/prod

I'm seeing this in the logs:

Apr 15 00:18:00 ip-172-31-36-68 docker[2852]: [pid: 10|app: 0|req: 73/73] 127.0.0.1 () {32 vars in 395 bytes} [Fri Apr 15 00:18:00 2016] GET /__lbheartbeat__ => generated 0 bytes in 0 msecs (HTTP/1.0 200) 5 headers in 189 bytes (1 switches on core 0)
Apr 15 00:18:00 ip-172-31-36-68 docker[2852]: --------------------------------------------------------------------------------
Apr 15 00:18:00 ip-172-31-36-68 docker[2852]: INFO in middleware [/app/recommendation/mozlog/middleware.py:63]:
Apr 15 00:18:00 ip-172-31-36-68 docker[2852]: --------------------------------------------------------------------------------

Should I set RECOMMENDATION_ENV to prod or stage? Looking through the code it seems setting it to something other than development will disable the debugging output.

Return favicon for more results

We currently only run Embedly for Wikipedia pages and top level directories/domains. Let's run it on everything so we can get access to all the favicons.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.