Giter Club home page Giter Club logo

search-engine's People

Contributors

lmffeexd avatar thesil avatar w32zhong avatar yzhan018 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

search-engine's Issues

Forced title search

I really like the search, but right now it's way to broad. It gives too many useless results.

One thing that would greatly help is a "forced title" setting to force the title to match the keywords and expressions.

What's nice about the loose math expression search is that variable names don't matter (for example $y^2$ or $x^2$ yield the same result). However there are also some problems with that since the search for $x^2$ matches also expressions like $\mathbb{R}^2$ which is unwanted. But I can imagine that it is difficult to come up with some good ruleset, so that for example $Q^2$ (where $\mathbb{Q}^2$ is intended) matches with $\mathbb{R}^2$ but $x^2$ doesn't.

Unexpected Mathjax error in SERP

Search results page

  • Entry number 8 has an error in two mathjax displays ("Missing close brace" and "Extra close brace or missing open brace" respectively).
  • this is a screenshot
    image
  • the original post on MSE is free from any mathjax error though
  • A similar problem exists for entry 7 as well
  • I could not come up with any possible reason for this

Broken Mathjax of entry 8 copy-pasted for reference:

...Put it this way \int {\frac{x}{{\sqrt { ... t {\frac{{2ax + b}}{{\sqrt {a{x^2} + bx + c} }}dx}  - \frac{b}{{2a}}\int {\frac{{dx}}{{\sqrt {a{x^2} + bx + c} }}} 
\displaystyle \frac{c}{a} - \frac{{{b^2}}}{{4{a^2}}} < 0 =  -  ... 2}. 

2 FPD & 2 reflected xss vulnerability in web app

Reflected xss:

1- https://approach0.xyz/search/?q=test&p=1%22%3E%3Csvg/onload=alert(/test/)%3E
2- https://approach0.xyz/search/?q=%24test%22%3E%3Csvg/onload=alert(/test/)%3E&p=1
fix: sanitize input ( try htmlspecialchars($string, ENT_QUOTES, 'UTF-8'); and check here ) from params q and p in L149-150 index.php.

Full path disclosure:

1- https://approach0.xyz/demo/search-relay.php?p=1&q[]=test
2- https://approach0.xyz/demo/?q[]=test&p=1
1 is caused by strlen($qry_str) in search-relay.php L45, to fix it: use is_scalar when checking $_GET['q'].
screen shot 2016-09-12 at 6 10 49 pm

Update MathJax library.

MathJax v3 is coming out, I am thinking to update the current render library A0 is using. This will give a better search experience since it takes too much time to be rendered for a page of search results currently. I feel v3 is much more faster than previous version. Although KaTeX is also fast, I found it cannot handle many content as robust as MathJax when I was choosing between them as A0 search result render library.

Docker Swarm MPI SSH connection unstable

In the new infrastructure based on Docker Swarm, the SSH connection randomly breaks and it causes search daemons to restart. Anyone knows how to boil down the problem further?

blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  0] (in memo) prefix/VAR/BASE (pf=138766737, ipf=1.86)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  1] (in memo) prefix/NUM/SUPSCRIPT (pf=6633044, ipf=4.90)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  2] (on disk) prefix/VAR/BASE/HANGER (pf=9006427, ipf=4.60)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  3] (in memo) prefix/NUM/SUPSCRIPT/HANGER (pf=138766395, ipf=1.86)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  4] (on disk) prefix/VAR/BASE/HANGER/SIGN (pf=6569091, ipf=4.91)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  5] (on disk) prefix/VAR/BASE/HANGER/TIMES (pf=8928046, ipf=4.60)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  6] (on disk) prefix/NUM/SUPSCRIPT/HANGER/SIGN (pf=15976531, ipf=4.02)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [  7] (on disk) prefix/VAR/BASE/HANGER/SIGN/ADD (pf=30797021, ipf=3.37)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [  8] (on disk) prefix/VAR/BASE/HANGER/TIMES/SIGN (pf=3410477, ipf=5.57)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [  9] (on disk) prefix/NUM/SUPSCRIPT/HANGER/SIGN/ADD (pf=11096355, ipf=4.39)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [ 10] (on disk) prefix/VAR/BASE/HANGER/TIMES/SIGN/ADD (pf=9205922, ipf=4.57)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,1]<stdout>:merge time cost: 2411 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stdout>:merge time cost: 3240 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,1]<stdout>:Query handle cost: 3390 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,1]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,3]<stdout>:merge time cost: 4611 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:merge time cost: 13286 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,3]<stdout>:Query handle cost: 13932 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,3]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:Query handle cost: 13937 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stdout>:Query handle cost: 13984 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | Connection to blue-shard3 closed by remote host.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | --------------------------------------------------------------------------
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | ORTE was unable to reliably start one or more daemons.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | This usually is caused by:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | 
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | * not finding the required libraries and/or binaries on
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   settings, or configure OMPI with --enable-orterun-prefix-by-default
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | 
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | * lack of authority to execute on one or more specified nodes.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   Please verify your allocation and authorities.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | 
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   Please check with your sys admin to determine the correct location to use.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | 
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | *  compilation of the orted with dynamic libraries when static are required
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   (e.g., on Cray). Please check your configure cmd line and consider using
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   one of the contrib/platform definitions for your system type.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | 
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | * an inability to create a connection back to mpirun due to a
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   lack of common network interfaces and/or no route found between
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   them. Please check network connectivity (including firewalls
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   and network routing requirements).
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | --------------------------------------------------------------------------
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:node[2] closing index...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,1]<stdout>:node[1] closing index...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,3]<stdout>:node[3] closing index...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stdout>:shutdown httpd...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stdout>:node[0] closing index...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stderr>:Caught signal: 15
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | + set +x

Follow up on the updated Mathquill fork.

I have built and pulled the forked version of Mathquill from @TheSil, see: 8c51528

Now the \lVert, \lvert, \langle and \rangle can be successfully entered.
And \dots is correctly handled after updating KaTeX.

Here I would welcome anyone to keep adding things into the list and send a pull request to the forked Mathquill.

@TheSil Thank you again for your help. Feel free to keep working on your forked repo and expand the "subs" there, only if you find other things are missing in that list. I would be more than willing to build your source and merge it into Approach0 online demo.

Server is down

The webpage says

Opps! Server is down right now, but will be back shortly. (return code #101)

How to search a formula locally?

@t-k- Dear sir, thanks a lot for your contribution. For the notes aren’t very detailed, I wonder if I want to search formula form my local database, how can I use your code?

I will be grateful for your reply.

Add breath first directory traversal.

Currently we are using dir_search_podfs() (podfs short for post-order DFS) function to search directory. Need to add a function (maybe named dir_search_bfs) to do BFS on directories. This will improve search results' effectiveness because we will more likely get more results within a small depth delta number.

relevant files: $PROJECT/dir-util/dir-util.[ch]

AoPS crawler is not working.

I have not used AoPS crawler for several months. However, (since I am playing with Docker swarm) today I run into AoPS crawler issue:

./crawler-artofproblemsolving.com.py  -n 0 -o 3650 -c 3
WARNING: Couldn't write lextab module <module 'slimit.lextab' from '/usr/local/lib/python3.7/dist-packages/slimit/lextab.py'>. Won't overwrite existing lextab module
WARNING: yacc table file version is out of date
WARNING: Token 'BLOCK_COMMENT' defined, but not used
WARNING: Token 'CLASS' defined, but not used
WARNING: Token 'CONST' defined, but not used
WARNING: Token 'ENUM' defined, but not used
WARNING: Token 'EXPORT' defined, but not used
WARNING: Token 'EXTENDS' defined, but not used
WARNING: Token 'IMPORT' defined, but not used
WARNING: Token 'LINE_COMMENT' defined, but not used
WARNING: Token 'LINE_TERMINATOR' defined, but not used
WARNING: Token 'SUPER' defined, but not used
WARNING: There are 10 unused tokens
WARNING: Couldn't create <module 'slimit.yacctab' from '/usr/local/lib/python3.7/dist-packages/slimit/yacctab.py'>. Won't overwrite existing tabmodule
[curl] /community/
Traceback (most recent call last):
  File "./crawler-artofproblemsolving.com.py", line 483, in <module>
    main(sys.argv)
  File "./crawler-artofproblemsolving.com.py", line 466, in main
    extra_opt)
  File "./crawler-artofproblemsolving.com.py", line 350, in crawl_category_topics
    for category, topic, e in list_category_topics(category, newest, oldest, c):
  File "./crawler-artofproblemsolving.com.py", line 264, in list_category_topics
    session = parsed['AoPS.session']
TypeError: 'NoneType' object is not subscriptable

It looks like AoPS has changed its API, It also could be my network is blocking AoPS, I have not tested yet.

@TheSil IF you get time, please help me and see if you can reproduce this issue, thanks!

Questions about datasets

Dear author, excuse me, I have two questions :

  1. Where did you get the datasets? (topics.txt, NTCIR12_MathWiki-qrels_judge.dat, NTCIR12_latex_expressions.txt, all.dat in the google drive)
  2. What is the meaning of the datasets? (topics.txt, NTCIR12_MathWiki-qrels_judge.dat, NTCIR12_latex_expressions.txt, all.dat in the google drive)
    Thanks for your any help!

Parsing optree to python program

Excuse me, if I want to parse the operator tree generated by tex-parser to my python program to do other applications, how should I do? Maybe I can use the result of preorder output and inorder output to construct the original tree in python, but how should I revise the code to get the inorder output? Thanks for your help.

Filtering results by source

I thought I would create an issue for this. Sometimes it would be nice to filter which sources I want to display (some other MSE users expressed this request).

[easy] server PHP client IP log

In php script of server side, log client IP for later traffic analysis (may be future will add IP block function based on this).

relevant files: demo/web/search-relay.php

Adding approach0.xyz in the HSTS preload list for (albeit small) faster loading time

Without HSTS or HSTS being preloaded a user would need to first connect to the website to retrieve the key then to connect again, something that wont happen if the HSTS header is preloaded. Right now there is a problem to get it added in the HSTS Preload list (which is included in all major browsers, from Chromium to Firefox)

https://securityheaders.io/?q=approach0.xyz&followRedirects=on
https://hstspreload.org/?domain=approach0.xyz

Error 1: No HSTS header is present on the response.

The website doesn't have HSTS to begin with :]

Error 2: Too many redirects: There are more than 3 redirects starting from http://approach0.xyz.

Error 3: Insecure redirect: http://approach0.xyz redirects to an insecure page on redirect #2: http://approach0.xyz/search

Error 4: Insecure redirect https://approach0.xyz redirects to an insecure page: http://approach0.xyz/search

Even solving these other redirect errors will result in faster speeds.

Parentheses

The search engine can't associate e.g. this:
\arcsin{\frac{a}{b}}
with this:
\arcsin{\left(\frac{a}{b}\right)}

小建议

基础库中引用的C库可以使用简单函数名替换。例如:
printf 的使用,如果在库中频繁使用printf功能,则建议使用类似如下的方式替换,
static inline int lib_printf(const char *fmt, ...) { .... printf( some string ); .... }
这样有个好处,库被别人引用时,可以定制是否需要输出。

cannot visit resource webpage

Hello! I was trying to access http://approach0.xyz/ecir2019/ to reproduce results for the ecir2019 ntcir math formula retrieval results.

When I was trying to access the above webpage, I got a "502 bad gateway" error.

Can I download the resources elsewhere, especially the corpus.txt file, to be able to run the indexer?

Thanks!

Suggestion

I'd recommend having some ability to force searches to return only pages where certain words are found. For instance, when I searched for

Dieudonne absolutely convergent series

I only want to see results that have the word "dieudonne" in them because I'm specifically looking for an explanation of one of his theorems--but I get a lot of search results that don't have him. Having good support for exact and approximate matches, "and", "or", and other sorts of advanced search features would be helpful.

Thanks for developing this!

Best wishes,
Adam

Redirect

Visiting http:\approach.xyz doesn't redirect to https:\approach0.xyz\search instead it downloads a file called download with the following content (Chrome)

<?php
header("HTTP/1.1 301 Moved Permanently");
header("Location: https://approach0.xyz/search/");
?>

While in IE it just displays the page with plaint text from above.

On demo page, add buttons for user to choose math symbols.

Even if we are providing users mathquill math edit box, users who do not know TeX still need a "table" to look up symbols/functions, and we will make this symbols clickable:

If user chooses a symbol/function, he click the item, and send that symbol/function to his query box.

relevant files: demo/web/index.html

Inside a docker container, a0 eats string following some utf-8 characters.

With Debian debian:buster image. A0 will eat string following some utf-8 characters.

Example:

# docker run -it -p 8921:8921 -v `pwd`/../indexerd/tmp:/mnt/index a0 searchd.out -i /mnt/index -c0 -C0

$ curl -X POST http://localhost:8921/search --header "Content-Type: application/json" -d '{"ip":"127.0.0.1","page":1,"kw":[{"type":"tex","str":"1+2+\u2026+100"}]}'

Output:

[inverted lists]
[0] (level 2)   9.90 `1+2+' (TeX, upp=9.90, th=0.77)
        [  0] (on disk) prefix/ONE/SIGN (pf=4, ipf=4.90)
        [  1] (on disk) prefix/NUM/SIGN (pf=8, ipf=4.20)
        [  2] (on disk) prefix/ONE/SIGN/ADD (pf=4, ipf=4.90)
        [  3] (on disk) prefix/NUM/SIGN/ADD (pf=8, ipf=4.20)

Here the TeX string is wrong, it is expected to be `1+2+…+100' which is the actual behavior outside container.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.