Giter Club home page Giter Club logo

Comments (9)

matiskay avatar matiskay commented on May 30, 2024 1

Hi @thouger,

Thanks for using the package.

Similarity between http://bbs.gfan.com/forum-22-1.html and http://bbs.gfan.com/android-9172442-2-1.html

You should not use structural_similarity here because the html tags in the forum can increase dramatically because users can add p, a, font tags as they need. You can use it but you should give more weight to the similarity (1 - k = 0.7) (k=0.3). My suggestion is to use k = 0.3 and set a low threshold.

In [35]: from html_similarity import similarity, style_similarity, structural_similarity

In [36]: structural_similarity(html_1, html_3)
Out[36]: 0.1694300518134715

In [37]: style_similarity(html_1, html_3)
Out[37]: 0.4748603351955307
In [55]: from html_similarity.style_similarity import get_classes

In [56]: class_html_1 = get_classes(html_1)

In [57]: len(class_html_1)
Out[57]: 146

In [58]: class_html_3 = get_classes(html_3)

In [59]: len(class_html_3)
Out[59]: 118

In [60]: len(class_html_1 & class_html_3)
Out[60]: 85

Similarity between http://bbs.gfan.com/android-9172442-2-1.html and http://bbs.gfan.com/android-9161132-1-1.html

The similarity should work for the second case:.

In [11]: from html_similarity import similarity

In [12]: html_1 = open('android-9161132-1-1.html').read()

In [13]: html_2 = open('android-9172442-2-1.html').read()

In [14]: similarity(html_1, html_2)
Out[14]: 0.6515255079848381

I got 0.65. As I see, the web page. The forum allows you to add your own content (multiple font and p elements which make the structure differ). In this case I suggest to use less weight on the structure.

Using k=0.3 I got

In [26]: similarity(html_1, html_2, 0.3)
Out[26]: 0.7067047784751134

I hope it helps. Let me know if you have any other question or doubt.

from html-similarity.

thouger avatar thouger commented on May 30, 2024

I very grateful for you answer!
I got success at part-1,but i didn't get 0.6515255079848381 at part-2,i suspectd you source for html_1 and html_2,so i try this:
I use requests,urllib.request , phantomjs and save save page as html by browser to get html source,but i got this
requests

similarity(h1,h2)
0.5741937488348972

urllib.request

similarity(h3.decode('utf-8'),h4.decode('utf-8'))
0.5741937488348972

phantomjs

similarity(h9,h10)
0.5900613016825356

save save page as html by browser

similarity(h5,h6)
0.5897597237261845

But i didn't get so high like you,so i want to know the method that you open url and save.I think it's a key.

from html-similarity.

matiskay avatar matiskay commented on May 30, 2024

I'm using python 3.6 and I use wget to download the pages. Bear in mind that the website maybe sending additional information to me because I'm located in South America. For the last part you can use the threshold of 0.55 to consider to pages similar.

from html-similarity.

matiskay avatar matiskay commented on May 30, 2024

@thouger, here is the html that I'm using https://www.dropbox.com/sh/6p0f4e9k9ldei6j/AABTb-ApCNfq6cdcWVHMAx2ca?dl=0

from html-similarity.

thouger avatar thouger commented on May 30, 2024

At last i got the same answer as you,thank you for your help.i think i used the wrong way.
Although South America i think it's ok because I am not hurry up and the hope that you show me make me continue to follow.
Actually i am writing my bachelor Thesis which is 《form data extraction》.Before i see this package,i am leaning simple tree matching.When i see this package,i think:it do excellent and easy way more than stm!I think the package maybe can help me to finish my paper.So i hope i can share the idea and analysis the structure in my bachelor Thesis.I will give clear indication of you name and where the package from.
A man who show the bachelor Thesis when he graduated take me to love the data extraction at three years ago.He's in my same major but three year ahead.now when i prepare to graduate,i hope i can succeed to take more people to love this by use my parper.
i will very grateful for you if you agree.

from html-similarity.

matiskay avatar matiskay commented on May 30, 2024

Cool. This package uses a heuristic to measure the structural similarity. I will do some experiments on my own today and check if I find something.

Great to hear about form data extraction. That's sound really interesting. I would like to know what is the strategy your are planing to to the form data extraction.

Note: There is a package called Formsaurus which extract forms from web pages. Formsaurus classify which form is in a web page (login, signup, search, mailing list, etc)

import formsaurus
import requests
html = requests.get('http://github.com/')
formsaurus.extract_forms(html)

speaker deck 2017-11-08 09-17-14

Formsaurus uses Logistic regression to make the classification using the following features:

  • POST/GET
  • Text of the submit button
  • Names of the css classes and ids.
  • Labels of the inputs
  • Present of some strings in the url.

You can read more about it in: http://formasaurus.readthedocs.io/en/latest/

from html-similarity.

thouger avatar thouger commented on May 30, 2024

I am very sorry(;´д`)ゞ because i missing a letter of form,it forum not form.In my parper,i extract the url of next page,forum user data,forum post data,post url.
I will be very happy to share strategy with you.But i can't show you in github issues because the original parper i refer to is 88 pages.also we can use other chat tool.I am also very interested in your other repository so i will spend some time to see.
Thank for you https://github.com/TeamHG-Memex/Formasaurus,i never hear that before.I hope you do not mind my mistake about missing u.Now I am going to sleep.

from html-similarity.

matiskay avatar matiskay commented on May 30, 2024

Hi @thouger, sorry for the delay. I will make the experiments over the weekend. I was busy these days.

I got some ideas to solve your problem:

Have fun doing you paper. I like the problem feel free to ask any question. My email is my-github-username AT gmail DOT com ;)

from html-similarity.

thouger avatar thouger commented on May 30, 2024

Thanks for you help!I thought you would never reply to me.You give too much information so that i much spend some time to analysis.

  1. About classifier to detect forum is i have never thought of that.After i shallowly search i find naive Bayes is a amazing.It has profound significance that the world is uncertain because human boservation has limitations. Naive Bayes is to hypothesis some we can't see according to what we can see.I'm sorry my English ability is limited(;´д`)ゞ(;´д`)ゞ
  2. About second point is the focus of my research but i have other discovery which is for the forum.I will share with you after i complete.
  3. About third points i find the paper(but it wrote in chinese!) and i already write in java but I'm still curious what else can be done in autopager.

It is wonderful to meet you when i doing related research.I am fascinated with above you gave me.The next time I will learn it.

from html-similarity.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.