i had to compare the url is:[http://bbs.gfan.com/forum-22-1.html,http://bbs.gfan.com/a

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Cool. This package uses <a href="https://github.com/matiskay/html-similarity/blob/mast

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

but i think sometime it not similar about html-similarity HOT 9 CLOSED

matiskay commented on May 30, 2024

but i think sometime it not similar

from html-similarity.

Comments (9)

matiskay commented on May 30, 2024 1

Hi @thouger,

Thanks for using the package.

Similarity between http://bbs.gfan.com/forum-22-1.html and http://bbs.gfan.com/android-9172442-2-1.html

You should not use structural_similarity here because the html tags in the forum can increase dramatically because users can add p, a, font tags as they need. You can use it but you should give more weight to the similarity (1 - k = 0.7) (k=0.3). My suggestion is to use k = 0.3 and set a low threshold.

In [35]: from html_similarity import similarity, style_similarity, structural_similarity

In [36]: structural_similarity(html_1, html_3)
Out[36]: 0.1694300518134715

In [37]: style_similarity(html_1, html_3)
Out[37]: 0.4748603351955307

In [55]: from html_similarity.style_similarity import get_classes

In [56]: class_html_1 = get_classes(html_1)

In [57]: len(class_html_1)
Out[57]: 146

In [58]: class_html_3 = get_classes(html_3)

In [59]: len(class_html_3)
Out[59]: 118

In [60]: len(class_html_1 & class_html_3)
Out[60]: 85

Similarity between http://bbs.gfan.com/android-9172442-2-1.html and http://bbs.gfan.com/android-9161132-1-1.html

The similarity should work for the second case:.

In [11]: from html_similarity import similarity

In [12]: html_1 = open('android-9161132-1-1.html').read()

In [13]: html_2 = open('android-9172442-2-1.html').read()

In [14]: similarity(html_1, html_2)
Out[14]: 0.6515255079848381

I got 0.65. As I see, the web page. The forum allows you to add your own content (multiple font and p elements which make the structure differ). In this case I suggest to use less weight on the structure.

Using k=0.3 I got

In [26]: similarity(html_1, html_2, 0.3)
Out[26]: 0.7067047784751134

I hope it helps. Let me know if you have any other question or doubt.

from html-similarity.

thouger commented on May 30, 2024

I very grateful for you answer!
I got success at part-1,but i didn't get 0.6515255079848381 at part-2,i suspectd you source for html_1 and html_2,so i try this:
I use requests,urllib.request , phantomjs and save save page as html by browser to get html source,but i got this
requests

similarity(h1,h2)
0.5741937488348972

urllib.request

similarity(h3.decode('utf-8'),h4.decode('utf-8'))
0.5741937488348972

phantomjs

similarity(h9,h10)
0.5900613016825356

save save page as html by browser

similarity(h5,h6)
0.5897597237261845

But i didn't get so high like you,so i want to know the method that you open url and save.I think it's a key.

from html-similarity.

matiskay commented on May 30, 2024

I'm using python 3.6 and I use wget to download the pages. Bear in mind that the website maybe sending additional information to me because I'm located in South America. For the last part you can use the threshold of 0.55 to consider to pages similar.

from html-similarity.

matiskay commented on May 30, 2024

@thouger, here is the html that I'm using https://www.dropbox.com/sh/6p0f4e9k9ldei6j/AABTb-ApCNfq6cdcWVHMAx2ca?dl=0

from html-similarity.

thouger commented on May 30, 2024

At last i got the same answer as you,thank you for your help.i think i used the wrong way.
Although South America i think it's ok because I am not hurry up and the hope that you show me make me continue to follow.
Actually i am writing my bachelor Thesis which is 《form data extraction》.Before i see this package,i am leaning simple tree matching.When i see this package,i think:it do excellent and easy way more than stm!I think the package maybe can help me to finish my paper.So i hope i can share the idea and analysis the structure in my bachelor Thesis.I will give clear indication of you name and where the package from.
A man who show the bachelor Thesis when he graduated take me to love the data extraction at three years ago.He's in my same major but three year ahead.now when i prepare to graduate,i hope i can succeed to take more people to love this by use my parper.
i will very grateful for you if you agree.

from html-similarity.

matiskay commented on May 30, 2024

Cool. This package uses a heuristic to measure the structural similarity. I will do some experiments on my own today and check if I find something.

Great to hear about form data extraction. That's sound really interesting. I would like to know what is the strategy your are planing to to the form data extraction.

Note: There is a package called Formsaurus which extract forms from web pages. Formsaurus classify which form is in a web page (login, signup, search, mailing list, etc)

import formsaurus
import requests
html = requests.get('http://github.com/')
formsaurus.extract_forms(html)

Formsaurus uses Logistic regression to make the classification using the following features:

POST/GET
Text of the submit button
Names of the css classes and ids.
Labels of the inputs
Present of some strings in the url.

You can read more about it in: http://formasaurus.readthedocs.io/en/latest/

from html-similarity.

thouger commented on May 30, 2024

I am very sorry(；´д｀)ゞ because i missing a letter of form,it forum not form.In my parper,i extract the url of next page,forum user data,forum post data,post url.
I will be very happy to share strategy with you.But i can't show you in github issues because the original parper i refer to is 88 pages.also we can use other chat tool.I am also very interested in your other repository so i will spend some time to see.
Thank for you https://github.com/TeamHG-Memex/Formasaurus,i never hear that before.I hope you do not mind my mistake about missing u.Now I am going to sleep.

from html-similarity.

matiskay commented on May 30, 2024

Hi @thouger, sorry for the delay. I will make the experiments over the weekend. I was busy these days.

I got some ideas to solve your problem:

I would create a classifier to detect if a page if a forum or not. You can use Simple Naive Bayes classifier. The simple way is to look for keywords on the HTML: https://github.com/Sotera/webpageclassifier/blob/master/Keywords/forum.txt.
Detect the information list forum entries with https://github.com/scrapinghub/mdr or similar tool.
Detect and classify pagination links
https://github.com/TeamHG-Memex/autopager

Have fun doing you paper. I like the problem feel free to ask any question. My email is my-github-username AT gmail DOT com ;)

from html-similarity.

thouger commented on May 30, 2024

Thanks for you help!I thought you would never reply to me.You give too much information so that i much spend some time to analysis.

About classifier to detect forum is i have never thought of that.After i shallowly search i find naive Bayes is a amazing.It has profound significance that the world is uncertain because human boservation has limitations. Naive Bayes is to hypothesis some we can't see according to what we can see.I'm sorry my English ability is limited(；´д｀)ゞ(；´д｀)ゞ
About second point is the focus of my research but i have other discovery which is for the forum.I will share with you after i complete.
About third points i find the paper(but it wrote in chinese!) and i already write in java but I'm still curious what else can be done in autopager.

It is wonderful to meet you when i doing related research.I am fascinated with above you gave me.The next time I will learn it.

from html-similarity.

but i think sometime it not similar about html-similarity HOT 9 CLOSED

Comments (9)

Similarity between http://bbs.gfan.com/forum-22-1.html and http://bbs.gfan.com/android-9172442-2-1.html

Similarity between http://bbs.gfan.com/android-9172442-2-1.html and http://bbs.gfan.com/android-9161132-1-1.html

Related Issues (4)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent