Comments (9)
Hi @thouger,
Thanks for using the package.
Similarity between http://bbs.gfan.com/forum-22-1.html and http://bbs.gfan.com/android-9172442-2-1.html
You should not use structural_similarity
here because the html tags in the forum can increase dramatically because users can add p
, a
, font
tags as they need. You can use it but you should give more weight to the similarity (1 - k = 0.7) (k=0.3). My suggestion is to use k = 0.3 and set a low threshold.
In [35]: from html_similarity import similarity, style_similarity, structural_similarity
In [36]: structural_similarity(html_1, html_3)
Out[36]: 0.1694300518134715
In [37]: style_similarity(html_1, html_3)
Out[37]: 0.4748603351955307
In [55]: from html_similarity.style_similarity import get_classes
In [56]: class_html_1 = get_classes(html_1)
In [57]: len(class_html_1)
Out[57]: 146
In [58]: class_html_3 = get_classes(html_3)
In [59]: len(class_html_3)
Out[59]: 118
In [60]: len(class_html_1 & class_html_3)
Out[60]: 85
Similarity between http://bbs.gfan.com/android-9172442-2-1.html and http://bbs.gfan.com/android-9161132-1-1.html
The similarity should work for the second case:.
In [11]: from html_similarity import similarity
In [12]: html_1 = open('android-9161132-1-1.html').read()
In [13]: html_2 = open('android-9172442-2-1.html').read()
In [14]: similarity(html_1, html_2)
Out[14]: 0.6515255079848381
I got 0.65. As I see, the web page. The forum allows you to add your own content (multiple font and p elements which make the structure differ). In this case I suggest to use less weight on the structure.
Using k=0.3
I got
In [26]: similarity(html_1, html_2, 0.3)
Out[26]: 0.7067047784751134
I hope it helps. Let me know if you have any other question or doubt.
from html-similarity.
I very grateful for you answer!
I got success at part-1,but i didn't get 0.6515255079848381 at part-2,i suspectd you source for html_1 and html_2,so i try this:
I use requests,urllib.request , phantomjs and save save page as html by browser to get html source,but i got this
requests
similarity(h1,h2)
0.5741937488348972
urllib.request
similarity(h3.decode('utf-8'),h4.decode('utf-8'))
0.5741937488348972
phantomjs
similarity(h9,h10)
0.5900613016825356
save save page as html by browser
similarity(h5,h6)
0.5897597237261845
But i didn't get so high like you,so i want to know the method that you open url and save.I think it's a key.
from html-similarity.
I'm using python 3.6 and I use wget to download the pages. Bear in mind that the website maybe sending additional information to me because I'm located in South America. For the last part you can use the threshold of 0.55
to consider to pages similar.
from html-similarity.
@thouger, here is the html that I'm using https://www.dropbox.com/sh/6p0f4e9k9ldei6j/AABTb-ApCNfq6cdcWVHMAx2ca?dl=0
from html-similarity.
At last i got the same answer as you,thank you for your help.i think i used the wrong way.
Although South America i think it's ok because I am not hurry up and the hope that you show me make me continue to follow.
Actually i am writing my bachelor Thesis which is 《form data extraction》.Before i see this package,i am leaning simple tree matching.When i see this package,i think:it do excellent and easy way more than stm!I think the package maybe can help me to finish my paper.So i hope i can share the idea and analysis the structure in my bachelor Thesis.I will give clear indication of you name and where the package from.
A man who show the bachelor Thesis when he graduated take me to love the data extraction at three years ago.He's in my same major but three year ahead.now when i prepare to graduate,i hope i can succeed to take more people to love this by use my parper.
i will very grateful for you if you agree.
from html-similarity.
Cool. This package uses a heuristic to measure the structural similarity. I will do some experiments on my own today and check if I find something.
Great to hear about form data extraction. That's sound really interesting. I would like to know what is the strategy your are planing to to the form data extraction.
Note: There is a package called Formsaurus which extract forms from web pages. Formsaurus classify which form is in a web page (login, signup, search, mailing list, etc)
import formsaurus
import requests
html = requests.get('http://github.com/')
formsaurus.extract_forms(html)
Formsaurus uses Logistic regression to make the classification using the following features:
- POST/GET
- Text of the submit button
- Names of the css classes and ids.
- Labels of the inputs
- Present of some strings in the url.
You can read more about it in: http://formasaurus.readthedocs.io/en/latest/
from html-similarity.
I am very sorry(;´д`)ゞ because i missing a letter of form,it forum not form.In my parper,i extract the url of next page,forum user data,forum post data,post url.
I will be very happy to share strategy with you.But i can't show you in github issues because the original parper i refer to is 88 pages.also we can use other chat tool.I am also very interested in your other repository so i will spend some time to see.
Thank for you https://github.com/TeamHG-Memex/Formasaurus,i never hear that before.I hope you do not mind my mistake about missing u.Now I am going to sleep.
from html-similarity.
Hi @thouger, sorry for the delay. I will make the experiments over the weekend. I was busy these days.
I got some ideas to solve your problem:
- I would create a classifier to detect if a page if a forum or not. You can use Simple Naive Bayes classifier. The simple way is to look for keywords on the HTML: https://github.com/Sotera/webpageclassifier/blob/master/Keywords/forum.txt.
- Detect the information list forum entries with https://github.com/scrapinghub/mdr or similar tool.
- Detect and classify pagination links
https://github.com/TeamHG-Memex/autopager
Have fun doing you paper. I like the problem feel free to ask any question. My email is my-github-username AT gmail DOT com ;)
from html-similarity.
Thanks for you help!I thought you would never reply to me.You give too much information so that i much spend some time to analysis.
- About classifier to detect forum is i have never thought of that.After i shallowly search i find naive Bayes is a amazing.It has profound significance that the world is uncertain because human boservation has limitations. Naive Bayes is to hypothesis some we can't see according to what we can see.I'm sorry my English ability is limited(;´д`)ゞ(;´д`)ゞ
- About second point is the focus of my research but i have other discovery which is for the forum.I will share with you after i complete.
- About third points i find the paper(but it wrote in chinese!) and i already write in java but I'm still curious what else can be done in autopager.
It is wonderful to meet you when i doing related research.I am fascinated with above you gave me.The next time I will learn it.
from html-similarity.
Related Issues (4)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from html-similarity.